百度百科这么说的:
NVMe(Non-VolatileMemory express),是一种建立在M.2接口上的类似AHCI的一种协议,是专门为闪存类存储设计的协议。中文名 NVMe协议 外文名 Non-Volatile Memory express。NVMe具体优势包括:
①性能有数倍的提升;
②可降低延迟超过50%;
③NVMe PCIe SSD可提供的IOPs十倍于高端企业级SATA SSD;
④自动功耗状态切换和动态能耗管理功能大大降低功耗;
⑤支持未来十年技术发展的可扩展能力。
码农该怎么理解?
它是一个存储协议,既然是存储协议是不是需要快速的读写?
答:对。
PCIe才是最快的协议啊,为啥不用PCIe呢?
答:PCIe很复杂的。
那我们给PCIe穿个马甲,就可以?
答:NVMe就是给PCIe穿个马甲。
NVMe是怎么做到的?
答:PCIe是作文题,NVMe是选词填空,最后的结果却一样。
怎么填?填什么?
答:按照这个表格填写,发什么就填什么,总共64字节,不需要的填0就行了。
IO命令:
appmask |
apptag |
reftag |
dsmgmt |
slba |
addr |
metadata |
rsvd |
nblocks |
control |
Flags |
Opcode |
Admin 命令:
rsvd11 |
numd |
offset |
lid |
prp2 |
prp1 |
rsvd1 |
command_id |
flags |
Opcode |
|
|
|
|
|
|
|
|
|
|
NVMe是一种Host与SSD之间通讯的协议,它在协议栈中隶属高层。
NVMe制定了Host与SSD之间通讯的命令,以及命令如何执行的。
NVMe有两种命令,一种叫Admin Command,用以Host管理和控制SSD;另外一种就是I/O Command,用以Host和SSD之间数据的传输。下面是NVMe1.2支持的命令列表:
NVMe支持的Admin Command:
NVMe支持的I/O Command:
搞两个缓冲区吧,
发送缓冲区:SubmissionQueue (SQ)。
完成缓冲区:CompletionQueue(CQ)
写这个寄存器就行Doorbell Register (DB)
每个flash块就是一个namaspce,它有个id,叫namaspceID。
举例Host需要从flash地址 0x02000000上读取nblock = 2的数据, PRP1给出内存地址是0x10000000,该怎么操作?
首先我们得组包nvme_cmd,这个包为读命令它包含我们读地址(0x02000000)、长度(nblock = 2)、和读到什么地方(prp);然后把这个包扔给sq,写doorbell通知控制器来数据咯,控制器取出命令来转换为TLP包通过PCIe Memory方式把0x02000000的数据写入到0x10000000中,然后在Cq的尾部写入完成标志,再写doorbell告诉控制器我的事干完了。
1:这个命令放在SQ里 。
2:Host通过写SQ的Tail DB,通知SSD来取命令。
3:SSD收到通知,去Host端的SQ中取指。 PCIe是通过发一个Memory Read TLP到Host的SQ中取指的。
4:SSD执行读命令,把数据从闪存中读到缓存中,然后把数据传给Host。
5:SSD往Host的CQ中返回状态。
6:SSD采用中断的方式告诉Host去处理CQ。
7:Host处理相应的CQ。
本次调试采用第三方NVMe卡,软件环境采用Linux 内核3.11.10。插入卡后能够在pci树上看到设备1987:5007,如图:
目前NVMe卡已经能作为pci设备被识别了,接下来开始移植驱动。下载linux3.11.10并解压,提取nvme-core.c 、nvme-scsi.c、nvme.h三个文件,然后编写makefile,如下:
然后加载驱动#insmod nvme_driver.ko, 接下来就可以看到nvme设备了:
注意:nvme0设备 是我们注册file_operations,nvme0n1对应block_device_operations。
现在设备和驱动都调试成功了,接下来就可以通过ioctl调试命令下方工具和解析命令。
获取namespace_id时最简单的ioctl操作,这里就不粘代码了,结果如下:
Submitio 就是对应disk的读写,这里只介绍READ/WRITE命令的下发:
READ命令:
appmask |
apptag |
reftag |
dsmgmt |
slba |
addr |
metadata |
rsvd |
nblocks |
control |
Flags |
Opcode |
|
|
|
0xc1 |
|
addr |
|
|
n |
|
|
0x02 |
Opcode: read命令头0x02
Flags:清0
Control:清0
nblocks: 读的blocks个数,不能超过最大值
metadata:暂时不用
addr:数据保存的地址,最好申请数组空间,大小至少16k
dsmgmt: 0xc1->11000001b, not compressible , sequential read , No latency information provided,Typical number of reads and writes expected forthis LBA range.
Reftag: This field is only used if the namespace is formatted to useend-to-end protection information.
Apptag: This field is only used if the namespace is formatted to useend-to-end protection information.
Appmask: This field is only used if the namespace is formatted to useend-to-end protection information.
WRITE命令:
appmask |
apptag |
reftag |
dsmgmt |
slba |
addr |
metadata |
rsvd |
nblocks |
control |
Flags |
Opcode |
|
|
|
0xc1 |
|
addr |
|
|
n |
|
|
0x01 |
Opcode: write命令头0x01
Flags:清0
Control:清0
nblocks: 写的blocks个数,不能超过最大值
metadata:暂时不用
addr:数据保存的地址,最好申请数组空间,大小至少16k
dsmgmt: 0xc1->11000001b, not compressible , sequential read , No latency information provided,Typical number of reads and writes expected forthis LBA range.
Reftag: This field is only used if the namespace is formatted to useend-to-end protection information.
Apptag: This field is only used if the namespace is formatted to useend-to-end protection information.
Appmask: This field is only used if the namespace is formatted to useend-to-end protection information.
值得注意的是ioctl的cmd参数,用户空间的cmd经过魔数、基数、变量型的转化和偏移才得到驱动层的cmd。
根据测试,返回status和result都为0表示命令成功,其他都表示命令失败。
Get Log Page command:
rsvd11 |
numd |
offset |
lid |
prp2 |
prp1 |
rsvd1 |
command_id |
flags |
Opcode |
|
|
|
|
|
|
|
|
|
|
Opcode:nvme_admin_get_log_page
Flags:清0
Command_id:清0
Prp1:数据保存的地址,最好申请数组空间,大小至少16k
Prp2:datalength,注意datalength的长度
Lid:
Offset:清0
Numd:清0
值得注意的是这里并没有定义namespace_ID, 最好设置rsvd1[0] = ~0。
Get Log Page: SMART/ Health Information
Critical Warning: 00
Composite Temperature: (32 01 )306K氏度
Available Spare: (64)100%
Identify command:
rsvd11 |
cns |
Prp2 |
Prp1 |
Rsvd2 |
nsid |
command_id |
flags |
Opcode |
|
|
|
|
|
|
|
|
|
Opcode:nvme_admin_identify
Flags:清0
Command_id:清0
Nsid: 0
Prp1:数据保存的地址,最好申请数组空间,大小至少16k
Prp2:datalength,注意datalength的长度
Cns:0x01;
Identify Controller Data Structure:见附件
Set Features command& Get Featurescommand:
rsvd12 |
dword11 |
Fid |
Prp2 |
Prp1 |
rsvd2 |
Nsid |
command_id |
flags |
Opcode |
|
|
|
|
|
|
|
|
|
|
Opcode:nvme_admin_get_features& nvme_admin_set_features
Flags:清0
Command_id:清0
Nsid: 0
Prp1:数据保存的地址,最好申请数组空间,大小至少16k
Prp2:datalength,注意datalength的长度
Fid:
NVMe块设备文件操作集会在申请disk设备的时候进行声明,代码如下:
disk->fops =&nvme_fops;
static conststruct block_device_operations nvme_fops = {
.owner =THIS_MODULE,
.ioctl =nvme_ioctl,
.compat_ioctl = nvme_ioctl,
};
其中owner成员表面该fops的所有者是NVMe块设备驱动,而ioctl和compat_ioctl分别是用户ioctl调用的两种方式,一般是ioctl,而不管是哪种方式,二者都会进入nvme_ioctl。
进入nvme_ioctl()接口后,驱动程序会对cmd类型进行解析被进入不同的分支,这里重点关注NVME_IOCTL_ADMIN_CMD和NVME_IOCTL_SUBMIT_IO。
注意这里两个函数最终都会调用:nvme_submit_sync_cmd(nvmeq,&c, NULL, NVME_IO_TIMEOUT);
其是利用同步的方式进行命令的下发和返回最终返回状态的处理。由于该函数会睡眠,我们需要保持抢占处理使能状态。其有可能在任意地方被抢占,然后重新被调度。
目前最新的协议为NVME-1.2.1Specification,http://www.nvmexpress.org/specifications/可下载; 驱动位于http://www.nvmexpress.org/drivers/,目前提供Microsoft Drivers、Linux Drivers、VMware、UEFI、FreeBSD、Solaris等系统的驱动代码。
低位在前高位在后。
87 19 PCI Vendor ID (VID)://Vendor ID:87,Device ID :19
87 19 PCI Subsystem Vendor ID (SSVID)://Subsystem Vendor ID :87, Subsystem ID (SSID): 19
36 37 43 45 30 37 36 36 31 30 31 37 30 3030 30 30 31 38 33:Serial Number (SN):
50 43 49 65 20 53 53 44 20 20 20 20 20 2020 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20: Model Number (MN):
45 37 46 4d 30 31 2e 31: Firmware Revision
01 :Recommended Arbitration Burst ,一页2K?
00 00 00 :EEE OUIIdentifier (IEEE):
00: Controller Multi-Path I/O and Namespace Sharing Capabilities (CMIC): then the NVM subsystem contains onlya single PCI Express port.
09 :Maximum Data Transfer Size (MDTS): The value is in unitsof the minimum memory page size (CAP.MPSMIN) and is reported as a power of two(2^n). 512
00 00 :ControllerID (CNTLID):
00 02 01 00 :Version(VER): Major Version Number :2, MinorVersion Number :1
80 4f 12 00 :RTD3 Resume Latency (RTD3R): ?
60 e3 16 00 :RTD3 Entry Latency (RTD3E): ?
07 00 :Optional Admin Command Support (OACS): the controller supports the FirmwareCommit and Firmware Image Download commands. the controller supports the FormatNVM command.the controller supports the Security Send and Security Receivecommands.
03 :Abort CommandLimit (ACL): 最大同时传送失败的个数限制
03 :Asynchronous Event Request Limit (AERL): 最大同时传送异步事件个数限制
02 :Firmware Updates (FRMW): the controller requiresa reset for firmware to be activated.indicate the number of firmware slots thatthe controller support(1~7) thefirst firmware slot (slot 1) is read/write
03 :Log PageAttributes (LPA): T controller supports the Command Effects log page,n the controller supports the SMART / Health information log page ona per namespace basis
3f :Error Log PageEntries (ELPE): T the maximum number of Error Information log entries that arestored by the controller
04 :Number of PowerStates Support (NPSS): This field indicates the number of NVM Express powerstates supported by the controlle ,
01 :Admin VendorSpecific Command Configuration (AVSCC): Tt all Admin Vendor Specific Commandsuse the format defined in Figure 13.
01 :AutonomousPower State Transition Attributes (APSTA):the controller supports autonomouspower state transitions.
7f 01 :WarningComposite Temperature Threshold (WCTEMP) 告警温度 383k
93 01 :CriticalComposite Temperature Threshold (CCTEMP) 危机温度403k
66 :Submission Queue Entry Size (SQES):define the maximum Submission Queueentry size when using the NVM Command Se :6;define the required Submission Queue Entry size when using the NVM Command Set:6
44 :Completion Queue Entry Size (CQES): define the maximum Completion Queue entry size when using the NVMCommand Set.:4; define the required Completion Queue entry size when using the NVM Command Set:4
01 00 00 00 :Numberof Namespaces (NN):This field defines the number of valid namespaces presentfor the controller:1
1e 00 :Optional NVMCommand Support (ONCS): the controllerdoes not support the Compare command. the controller supports the Write Uncorrectable command,the controllersupports the Dataset Management command, the controller supports the WriteZeroes command, the controller supports the Save field in the Set Features command and the Select field in the Get Features command.
00 00: FusedOperation Support (FUSES): the controller does not support the Compare andWrite fused operation.
01:Format NVMAttributes (FNA): then all namespacesshall be configured with the same attributes and a format of any namespace results in a format of all namespaces
01:Volatile WriteCache 525 indicates that a volatile write cache is present
ff 00:Atomic Write Unit Normal 原子写的最大逻辑块个数
00 00:Atomic WriteUnit Power Fail (
01:NVM VendorSpecific Command Configuration l NVM Vendor Specific Commands use the formatdefined in Figure 13. I
00 00:AtomicCompare & Write Unit
16 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 521c 40 00 16 03 81 00 00 00 00 00 00 00 00 00:PowerState 0 Descriptor (PSD0)
16 03: the maximum power consumed by the NVM subsystem in this power state. 790W ?= 7.9w
00:Reserved
00:the controllerprocesses I/O commands in this power state.the scale of the Maximum Power fieldis in 0.01 Watts.
00 00 00 00:he maximum entry latency in microseconds associated with entering this power state.
00 00 00 00:maximum exit latency in microseconds associated withexiting this power state
00:ative read throughputassociated with this power state.
00:the relativeread latency associated with this powerstat
00 : relative write throughput associatedwith this power state.
52 1c : the typical power consumed by theNVM subsystem over 30 seconds in this power state when idle .30s空闲消耗多少电7250*0.0001W
40:Idle Power Scale( 0.0001w)
00:保留
81:Active PowerScale:0.01w,the workload usedto calculate maximum power for this power state:001b
f0 00: the largestaverage power consumed by the NVM subsystem over a 10 second period in thispower state with the workload indicated in the Active Power Workload field.
00 00 00 00 00 00 00 00 00 Power State 1Descriptor (PSD1):
{
be 00 00 00 0000 00 00 00 00 00 00 00 00 00 00 52 1c 40 00 be 00 81 00 00 00 00 00 00 00 0000 Power State 2 Descriptor (PSD2);
4c 04 00 03 58 02 00 00 58 02 00 00 02 0202 02 4c 04 40 00 4c 04 41 00 00 00 00 00 00 00 00 00 Power State 3 Descriptor(PSD3):
32 00 00 03 a0 86 01 00 00 71 02 00 03 0303 03 32 00 40 00 32 00 41 00 00 00 00 00 00 00 00 00 Power State 4 Descriptor(PSD4):
}