READ_6
命令只能支持块大小为512B设备的2GB范围的寻址,因此官方推荐将READ_6
迁移到READ_10
。
READ_10
具有2TB的寻址能力,对于800G的NVMe设备来说当然是极好的。其实READ_10
具有更多乱七八糟的特性,但当前的nvme-scsi.c中忽略了其中大部分的特性,因此先不予考虑。
想要通过ioctl发送一个READ_10
的SCSI命令,至少需要进行下文的四步操作。
由图可知,这个命令由10个char组成。需要填写opcode、LBA和Transfer Length:
unsigned char rdCmd[10] = {READ_10, 0, 0, 0, 0, 0, 0, 0, 0, 0};
rdCmd[2] = (unsigned char)((start_lba >> 24) & 0xff);
rdCmd[3] = (unsigned char)((start_lba >> 16) & 0xff);
rdCmd[4] = (unsigned char)((start_lba >> 8) & 0xff);
rdCmd[5] = (unsigned char)(start_lba & 0xff);
rdCmd[7] = (unsigned char)((lba_num >> 8) & 0xff);
rdCmd[8] = (unsigned char)(lba_num & 0xff);
亦即从start_lba
开始读出lba_num
个逻辑块,每个逻辑块的大小一般是512B。
NVMe接收两种读写模式,区别在于下一步sg_io_hdr
中iovec_count
的设置。当这个位置非0时,表示使用了scatter gather方法,它向设备传递了一个请求向量。
非sg的方法可以请求传递任意大小的块。而且驱动自动分割读写请求,保证了安全性和可靠性。然而,非sg方法传递一个大缓冲区时,很容易得到ENOMEM
错误。
下面这段话引用自sg.danny.cz:
Scatter gather allows large buffers (previously limited to 128 KB on i386) to be used. Scatter gather is also a lot more “kernel friendly”. The original driver used a single large buffer which made it impossible to run 2 or more sg-based applications at the same time. With the new driver a buffer is reserved for each file descriptor guaranteeing that at least that buffer size will be available for each request on the file descriptor. A user may request a larger buffer size on any particular request but runs the (usually remote) risk of an out of memory (ENOMEM) error.
因此我们尽量选择scatter gather的方法,手动将一个大请求分割成很多的小块。下面具体阐述了这个过程。
NVMe的nvme-scsi.c中有一个名为nvme_trans_io
的函数,它具体执行将一个标准SCSI读写命令转化为NVMe命令的工作。
这个函数对SCSI请求进行了一些限制。可以看到该函数有以下语句:
/* IO vector sizes should be multiples of block size */
if (sgl.iov_len % (1 << ns->lba_shift) != 0) {
res = nvme_trans_completion(hdr,
SAM_STAT_CHECK_CONDITION,
ILLEGAL_REQUEST,
SCSI_ASC_INVALID_PARAMETER,
SCSI_ASCQ_CAUSE_NOT_REPORTABLE);
goto out;
}
因此切割的时候必须以LBA为基本单位。
另外注意到,nvme_trans_io
调用的执行读写操作的函数nvme_trans_do_nvme_io
函数中不加判断地将用户传进来的sg_iovec
中的请求大小放入NVMe命令:
if (hdr->iovec_count > 0) {
unit_len = sgl.iov_len;
unit_num_blocks = unit_len >> ns->lba_shift;
}
c.rw.length = cpu_to_le16(unit_num_blocks - 1);
nvme_sc = nvme_submit_io_cmd(dev, ns, &c, NULL);
然而,每个NVMe命令却有一个读写大小的上限。所以在分割时还需要注意,每个块不能超过这个大小。
800GB NVMe P3700的这个大小为256个LBA,也就是128KB。对于一个特定的NVMe设备,这个大小的获取在我以前的博文中提到过。
下面是具体的分割过程,使用IOVEC_ELEMS控制一次读写的数量。
// split into iovecs
rem = len * lba_size;
max_size = max_blocks * lba_size;
for (i = 0; i < IOVEC_ELEMS; ++i){
iovec[i].iov_base = (char*)(base + i * max_size);
iovec[i].iov_len = (rem > max_size) ? max_size : rem;
if (rem <= max_size)
break;
rem -= max_size;
}
if (i >= IOVEC_ELEMS){
printf("Too many data!\n");
goto exit;
}
sg_io_hdr
sg_io_hdr
定义在sg.h中,包含了一个SCSI命令的全部信息:
typedef struct sg_io_hdr
{
int interface_id; /* [i] 'S' for SCSI generic (required) */
int dxfer_direction; /* [i] data transfer direction */
unsigned char cmd_len; /* [i] SCSI command length ( <= 16 bytes) */
unsigned char mx_sb_len; /* [i] max length to write to sbp */
unsigned short iovec_count; /* [i] 0 implies no scatter gather */
unsigned int dxfer_len; /* [i] byte count of data transfer */
void * dxferp; /* [i], [*io] points to data transfer memory
or scatter gather list */
unsigned char * cmdp; /* [i], [*i] points to command to perform */
unsigned char * sbp; /* [i], [*o] points to sense_buffer memory */
unsigned int timeout; /* [i] MAX_UINT->no timeout (unit: millisec) */
unsigned int flags; /* [i] 0 -> default, see SG_FLAG... */
int pack_id; /* [i->o] unused internally (normally) */
void * usr_ptr; /* [i->o] unused internally */
unsigned char status; /* [o] scsi status */
unsigned char masked_status;/* [o] shifted, masked scsi status */
unsigned char msg_status; /* [o] messaging level data (optional) */
unsigned char sb_len_wr; /* [o] byte count actually written to sbp */
unsigned short host_status; /* [o] errors from host adapter */
unsigned short driver_status;/* [o] errors from software driver */
int resid; /* [o] dxfer_len - actual_transferred */
unsigned int duration; /* [o] time taken by cmd (unit: millisec) */
unsigned int info; /* [o] auxiliary information */
} sg_io_hdr_t; /* 64 bytes long (on i386) */
基本上需要填写的就是下面这些啦:
interface_id = 'S';
cmd_len = sizeof(rdCmd);
cmdp = rdCmd;
dxfer_direction = SG_DXFER_FROM_DEV;
dxfer_len = lba_size * len;
iovec_count = num_of_iovec;
dxferp = iovec;
rdCmd和iovec分别就是上面填好的东西啦。
一言以蔽之:
err = ioctl(fd, SG_IO, &io_hdr);