给NVMe设备发送一个SCSI READ_10命令

0 READ_10命令

READ_6命令只能支持块大小为512B设备的2GB范围的寻址,因此官方推荐将READ_6迁移到READ_10
READ_10具有2TB的寻址能力,对于800G的NVMe设备来说当然是极好的。其实READ_10具有更多乱七八糟的特性,但当前的nvme-scsi.c中忽略了其中大部分的特性,因此先不予考虑。

想要通过ioctl发送一个READ_10的SCSI命令,至少需要进行下文的四步操作。

1 构建一个READ(10) Command

READ(10) Command具体定义如下:
给NVMe设备发送一个SCSI READ_10命令_第1张图片

由图可知,这个命令由10个char组成。需要填写opcode、LBA和Transfer Length:

    unsigned char rdCmd[10] = {READ_10, 0, 0, 0, 0, 0, 0, 0, 0, 0};
    rdCmd[2] = (unsigned char)((start_lba >> 24) & 0xff);
    rdCmd[3] = (unsigned char)((start_lba >> 16) & 0xff);
    rdCmd[4] = (unsigned char)((start_lba >> 8) & 0xff);
    rdCmd[5] = (unsigned char)(start_lba & 0xff);
    rdCmd[7] = (unsigned char)((lba_num >> 8) & 0xff);
    rdCmd[8] = (unsigned char)(lba_num & 0xff);

亦即从start_lba开始读出lba_num个逻辑块,每个逻辑块的大小一般是512B。

2 切割读写请求

NVMe接收两种读写模式,区别在于下一步sg_io_hdriovec_count的设置。当这个位置非0时,表示使用了scatter gather方法,它向设备传递了一个请求向量。
非sg的方法可以请求传递任意大小的块。而且驱动自动分割读写请求,保证了安全性和可靠性。然而,非sg方法传递一个大缓冲区时,很容易得到ENOMEM错误。
下面这段话引用自sg.danny.cz:

Scatter gather allows large buffers (previously limited to 128 KB on i386) to be used. Scatter gather is also a lot more “kernel friendly”. The original driver used a single large buffer which made it impossible to run 2 or more sg-based applications at the same time. With the new driver a buffer is reserved for each file descriptor guaranteeing that at least that buffer size will be available for each request on the file descriptor. A user may request a larger buffer size on any particular request but runs the (usually remote) risk of an out of memory (ENOMEM) error.

因此我们尽量选择scatter gather的方法,手动将一个大请求分割成很多的小块。下面具体阐述了这个过程。

NVMe的nvme-scsi.c中有一个名为nvme_trans_io的函数,它具体执行将一个标准SCSI读写命令转化为NVMe命令的工作。
这个函数对SCSI请求进行了一些限制。可以看到该函数有以下语句:

/* IO vector sizes should be multiples of block size */
if (sgl.iov_len % (1 << ns->lba_shift) != 0) {
    res = nvme_trans_completion(hdr,
            SAM_STAT_CHECK_CONDITION,
            ILLEGAL_REQUEST,
            SCSI_ASC_INVALID_PARAMETER,
            SCSI_ASCQ_CAUSE_NOT_REPORTABLE);    
    goto out;
}

因此切割的时候必须以LBA为基本单位。

另外注意到,nvme_trans_io调用的执行读写操作的函数nvme_trans_do_nvme_io函数中不加判断地将用户传进来的sg_iovec中的请求大小放入NVMe命令:

if (hdr->iovec_count > 0) {
    unit_len = sgl.iov_len;
    unit_num_blocks = unit_len >> ns->lba_shift;
}
c.rw.length = cpu_to_le16(unit_num_blocks - 1);
nvme_sc = nvme_submit_io_cmd(dev, ns, &c, NULL);

然而,每个NVMe命令却有一个读写大小的上限。所以在分割时还需要注意,每个块不能超过这个大小。
800GB NVMe P3700的这个大小为256个LBA,也就是128KB。对于一个特定的NVMe设备,这个大小的获取在我以前的博文中提到过。

下面是具体的分割过程,使用IOVEC_ELEMS控制一次读写的数量。

// split into iovecs
rem = len * lba_size;
max_size = max_blocks * lba_size;
for (i = 0; i < IOVEC_ELEMS; ++i){
    iovec[i].iov_base = (char*)(base + i * max_size);
    iovec[i].iov_len = (rem > max_size) ? max_size : rem;
    if (rem <= max_size)
        break;
    rem -= max_size;
}
if (i >= IOVEC_ELEMS){
    printf("Too many data!\n");
    goto exit;
}

3 构建一个sg_io_hdr

sg_io_hdr定义在sg.h中,包含了一个SCSI命令的全部信息:

typedef struct sg_io_hdr
{
    int interface_id;           /* [i] 'S' for SCSI generic (required) */
    int dxfer_direction;        /* [i] data transfer direction  */
    unsigned char cmd_len;      /* [i] SCSI command length ( <= 16 bytes) */
    unsigned char mx_sb_len;    /* [i] max length to write to sbp */
    unsigned short iovec_count; /* [i] 0 implies no scatter gather */
    unsigned int dxfer_len;     /* [i] byte count of data transfer */
    void * dxferp;              /* [i], [*io] points to data transfer memory
                                              or scatter gather list */
    unsigned char * cmdp;       /* [i], [*i] points to command to perform */
    unsigned char * sbp;        /* [i], [*o] points to sense_buffer memory */
    unsigned int timeout;       /* [i] MAX_UINT->no timeout (unit: millisec) */
    unsigned int flags;         /* [i] 0 -> default, see SG_FLAG... */
    int pack_id;                /* [i->o] unused internally (normally) */
    void * usr_ptr;             /* [i->o] unused internally */
    unsigned char status;       /* [o] scsi status */
    unsigned char masked_status;/* [o] shifted, masked scsi status */
    unsigned char msg_status;   /* [o] messaging level data (optional) */
    unsigned char sb_len_wr;    /* [o] byte count actually written to sbp */
    unsigned short host_status; /* [o] errors from host adapter */
    unsigned short driver_status;/* [o] errors from software driver */
    int resid;                  /* [o] dxfer_len - actual_transferred */
    unsigned int duration;      /* [o] time taken by cmd (unit: millisec) */
    unsigned int info;          /* [o] auxiliary information */
} sg_io_hdr_t;  /* 64 bytes long (on i386) */

基本上需要填写的就是下面这些啦:

interface_id = 'S';
cmd_len = sizeof(rdCmd);
cmdp = rdCmd;
dxfer_direction = SG_DXFER_FROM_DEV;
dxfer_len = lba_size * len;
iovec_count = num_of_iovec;
dxferp = iovec;

rdCmd和iovec分别就是上面填好的东西啦。

4 将请求发出去!

一言以蔽之:

err = ioctl(fd, SG_IO, &io_hdr);

你可能感兴趣的:(NVMe)