Linux那些事儿之我是SCSI硬盘(7)从应用层走来的ioctl

2007年过去了,这一年里明星们一如既往,大牌们继续做领军人物,而希望上位的小辈也使尽手段.该恋爱的恋爱,该炒作的炒作,该整容的整容.功成名就的就做慈善,有待提高的就造绯闻.而我该做的,是继续写我的blog,继续说Linux那些鸟事儿,继续说那些无聊的函数,在说完了sd_probe之后,我们要接触一些新的函数了,首先推出的是ioctl,具体到sd模块中就是sd_ioctl.

当我们向scsi磁盘发送命令的时候,这个函数多半会被调用.现在让我们用kdb来演示一下:先在kdb中设置断点,sd_ioctl,

Entering kdb (current=0xffff81022adcf140, pid 4074) on processor 6 due to KDB_ENTER()

[6]kdb> bp sd_ioctl

Instruction(i) BP #0 at 0xffffffff880b1de2 ([sd_mod]sd_ioctl)

    is enabled globally adjust 1

[6]kdb> go

然后执行一个scsi命令,比如INQUIRY命令.

[root@lfg2 tedkdb]# sg_inq /dev/sdg

不用说,电光火石之间kdb提示符就打印了出来,

Entering kdb (current=0xffff81022cdae760, pid 4095) on processor 5 due to Breakpoint @ 0xffffffff880b1de2

[5]kdb> bt

Stack traceback for pid 4095

0xffff81022cdae760     4095     4044  1    5   R  0xffff81022cdaea40 *sg_inq

rsp                rip                Function (args)

 =======================

0xffff81022f619fd8 0xffffffff880b1de2 [sd_mod]sd_ioctl

 =======================

0xffff81022d7f7d40 0xffffffff803101a0 blkdev_driver_ioctl+0x63

0xffff81022d7f7d80 0xffffffff80310810 blkdev_ioctl+0x65b

0xffff81022d7f7da0 0xffffffff80278825 __handle_mm_fault+0x6d3

0xffff81022d7f7db0 0xffffffff8029a1ea may_open+0x65

0xffff81022d7f7e10 0xffffffff80322a48 __up_read+0x7a

0xffff81022d7f7e40 0xffffffff80249b4f up_read+0x9

0xffff81022d7f7e50 0xffffffff80482d6d do_page_fault+0x48a

0xffff81022d7f7e60 0xffffffff8027acff __vma_link+0x52

0xffff81022d7f7ea0 0xffffffff802b58d5 block_ioctl+0x1b

0xffff81022d7f7eb0 0xffffffff8029da98 do_ioctl+0x2c

0xffff81022d7f7ee0 0xffffffff8029dd6d vfs_ioctl+0x247

0xffff81022d7f7f30 0xffffffff8029dde9 sys_ioctl+0x5f

0xffff81022d7f7f80 0xffffffff80209efc tracesys+0xdc

[5]kdb>

实际上走的路径是一个系统调用ioctl,或者说从sys_ioctl进来的,最终走到了sd_ioctl.

关于sd_ioctl,来自drivers/scsi/sd.c:

    641 /**

    642  *      sd_ioctl - process an ioctl

    643  *      @inode: only i_rdev/i_bdev members may be used

    644  *      @filp: only f_mode and f_flags may be used

    645  *      @cmd: ioctl command number

    646  *      @arg: this is third argument given to ioctl(2) system call.

    647  *      Often contains a pointer.

    648  *

    649  *      Returns 0 if successful (some ioctls return postive numbers on

    650  *      success as well). Returns a negated errno value in case of error.

    651  *

    652  *      Note: most ioctls are forward onto the block subsystem or further

    653  *      down in the scsi subsytem.

    654  **/

    655 static int sd_ioctl(struct inode * inode, struct file * filp,

    656                     unsigned int cmd, unsigned long arg)

    657 {

    658         struct block_device *bdev = inode->i_bdev;

    659         struct gendisk *disk = bdev->bd_disk;

    660         struct scsi_device *sdp = scsi_disk(disk)->device;

    661         void __user *p = (void __user *)arg;

    662         int error;

    663

    664         SCSI_LOG_IOCTL(1, printk("sd_ioctl: disk=%s, cmd=0x%x/n",

    665                                                 disk->disk_name, cmd));

    666

    667         /*

    668          * If we are in the middle of error recovery, don't let anyone

    669          * else try and use this device.  Also, if error recovery fails, it

    670          * may try and take the device offline, in which case all further

    671          * access to the device is prohibited.

    672          */

    673         error = scsi_nonblockable_ioctl(sdp, cmd, p, filp);

    674         if (!scsi_block_when_processing_errors(sdp) || !error)

    675                 return error;

    676

    677         /*

    678          * Send SCSI addressing ioctls directly to mid level, send other

    679          * ioctls to block level and then onto mid level if they can't be

680          * resolved.

    681          */

    682         switch (cmd) {

    683                 case SCSI_IOCTL_GET_IDLUN:

    684                 case SCSI_IOCTL_GET_BUS_NUMBER:

    685                         return scsi_ioctl(sdp, cmd, p);

    686                 default:

    687                         error = scsi_cmd_ioctl(filp, disk, cmd, p);

    688                         if (error != -ENOTTY)

    689                                 return error;

    690         }

    691         return scsi_ioctl(sdp, cmd, p);

    692 }

继续设置断点,scsi_cmd_ioctlscsi_ioctl,会发现,调用的是scsi_cmd_ioctl.

Instruction(i) breakpoint #2 at 0xffffffff802ee554 (adjusted)

0xffffffff802ee554 scsi_cmd_ioctl:         int3

 

Entering kdb (current=0xffff81022e1a01c0, pid 3583) due to Breakpoint @ 0xffffffff802ee554

kdb> bt

Stack traceback for pid 3583

0xffff81022e1a01c0     3583     3425  1    0   R  0xffff81022e1a0490 *sg_inq

rsp                rip                Function (args)

 =======================

0xffffffff805dafa0 0xffffffff88080dcc [sd_mod]sd_ioctl

0xffffffff805dafd8 0xffffffff802ee554 scsi_cmd_ioctl

 =======================

0xffff8102112d7d70 0xffffffff88080e68 [sd_mod]sd_ioctl+0x9c

0xffff8102112d7db0 0xffffffff802ec9b2 blkdev_driver_ioctl+0x3a

0xffff8102112d7dc0 0xffffffff802ed00d blkdev_ioctl+0x627

0xffff8102112d7e40 0xffffffff80239e1b up_read+0x9

0xffff8102112d7e50 0xffffffff80435872 do_page_fault+0x48a

0xffff8102112d7ea0 0xffffffff802641f6 vma_link+0x3a

0xffff8102112d7ee0 0xffffffff80295155 block_ioctl+0x1b

0xffff8102112d7ef0 0xffffffff8027ec3b do_ioctl+0x1b

0xffff8102112d7f00 0xffffffff8027ee7e vfs_ioctl+0x20e

0xffff8102112d7f30 0xffffffff8027eeef sys_ioctl+0x5f

0xffff8102112d7f80 0xffffffff80209c8c tracesys+0xdc

其中的那来自block/scsi_ioctl.c

    520 int scsi_cmd_ioctl(struct file *file, struct gendisk *bd_disk, unsigned int cmd, void __user *arg)

    521 {

    522         request_queue_t *q;

    523         int err;

    524

    525         q = bd_disk->queue;

    526         if (!q)

    527                 return -ENXIO;

    528

    529         if (blk_get_queue(q))

    530                 return -ENXIO;

    531

    532         switch (cmd) {

    533                 /*

    534                  * new sgv3 interface

    535                  */

    536                 case SG_GET_VERSION_NUM:

    537                         err = sg_get_version(arg);

    538                         break;

    539                 case SCSI_IOCTL_GET_IDLUN:

    540                         err = scsi_get_idlun(q, arg);

    541                         break;

    542                 case SCSI_IOCTL_GET_BUS_NUMBER:

    543                         err = scsi_get_bus(q, arg);

    544                         break;

    545                 case SG_SET_TIMEOUT:

    546                         err = sg_set_timeout(q, arg);

    547                         break;

    548                 case SG_GET_TIMEOUT:

    549                         err = sg_get_timeout(q);

    550                         break;

    551                 case SG_GET_RESERVED_SIZE:

    552                         err = sg_get_reserved_size(q, arg);

    553                         break;

554                 case SG_SET_RESERVED_SIZE:

    555                         err = sg_set_reserved_size(q, arg);

    556                         break;

    557                 case SG_EMULATED_HOST:

    558                         err = sg_emulated_host(q, arg);

    559                         break;

    560                 case SG_IO: {

    561                         struct sg_io_hdr hdr;

    562

    563                         err = -EFAULT;

    564                         if (copy_from_user(&hdr, arg, sizeof(hdr)))

    565                                 break;

    566                         err = sg_io(file, q, bd_disk, &hdr);

    567                         if (err == -EFAULT)

    568                                 break;

    569

    570                         if (copy_to_user(arg, &hdr, sizeof(hdr)))

    571                                 err = -EFAULT;

    572                         break;

    573                 }

    574                 case CDROM_SEND_PACKET: {

    575                         struct cdrom_generic_command cgc;

    576                         struct sg_io_hdr hdr;

    577

    578                         err = -EFAULT;

    579                         if (copy_from_user(&cgc, arg, sizeof(cgc)))

    580                                 break;

    581                         cgc.timeout = clock_t_to_jiffies(cgc.timeout);

    582                         memset(&hdr, 0, sizeof(hdr));

    583                         hdr.interface_id = 'S';

    584                         hdr.cmd_len = sizeof(cgc.cmd);

    585                         hdr.dxfer_len = cgc.buflen;

    586                         err = 0;

    587                         switch (cgc.data_direction) {

    588                                 case CGC_DATA_UNKNOWN:

    589                                         hdr.dxfer_direction = SG_DXFER_UNKNOWN;

    590                                         break;

    591                                 case CGC_DATA_WRITE:

    592                                         hdr.dxfer_direction = SG_DXFER_TO_DEV;

    593                                         break;

    594                                 case CGC_DATA_READ:

    595                                         hdr.dxfer_direction = SG_DXFER_FROM_DEV;

596                                         break;

    597                                 case CGC_DATA_NONE:

    598                                         hdr.dxfer_direction = SG_DXFER_NONE;

    599                                         break;

    600                                 default:

    601                                         err = -EINVAL;

    602                         }

    603                         if (err)

    604                                 break;

    605

    606                         hdr.dxferp = cgc.buffer;

    607                         hdr.sbp = cgc.sense;

    608                         if (hdr.sbp)

    609                                 hdr.mx_sb_len = sizeof(struct request_sense);

    610                         hdr.timeout = cgc.timeout;

    611                         hdr.cmdp = ((struct cdrom_generic_command __user*) arg)->cmd;

    612                         hdr.cmd_len = sizeof(cgc.cmd);

    613

    614                         err = sg_io(file, q, bd_disk, &hdr);

    615                         if (err == -EFAULT)

    616                                 break;

    617

    618                         if (hdr.status)

    619                                 err = -EIO;

    620

    621                         cgc.stat = err;

    622                         cgc.buflen = hdr.resid;

    623                         if (copy_to_user(arg, &cgc, sizeof(cgc)))

    624                                 err = -EFAULT;

    625

    626                         break;

    627                 }

    628

    629                 /*

    630                  * old junk scsi send command ioctl

    631                  */

    632                 case SCSI_IOCTL_SEND_COMMAND:

    633                         printk(KERN_WARNING "program %s is using a deprecated SCSI ioctl, please convert it to SG_IO/        n", current->comm);

634                         err = -EINVAL;

    635                         if (!arg)

    636                                 break;

    637

    638                         err = sg_scsi_ioctl(file, q, bd_disk, arg);

    639                         break;

    640                 case CDROMCLOSETRAY:

    641                         err = blk_send_start_stop(q, bd_disk, 0x03);

    642                         break;

    643                 case CDROMEJECT:

    644                         err = blk_send_start_stop(q, bd_disk, 0x02);

    645                         break;

    646                 default:

    647                         err = -ENOTTY;

    648         }

    649

    650         blk_put_queue(q);

    651         return err;

    652 }

通过kdb的跟踪,你会发现传递进来的是cmdSG_IO.换言之,switch-case这一大段最终会定格在560行这个SG_IO.那么这里涉及到一个结构体就很重要了,来自include/scsi/sg.h中的sg_io_hdr.

     83 typedef struct sg_io_hdr

     84 {

     85     int interface_id;           /* [i] 'S' for SCSI generic (required) */

     86     int dxfer_direction;        /* [i] data transfer direction  */

     87     unsigned char cmd_len;      /* [i] SCSI command length ( <= 16 bytes) */

     88     unsigned char mx_sb_len;    /* [i] max length to write to sbp */

     89     unsigned short iovec_count; /* [i] 0 implies no scatter gather */

     90     unsigned int dxfer_len;     /* [i] byte count of data transfer */

     91     void __user *dxferp;        /* [i], [*io] points to data transfer memory

     92                                               or scatter gather list */

     93     unsigned char __user *cmdp; /* [i], [*i] points to command to perform */

     94     void __user *sbp;           /* [i], [*o] points to sense_buffer memory */

     95     unsigned int timeout;       /* [i] MAX_UINT->no timeout (unit: millisec) */

     96     unsigned int flags;         /* [i] 0 -> default, see SG_FLAG... */

     97     int pack_id;                /* [i->o] unused internally (normally) */

     98     void __user * usr_ptr;      /* [i->o] unused internally */

     99     unsigned char status;       /* [o] scsi status */

    100     unsigned char masked_status;/* [o] shifted, masked scsi status */

    101     unsigned char msg_status;   /* [o] messaging level data (optional) */

    102     unsigned char sb_len_wr;    /* [o] byte count actually written to sbp */

    103     unsigned short host_status; /* [o] errors from host adapter */

    104     unsigned short driver_status;/* [o] errors from software driver */

    105     int resid;                  /* [o] dxfer_len - actual_transferred */

    106     unsigned int duration;      /* [o] time taken by cmd (unit: millisec) */

    107     unsigned int info;          /* [o] auxiliary information */

    108 } sg_io_hdr_t;  /* 64 bytes long (on i386) */

这其中,cmdp指针指向的不是别人,正是命令本身也.

有过Linux下编程经验的人一定不会对copy_from_usercopy_to_user这两个骨灰级的函数陌生吧,内核空间和用户空间传递数据的两个经典函数,更远咱就不说了,当年2.4内核那会儿课堂上老师让写一个简单的字符驱动的时候就用上这两个函数了.而这里具体来说,先是把arg里的玩意儿copyhdr,然后调用sg_io完成实质性的工作,然后再把hdr里的内容copy回到arg.

进一步跟踪sg_io,来自block/scsi_ioctl.c:

    225 static int sg_io(struct file *file, request_queue_t *q,

    226                 struct gendisk *bd_disk, struct sg_io_hdr *hdr)

    227 {

    228         unsigned long start_time, timeout;

    229         int writing = 0, ret = 0;

    230         struct request *rq;

    231         char sense[SCSI_SENSE_BUFFERSIZE];

    232         unsigned char cmd[BLK_MAX_CDB];

    233         struct bio *bio;

    234

    235         if (hdr->interface_id != 'S')

    236                 return -EINVAL;

    237         if (hdr->cmd_len > BLK_MAX_CDB)

    238                 return -EINVAL;

    239         if (copy_from_user(cmd, hdr->cmdp, hdr->cmd_len))

    240                 return -EFAULT;

    241         if (verify_command(file, cmd))

    242                 return -EPERM;

    243

    244         if (hdr->dxfer_len > (q->max_hw_sectors << 9))

    245                 return -EIO;

    246

    247         if (hdr->dxfer_len)

    248                 switch (hdr->dxfer_direction) {

    249                 default:

    250                         return -EINVAL;

    251                 case SG_DXFER_TO_DEV:

    252                         writing = 1;

    253                         break;

    254                 case SG_DXFER_TO_FROM_DEV:

    255                 case SG_DXFER_FROM_DEV:

    256                         break;

    257                 }

    258

    259         rq = blk_get_request(q, writing ? WRITE : READ, GFP_KERNEL);

    260         if (!rq)

    261                 return -ENOMEM;

    262

    263         /*

    264          * fill in request structure

    265          */

    266         rq->cmd_len = hdr->cmd_len;

    267         memset(rq->cmd, 0, BLK_MAX_CDB); /* ATAPI hates garbage after CDB */

    268         memcpy(rq->cmd, cmd, hdr->cmd_len);

    269

    270         memset(sense, 0, sizeof(sense));

    271         rq->sense = sense;

    272         rq->sense_len = 0;

    273

    274         rq->cmd_type = REQ_TYPE_BLOCK_PC;

    275

    276         timeout = msecs_to_jiffies(hdr->timeout);

    277         rq->timeout = (timeout < INT_MAX) ? timeout : INT_MAX;

    278         if (!rq->timeout)

    279                 rq->timeout = q->sg_timeout;

    280         if (!rq->timeout)

    281                 rq->timeout = BLK_DEFAULT_TIMEOUT;

    282

    283         if (hdr->iovec_count) {

    284                 const int size = sizeof(struct sg_iovec) * hdr->iovec_count;

    285                 struct sg_iovec *iov;

    286

    287                 iov = kmalloc(size, GFP_KERNEL);

    288                 if (!iov) {

    289                         ret = -ENOMEM;

    290                         goto out;

    291                 }

    292

    293                 if (copy_from_user(iov, hdr->dxferp, size)) {

    294                         kfree(iov);

    295                         ret = -EFAULT;

    296                         goto out;

    297                 }

    298

    299                 ret = blk_rq_map_user_iov(q, rq, iov, hdr->iovec_count,

    300                                           hdr->dxfer_len);

    301                 kfree(iov);

    302         } else if (hdr->dxfer_len)

    303                 ret = blk_rq_map_user(q, rq, hdr->dxferp, hdr->dxfer_len);

    304

    305         if (ret)

    306                 goto out;

    307

    308         bio = rq->bio;

    309         rq->retries = 0;

    310

    311         start_time = jiffies;

    312

    313         /* ignore return value. All information is passed back to caller

    314          * (if he doesn't check that is his problem).

    315          * N.B. a non-zero SCSI status is _not_ necessarily an error.

    316          */

    317         blk_execute_rq(q, bd_disk, rq, 0);

    318

    319         /* write to all output members */

    320         hdr->status = 0xff & rq->errors;

    321         hdr->masked_status = status_byte(rq->errors);

    322         hdr->msg_status = msg_byte(rq->errors);

    323         hdr->host_status = host_byte(rq->errors);

    324         hdr->driver_status = driver_byte(rq->errors);

    325         hdr->info = 0;

    326         if (hdr->masked_status || hdr->host_status || hdr->driver_status)

    327                 hdr->info |= SG_INFO_CHECK;

    328         hdr->resid = rq->data_len;

    329         hdr->duration = ((jiffies - start_time) * 1000) / HZ;

    330         hdr->sb_len_wr = 0;

    331

    332         if (rq->sense_len && hdr->sbp) {

    333                 int len = min((unsigned int) hdr->mx_sb_len, rq->sense_len);

    334

    335                 if (!copy_to_user(hdr->sbp, rq->sense, len))

    336                         hdr->sb_len_wr = len;

337         }

    338

    339         if (blk_rq_unmap_user(bio))

    340                 ret = -EFAULT;

    341

    342         /* may not have succeeded, but output values written to control

    343          * structure (struct sg_io_hdr).  */

    344 out:

    345         blk_put_request(rq);

    346         return ret;

    347 }

不难看出这里hdr实际上扮演了两个角色,一个是输入,一个是输出.而我们为了得到信息所采取的手段依然是blk_execute_rq,即仍然是以提交request的方式,blk_execute_rq之后,实际上信息已经保存到了rq的各个成员里边,而这之后的代码就是把信息从rq的成员中转移到hdr的成员中.这种情况就好比,我去翠宫饭店游泳,我花900块钱办了一张游泳卡,半年有效期的那种,总共可以游30,于是每次我去游泳就得凭这张卡进入,等我出来的时候,工作人员会把卡还给我,而卡上有30个格子,每有一次工作人员会划掉一格.于是hdr这个变量的作用就相当于我这张卡,进来出去都要和它打交道,并且进来出去时的状态是不一样的.

那么基本上这个函数在做什么我们是很清楚了的,我们再来关注一些细节.首先结合struct sg_io_hdr这个伟大的结构体来看代码.

interface_id,必须得是大”S”.表示Scsi Generic.从历史渊源来说表征当年那个叫做sg的模块.而与之对应的是另一个叫做pg的模块(parallel port generic driver,并行端口通用驱动),也会有interface_id这么一个变量,它的这个变量则被设置为”P”.

dxfer_direction,这个就表示数据传输方向.比如对于写命令这个变量可以取值为SG_DXFER_TO_DEV,对于读命令这个变量可以取值为SG_DXFER_FROM_DEV.

cmd_len就不用说了,表征该scsi命令的长度.它必须小于等于16.因为scsi命令最多就是16个字节.这就是为什么237行判断是否大于16.(BLK_MAX_CDB被定义为16.)

dxfer_len就是数据传输阶段到底会传输多少字节.max_hw_sectors表示单个request最大能传输多少个sectors,这个是硬件限制.一个sector512个字节,所以244行要左移9,即乘上512.

259,blk_get_request基本上可以理解为申请一个struct request结构体.

然后268,cmd复制给rq->cmd.cmd只是刚才239行从hdr->cmdpcopy过来的.

需要注意的是274行这里设置了rq->cmd_typeREQ_TYPE_BLOCK_PC,block那边会按不同的命令类型进行不同的处理.

283,iovec_count,它和dxferp是分不开的.如果iovec_count0,dxferp表示用户空间内存地址,如果iovec_count大于0,那么dxferp实际指向了一个scatter-gather数组,这个数组中的每一个成员是一个sg_iovec_t结构体变量.struct sg_iovec定义于include/scsi/sg.h:

     76 typedef struct sg_iovec /* same structure as used by readv() Linux system */

     77 {                       /* call. It defines one scatter-gather element. */

     78     void __user *iov_base;      /* Starting address  */

     79     size_t iov_len;             /* Length in bytes  */

     80 } sg_iovec_t;

blk_rq_map_userblk_rq_map_user_iov的作用是一样的,建立用户数据和request之间的映射.如今几乎每个人都听说过Linux中所谓的零拷贝特性,而传说中神奇的零拷贝这一刻离我们竟是如此之近.调用这两个函数的目的就是为了使用零拷贝技术.零拷贝的作用自不必说,改善性能,提高效率,这与十七大倡导的科学发展观是吻合的.然而,这其中涉及的这潭水未免太深了,我们就不去掺合了.

在执行完blk_execute_rq之后就简单了.

resid表示还剩下多少字节没有传输.

duration表示从scsi命令被发送到完成之间的时间,单位是毫秒.

sbp指向用于写Scsi sense bufferuser memory.

sb_len_wr表示实际写了多少个bytessbp指向的user memory.命令如果成功了基本上就不会写sense buffer,这样的话sb_len_wr就是0.

mx_sb_len则表征能往sbp中写的最大的size.人民大学东门外办假学生证的哥们儿都知道,sb_len_wr<=mx_sb_len.

那么最后,我们实际上就可以大致了解了从应用层来讲是如何给scsi设备发送命令的.sg_inq实际上触发的是ioctl的系统调用,经过几次辗转反侧,最终sd_ioctl会被调用.sd_ioctl会调用scsi核心层提供的函数,sg_io,最终走的路线依然是blk_execute_rq,而关于这个函数最终如何与usb-storage牵上手的,我们在block层那边对scsi命令进行分析时已经详细的介绍过了.

你可能感兴趣的:(Linux那些事儿,之,我是SCSI硬盘)