2007年过去了,这一年里明星们一如既往,大牌们继续做领军人物,而希望上位的小辈也使尽手段.该恋爱的恋爱,该炒作的炒作,该整容的整容.功成名就的就做慈善,有待提高的就造绯闻.而我该做的,是继续写我的blog,继续说Linux那些鸟事儿,继续说那些无聊的函数,在说完了sd_probe之后,我们要接触一些新的函数了,首先推出的是ioctl,具体到sd模块中就是sd_ioctl.
当我们向scsi磁盘发送命令的时候,这个函数多半会被调用.现在让我们用kdb来演示一下:先在kdb中设置断点,sd_ioctl,
Entering kdb (current=0xffff81022adcf140, pid 4074) on processor 6 due to KDB_ENTER()
[6]kdb> bp sd_ioctl
Instruction(i) BP #0 at 0xffffffff880b1de2 ([sd_mod]sd_ioctl)
is enabled globally adjust 1
[6]kdb> go
然后执行一个scsi命令,比如INQUIRY命令.
[root@lfg2 tedkdb]# sg_inq /dev/sdg
不用说,电光火石之间kdb提示符就打印了出来,
Entering kdb (current=0xffff81022cdae760, pid 4095) on processor 5 due to Breakpoint @ 0xffffffff880b1de2
[5]kdb> bt
Stack traceback for pid 4095
0xffff81022cdae760 4095 4044 1 5 R 0xffff81022cdaea40 *sg_inq
rsp rip Function (args)
======================= <debug>
0xffff81022f619fd8 0xffffffff880b1de2 [sd_mod]sd_ioctl
======================= <normal>
0xffff81022d7f7d40 0xffffffff803101a0 blkdev_driver_ioctl+0x63
0xffff81022d7f7d80 0xffffffff80310810 blkdev_ioctl+0x65b
0xffff81022d7f7da0 0xffffffff80278825 __handle_mm_fault+0x6d3
0xffff81022d7f7db0 0xffffffff8029a1ea may_open+0x65
0xffff81022d7f7e10 0xffffffff80322a48 __up_read+0x7a
0xffff81022d7f7e40 0xffffffff80249b4f up_read+0x9
0xffff81022d7f7e50 0xffffffff80482d6d do_page_fault+0x48a
0xffff81022d7f7e60 0xffffffff8027acff __vma_link+0x52
0xffff81022d7f7ea0 0xffffffff802b58d5 block_ioctl+0x1b
0xffff81022d7f7eb0 0xffffffff8029da98 do_ioctl+0x2c
0xffff81022d7f7ee0 0xffffffff8029dd6d vfs_ioctl+0x247
0xffff81022d7f7f30 0xffffffff8029dde9 sys_ioctl+0x5f
0xffff81022d7f7f80 0xffffffff80209efc tracesys+0xdc
[5]kdb>
实际上走的路径是一个系统调用ioctl,或者说从sys_ioctl进来的,最终走到了sd_ioctl.
关于sd_ioctl,来自drivers/scsi/sd.c:
641 /**
642 * sd_ioctl - process an ioctl
643 * @inode: only i_rdev/i_bdev members may be used
644 * @filp: only f_mode and f_flags may be used
645 * @cmd: ioctl command number
646 * @arg: this is third argument given to ioctl(2) system call.
647 * Often contains a pointer.
648 *
649 * Returns 0 if successful (some ioctls return postive numbers on
650 * success as well). Returns a negated errno value in case of error.
651 *
652 * Note: most ioctls are forward onto the block subsystem or further
653 * down in the scsi subsytem.
654 **/
655 static int sd_ioctl(struct inode * inode, struct file * filp,
656 unsigned int cmd, unsigned long arg)
657 {
658 struct block_device *bdev = inode->i_bdev;
659 struct gendisk *disk = bdev->bd_disk;
660 struct scsi_device *sdp = scsi_disk(disk)->device;
661 void __user *p = (void __user *)arg;
662 int error;
663
664 SCSI_LOG_IOCTL(1, printk("sd_ioctl: disk=%s, cmd=0x%x/n",
665 disk->disk_name, cmd));
666
667 /*
668 * If we are in the middle of error recovery, don't let anyone
669 * else try and use this device. Also, if error recovery fails, it
670 * may try and take the device offline, in which case all further
671 * access to the device is prohibited.
672 */
673 error = scsi_nonblockable_ioctl(sdp, cmd, p, filp);
674 if (!scsi_block_when_processing_errors(sdp) || !error)
675 return error;
676
677 /*
678 * Send SCSI addressing ioctls directly to mid level, send other
679 * ioctls to block level and then onto mid level if they can't be
680 * resolved.
681 */
682 switch (cmd) {
683 case SCSI_IOCTL_GET_IDLUN:
684 case SCSI_IOCTL_GET_BUS_NUMBER:
685 return scsi_ioctl(sdp, cmd, p);
686 default:
687 error = scsi_cmd_ioctl(filp, disk, cmd, p);
688 if (error != -ENOTTY)
689 return error;
690 }
691 return scsi_ioctl(sdp, cmd, p);
692 }
继续设置断点,scsi_cmd_ioctl和scsi_ioctl,会发现,调用的是scsi_cmd_ioctl.
Instruction(i) breakpoint #2 at 0xffffffff802ee554 (adjusted)
0xffffffff802ee554 scsi_cmd_ioctl: int3
Entering kdb (current=0xffff81022e1a01c0, pid 3583) due to Breakpoint @ 0xffffffff802ee554
kdb> bt
Stack traceback for pid 3583
0xffff81022e1a01c0 3583 3425 1 0 R 0xffff81022e1a0490 *sg_inq
rsp rip Function (args)
======================= <debug>
0xffffffff805dafa0 0xffffffff88080dcc [sd_mod]sd_ioctl
0xffffffff805dafd8 0xffffffff802ee554 scsi_cmd_ioctl
======================= <normal>
0xffff8102112d7d70 0xffffffff88080e68 [sd_mod]sd_ioctl+0x9c
0xffff8102112d7db0 0xffffffff802ec9b2 blkdev_driver_ioctl+0x3a
0xffff8102112d7dc0 0xffffffff802ed00d blkdev_ioctl+0x627
0xffff8102112d7e40 0xffffffff80239e1b up_read+0x9
0xffff8102112d7e50 0xffffffff80435872 do_page_fault+0x48a
0xffff8102112d7ea0 0xffffffff802641f6 vma_link+0x3a
0xffff8102112d7ee0 0xffffffff80295155 block_ioctl+0x1b
0xffff8102112d7ef0 0xffffffff8027ec3b do_ioctl+0x1b
0xffff8102112d7f00 0xffffffff8027ee7e vfs_ioctl+0x20e
0xffff8102112d7f30 0xffffffff8027eeef sys_ioctl+0x5f
0xffff8102112d7f80 0xffffffff80209c8c tracesys+0xdc
其中的那来自block/scsi_ioctl.c
520 int scsi_cmd_ioctl(struct file *file, struct gendisk *bd_disk, unsigned int cmd, void __user *arg)
521 {
522 request_queue_t *q;
523 int err;
524
525 q = bd_disk->queue;
526 if (!q)
527 return -ENXIO;
528
529 if (blk_get_queue(q))
530 return -ENXIO;
531
532 switch (cmd) {
533 /*
534 * new sgv3 interface
535 */
536 case SG_GET_VERSION_NUM:
537 err = sg_get_version(arg);
538 break;
539 case SCSI_IOCTL_GET_IDLUN:
540 err = scsi_get_idlun(q, arg);
541 break;
542 case SCSI_IOCTL_GET_BUS_NUMBER:
543 err = scsi_get_bus(q, arg);
544 break;
545 case SG_SET_TIMEOUT:
546 err = sg_set_timeout(q, arg);
547 break;
548 case SG_GET_TIMEOUT:
549 err = sg_get_timeout(q);
550 break;
551 case SG_GET_RESERVED_SIZE:
552 err = sg_get_reserved_size(q, arg);
553 break;
554 case SG_SET_RESERVED_SIZE:
555 err = sg_set_reserved_size(q, arg);
556 break;
557 case SG_EMULATED_HOST:
558 err = sg_emulated_host(q, arg);
559 break;
560 case SG_IO: {
561 struct sg_io_hdr hdr;
562
563 err = -EFAULT;
564 if (copy_from_user(&hdr, arg, sizeof(hdr)))
565 break;
566 err = sg_io(file, q, bd_disk, &hdr);
567 if (err == -EFAULT)
568 break;
569
570 if (copy_to_user(arg, &hdr, sizeof(hdr)))
571 err = -EFAULT;
572 break;
573 }
574 case CDROM_SEND_PACKET: {
575 struct cdrom_generic_command cgc;
576 struct sg_io_hdr hdr;
577
578 err = -EFAULT;
579 if (copy_from_user(&cgc, arg, sizeof(cgc)))
580 break;
581 cgc.timeout = clock_t_to_jiffies(cgc.timeout);
582 memset(&hdr, 0, sizeof(hdr));
583 hdr.interface_id = 'S';
584 hdr.cmd_len = sizeof(cgc.cmd);
585 hdr.dxfer_len = cgc.buflen;
586 err = 0;
587 switch (cgc.data_direction) {
588 case CGC_DATA_UNKNOWN:
589 hdr.dxfer_direction = SG_DXFER_UNKNOWN;
590 break;
591 case CGC_DATA_WRITE:
592 hdr.dxfer_direction = SG_DXFER_TO_DEV;
593 break;
594 case CGC_DATA_READ:
595 hdr.dxfer_direction = SG_DXFER_FROM_DEV;
596 break;
597 case CGC_DATA_NONE:
598 hdr.dxfer_direction = SG_DXFER_NONE;
599 break;
600 default:
601 err = -EINVAL;
602 }
603 if (err)
604 break;
605
606 hdr.dxferp = cgc.buffer;
607 hdr.sbp = cgc.sense;
608 if (hdr.sbp)
609 hdr.mx_sb_len = sizeof(struct request_sense);
610 hdr.timeout = cgc.timeout;
611 hdr.cmdp = ((struct cdrom_generic_command __user*) arg)->cmd;
612 hdr.cmd_len = sizeof(cgc.cmd);
613
614 err = sg_io(file, q, bd_disk, &hdr);
615 if (err == -EFAULT)
616 break;
617
618 if (hdr.status)
619 err = -EIO;
620
621 cgc.stat = err;
622 cgc.buflen = hdr.resid;
623 if (copy_to_user(arg, &cgc, sizeof(cgc)))
624 err = -EFAULT;
625
626 break;
627 }
628
629 /*
630 * old junk scsi send command ioctl
631 */
632 case SCSI_IOCTL_SEND_COMMAND:
633 printk(KERN_WARNING "program %s is using a deprecated SCSI ioctl, please convert it to SG_IO/ n", current->comm);
634 err = -EINVAL;
635 if (!arg)
636 break;
637
638 err = sg_scsi_ioctl(file, q, bd_disk, arg);
639 break;
640 case CDROMCLOSETRAY:
641 err = blk_send_start_stop(q, bd_disk, 0x03);
642 break;
643 case CDROMEJECT:
644 err = blk_send_start_stop(q, bd_disk, 0x02);
645 break;
646 default:
647 err = -ENOTTY;
648 }
649
650 blk_put_queue(q);
651 return err;
652 }
通过kdb的跟踪,你会发现传递进来的是cmd是SG_IO.换言之,switch-case这一大段最终会定格在560行这个SG_IO里.那么这里涉及到一个结构体就很重要了,来自include/scsi/sg.h中的sg_io_hdr.
83 typedef struct sg_io_hdr
84 {
85 int interface_id; /* [i] 'S' for SCSI generic (required) */
86 int dxfer_direction; /* [i] data transfer direction */
87 unsigned char cmd_len; /* [i] SCSI command length ( <= 16 bytes) */
88 unsigned char mx_sb_len; /* [i] max length to write to sbp */
89 unsigned short iovec_count; /* [i] 0 implies no scatter gather */
90 unsigned int dxfer_len; /* [i] byte count of data transfer */
91 void __user *dxferp; /* [i], [*io] points to data transfer memory
92 or scatter gather list */
93 unsigned char __user *cmdp; /* [i], [*i] points to command to perform */
94 void __user *sbp; /* [i], [*o] points to sense_buffer memory */
95 unsigned int timeout; /* [i] MAX_UINT->no timeout (unit: millisec) */
96 unsigned int flags; /* [i] 0 -> default, see SG_FLAG... */
97 int pack_id; /* [i->o] unused internally (normally) */
98 void __user * usr_ptr; /* [i->o] unused internally */
99 unsigned char status; /* [o] scsi status */
100 unsigned char masked_status;/* [o] shifted, masked scsi status */
101 unsigned char msg_status; /* [o] messaging level data (optional) */
102 unsigned char sb_len_wr; /* [o] byte count actually written to sbp */
103 unsigned short host_status; /* [o] errors from host adapter */
104 unsigned short driver_status;/* [o] errors from software driver */
105 int resid; /* [o] dxfer_len - actual_transferred */
106 unsigned int duration; /* [o] time taken by cmd (unit: millisec) */
107 unsigned int info; /* [o] auxiliary information */
108 } sg_io_hdr_t; /* 64 bytes long (on i386) */
这其中,cmdp指针指向的不是别人,正是命令本身也.
有过Linux下编程经验的人一定不会对copy_from_user和copy_to_user这两个骨灰级的函数陌生吧,内核空间和用户空间传递数据的两个经典函数,更远咱就不说了,当年2.4内核那会儿课堂上老师让写一个简单的字符驱动的时候就用上这两个函数了.而这里具体来说,先是把arg里的玩意儿copy到hdr中,然后调用sg_io完成实质性的工作,然后再把hdr里的内容copy回到arg中.
进一步跟踪sg_io,来自block/scsi_ioctl.c:
225 static int sg_io(struct file *file, request_queue_t *q,
226 struct gendisk *bd_disk, struct sg_io_hdr *hdr)
227 {
228 unsigned long start_time, timeout;
229 int writing = 0, ret = 0;
230 struct request *rq;
231 char sense[SCSI_SENSE_BUFFERSIZE];
232 unsigned char cmd[BLK_MAX_CDB];
233 struct bio *bio;
234
235 if (hdr->interface_id != 'S')
236 return -EINVAL;
237 if (hdr->cmd_len > BLK_MAX_CDB)
238 return -EINVAL;
239 if (copy_from_user(cmd, hdr->cmdp, hdr->cmd_len))
240 return -EFAULT;
241 if (verify_command(file, cmd))
242 return -EPERM;
243
244 if (hdr->dxfer_len > (q->max_hw_sectors << 9))
245 return -EIO;
246
247 if (hdr->dxfer_len)
248 switch (hdr->dxfer_direction) {
249 default:
250 return -EINVAL;
251 case SG_DXFER_TO_DEV:
252 writing = 1;
253 break;
254 case SG_DXFER_TO_FROM_DEV:
255 case SG_DXFER_FROM_DEV:
256 break;
257 }
258
259 rq = blk_get_request(q, writing ? WRITE : READ, GFP_KERNEL);
260 if (!rq)
261 return -ENOMEM;
262
263 /*
264 * fill in request structure
265 */
266 rq->cmd_len = hdr->cmd_len;
267 memset(rq->cmd, 0, BLK_MAX_CDB); /* ATAPI hates garbage after CDB */
268 memcpy(rq->cmd, cmd, hdr->cmd_len);
269
270 memset(sense, 0, sizeof(sense));
271 rq->sense = sense;
272 rq->sense_len = 0;
273
274 rq->cmd_type = REQ_TYPE_BLOCK_PC;
275
276 timeout = msecs_to_jiffies(hdr->timeout);
277 rq->timeout = (timeout < INT_MAX) ? timeout : INT_MAX;
278 if (!rq->timeout)
279 rq->timeout = q->sg_timeout;
280 if (!rq->timeout)
281 rq->timeout = BLK_DEFAULT_TIMEOUT;
282
283 if (hdr->iovec_count) {
284 const int size = sizeof(struct sg_iovec) * hdr->iovec_count;
285 struct sg_iovec *iov;
286
287 iov = kmalloc(size, GFP_KERNEL);
288 if (!iov) {
289 ret = -ENOMEM;
290 goto out;
291 }
292
293 if (copy_from_user(iov, hdr->dxferp, size)) {
294 kfree(iov);
295 ret = -EFAULT;
296 goto out;
297 }
298
299 ret = blk_rq_map_user_iov(q, rq, iov, hdr->iovec_count,
300 hdr->dxfer_len);
301 kfree(iov);
302 } else if (hdr->dxfer_len)
303 ret = blk_rq_map_user(q, rq, hdr->dxferp, hdr->dxfer_len);
304
305 if (ret)
306 goto out;
307
308 bio = rq->bio;
309 rq->retries = 0;
310
311 start_time = jiffies;
312
313 /* ignore return value. All information is passed back to caller
314 * (if he doesn't check that is his problem).
315 * N.B. a non-zero SCSI status is _not_ necessarily an error.
316 */
317 blk_execute_rq(q, bd_disk, rq, 0);
318
319 /* write to all output members */
320 hdr->status = 0xff & rq->errors;
321 hdr->masked_status = status_byte(rq->errors);
322 hdr->msg_status = msg_byte(rq->errors);
323 hdr->host_status = host_byte(rq->errors);
324 hdr->driver_status = driver_byte(rq->errors);
325 hdr->info = 0;
326 if (hdr->masked_status || hdr->host_status || hdr->driver_status)
327 hdr->info |= SG_INFO_CHECK;
328 hdr->resid = rq->data_len;
329 hdr->duration = ((jiffies - start_time) * 1000) / HZ;
330 hdr->sb_len_wr = 0;
331
332 if (rq->sense_len && hdr->sbp) {
333 int len = min((unsigned int) hdr->mx_sb_len, rq->sense_len);
334
335 if (!copy_to_user(hdr->sbp, rq->sense, len))
336 hdr->sb_len_wr = len;
337 }
338
339 if (blk_rq_unmap_user(bio))
340 ret = -EFAULT;
341
342 /* may not have succeeded, but output values written to control
343 * structure (struct sg_io_hdr). */
344 out:
345 blk_put_request(rq);
346 return ret;
347 }
不难看出这里hdr实际上扮演了两个角色,一个是输入,一个是输出.而我们为了得到信息所采取的手段依然是blk_execute_rq,即仍然是以提交request的方式,在blk_execute_rq之后,实际上信息已经保存到了rq的各个成员里边,而这之后的代码就是把信息从rq的成员中转移到hdr的成员中.这种情况就好比,我去翠宫饭店游泳,我花900块钱办了一张游泳卡,半年有效期的那种,总共可以游30次,于是每次我去游泳就得凭这张卡进入,等我出来的时候,工作人员会把卡还给我,而卡上有30个格子,每有一次工作人员会划掉一格.于是hdr这个变量的作用就相当于我这张卡,进来出去都要和它打交道,并且进来出去时的状态是不一样的.
那么基本上这个函数在做什么我们是很清楚了的,我们再来关注一些细节.首先结合struct sg_io_hdr这个伟大的结构体来看代码.
interface_id,必须得是大”S”.表示Scsi Generic.从历史渊源来说表征当年那个叫做sg的模块.而与之对应的是另一个叫做pg的模块(parallel port generic driver,并行端口通用驱动),也会有interface_id这么一个变量,它的这个变量则被设置为”P”.
dxfer_direction,这个就表示数据传输方向.比如对于写命令这个变量可以取值为SG_DXFER_TO_DEV,对于读命令这个变量可以取值为SG_DXFER_FROM_DEV.
cmd_len就不用说了,表征该scsi命令的长度.它必须小于等于16.因为scsi命令最多就是16个字节.这就是为什么237行判断是否大于16.(BLK_MAX_CDB被定义为16.)
dxfer_len就是数据传输阶段到底会传输多少字节.max_hw_sectors表示单个request最大能传输多少个sectors,这个是硬件限制.一个sector是512个字节,所以244行要左移9位,即乘上512.
259行,blk_get_request基本上可以理解为申请一个struct request结构体.
然后268行,把cmd复制给rq->cmd.而cmd只是刚才239行从hdr->cmdp中copy过来的.
需要注意的是274行这里设置了rq->cmd_type是REQ_TYPE_BLOCK_PC,block那边会按不同的命令类型进行不同的处理.
283行,iovec_count,它和dxferp是分不开的.如果iovec_count为0,dxferp表示用户空间内存地址,如果iovec_count大于0,那么dxferp实际指向了一个scatter-gather数组,这个数组中的每一个成员是一个sg_iovec_t结构体变量.struct sg_iovec定义于include/scsi/sg.h:
76 typedef struct sg_iovec /* same structure as used by readv() Linux system */
77 { /* call. It defines one scatter-gather element. */
78 void __user *iov_base; /* Starting address */
79 size_t iov_len; /* Length in bytes */
80 } sg_iovec_t;
而blk_rq_map_user和blk_rq_map_user_iov的作用是一样的,建立用户数据和request之间的映射.如今几乎每个人都听说过Linux中所谓的零拷贝特性,而传说中神奇的零拷贝这一刻离我们竟是如此之近.调用这两个函数的目的就是为了使用零拷贝技术.零拷贝的作用自不必说,改善性能,提高效率,这与十七大倡导的科学发展观是吻合的.然而,这其中涉及的这潭水未免太深了,我们就不去掺合了.
在执行完blk_execute_rq之后就简单了.
resid表示还剩下多少字节没有传输.
duration表示从scsi命令被发送到完成之间的时间,单位是毫秒.
sbp指向用于写Scsi sense buffer的user memory.
sb_len_wr表示实际写了多少个bytes到sbp指向的user memory中.命令如果成功了基本上就不会写sense buffer,这样的话sb_len_wr就是0.
mx_sb_len则表征能往sbp中写的最大的size.人民大学东门外办假学生证的哥们儿都知道,sb_len_wr<=mx_sb_len.
那么最后,我们实际上就可以大致了解了从应用层来讲是如何给scsi设备发送命令的.sg_inq实际上触发的是ioctl的系统调用,经过几次辗转反侧,最终sd_ioctl会被调用.而sd_ioctl会调用scsi核心层提供的函数,sg_io,最终走的路线依然是blk_execute_rq,而关于这个函数最终如何与usb-storage牵上手的,我们在block层那边对scsi命令进行分析时已经详细的介绍过了.