Linux存储IO栈(4)-- SCSI子系统之概述

概述

Linux SCSI子系统的分层架构:

Linux存储IO栈(4)-- SCSI子系统之概述_第1张图片

  • 低层:代表与SCSI的物理接口的实际驱动器,例如各个厂商为其特定的主机适配器(Host Bus Adapter, HBA)开发的驱动,低层驱动主要作用是发现连接到主机适配器的scsi设备,在内存中构建scsi子系统所需的数据结构,并提供消息传递接口,将scsi命令的接受与发送解释为主机适配器的操作。

  • 高层: 代表各种scsi设备类型的驱动,如scsi磁盘驱动,scsi磁带驱动,高层驱动认领低层驱动发现的scsi设备,为这些设备分配名称,将对设备的IO转换为scsi命令,交由低层驱动处理。

  • 中层:包含scsi栈的公共服务函数。高层和低层通过调用中层的函数完成其功能,而中层在执行过程中,也需要调用高层和低层注册的回调函数做一些个性化处理。

Linux SCSI模型

Linux存储IO栈(4)-- SCSI子系统之概述_第2张图片

Linux SCSI模型是内核的抽象,主机适配器连接主机IO总线(如PCI总线)和存储IO总线(如SCSI总线)。一台计算机可以有多个主机适配器,而主机适配器可以控制一条或多条SCSI总线,一条总线可以有多个目标节点与之相连,并且一个目标节点可以有多个逻辑单元。

在Linux SCSI子系统中,内核中的目标节点(target)对应SCSI磁盘,SCSI磁盘中可以有多个逻辑单元,统一由磁盘控制器控制,这些逻辑单元才是真正作为IO终点的存储设备,内核用设备(device)对逻辑单元进行抽象;内核中的Host对应主机适配器(物理的HBA/RAID卡,虚拟的iscsi target)

内核使用四元组 来唯一标识一个scsi的逻辑单元,在sysfs中查看sda磁盘<2:0:0:0>显示如下:

root@ubuntu16:/home/comet/Costor/bin# ls /sys/bus/scsi/devices/2\:0\:0\:0/block/sda/
alignment_offset  device             events_poll_msecs  integrity  removable  sda5    subsystem
bdi               discard_alignment  ext_range          power      ro         size    trace
capability        events             holders            queue      sda1       slaves  uevent
dev               events_async       inflight           range      sda2       stat
root@ubuntu16:/home/comet/Costor/bin# cat /sys/bus/scsi/devices/2\:0\:0\:0/block/sda/dev
8:0
root@ubuntu16:/home/comet/Costor/bin# ll /dev/sda
brw-rw---- 1 root disk 8, 0 Sep 19 11:36 /dev/sda
  • host: 主机适配器的唯一编号。
  • channel: 主机适配器中scsi通道编号,由主机适配器固件维护。
  • id: 目标节点唯一标识符。
  • lun: 目标节点内逻辑单元编号。

SCSI命令

SCSI 命令是在 Command Descriptor Block (CDB) 中定义的。CDB 包含了用来定义要执行的特定操作的操作代码,以及大量特定于操作的参数。

命令 用途
Test unit ready 查询设备是否已经准备好进行传输
Inquiry 请求设备基本信息
Request sense 请求之前命令的错误信息
Read capacity 请求存储容量信息
Read 从设备读取数据
Write 向设备写入数据
Mode sense 请求模式页面(设备参数)
Mode select 在模式页面配置设备参数

借助大约 60 种可用命令,SCSI 可适用于许多设备(包括随机存取设备,比如磁盘和像磁带这样的顺序存储设备)。SCSI 也提供了专门的命令以访问箱体服务(比如存储箱体内部当前的传感和温度)。

核心数据结构

主机适配器模板scsi_host_template

主机适配器模板是相同型号主机适配器的公共内容,包括请求队列深度,SCSI命令处理回调函数,错误处理恢复函数。分配主机适配器结构时,需要使用主机适配器模板来赋值。在编写SCSI低层驱动时,第一步便是定义模板scsi_host_template,之后才能有模板生成主机适配器。

struct scsi_host_template {
    struct module *module;  //指向使用该模板实现的scsi_host,低层驱动模块。
    const char *name;       //主机适配器名称

    int (* detect)(struct scsi_host_template *);
    int (* release)(struct Scsi_Host *);

    const char *(* info)(struct Scsi_Host *); //返回HBA相关信息,可选实现

    int (* ioctl)(struct scsi_device *dev, int cmd, void __user *arg); //用户空间ioctl函数的实现,可选实现


#ifdef CONFIG_COMPAT
    //通过该函数,支持32位系统的用户态ioctl函数
    int (* compat_ioctl)(struct scsi_device *dev, int cmd, void __user *arg);
#endif

    //将scsi命令放进低层驱动的队列,由中间层调用,必须实现
    int (* queuecommand)(struct Scsi_Host *, struct scsi_cmnd *);

    //以下5个函数是错误处理回调函数,由中间层按照严重程度调用
    int (* eh_abort_handler)(struct scsi_cmnd *);        //Abort
    int (* eh_device_reset_handler)(struct scsi_cmnd *); //Device Reset
    int (* eh_target_reset_handler)(struct scsi_cmnd *); //Target Reset
    int (* eh_bus_reset_handler)(struct scsi_cmnd *);    //Bus Reset
    int (* eh_host_reset_handler)(struct scsi_cmnd *);   //Host Reset

    //当扫描到新磁盘时调用,中间层回调这个函数中可以分配和初始化低层驱动所需要的结构
    int (* slave_alloc)(struct scsi_device *)

//在设备受到INQUIRY命令后,执行相关的配置操作
    int (* slave_configure)(struct scsi_device *);

    //在scsi设备销毁之前调用,中间层回调用于释放slave_alloc分配的私有数据
    void (* slave_destroy)(struct scsi_device *);

    //当发现新的target,中间层调用,用户分配target私有数据
    int (* target_alloc)(struct scsi_target *);

    //在target被销毁之前,中间层调用,低层驱动实现,用于释放target_alloc分配的数据
    void (* target_destroy)(struct scsi_target *);

    //需要自定义扫描target逻辑时,中间层循环检查返回值,直到该函数返回1,表示扫描完成
    int (* scan_finished)(struct Scsi_Host *, unsigned long);

    //需要自定义扫描target逻辑时,扫描开始前回调
    void (* scan_start)(struct Scsi_Host *);

    //改变主机适配器的队列深度,返回设置的队列深度
    int (* change_queue_depth)(struct scsi_device *, int);

    //返回磁盘的BIOS参数,如size, device, list (heads, sectors, cylinders)
    int (* bios_param)(struct scsi_device *, struct block_device *,
            sector_t, int []);

    void (*unlock_native_capacity)(struct scsi_device *);

    //在procfs中的读写操作回调
    int (*show_info)(struct seq_file *, struct Scsi_Host *);
    int (*write_info)(struct Scsi_Host *, char *, int);

    //中间层发现scsi命令超时回调
    enum blk_eh_timer_return (*eh_timed_out)(struct scsi_cmnd *);

    //通过sysfs属性reset主机适配器时,回调
    int (*host_reset)(struct Scsi_Host *shost, int reset_type);
#define SCSI_ADAPTER_RESET  1
#define SCSI_FIRMWARE_RESET 2

    const char *proc_name; //在proc文件系统的名称

    struct proc_dir_entry *proc_dir;

    int can_queue; //主机适配器能同时接受的命令数

    int this_id;

    /*
     * This determines the degree to which the host adapter is capable
     * of scatter-gather.
     */  //聚散列表的参数
    unsigned short sg_tablesize;
    unsigned short sg_prot_tablesize;

    /*
     * Set this if the host adapter has limitations beside segment count.
     */ //单个scsi命令能够访问的扇区最大数量
    unsigned int max_sectors;

    /*
     * DMA scatter gather segment boundary limit. A segment crossing this
     * boundary will be split in two.
     */
    unsigned long dma_boundary; //DMA聚散段边界值,超过该值将被切割成两个

#define SCSI_DEFAULT_MAX_SECTORS    1024

    short cmd_per_lun;

    /*
     * present contains counter indicating how many boards of this
     * type were found when we did the scan.
     */
    unsigned char present;

    /* If use block layer to manage tags, this is tag allocation policy */
    int tag_alloc_policy;

    /*
     * Track QUEUE_FULL events and reduce queue depth on demand.
     */
    unsigned track_queue_depth:1;

    /*
     * This specifies the mode that a LLD supports.
     */
    unsigned supported_mode:2; //低层驱动支持的模式(initiator或target)

    /*
     * True if this host adapter uses unchecked DMA onto an ISA bus.
     */
    unsigned unchecked_isa_dma:1;

    unsigned use_clustering:1;

    /*
     * True for emulated SCSI host adapters (e.g. ATAPI).
     */
    unsigned emulated:1;

    /*
     * True if the low-level driver performs its own reset-settle delays.
     */
    unsigned skip_settle_delay:1;

    /* True if the controller does not support WRITE SAME */
    unsigned no_write_same:1;

    /*
     * True if asynchronous aborts are not supported
     */
    unsigned no_async_abort:1;

    /*
     * Countdown for host blocking with no commands outstanding.
     */
    unsigned int max_host_blocked; //主机适配器发送队列的低阀值,允许累计多个命令同时派发

#define SCSI_DEFAULT_HOST_BLOCKED   7

    /*
     * Pointer to the sysfs class properties for this host, NULL terminated.
     */
    struct device_attribute **shost_attrs; //主机适配器类属性

    /*
     * Pointer to the SCSI device properties for this host, NULL terminated.
     */
    struct device_attribute **sdev_attrs;  //主机适配器设备属性

    struct list_head legacy_hosts;

    u64 vendor_id;

    /*
     * Additional per-command data allocated for the driver.
     */  //scsi 命令缓冲池,scsi命令都是预先分配好的,保存在cmd_pool中
    unsigned int cmd_size;
    struct scsi_host_cmd_pool *cmd_pool;

    /* temporary flag to disable blk-mq I/O path */
    bool disable_blk_mq;  //禁用通用块层多队列模式标志
};

主机适配器Scsi_Host

Scsi_Host描述一个SCSI主机适配器,SCSI主机适配器通常是一块基于PCI总线的扩展卡或是一个SCSI控制器芯片。每个SCSI主机适配器可以存在多个通道,一个通道实际扩展了一条SCSI总线。每个通过可以连接多个SCSI目标节点,具体连接数量与SCSI总线带载能力有关,或者受具体SCSI协议的限制。 真实的主机总线适配器是接入主机IO总线上(通常是PCI总线),在系统启动时,会扫描挂载在PCI总线上的设备,此时会分配主机总线适配器。
Scsi_Host结构包含内嵌通用设备,将被链入SCSI总线类型(scsi_bus_type)的设备链表。

struct Scsi_Host {
    struct list_head    __devices; //设备链表
    struct list_head    __targets; //目标节点链表

    struct scsi_host_cmd_pool *cmd_pool; //scsi命令缓冲池
    spinlock_t      free_list_lock;   //保护free_list
    struct list_head    free_list; /* backup store of cmd structs, scsi命令预先分配的备用命令链表 */
    struct list_head    starved_list; //scsi命令的饥饿链表

    spinlock_t      default_lock;
    spinlock_t      *host_lock;

    struct mutex        scan_mutex;/* serialize scanning activity */

    struct list_head    eh_cmd_q; //执行错误的scsi命令的链表
    struct task_struct    * ehandler;  /* Error recovery thread. 错误恢复线程 */
    struct completion     * eh_action; /* Wait for specific actions on the
                          host. */
    wait_queue_head_t       host_wait; //scsi设备恢复等待队列
    struct scsi_host_template *hostt;  //主机适配器模板
    struct scsi_transport_template *transportt; //指向SCSI传输层模板

    /*
     * Area to keep a shared tag map (if needed, will be
     * NULL if not).
     */
    union {
        struct blk_queue_tag    *bqt;
        struct blk_mq_tag_set   tag_set; //SCSI支持多队列时使用
    };
    //已经派发给主机适配器(低层驱动)的scsi命令数
    atomic_t host_busy;        /* commands actually active on low-level */
    atomic_t host_blocked;  //阻塞的scsi命令数

    unsigned int host_failed;      /* commands that failed.
                          protected by host_lock */
    unsigned int host_eh_scheduled;    /* EH scheduled without command */

    unsigned int host_no;  /* Used for IOCTL_GET_IDLUN, /proc/scsi et al. 系统内唯一标识 */

    /* next two fields are used to bound the time spent in error handling */
    int eh_deadline;
    unsigned long last_reset; //记录上次reset时间


    /*
     * These three parameters can be used to allow for wide scsi,
     * and for host adapters that support multiple busses
     * The last two should be set to 1 more than the actual max id
     * or lun (e.g. 8 for SCSI parallel systems).
     */
    unsigned int max_channel; //主机适配器的最大通道编号
    unsigned int max_id;      //主机适配器目标节点最大编号
    u64 max_lun;              //主机适配器lun最大编号

    unsigned int unique_id;

    /*
     * The maximum length of SCSI commands that this host can accept.
     * Probably 12 for most host adapters, but could be 16 for others.
     * or 260 if the driver supports variable length cdbs.
     * For drivers that don't set this field, a value of 12 is
     * assumed.
     */
    unsigned short max_cmd_len;  //主机适配器可以接受的最长的SCSI命令
    //下面这段在scsi_host_template中也有,由template中的字段赋值
    int this_id;
    int can_queue;
    short cmd_per_lun;
    short unsigned int sg_tablesize;
    short unsigned int sg_prot_tablesize;
    unsigned int max_sectors;
    unsigned long dma_boundary;
    /*
     * In scsi-mq mode, the number of hardware queues supported by the LLD.
     *
     * Note: it is assumed that each hardware queue has a queue depth of
     * can_queue. In other words, the total queue depth per host
     * is nr_hw_queues * can_queue.
     */
    unsigned nr_hw_queues; //在scsi-mq模式中,低层驱动所支持的硬件队列的数量
    /*
     * Used to assign serial numbers to the cmds.
     * Protected by the host lock.
     */
    unsigned long cmd_serial_number;  //指向命令序列号

    unsigned active_mode:2;           //标识是initiator或target
    unsigned unchecked_isa_dma:1;
    unsigned use_clustering:1;

    /*
     * Host has requested that no further requests come through for the
     * time being.
     */
    unsigned host_self_blocked:1; //表示低层驱动要求阻塞该主机适配器,此时中间层不会继续派发命令到主机适配器队列中

    /*
     * Host uses correct SCSI ordering not PC ordering. The bit is
     * set for the minority of drivers whose authors actually read
     * the spec ;).
     */
    unsigned reverse_ordering:1;

    /* Task mgmt function in progress */
    unsigned tmf_in_progress:1;  //任务管理函数正在执行

    /* Asynchronous scan in progress */
    unsigned async_scan:1;       //异步扫描正在执行

    /* Don't resume host in EH */
    unsigned eh_noresume:1;      //在错误处理过程不恢复主机适配器

    /* The controller does not support WRITE SAME */
    unsigned no_write_same:1;

    unsigned use_blk_mq:1;       //是否使用SCSI多队列模式
    unsigned use_cmd_list:1;

    /* Host responded with short (<36 bytes) INQUIRY result */
    unsigned short_inquiry:1;

    /*
     * Optional work queue to be utilized by the transport
     */
    char work_q_name[20];  //被scsi传输层使用的工作队列
    struct workqueue_struct *work_q;

    /*
     * Task management function work queue
     */
    struct workqueue_struct *tmf_work_q; //任务管理函数工作队列

    /* The transport requires the LUN bits NOT to be stored in CDB[1] */
    unsigned no_scsi2_lun_in_cdb:1;

    /*
     * Value host_blocked counts down from
     */
    unsigned int max_host_blocked; //在派发队列中累计命令达到这个数值,才开始唤醒主机适配器

    /* Protection Information */
    unsigned int prot_capabilities;
    unsigned char prot_guard_type;

    /*
     * q used for scsi_tgt msgs, async events or any other requests that
     * need to be processed in userspace
     */
    struct request_queue *uspace_req_q; //需要在用户空间处理的scsi_tgt消息、异步事件或其他请求的请求队列

    /* legacy crap */
    unsigned long base;
    unsigned long io_port;   //I/O端口编号
    unsigned char n_io_port;
    unsigned char dma_channel;
    unsigned int  irq;


    enum scsi_host_state shost_state; //状态

    /* ldm bits */ //shost_gendev: 内嵌通用设备,SCSI设备通过这个域链入SCSI总线类型(scsi_bus_type)的设备链表
    struct device       shost_gendev, shost_dev;
    //shost_dev: 内嵌类设备, SCSI设备通过这个域链入SCSI主机适配器类型(shost_class)的设备链表
    /*
     * List of hosts per template.
     *
     * This is only for use by scsi_module.c for legacy templates.
     * For these access to it is synchronized implicitly by
     * module_init/module_exit.
     */
    struct list_head sht_legacy_list;

    /*
     * Points to the transport data (if any) which is allocated
     * separately
     */
    void *shost_data; //指向独立分配的传输层数据,由SCSI传输层使用

    /*
     * Points to the physical bus device we'd use to do DMA
     * Needed just in case we have virtual hosts.
     */
    struct device *dma_dev;

    /*
     * We should ensure that this is aligned, both for better performance
     * and also because some compilers (m68k) don't automatically force
     * alignment to a long boundary.
     */ //主机适配器专有数据
    unsigned long hostdata[0]  /* Used for storage of host specific stuff */
        __attribute__ ((aligned (sizeof(unsigned long))));
};

目标节点scsi_target

scsi_target结构中有一个内嵌驱动模型设备,被链入SCSI总线类型scsi_bus_type的设备链表。

struct scsi_target {
    struct scsi_device  *starget_sdev_user; //指向正在进行I/O的scsi设备,没有IO则指向NULL
    struct list_head    siblings;  //链入主机适配器target链表中
    struct list_head    devices;   //属于该target的device链表
    struct device       dev;       //通用设备,用于加入设备驱动模型
    struct kref     reap_ref; /* last put renders target invisible 本结构的引用计数 */
    unsigned int        channel;   //该target所在的channel号
    unsigned int        id; /* target id ... replace
                     * scsi_device.id eventually */
    unsigned int        create:1; /* signal that it needs to be added */
    unsigned int        single_lun:1;   /* Indicates we should only
                         * allow I/O to one of the luns
                         * for the device at a time. */
    unsigned int        pdt_1f_for_no_lun:1;    /* PDT = 0x1f
                         * means no lun present. */
    unsigned int        no_report_luns:1;   /* Don't use
                         * REPORT LUNS for scanning. */
    unsigned int        expecting_lun_change:1; /* A device has reported
                         * a 3F/0E UA, other devices on
                         * the same target will also. */
    /* commands actually active on LLD. */
    atomic_t        target_busy;
    atomic_t        target_blocked;           //当前阻塞的命令数

    /*
     * LLDs should set this in the slave_alloc host template callout.
     * If set to zero then there is not limit.
     */
    unsigned int        can_queue;             //同时处理的命令数
    unsigned int        max_target_blocked;    //阻塞命令数阀值
#define SCSI_DEFAULT_TARGET_BLOCKED 3

    char            scsi_level;                //支持的SCSI规范级别
    enum scsi_target_state  state;             //target状态
    void            *hostdata; /* available to low-level driver */
    unsigned long       starget_data[0]; /* for the transport SCSI传输层(中间层)使用 */
    /* starget_data must be the last element!!!! */
} __attribute__((aligned(sizeof(unsigned long))));

逻辑设备scsi_device

scsi_device描述scsi逻辑设备,代表scsi磁盘的逻辑单元lun。scsi_device描述符所代表的设备可能是另一台存储设备上的SATA/SAS/SCSI磁盘或SSD。操作系统在扫描到连接在主机适配器上的逻辑设备时,创建scsi_device结构,用于scsi高层驱动和该设备通信。

struct scsi_device {
    struct Scsi_Host *host;  //所归属的主机总线适配器
    struct request_queue *request_queue; //请求队列

    /* the next two are protected by the host->host_lock */
    struct list_head    siblings;   /* list of all devices on this host */ //链入主机总线适配器设备链表
    struct list_head    same_target_siblings; /* just the devices sharing same target id */ //链入target的设备链表

    atomic_t device_busy;       /* commands actually active on LLDD */
    atomic_t device_blocked;    /* Device returned QUEUE_FULL. */

    spinlock_t list_lock;
    struct list_head cmd_list;  /* queue of in use SCSI Command structures */
    struct list_head starved_entry; //链入主机适配器的"饥饿"链表
    struct scsi_cmnd *current_cmnd; /* currently active command */ //当前正在执行的命令
    unsigned short queue_depth; /* How deep of a queue we want */
    unsigned short max_queue_depth; /* max queue depth */
    unsigned short last_queue_full_depth; /* These two are used by */
    unsigned short last_queue_full_count; /* scsi_track_queue_full() */
    unsigned long last_queue_full_time; /* last queue full time */
    unsigned long queue_ramp_up_period; /* ramp up period in jiffies */
#define SCSI_DEFAULT_RAMP_UP_PERIOD (120 * HZ)

    unsigned long last_queue_ramp_up;   /* last queue ramp up time */

    unsigned int id, channel; //scsi_device所属的target id和所在channel通道号
    u64 lun;  //该设备的lun编号
    unsigned int manufacturer;  /* Manufacturer of device, for using  制造商
                     * vendor-specific cmd's */
    unsigned sector_size;   /* size in bytes 硬件的扇区大小 */

    void *hostdata;     /* available to low-level driver 专有数据 */
    char type;          //SCSI设备类型
    char scsi_level;    //所支持SCSI规范的版本号,由INQUIRY命令获得
    char inq_periph_qual;   /* PQ from INQUIRY data */
    unsigned char inquiry_len;  /* valid bytes in 'inquiry' */
    unsigned char * inquiry;    /* INQUIRY response data */
    const char * vendor;        /* [back_compat] point into 'inquiry' ... */
    const char * model;     /* ... after scan; point to static string */
    const char * rev;       /* ... "nullnullnullnull" before scan */

#define SCSI_VPD_PG_LEN                255
    int vpd_pg83_len;          //sense命令 0x83
    unsigned char *vpd_pg83;
    int vpd_pg80_len;          //sense命令 0x80
    unsigned char *vpd_pg80;
    unsigned char current_tag;  /* current tag */
    struct scsi_target      *sdev_target;   /* used only for single_lun */

    unsigned int    sdev_bflags; /* black/white flags as also found in
                 * scsi_devinfo.[hc]. For now used only to
                 * pass settings from slave_alloc to scsi
                 * core. */
    unsigned int eh_timeout; /* Error handling timeout */
    unsigned removable:1;
    unsigned changed:1; /* Data invalid due to media change */
    unsigned busy:1;    /* Used to prevent races */
    unsigned lockable:1;    /* Able to prevent media removal */
    unsigned locked:1;      /* Media removal disabled */
    unsigned borken:1;  /* Tell the Seagate driver to be
                 * painfully slow on this device */
    unsigned disconnect:1;  /* can disconnect */
    unsigned soft_reset:1;  /* Uses soft reset option */
    unsigned sdtr:1;    /* Device supports SDTR messages 支持同步数据传输 */
    unsigned wdtr:1;    /* Device supports WDTR messages 支持16位宽数据传输*/
    unsigned ppr:1;     /* Device supports PPR messages 支持PPR(并行协议请求)消息*/
    unsigned tagged_supported:1;    /* Supports SCSI-II tagged queuing */
    unsigned simple_tags:1; /* simple queue tag messages are enabled */
    unsigned was_reset:1;   /* There was a bus reset on the bus for
                 * this device */
    unsigned expecting_cc_ua:1; /* Expecting a CHECK_CONDITION/UNIT_ATTN
                     * because we did a bus reset. */
    unsigned use_10_for_rw:1; /* first try 10-byte read / write */
    unsigned use_10_for_ms:1; /* first try 10-byte mode sense/select */
    unsigned no_report_opcodes:1;   /* no REPORT SUPPORTED OPERATION CODES */
    unsigned no_write_same:1;   /* no WRITE SAME command */
    unsigned use_16_for_rw:1; /* Use read/write(16) over read/write(10) */
    unsigned skip_ms_page_8:1;  /* do not use MODE SENSE page 0x08 */
    unsigned skip_ms_page_3f:1; /* do not use MODE SENSE page 0x3f */
    unsigned skip_vpd_pages:1;  /* do not read VPD pages */
    unsigned try_vpd_pages:1;   /* attempt to read VPD pages */
    unsigned use_192_bytes_for_3f:1; /* ask for 192 bytes from page 0x3f */
    unsigned no_start_on_add:1; /* do not issue start on add */
    unsigned allow_restart:1; /* issue START_UNIT in error handler */
    unsigned manage_start_stop:1;   /* Let HLD (sd) manage start/stop */
    unsigned start_stop_pwr_cond:1; /* Set power cond. in START_STOP_UNIT */
    unsigned no_uld_attach:1; /* disable connecting to upper level drivers */
    unsigned select_no_atn:1;
    unsigned fix_capacity:1;    /* READ_CAPACITY is too high by 1 */
    unsigned guess_capacity:1;  /* READ_CAPACITY might be too high by 1 */
    unsigned retry_hwerror:1;   /* Retry HARDWARE_ERROR */
    unsigned last_sector_bug:1; /* do not use multisector accesses on
                       SD_LAST_BUGGY_SECTORS */
    unsigned no_read_disc_info:1;   /* Avoid READ_DISC_INFO cmds */
    unsigned no_read_capacity_16:1; /* Avoid READ_CAPACITY_16 cmds */
    unsigned try_rc_10_first:1; /* Try READ_CAPACACITY_10 first */
    unsigned is_visible:1;  /* is the device visible in sysfs */
    unsigned wce_default_on:1;  /* Cache is ON by default */
    unsigned no_dif:1;  /* T10 PI (DIF) should be disabled */
    unsigned broken_fua:1;      /* Don't set FUA bit */
    unsigned lun_in_cdb:1;      /* Store LUN bits in CDB[1] */

    atomic_t disk_events_disable_depth; /* disable depth for disk events */

    DECLARE_BITMAP(supported_events, SDEV_EVT_MAXBITS); /* supported events */
    DECLARE_BITMAP(pending_events, SDEV_EVT_MAXBITS); /* pending events */
    struct list_head event_list;    /* asserted events */
    struct work_struct event_work;

    unsigned int max_device_blocked; /* what device_blocked counts down from  */
#define SCSI_DEFAULT_DEVICE_BLOCKED 3

    atomic_t iorequest_cnt;
    atomic_t iodone_cnt;
    atomic_t ioerr_cnt;

    struct device       sdev_gendev, //内嵌通用设备, 链入scsi总线类型(scsi_bus_type)的设备链表
                sdev_dev; //内嵌类设备,链入scsi设备类(sdev_class)的设备链表

    struct execute_work ew; /* used to get process context on put */
    struct work_struct  requeue_work;

    struct scsi_device_handler *handler; //自定义设备处理函数
    void            *handler_data;

    enum scsi_device_state sdev_state;  //scsi设备状态
    unsigned long       sdev_data[0];   //scsi传输层使用
} __attribute__((aligned(sizeof(unsigned long))));

内核定义的SCSI命令结构scsi_cmnd

scsi_cmnd结构有SCSI中间层创建,传递到SCSI低层驱动。每个IO请求会被创建一个scsi_cnmd,但scsi_cmnd并不一定是时IO请求。scsi_cmnd最终转化成一个具体的SCSI命令。除了命令描述块之外,scsi_cmnd包含更丰富的信息,包括数据缓冲区、感测数据缓冲区、完成回调函数以及所关联的块设备驱动层请求等,是SCSI中间层执行SCSI命令的上下文。

struct scsi_cmnd {
    struct scsi_device *device;  //指向命令所属SCSI设备的描述符的指针
    struct list_head list;  /* scsi_cmnd participates in queue lists 链入scsi设备的命令链表 */
    struct list_head eh_entry; /* entry for the host eh_cmd_q */
    struct delayed_work abort_work;
    int eh_eflags;      /* Used by error handlr */

    /*
     * A SCSI Command is assigned a nonzero serial_number before passed
     * to the driver's queue command function.  The serial_number is
     * cleared when scsi_done is entered indicating that the command
     * has been completed.  It is a bug for LLDDs to use this number
     * for purposes other than printk (and even that is only useful
     * for debugging).
     */
    unsigned long serial_number; //scsi命令的唯一序号

    /*
     * This is set to jiffies as it was when the command was first
     * allocated.  It is used to time how long the command has
     * been outstanding
     */
    unsigned long jiffies_at_alloc; //分配时的jiffies, 用于计算命令处理时间

    int retries;  //命令重试次数
    int allowed;  //允许的重试次数

    unsigned char prot_op;    //保护操作(DIF和DIX)
    unsigned char prot_type;  //DIF保护类型
    unsigned char prot_flags;

    unsigned short cmd_len;   //命令长度
    enum dma_data_direction sc_data_direction;  //命令传输方向

    /* These elements define the operation we are about to perform */
    unsigned char *cmnd;  //scsi规范格式的命令字符串


    /* These elements define the operation we ultimately want to perform */
    struct scsi_data_buffer sdb;        //scsi命令数据缓冲区
    struct scsi_data_buffer *prot_sdb;  //scsi命令保护信息缓冲区

    unsigned underflow; /* Return error if less than
                   this amount is transferred */

    unsigned transfersize;  /* How much we are guaranteed to  //传输单位
                   transfer with each SCSI transfer
                   (ie, between disconnect /
                   reconnects.   Probably == sector
                   size */

    struct request *request;    /* The command we are  通用块层的请求描述符
                       working on */

#define SCSI_SENSE_BUFFERSIZE   96
    unsigned char *sense_buffer;    //scsi命令感测数据缓冲区
                /* obtained by REQUEST SENSE when
                 * CHECK CONDITION is received on original
                 * command (auto-sense) */

    /* Low-level done function - can be used by low-level driver to point
     *        to completion function.  Not used by mid/upper level code. */
    void (*scsi_done) (struct scsi_cmnd *); //scsi命令在低层驱动完成时,回调

    /*
     * The following fields can be written to by the host specific code.
     * Everything else should be left alone.
     */
    struct scsi_pointer SCp;    /* Scratchpad used by some host adapters */

    unsigned char *host_scribble;   /* The host adapter is allowed to
                     * call scsi_malloc and get some memory
                     * and hang it here.  The host adapter
                     * is also expected to call scsi_free
                     * to release this memory.  (The memory
                     * obtained by scsi_malloc is guaranteed
                     * to be at an address < 16Mb). */

    int result;     /* Status code from lower level driver */
    int flags;      /* Command flags */

    unsigned char tag;  /* SCSI-II queued command tag */
};

驱动scsi_driver

struct scsi_driver {
    struct device_driver    gendrv;  // "继承"device_driver

    void (*rescan)(struct device *); //重新扫描前调用的回调函数
    int (*init_command)(struct scsi_cmnd *);
    void (*uninit_command)(struct scsi_cmnd *);
    int (*done)(struct scsi_cmnd *);  //当低层驱动完成一个scsi命令时调用,用于计算已经完成的字节数
    int (*eh_action)(struct scsi_cmnd *, int); //错误处理回调
};

设备模型

  • scsi_bus_type: scsi子系统总线类型
struct bus_type scsi_bus_type = {
        .name       = "scsi",   // 对应/sys/bus/scsi
        .match      = scsi_bus_match,
    .uevent     = scsi_bus_uevent,
#ifdef CONFIG_PM
    .pm     = &scsi_bus_pm_ops,
#endif
};
EXPORT_SYMBOL_GPL(scsi_bus_type);
  • shost_class: scsi子系统类
static struct class shost_class = {
    .name       = "scsi_host",  // 对应/sys/class/scsi_host
    .dev_release    = scsi_host_cls_release,
};

Linux存储IO栈(4)-- SCSI子系统之概述_第3张图片

初始化过程

操作系统启动时,会加载scsi子系统,入口函数是init_scsi,使用subsys_initcall定义:

static int __init init_scsi(void)
{
    int error;

    error = scsi_init_queue();  //初始化聚散列表所需要的存储池
    if (error)
        return error;
    error = scsi_init_procfs(); //初始化procfs中与scsi相关的目录项
    if (error)
        goto cleanup_queue;
    error = scsi_init_devinfo();//设置scsi动态设备信息列表
    if (error)
        goto cleanup_procfs;
    error = scsi_init_hosts();  //注册shost_class类,在/sys/class/目录下创建scsi_host子目录
    if (error)
        goto cleanup_devlist;
    error = scsi_init_sysctl(); //注册SCSI系统控制表
    if (error)
        goto cleanup_hosts;
    error = scsi_sysfs_register(); //注册scsi_bus_type总线类型和sdev_class类
    if (error)
        goto cleanup_sysctl;

    scsi_netlink_init(); //初始化SCSI传输netlink接口

    printk(KERN_NOTICE "SCSI subsystem initialized\n");
    return 0;

cleanup_sysctl:
    scsi_exit_sysctl();
cleanup_hosts:
    scsi_exit_hosts();
cleanup_devlist:
    scsi_exit_devinfo();
cleanup_procfs:
    scsi_exit_procfs();
cleanup_queue:
    scsi_exit_queue();
    printk(KERN_ERR "SCSI subsystem failed to initialize, error = %d\n",
           -error);
    return error;
}

scsi_init_hosts函数初始化scsi子系统主机适配器所属的类shost_class:

int scsi_init_hosts(void)
{
    return class_register(&shost_class);
}

scsi_sysfs_register函数初始化scsi子系统总线类型scsi_bus_type和设备所属的类sdev_class类:

int scsi_sysfs_register(void)
{
    int error;

    error = bus_register(&scsi_bus_type);
    if (!error) {
        error = class_register(&sdev_class);
        if (error)
            bus_unregister(&scsi_bus_type);
    }

    return error;
}

scsi低层驱动是面向主机适配器的,低层驱动被加载时,需要添加主机适配器。主机适配器添加有两种方式:1.在PCI子系统扫描挂载驱动时添加;2.手动方式添加。所有基于硬件PCI接口的主机适配器都采用第一种方式。添加主机适配器包括两个步骤:
1. 分别主机适配器数据结构scsi_host_alloc
2. 将主机适配器添加到系统scsi_add_host

struct Scsi_Host *scsi_host_alloc(struct scsi_host_template *sht, int privsize)
{
    struct Scsi_Host *shost;
    gfp_t gfp_mask = GFP_KERNEL;

    if (sht->unchecked_isa_dma && privsize)
        gfp_mask |= __GFP_DMA;
    //一次分配Scsi_Host和私有数据空间
    shost = kzalloc(sizeof(struct Scsi_Host) + privsize, gfp_mask);
    if (!shost)
        return NULL;

    shost->host_lock = &shost->default_lock;
    spin_lock_init(shost->host_lock);
    shost->shost_state = SHOST_CREATED; //更新状态
    INIT_LIST_HEAD(&shost->__devices);  //初始化scsi设备链表
    INIT_LIST_HEAD(&shost->__targets);  //初始化target链表
    INIT_LIST_HEAD(&shost->eh_cmd_q);   //初始化执行错误的scsi命令链表
    INIT_LIST_HEAD(&shost->starved_list);   //初始化scsi命令饥饿链表
    init_waitqueue_head(&shost->host_wait);
    mutex_init(&shost->scan_mutex);

    /*
     * subtract one because we increment first then return, but we need to
     * know what the next host number was before increment
     */ //递增分配主机适配器号
    shost->host_no = atomic_inc_return(&scsi_host_next_hn) - 1;
    shost->dma_channel = 0xff;

    /* These three are default values which can be overridden */
    shost->max_channel = 0; //默认通道号为0
    shost->max_id = 8;      //默认target最大数量
    shost->max_lun = 8;     //默认scsi_device最大数量

    /* Give each shost a default transportt */
    shost->transportt = &blank_transport_template;  //scsi传输层(中间层)模板

    /*
     * All drivers right now should be able to handle 12 byte
     * commands.  Every so often there are requests for 16 byte
     * commands, but individual low-level drivers need to certify that
     * they actually do something sensible with such commands.
     */
    shost->max_cmd_len = 12;  //最长的SCSI命令长度
    shost->hostt = sht;       //使用主机适配器模板
    shost->this_id = sht->this_id;
    shost->can_queue = sht->can_queue;
    shost->sg_tablesize = sht->sg_tablesize;
    shost->sg_prot_tablesize = sht->sg_prot_tablesize;
    shost->cmd_per_lun = sht->cmd_per_lun;
    shost->unchecked_isa_dma = sht->unchecked_isa_dma;
    shost->use_clustering = sht->use_clustering;
    shost->no_write_same = sht->no_write_same;

    if (shost_eh_deadline == -1 || !sht->eh_host_reset_handler)
        shost->eh_deadline = -1;
    else if ((ulong) shost_eh_deadline * HZ > INT_MAX) {
        shost_printk(KERN_WARNING, shost,
                 "eh_deadline %u too large, setting to %u\n",
                 shost_eh_deadline, INT_MAX / HZ);
        shost->eh_deadline = INT_MAX;
    } else
        shost->eh_deadline = shost_eh_deadline * HZ;

    if (sht->supported_mode == MODE_UNKNOWN) //由模板指定HBA的模式
        /* means we didn't set it ... default to INITIATOR */
        shost->active_mode = MODE_INITIATOR;  //主机适配器模式默认是initiator
    else
        shost->active_mode = sht->supported_mode;

    if (sht->max_host_blocked)
        shost->max_host_blocked = sht->max_host_blocked;
    else
        shost->max_host_blocked = SCSI_DEFAULT_HOST_BLOCKED;

    /*
     * If the driver imposes no hard sector transfer limit, start at
     * machine infinity initially.
     */
    if (sht->max_sectors)
        shost->max_sectors = sht->max_sectors;
    else
        shost->max_sectors = SCSI_DEFAULT_MAX_SECTORS;

    /*
     * assume a 4GB boundary, if not set
     */
    if (sht->dma_boundary)
        shost->dma_boundary = sht->dma_boundary;
    else
        shost->dma_boundary = 0xffffffff;  //默认DMA的边界为4G

    shost->use_blk_mq = scsi_use_blk_mq && !shost->hostt->disable_blk_mq;

    device_initialize(&shost->shost_gendev); //初始化主机适配器内部通用设备
    dev_set_name(&shost->shost_gendev, "host%d", shost->host_no);
    shost->shost_gendev.bus = &scsi_bus_type;   //设置主机适配器的总线类型
    shost->shost_gendev.type = &scsi_host_type; //设置主机适配器的设备类型

    device_initialize(&shost->shost_dev);    //初始化主机适配器的内部类设备
    shost->shost_dev.parent = &shost->shost_gendev; //内部类设备的父设备设置为其内部通用设备
    shost->shost_dev.class = &shost_class;   //设置内部类设备所属的类是shost_class
    dev_set_name(&shost->shost_dev, "host%d", shost->host_no);
    shost->shost_dev.groups = scsi_sysfs_shost_attr_groups;  //设置类设备的属性组

    shost->ehandler = kthread_run(scsi_error_handler, shost,  //启动主机适配器的错误恢复内核线程
            "scsi_eh_%d", shost->host_no);
    if (IS_ERR(shost->ehandler)) {
        shost_printk(KERN_WARNING, shost,
            "error handler thread failed to spawn, error = %ld\n",
            PTR_ERR(shost->ehandler));
        goto fail_kfree;
    }
    //分配任务管理工作队列
    shost->tmf_work_q = alloc_workqueue("scsi_tmf_%d",
                        WQ_UNBOUND | WQ_MEM_RECLAIM,
                       1, shost->host_no);
    if (!shost->tmf_work_q) {
        shost_printk(KERN_WARNING, shost,
                 "failed to create tmf workq\n");
        goto fail_kthread;
    }
    scsi_proc_hostdir_add(shost->hostt); //在procfs中添加主机适配器的目录, eg. //创建/proc/scsi/<主机适配器名称>目录
    return shost;

 fail_kthread:
    kthread_stop(shost->ehandler);
 fail_kfree:
    kfree(shost);
    return NULL;
}
EXPORT_SYMBOL(scsi_host_alloc);
static inline int __must_check scsi_add_host(struct Scsi_Host *host,
                         struct device *dev) //dev为父设备
{
    return scsi_add_host_with_dma(host, dev, dev);
}

int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev,
               struct device *dma_dev)
{
    struct scsi_host_template *sht = shost->hostt;
    int error = -EINVAL;

    shost_printk(KERN_INFO, shost, "%s\n",
            sht->info ? sht->info(shost) : sht->name);

    if (!shost->can_queue) {
        shost_printk(KERN_ERR, shost,
                 "can_queue = 0 no longer supported\n");
        goto fail;
    }

    if (shost_use_blk_mq(shost)) {         //如果主机适配器设置使用多队列IO,则建立
        error = scsi_mq_setup_tags(shost); //相应的多队列环境
        if (error)
            goto fail;
    } else {
        shost->bqt = blk_init_tags(shost->can_queue,
                shost->hostt->tag_alloc_policy);
        if (!shost->bqt) {
            error = -ENOMEM;
            goto fail;
        }
    }

    /*
     * Note that we allocate the freelist even for the MQ case for now,
     * as we need a command set aside for scsi_reset_provider.  Having
     * the full host freelist and one command available for that is a
     * little heavy-handed, but avoids introducing a special allocator
     * just for this.  Eventually the structure of scsi_reset_provider
     * will need a major overhaul.
     */ //分配存储scsi命令和sense数据的缓冲区, 并分配scsi命令的备用仓库链表
    error = scsi_setup_command_freelist(shost);
    if (error)
        goto out_destroy_tags;

    //设置主机适配器的父设备,确定该设备在sysfs中的位置,通常会通过dev参数传入pci_dev。
    if (!shost->shost_gendev.parent)
        shost->shost_gendev.parent = dev ? dev : &platform_bus; //如果dev为NULL,设置为platform_bus
    if (!dma_dev)
        dma_dev = shost->shost_gendev.parent;

    shost->dma_dev = dma_dev;

    error = device_add(&shost->shost_gendev);  //添加主机适配器通用设备到系统
    if (error)
        goto out_destroy_freelist;

    pm_runtime_set_active(&shost->shost_gendev);
    pm_runtime_enable(&shost->shost_gendev);
    device_enable_async_suspend(&shost->shost_gendev); //支持异步挂起通用设备

    scsi_host_set_state(shost, SHOST_RUNNING);  //设置主机适配器状态
    get_device(shost->shost_gendev.parent);     //增加通用父设备的引用计数

    device_enable_async_suspend(&shost->shost_dev);  //支持异步挂起类设备

    error = device_add(&shost->shost_dev);    //添加主机适配器类设备到系统
    if (error)
        goto out_del_gendev;

    get_device(&shost->shost_gendev);

    if (shost->transportt->host_size) {  //scsi传输层使用的数据空间
        shost->shost_data = kzalloc(shost->transportt->host_size,
                     GFP_KERNEL);
        if (shost->shost_data == NULL) {
            error = -ENOMEM;
            goto out_del_dev;
        }
    }

    if (shost->transportt->create_work_queue) {
        snprintf(shost->work_q_name, sizeof(shost->work_q_name),
             "scsi_wq_%d", shost->host_no);
        shost->work_q = create_singlethread_workqueue( //分配被scsi传输层使用的工作队列
                    shost->work_q_name);
        if (!shost->work_q) {
            error = -EINVAL;
            goto out_free_shost_data;
        }
    }

    error = scsi_sysfs_add_host(shost); //添加主机适配器到子系统
    if (error)
        goto out_destroy_host;

    scsi_proc_host_add(shost);  //在procfs添加主机适配器信息
    return error;

 out_destroy_host:
    if (shost->work_q)
        destroy_workqueue(shost->work_q);
 out_free_shost_data:
    kfree(shost->shost_data);
 out_del_dev:
    device_del(&shost->shost_dev);
 out_del_gendev:
    device_del(&shost->shost_gendev);
 out_destroy_freelist:
    scsi_destroy_command_freelist(shost);
 out_destroy_tags:
    if (shost_use_blk_mq(shost))
        scsi_mq_destroy_tags(shost);
 fail:
    return error;
}
EXPORT_SYMBOL(scsi_add_host_with_dma);

设备探测过程

在系统启动过程中,会扫描默认的PCI根总线,从而触发了PCI设备扫描的过程,开始构造PCI设备树,SCSI主机适配器是挂载在PCI总线的设备。SCSI主机适配器做PCI设备会被PCI总线驱动层扫描到(PCI设备的扫描采用配置空间访问的方式),扫描到SCSI主机适配器后,操作系统开始加载SCSI主机适配器驱动,SCSI主机适配器驱动就是上面所说的低层驱动。SCSI主机适配器驱动根据SCSI主机适配器驱动根据SCSI主机适配模板分配SCSI主机适配器描述符,并添加到系统,之后启动通过SCSI主机适配器扩展出来的下一级总线–SCSI总线的扫描过程。

SCSI中间层依次以可能的ID和LUN构造INQUIRY命令,之后将这些INQUIRY命令提交给块IO子系统,后者又最终将调用SCSI中间层的策略例程,再次提取到SCSI命令结构后,调用SCSI低层驱动的queuecommand回调函数实现。
对于给定ID的目标节点,如果它在SCSI总线上存在,那么它一定要实现对LUN0的INQUIRY响应。也就是说,如果向某个ID的目标节点的LUN0发送INQUIRY命令,或依次向各个LUN尝试发送INQUIRY命令,检查是否能收到响应,最终SCSI中间层能够得到SCSI域中的所连接的逻辑设备及其信息。

SCSI总线具体的扫描方式可以由具体的主机适配器固件、主机适配器驱动实现,在此只讨论由主机适配器驱动调用scsi中间层提供通用的扫描函数的实现方式scsi_scan_host。

void scsi_scan_host(struct Scsi_Host *shost)
{
    struct async_scan_data *data;

    if (strncmp(scsi_scan_type, "none", 4) == 0) //检查扫描逻辑
        return;
    if (scsi_autopm_get_host(shost) < 0)
        return;

    data = scsi_prep_async_scan(shost); //准备异步扫描
    if (!data) {
        do_scsi_scan_host(shost);    //同步扫描
        scsi_autopm_put_host(shost);
        return;
    }

    /* register with the async subsystem so wait_for_device_probe()
     * will flush this work
     */
    async_schedule(do_scan_async, data);  //异步扫描

    /* scsi_autopm_put_host(shost) is called in scsi_finish_async_scan() */
}
EXPORT_SYMBOL(scsi_scan_host);

scsi_scan_host函数是scsi中间层提供的主机适配器扫描函数,对于有主机适配器驱动有自定义扫描逻辑需求的可以设置主机适配器模板的回调函数,由scsi_scan_host函数来调用回调实现自定义扫描。
scsi_scan_type变量指定了扫描方式:async、sync、none。无论最终扫描方式是同步还是异步,都是由do_scsi_scan_host函数实现:

static void do_scsi_scan_host(struct Scsi_Host *shost)
{
    if (shost->hostt->scan_finished) {  //使用自定义扫描方式
        unsigned long start = jiffies;
        if (shost->hostt->scan_start)
            shost->hostt->scan_start(shost); //自定义扫描开始回调

        while (!shost->hostt->scan_finished(shost, jiffies - start)) //自定义扫描完成时返回1
            msleep(10);
    } else { //scsi子系统通用扫描函数, SCAN_WILD_CARD表示扫描所有的target和device
        scsi_scan_host_selected(shost, SCAN_WILD_CARD, SCAN_WILD_CARD,
                SCAN_WILD_CARD, 0);
    }
}

如果主机适配器模板设置了自定义扫描函数,do_scsi_scan_host函数将会调用。如果没有设置则使用默认的扫描函数scsi_scan_host_selected执行扫描。

int scsi_scan_host_selected(struct Scsi_Host *shost, unsigned int channel,
                unsigned int id, u64 lun, int rescan)
{
    SCSI_LOG_SCAN_BUS(3, shost_printk (KERN_INFO, shost,
        "%s: <%u:%u:%llu>\n",
        __func__, channel, id, lun));
    //检查channel、id、lun是否有效
    if (((channel != SCAN_WILD_CARD) && (channel > shost->max_channel)) ||
        ((id != SCAN_WILD_CARD) && (id >= shost->max_id)) ||
        ((lun != SCAN_WILD_CARD) && (lun >= shost->max_lun)))
        return -EINVAL;

    mutex_lock(&shost->scan_mutex);
    if (!shost->async_scan)
        scsi_complete_async_scans();
    //检查Scsi_Host的状态是否允许扫描
    if (scsi_host_scan_allowed(shost) && scsi_autopm_get_host(shost) == 0) {
        if (channel == SCAN_WILD_CARD)
            for (channel = 0; channel <= shost->max_channel; //遍历所有的channel进行扫描
                 channel++)
                scsi_scan_channel(shost, channel, id, lun,  //扫描channel
                          rescan);
        else
            scsi_scan_channel(shost, channel, id, lun, rescan); //扫描指定的channel
        scsi_autopm_put_host(shost);
    }
    mutex_unlock(&shost->scan_mutex);

    return 0;
}

scsi_scan_host_selected函数扫描指定的主机适配器,根据输入的参数决定是否遍历扫描所有channel或扫描指定channel,通过函数scsi_scan_channel完成。

static void scsi_scan_channel(struct Scsi_Host *shost, unsigned int channel,
                  unsigned int id, u64 lun, int rescan)
{
    uint order_id;

    if (id == SCAN_WILD_CARD)
        for (id = 0; id < shost->max_id; ++id) {  //遍历所有的target
            /*
             * XXX adapter drivers when possible (FCP, iSCSI)
             * could modify max_id to match the current max,
             * not the absolute max.
             *
             * XXX add a shost id iterator, so for example,
             * the FC ID can be the same as a target id
             * without a huge overhead of sparse id's.
             */
            if (shost->reverse_ordering)
                /*
                 * Scan from high to low id.
                 */
                order_id = shost->max_id - id - 1;
            else
                order_id = id;
            __scsi_scan_target(&shost->shost_gendev, channel, //扫描指定的target
                    order_id, lun, rescan);
        }
    else
        __scsi_scan_target(&shost->shost_gendev, channel,
                id, lun, rescan);
}

__scsi_scan_target函数指定扫描target内部的lun。

static void __scsi_scan_target(struct device *parent, unsigned int channel,
        unsigned int id, u64 lun, int rescan)
{
    struct Scsi_Host *shost = dev_to_shost(parent);
    int bflags = 0;
    int res;
    struct scsi_target *starget;

    if (shost->this_id == id)
        /*
         * Don't scan the host adapter
         */
        return;
    //为指定的id分配target数据结构,并初始化
    starget = scsi_alloc_target(parent, channel, id);
    if (!starget)
        return;
    scsi_autopm_get_target(starget);

    if (lun != SCAN_WILD_CARD) {
        /*
         * Scan for a specific host/chan/id/lun.
         */ //扫描target中指定id的scsi_device(lun),并将scsi_device(lun)添加到子系统
        scsi_probe_and_add_lun(starget, lun, NULL, NULL, rescan, NULL);
        goto out_reap;
    }

    /*
     * Scan LUN 0, if there is some response, scan further. Ideally, we
     * would not configure LUN 0 until all LUNs are scanned.
     */ //探测target的LUN0
    res = scsi_probe_and_add_lun(starget, 0, &bflags, NULL, rescan, NULL);
    if (res == SCSI_SCAN_LUN_PRESENT || res == SCSI_SCAN_TARGET_PRESENT) {
        if (scsi_report_lun_scan(starget, bflags, rescan) != 0) //向target lun 0发送REPORT_LUNS
            /*
             * The REPORT LUN did not scan the target,
             * do a sequential scan.
             */
            scsi_sequential_lun_scan(starget, bflags,  //探测REPORT_LUNS上报的lun
                         starget->scsi_level, rescan);
    }

 out_reap:
    scsi_autopm_put_target(starget);
    /*
     * paired with scsi_alloc_target(): determine if the target has
     * any children at all and if not, nuke it
     */
    scsi_target_reap(starget);

    put_device(&starget->dev);
}

扫描到target时分配并初始化scsi_target结构,scsi_probe_and_add_lun函数完成探测target中的lun,并将发现的lun添加到系统。

static int scsi_probe_and_add_lun(struct scsi_target *starget,
                  u64 lun, int *bflagsp,
                  struct scsi_device **sdevp, int rescan,
                  void *hostdata)
{
    struct scsi_device *sdev;
    unsigned char *result;
    int bflags, res = SCSI_SCAN_NO_RESPONSE, result_len = 256;
    struct Scsi_Host *shost = dev_to_shost(starget->dev.parent);

    /*
     * The rescan flag is used as an optimization, the first scan of a
     * host adapter calls into here with rescan == 0.
     */
    sdev = scsi_device_lookup_by_target(starget, lun);  //寻找target中指定id的lun
    if (sdev) {   //target中已经存在lun
        if (rescan || !scsi_device_created(sdev)) { //rescan参数要求重新扫描该lun
            SCSI_LOG_SCAN_BUS(3, sdev_printk(KERN_INFO, sdev,
                "scsi scan: device exists on %s\n",
                dev_name(&sdev->sdev_gendev)));
            if (sdevp)
                *sdevp = sdev;
            else
                scsi_device_put(sdev);

            if (bflagsp)
                *bflagsp = scsi_get_device_flags(sdev,
                                 sdev->vendor,
                                 sdev->model);
            return SCSI_SCAN_LUN_PRESENT;
        }
        scsi_device_put(sdev);
    } else
        sdev = scsi_alloc_sdev(starget, lun, hostdata); //target中不存在lun,分配scsi_device
    if (!sdev)
        goto out;

    result = kmalloc(result_len, GFP_ATOMIC |
            ((shost->unchecked_isa_dma) ? __GFP_DMA : 0));
    if (!result)
        goto out_free_sdev;

    if (scsi_probe_lun(sdev, result, result_len, &bflags)) //发送INQUIRY到具体device,进行探测
        goto out_free_result;

    if (bflagsp)
        *bflagsp = bflags;
    /*
     * result contains valid SCSI INQUIRY data.
     */
    if (((result[0] >> 5) == 3) && !(bflags & BLIST_ATTACH_PQ3)) {
        /*
         * For a Peripheral qualifier 3 (011b), the SCSI
         * spec says: The device server is not capable of
         * supporting a physical device on this logical
         * unit.
         *
         * For disks, this implies that there is no
         * logical disk configured at sdev->lun, but there
         * is a target id responding.
         */
        SCSI_LOG_SCAN_BUS(2, sdev_printk(KERN_INFO, sdev, "scsi scan:"
                   " peripheral qualifier of 3, device not"
                   " added\n"))
        if (lun == 0) {
            SCSI_LOG_SCAN_BUS(1, {
                unsigned char vend[9];
                unsigned char mod[17];

                sdev_printk(KERN_INFO, sdev,
                    "scsi scan: consider passing scsi_mod."
                    "dev_flags=%s:%s:0x240 or 0x1000240\n",
                    scsi_inq_str(vend, result, 8, 16),
                    scsi_inq_str(mod, result, 16, 32));
            });

        }

        res = SCSI_SCAN_TARGET_PRESENT;
        goto out_free_result;
    }

    /*
     * Some targets may set slight variations of PQ and PDT to signal
     * that no LUN is present, so don't add sdev in these cases.
     * Two specific examples are:
     * 1) NetApp targets: return PQ=1, PDT=0x1f
     * 2) USB UFI: returns PDT=0x1f, with the PQ bits being "reserved"
     *    in the UFI 1.0 spec (we cannot rely on reserved bits).
     *
     * References:
     * 1) SCSI SPC-3, pp. 145-146
     * PQ=1: "A peripheral device having the specified peripheral
     * device type is not connected to this logical unit. However, the
     * device server is capable of supporting the specified peripheral
     * device type on this logical unit."
     * PDT=0x1f: "Unknown or no device type"
     * 2) USB UFI 1.0, p. 20
     * PDT=00h Direct-access device (floppy)
     * PDT=1Fh none (no FDD connected to the requested logical unit)
     */
    if (((result[0] >> 5) == 1 || starget->pdt_1f_for_no_lun) &&
        (result[0] & 0x1f) == 0x1f &&
        !scsi_is_wlun(lun)) {
        SCSI_LOG_SCAN_BUS(3, sdev_printk(KERN_INFO, sdev,
                    "scsi scan: peripheral device type"
                    " of 31, no device added\n"));
        res = SCSI_SCAN_TARGET_PRESENT;
        goto out_free_result;
    }
    //添加scsi设备到子系统
    res = scsi_add_lun(sdev, result, &bflags, shost->async_scan);
    if (res == SCSI_SCAN_LUN_PRESENT) {
        if (bflags & BLIST_KEY) {
            sdev->lockable = 0;
            scsi_unlock_floptical(sdev, result);
        }
    }

 out_free_result:
    kfree(result);
 out_free_sdev:
    if (res == SCSI_SCAN_LUN_PRESENT) {
        if (sdevp) {
            if (scsi_device_get(sdev) == 0) {
                *sdevp = sdev;
            } else {
                __scsi_remove_device(sdev);
                res = SCSI_SCAN_NO_RESPONSE;
            }
        }
    } else
        __scsi_remove_device(sdev);
 out:
    return res;
}

scsi_probe_and_add_lun函数由名字可知,完成lun的probe和add两个操作:
1. 探测逻辑设备scsi_probe_lun,发送INQUIRY命令到具体设备。
2. 添加逻辑设备到系统scsi_add_lun,根据INQUIRY命令返回值添加lun到系统。

static int scsi_probe_lun(struct scsi_device *sdev, unsigned char *inq_result,
              int result_len, int *bflags)
{
    unsigned char scsi_cmd[MAX_COMMAND_SIZE];
    int first_inquiry_len, try_inquiry_len, next_inquiry_len;
    int response_len = 0;
    int pass, count, result;
    struct scsi_sense_hdr sshdr;

    *bflags = 0;

    /* Perform up to 3 passes.  The first pass uses a conservative
     * transfer length of 36 unless sdev->inquiry_len specifies a
     * different value. */
    first_inquiry_len = sdev->inquiry_len ? sdev->inquiry_len : 36;
    try_inquiry_len = first_inquiry_len;
    pass = 1;

 next_pass:
    SCSI_LOG_SCAN_BUS(3, sdev_printk(KERN_INFO, sdev,
                "scsi scan: INQUIRY pass %d length %d\n",
                pass, try_inquiry_len));

    /* Each pass gets up to three chances to ignore Unit Attention */
    for (count = 0; count < 3; ++count) {
        int resid;

        memset(scsi_cmd, 0, 6);
        scsi_cmd[0] = INQUIRY;      //命令类型是INQUIRY
        scsi_cmd[4] = (unsigned char) try_inquiry_len;

        memset(inq_result, 0, try_inquiry_len);
        //发送SCSI命令,重试3次
        result = scsi_execute_req(sdev,  scsi_cmd, DMA_FROM_DEVICE,
                      inq_result, try_inquiry_len, &sshdr,
                      HZ / 2 + HZ * scsi_inq_timeout, 3,
                      &resid);

        SCSI_LOG_SCAN_BUS(3, sdev_printk(KERN_INFO, sdev,
                "scsi scan: INQUIRY %s with code 0x%x\n",
                result ? "failed" : "successful", result));

        if (result) {
            /*
             * not-ready to ready transition [asc/ascq=0x28/0x0]
             * or power-on, reset [asc/ascq=0x29/0x0], continue.
             * INQUIRY should not yield UNIT_ATTENTION
             * but many buggy devices do so anyway.
             */
            if ((driver_byte(result) & DRIVER_SENSE) &&
                scsi_sense_valid(&sshdr)) {
                if ((sshdr.sense_key == UNIT_ATTENTION) &&
                    ((sshdr.asc == 0x28) ||
                     (sshdr.asc == 0x29)) &&
                    (sshdr.ascq == 0))
                    continue;
            }
        } else {
            /*
             * if nothing was transferred, we try
             * again. It's a workaround for some USB
             * devices.
             */
            if (resid == try_inquiry_len)
                continue;
        }
        break;
    }

    if (result == 0) {
        sanitize_inquiry_string(&inq_result[8], 8);
        sanitize_inquiry_string(&inq_result[16], 16);
        sanitize_inquiry_string(&inq_result[32], 4);

        response_len = inq_result[4] + 5;
        if (response_len > 255)
            response_len = first_inquiry_len;   /* sanity */

        /*
         * Get any flags for this device.
         *
         * XXX add a bflags to scsi_device, and replace the
         * corresponding bit fields in scsi_device, so bflags
         * need not be passed as an argument.
         */
        *bflags = scsi_get_device_flags(sdev, &inq_result[8],
                &inq_result[16]);

        /* When the first pass succeeds we gain information about
         * what larger transfer lengths might work. */
        if (pass == 1) {
            if (BLIST_INQUIRY_36 & *bflags)
                next_inquiry_len = 36;
            else if (BLIST_INQUIRY_58 & *bflags)
                next_inquiry_len = 58;
            else if (sdev->inquiry_len)
                next_inquiry_len = sdev->inquiry_len;
            else
                next_inquiry_len = response_len;

            /* If more data is available perform the second pass */
            if (next_inquiry_len > try_inquiry_len) {
                try_inquiry_len = next_inquiry_len;
                pass = 2;
                goto next_pass;
            }
        }

    } else if (pass == 2) {
        sdev_printk(KERN_INFO, sdev,
                "scsi scan: %d byte inquiry failed.  "
                "Consider BLIST_INQUIRY_36 for this device\n",
                try_inquiry_len);

        /* If this pass failed, the third pass goes back and transfers
         * the same amount as we successfully got in the first pass. */
        try_inquiry_len = first_inquiry_len;
        pass = 3;
        goto next_pass;
    }

    /* If the last transfer attempt got an error, assume the
     * peripheral doesn't exist or is dead. */
    if (result)
        return -EIO;

    /* Don't report any more data than the device says is valid */
    sdev->inquiry_len = min(try_inquiry_len, response_len);

    /*
     * XXX Abort if the response length is less than 36? If less than
     * 32, the lookup of the device flags (above) could be invalid,
     * and it would be possible to take an incorrect action - we do
     * not want to hang because of a short INQUIRY. On the flip side,
     * if the device is spun down or becoming ready (and so it gives a
     * short INQUIRY), an abort here prevents any further use of the
     * device, including spin up.
     *
     * On the whole, the best approach seems to be to assume the first
     * 36 bytes are valid no matter what the device says.  That's
     * better than copying < 36 bytes to the inquiry-result buffer
     * and displaying garbage for the Vendor, Product, or Revision
     * strings.
     */
    if (sdev->inquiry_len < 36) {
        if (!sdev->host->short_inquiry) {
            shost_printk(KERN_INFO, sdev->host,
                    "scsi scan: INQUIRY result too short (%d),"
                    " using 36\n", sdev->inquiry_len);
            sdev->host->short_inquiry = 1;
        }
        sdev->inquiry_len = 36;
    }

    /*
     * Related to the above issue:
     *
     * XXX Devices (disk or all?) should be sent a TEST UNIT READY,
     * and if not ready, sent a START_STOP to start (maybe spin up) and
     * then send the INQUIRY again, since the INQUIRY can change after
     * a device is initialized.
     *
     * Ideally, start a device if explicitly asked to do so.  This
     * assumes that a device is spun up on power on, spun down on
     * request, and then spun up on request.
     */

    /*
     * The scanning code needs to know the scsi_level, even if no
     * device is attached at LUN 0 (SCSI_SCAN_TARGET_PRESENT) so
     * non-zero LUNs can be scanned.
     */
    sdev->scsi_level = inq_result[2] & 0x07;
    if (sdev->scsi_level >= 2 ||
        (sdev->scsi_level == 1 && (inq_result[3] & 0x0f) == 1))
        sdev->scsi_level++;
    sdev->sdev_target->scsi_level = sdev->scsi_level;

    /*
     * If SCSI-2 or lower, and if the transport requires it,
     * store the LUN value in CDB[1].
     */
    sdev->lun_in_cdb = 0;
    if (sdev->scsi_level <= SCSI_2 &&
        sdev->scsi_level != SCSI_UNKNOWN &&
        !sdev->host->no_scsi2_lun_in_cdb)
        sdev->lun_in_cdb = 1;

    return 0;
}


static int scsi_add_lun(struct scsi_device *sdev, unsigned char *inq_result,
        int *bflags, int async)
{
    int ret;

    /*
     * XXX do not save the inquiry, since it can change underneath us,
     * save just vendor/model/rev.
     *
     * Rather than save it and have an ioctl that retrieves the saved
     * value, have an ioctl that executes the same INQUIRY code used
     * in scsi_probe_lun, let user level programs doing INQUIRY
     * scanning run at their own risk, or supply a user level program
     * that can correctly scan.
     */

    /*
     * Copy at least 36 bytes of INQUIRY data, so that we don't
     * dereference unallocated memory when accessing the Vendor,
     * Product, and Revision strings.  Badly behaved devices may set
     * the INQUIRY Additional Length byte to a small value, indicating
     * these strings are invalid, but often they contain plausible data
     * nonetheless.  It doesn't matter if the device sent < 36 bytes
     * total, since scsi_probe_lun() initializes inq_result with 0s.
     */
    sdev->inquiry = kmemdup(inq_result,
                max_t(size_t, sdev->inquiry_len, 36),
                GFP_ATOMIC);
    if (sdev->inquiry == NULL)
        return SCSI_SCAN_NO_RESPONSE;

    sdev->vendor = (char *) (sdev->inquiry + 8); //第8个字节到第15个字节是vendor identification
    sdev->model = (char *) (sdev->inquiry + 16); //第16个字节到第31个字节是product identification
    sdev->rev = (char *) (sdev->inquiry + 32);   //第32个字节到第35个字节是product revision level

    if (strncmp(sdev->vendor, "ATA     ", 8) == 0) {
        /*
         * sata emulation layer device.  This is a hack to work around
         * the SATL power management specifications which state that
         * when the SATL detects the device has gone into standby
         * mode, it shall respond with NOT READY.
         */
        sdev->allow_restart = 1;
    }

    if (*bflags & BLIST_ISROM) {
        sdev->type = TYPE_ROM;
        sdev->removable = 1;
    } else {
        sdev->type = (inq_result[0] & 0x1f);
        sdev->removable = (inq_result[1] & 0x80) >> 7;

        /*
         * some devices may respond with wrong type for
         * well-known logical units. Force well-known type
         * to enumerate them correctly.
         */
        if (scsi_is_wlun(sdev->lun) && sdev->type != TYPE_WLUN) {
            sdev_printk(KERN_WARNING, sdev,
                "%s: correcting incorrect peripheral device type 0x%x for W-LUN 0x%16xhN\n",
                __func__, sdev->type, (unsigned int)sdev->lun);
            sdev->type = TYPE_WLUN;
        }

    }

    if (sdev->type == TYPE_RBC || sdev->type == TYPE_ROM) {
        /* RBC and MMC devices can return SCSI-3 compliance and yet
         * still not support REPORT LUNS, so make them act as
         * BLIST_NOREPORTLUN unless BLIST_REPORTLUN2 is
         * specifically set */
        if ((*bflags & BLIST_REPORTLUN2) == 0)
            *bflags |= BLIST_NOREPORTLUN;
    }

    /*
     * For a peripheral qualifier (PQ) value of 1 (001b), the SCSI
     * spec says: The device server is capable of supporting the
     * specified peripheral device type on this logical unit. However,
     * the physical device is not currently connected to this logical
     * unit.
     *
     * The above is vague, as it implies that we could treat 001 and
     * 011 the same. Stay compatible with previous code, and create a
     * scsi_device for a PQ of 1
     *
     * Don't set the device offline here; rather let the upper
     * level drivers eval the PQ to decide whether they should
     * attach. So remove ((inq_result[0] >> 5) & 7) == 1 check.
     */

    sdev->inq_periph_qual = (inq_result[0] >> 5) & 7;
    sdev->lockable = sdev->removable;
    sdev->soft_reset = (inq_result[7] & 1) && ((inq_result[3] & 7) == 2);

    if (sdev->scsi_level >= SCSI_3 ||
            (sdev->inquiry_len > 56 && inq_result[56] & 0x04))
        sdev->ppr = 1;
    if (inq_result[7] & 0x60)
        sdev->wdtr = 1;
    if (inq_result[7] & 0x10)
        sdev->sdtr = 1;

    sdev_printk(KERN_NOTICE, sdev, "%s %.8s %.16s %.4s PQ: %d "
            "ANSI: %d%s\n", scsi_device_type(sdev->type),
            sdev->vendor, sdev->model, sdev->rev,
            sdev->inq_periph_qual, inq_result[2] & 0x07,
            (inq_result[3] & 0x0f) == 1 ? " CCS" : "");

    if ((sdev->scsi_level >= SCSI_2) && (inq_result[7] & 2) &&
        !(*bflags & BLIST_NOTQ)) {
        sdev->tagged_supported = 1;
        sdev->simple_tags = 1;
    }

    /*
     * Some devices (Texel CD ROM drives) have handshaking problems
     * when used with the Seagate controllers. borken is initialized
     * to 1, and then set it to 0 here.
     */
    if ((*bflags & BLIST_BORKEN) == 0)
        sdev->borken = 0;

    if (*bflags & BLIST_NO_ULD_ATTACH)
        sdev->no_uld_attach = 1;

    /*
     * Apparently some really broken devices (contrary to the SCSI
     * standards) need to be selected without asserting ATN
     */
    if (*bflags & BLIST_SELECT_NO_ATN)
        sdev->select_no_atn = 1;

    /*
     * Maximum 512 sector transfer length
     * broken RA4x00 Compaq Disk Array
     */
    if (*bflags & BLIST_MAX_512)
        blk_queue_max_hw_sectors(sdev->request_queue, 512);
    /*
     * Max 1024 sector transfer length for targets that report incorrect
     * max/optimal lengths and relied on the old block layer safe default
     */
    else if (*bflags & BLIST_MAX_1024)
        blk_queue_max_hw_sectors(sdev->request_queue, 1024);

    /*
     * Some devices may not want to have a start command automatically
     * issued when a device is added.
     */
    if (*bflags & BLIST_NOSTARTONADD)
        sdev->no_start_on_add = 1;

    if (*bflags & BLIST_SINGLELUN)
        scsi_target(sdev)->single_lun = 1;

    sdev->use_10_for_rw = 1;

    if (*bflags & BLIST_MS_SKIP_PAGE_08)
        sdev->skip_ms_page_8 = 1;

    if (*bflags & BLIST_MS_SKIP_PAGE_3F)
        sdev->skip_ms_page_3f = 1;

    if (*bflags & BLIST_USE_10_BYTE_MS)
        sdev->use_10_for_ms = 1;

    /* some devices don't like REPORT SUPPORTED OPERATION CODES
     * and will simply timeout causing sd_mod init to take a very
     * very long time */
    if (*bflags & BLIST_NO_RSOC)
        sdev->no_report_opcodes = 1;

    /* set the device running here so that slave configure
     * may do I/O */
    ret = scsi_device_set_state(sdev, SDEV_RUNNING); //状态
    if (ret) {
        ret = scsi_device_set_state(sdev, SDEV_BLOCK);

        if (ret) {
            sdev_printk(KERN_ERR, sdev,
                    "in wrong state %s to complete scan\n",
                    scsi_device_state_name(sdev->sdev_state));
            return SCSI_SCAN_NO_RESPONSE;
        }
    }

    if (*bflags & BLIST_MS_192_BYTES_FOR_3F)
        sdev->use_192_bytes_for_3f = 1;

    if (*bflags & BLIST_NOT_LOCKABLE)
        sdev->lockable = 0;

    if (*bflags & BLIST_RETRY_HWERROR)
        sdev->retry_hwerror = 1;

    if (*bflags & BLIST_NO_DIF)
        sdev->no_dif = 1;

    sdev->eh_timeout = SCSI_DEFAULT_EH_TIMEOUT;

    if (*bflags & BLIST_TRY_VPD_PAGES)
        sdev->try_vpd_pages = 1;
    else if (*bflags & BLIST_SKIP_VPD_PAGES)
        sdev->skip_vpd_pages = 1;

    transport_configure_device(&sdev->sdev_gendev); //把lun配置到scsi传输层

    if (sdev->host->hostt->slave_configure) {
        ret = sdev->host->hostt->slave_configure(sdev); //主机适配器模板设置的回调,对scsi_device(lun)执行特定的初始化
        if (ret) {
            /*
             * if LLDD reports slave not present, don't clutter
             * console with alloc failure messages
             */
            if (ret != -ENXIO) {
                sdev_printk(KERN_ERR, sdev,
                    "failed to configure device\n");
            }
            return SCSI_SCAN_NO_RESPONSE;
        }
    }

    if (sdev->scsi_level >= SCSI_3)
        scsi_attach_vpd(sdev);

    sdev->max_queue_depth = sdev->queue_depth;  //设置最大队列深度

    /*
     * Ok, the device is now all set up, we can
     * register it and tell the rest of the kernel
     * about it.
     */ //添加scsi_device(lun)到sysfs
    if (!async && scsi_sysfs_add_sdev(sdev) != 0)
        return SCSI_SCAN_NO_RESPONSE;

    return SCSI_SCAN_LUN_PRESENT;
}

你可能感兴趣的:(存储,Linux内核)