xen块设备体系结构(1)

序:看代码顺手写的备忘,如有错误之处请各位大侠多多指教


blktap2

blktap2的README里有这段话,讲述了blktap2的原理:

Working in conjunction with the kernel blktap2 driver, all disk I/O
requests from VMs are passed to the userspace deamon (using a shared
memory interface) through a character device. Each active disk is
mapped to an individual device node, allowing per-disk processes to
implement individual block devices where desired.  The userspace
drivers are implemented using asynchronous (Linux libaio),
O_DIRECT-based calls to preserve the unbuffered, batched and
asynchronous request dispatch achieved with the existing blkback
code.  We provide a simple, asynchronous virtual disk interface that
makes it quite easy to add new disk implementations.

IO请求从blkfront到blkback之后,通过blktap2驱动交到user space的tapdisk2进程的地址空间。文中说的using a shared memory interface through a character device,在blktap驱动下指的就是/dev/xen/blktapX这些major 247的char device. xen4.0之后的blktap2驱动,这些设备是/dev/xen/blktap-2/blktapX。这些设备的信息还可以从/sys/class/blktap2/blktapX/下面查看。这里还有个控制设备/dev/xen/blktap-2/control

tapdisk2就是per-disk process,tapdisk2可以支持不同的后端存储格式,比如img, vhd, qcow, vmdk以及分布式的存储系统,如dynamo。除此以外,tapdisk2可以通过不同的syscall来调用IO,比如pread/pwrite,或者libaio。本人是libaio + direct方式的坚定支持者。

 * tap_disk aims to provide a generic interface to easily implement new 
 * types of image accessors.  The structure-of-function-calls is similar
 * to disk interfaces used in qemu/denali/etc, with the significant 
 * difference being the expectation of asynchronous rather than synchronous 
 * I/O.  The asynchronous interface is intended to allow lots of requests to
 * be pipelined through a disk, without the disk requiring any of its own
 * threads of control.  As such, a batch of requests is delivered to the disk
 * using:
 * 
 *    td_queue_[read,write]()
 * 
 * and passing in a completion callback, which the disk is responsible for 
 * tracking.  Disks should transform these requests as necessary and return
 * the resulting iocbs to tapdisk using td_prep_[read,write]() and 
 * td_queue_tiocb().
 *
 * NOTE: tapdisk uses the number of sectors submitted per request as a 
 * ref count.  Plugins must use the callback function to communicate the
 * completion -- or error -- of every sector submitted to them.


在xen4.0.1的source tree下, xen/tools/blktap2/下面基本都是user space的代码,即tapdisk2的实现代码:

block-xxx.c , tapdisk2针对不同数据格式的磁盘封装的具体实现。struct tap_disk有点类似kernel的struct device结构,通过函数指针定义了tapdisk的诸类方法, e.g.

struct tap_disk {
const char                  *disk_type;
td_flag_t                    flags;
int                          private_data_size;
int (*td_open)               (td_driver_t *, const char *, td_flag_t);
int (*td_close)              (td_driver_t *);
int (*td_get_parent_id)      (td_driver_t *, td_disk_id_t *);
int (*td_validate_parent)    (td_driver_t *, td_driver_t *, td_flag_t);
void (*td_queue_read)        (td_driver_t *, td_request_t);
void (*td_queue_write)       (td_driver_t *, td_request_t);
void (*td_debug)             (td_driver_t *);
};

具体的磁盘格式,有如下定义,e.g.

static disk_info_t vhd_disk = {
       DISK_TYPE_VHD,
       "virtual server image (vhd)",
       "vhd",
       0,
#ifdef TAPDISK
       &tapdisk_vhd,
#endif
};


blktaplib.h定义了blktap2驱动和tapdisk2之间的控制信息数据结构:

#define MAX_REQUESTS            BLK_RING_SIZE     BLK_RING_SIZE值是32

#define MAX_PENDING_REQS BLK_RING_SIZE

#define BLKTAP_RING_PAGES       1 /* Front */
#define BLKTAP_MMAP_REGION_SIZE (BLKTAP_RING_PAGES + MMAP_PAGES)

BLKTAP_MMAP_REGION,目前我还不清楚干什么用的,很可能是IO请求的shared memory. tapdisk2进程通过这块内存从blktap2收发IO请求。

#define MMAP_PAGES                                                    \
    (MAX_PENDING_REQS * BLKIF_MAX_SEGMENTS_PER_REQUEST)

32*11 = 352个page,应该足够了吧


blktap2/tapdisk.h定义了tapdisk2进程的主要数据结构:

struct td_request {
int                          op;
char                        *buf;
td_sector_t                  sec;
int                          secs;
uint8_t                      blocked; /* blocked on a dependency */
td_image_t                  *image;
void * /*td_callback_t*/     cb;
void                        *cb_data;
uint64_t                     id;
int                          sidx;
void                        *private;
#ifdef MEMSHR
uint64_t                     memshr_hnd;
#endif
};

/* 
 * Prototype of the callback to activate as requests complete.
 */
typedef void (*td_callback_t)(td_request_t, int);


最后,tapdisk2的实现:

从tapdisk2.c 的main函数开始

switch (fork()) {
case -1:
printf("fork failed: %d\n", errno);
return errno;
case 0:
return tapdisk2_create_device(params);
default:
return tapdisk2_wait_for_device();
}

fork出来的子进程,调用tapdisk2_create_device(params)其中params是类似tap2:vhd:xxxxx的参数。父进程则调用tapdisk2_wait_for_device。父子进程通过管道channel,父进程查看子进程是否正确后退出,所有tapdisk2的子进程最后都被init托管

static int
tapdisk2_create_device(const char *params)
{
char *path;
int err, type;

chdir("/");
tapdisk_start_logging("tapdisk2");

err = tapdisk2_set_child_fds();
if (err)
goto out;

err = tapdisk2_check_environment();
if (err)
goto out;

err = tapdisk_parse_disk_type(params, &path, &type);
if (err)
goto out;

err = tapdisk2_prepare_device();
if (err)
goto out;

err = tapdisk_server_initialize(NULL, NULL);
if (err)
goto fail;

err = tapdisk2_open_device(type, path, params);
if (err)
goto fail;

cprintf(0, "%s%d\n", BLKTAP2_IO_DEVICE, handle.minor);
close(STDOUT_FILENO);

err = tapdisk_server_run();
if (err)
goto fail;

err = 0;

out:
tapdisk_stop_logging();
return err;

fail:
tapdisk2_free_device();
goto out;
}

tapdisk_start_logging("tapdisk2")会打开/var/log/message作为tapdisk2的log,e.g.

Aug 11 14:28:33 r02b08013 tapdisk2[28960]: Created /dev/xen/blktap-2/blktap24 device

格式为tapdisk2[pid], pid为tapdisk2的进程pid

tapdisk2_set_child_fds的作用是打开之前那个channel管道的写端,用于给父进程发response

tapdisk2_check_environment首先会检查一个字符设备/dev/xen/blktap-2/control是否存在,如果存在就没啥事情做了,直接退出。如果不存在的话,打开/proc/misc,查询blktap-control,找出minor号,一般都是55。然后调用mknod(glibc里的mknod)创建一个major, minor为10, 55的字符设备/dev/xen/blktap-2/control

tapdisk_parse_disk_type用来parse出disk的类型,比如aio, vhd, qcow等

tapdisk2_prepare_device用来创建设备。 首先调用了control提供的ioctl接口(这个/dev/xen/blktap-2/control的驱动要好好看下。。)ioctl(fd, BLKTAP2_IOCTL_ALLOC_TAP, &handle),之后创建/dev/xen/blktap-2/blktapX设备(和前面的tapdisk2_check_enviroment一样,都是调用tapdisk2_make_device来创建设备,这个函数最终调用mknod)和/dev/xen/blktap-2/tapdevX的块设备。这两个blktapX和tapdevX设备,按照目前我的理解,tapdevX是一个真正的后端块设备,而blktapX是请求的ring,所以是一个字符设备

tapdisk_server_initialize初始化tapdisk2进程。tapdisk_server相关的方法下面会详细去讲

tapdisk2_open_device基于params和path,e.g.某个vhd文件,创建一个vbd设备。之后是一组tapdisk_vbd的操作。以后再详细去讲吧

最后,是tapdisk_server_run,和其他这类daemon一样,tapdisk2进程用一个select函数阻塞等待IO请求过来,然后对有event需要处理的fd,调用相应的注册函数,这里就不多说了,基本和apache, nginx这类web server是一个调调。


更多关于blktap的link:

http://wiki.xen.org/xenwiki/blktap

http://wiki.xen.org/xenwiki/blktap2

你可能感兴趣的:(server,callback,interface,asynchronous,xen,disk)