序:看代码顺手写的备忘,如有错误之处请各位大侠多多指教
blktap2的README里有这段话,讲述了blktap2的原理:
Working in conjunction with the kernel blktap2 driver, all disk I/O
requests from VMs are passed to the userspace deamon (using a shared
memory interface) through a character device. Each active disk is
mapped to an individual device node, allowing per-disk processes to
implement individual block devices where desired. The userspace
drivers are implemented using asynchronous (Linux libaio),
O_DIRECT-based calls to preserve the unbuffered, batched and
asynchronous request dispatch achieved with the existing blkback
code. We provide a simple, asynchronous virtual disk interface that
makes it quite easy to add new disk implementations.
IO请求从blkfront到blkback之后,通过blktap2驱动交到user space的tapdisk2进程的地址空间。文中说的using a shared memory interface through a character device,在blktap驱动下指的就是/dev/xen/blktapX这些major 247的char device. xen4.0之后的blktap2驱动,这些设备是/dev/xen/blktap-2/blktapX。这些设备的信息还可以从/sys/class/blktap2/blktapX/下面查看。这里还有个控制设备/dev/xen/blktap-2/control
tapdisk2就是per-disk process,tapdisk2可以支持不同的后端存储格式,比如img, vhd, qcow, vmdk以及分布式的存储系统,如dynamo。除此以外,tapdisk2可以通过不同的syscall来调用IO,比如pread/pwrite,或者libaio。本人是libaio + direct方式的坚定支持者。
* tap_disk aims to provide a generic interface to easily implement new
* types of image accessors. The structure-of-function-calls is similar
* to disk interfaces used in qemu/denali/etc, with the significant
* difference being the expectation of asynchronous rather than synchronous
* I/O. The asynchronous interface is intended to allow lots of requests to
* be pipelined through a disk, without the disk requiring any of its own
* threads of control. As such, a batch of requests is delivered to the disk
* using:
*
* td_queue_[read,write]()
*
* and passing in a completion callback, which the disk is responsible for
* tracking. Disks should transform these requests as necessary and return
* the resulting iocbs to tapdisk using td_prep_[read,write]() and
* td_queue_tiocb().
*
* NOTE: tapdisk uses the number of sectors submitted per request as a
* ref count. Plugins must use the callback function to communicate the
* completion -- or error -- of every sector submitted to them.
在xen4.0.1的source tree下, xen/tools/blktap2/下面基本都是user space的代码,即tapdisk2的实现代码:
block-xxx.c , tapdisk2针对不同数据格式的磁盘封装的具体实现。struct tap_disk有点类似kernel的struct device结构,通过函数指针定义了tapdisk的诸类方法, e.g.
struct tap_disk {
const char *disk_type;
td_flag_t flags;
int private_data_size;
int (*td_open) (td_driver_t *, const char *, td_flag_t);
int (*td_close) (td_driver_t *);
int (*td_get_parent_id) (td_driver_t *, td_disk_id_t *);
int (*td_validate_parent) (td_driver_t *, td_driver_t *, td_flag_t);
void (*td_queue_read) (td_driver_t *, td_request_t);
void (*td_queue_write) (td_driver_t *, td_request_t);
void (*td_debug) (td_driver_t *);
};
具体的磁盘格式,有如下定义,e.g.
static disk_info_t vhd_disk = {
DISK_TYPE_VHD,
"virtual server image (vhd)",
"vhd",
0,
#ifdef TAPDISK
&tapdisk_vhd,
#endif
};
blktaplib.h定义了blktap2驱动和tapdisk2之间的控制信息数据结构:
#define MAX_REQUESTS BLK_RING_SIZE BLK_RING_SIZE值是32
#define MAX_PENDING_REQS BLK_RING_SIZE
#define BLKTAP_RING_PAGES 1 /* Front */
#define BLKTAP_MMAP_REGION_SIZE (BLKTAP_RING_PAGES + MMAP_PAGES)
BLKTAP_MMAP_REGION,目前我还不清楚干什么用的,很可能是IO请求的shared memory. tapdisk2进程通过这块内存从blktap2收发IO请求。
#define MMAP_PAGES \
(MAX_PENDING_REQS * BLKIF_MAX_SEGMENTS_PER_REQUEST)
32*11 = 352个page,应该足够了吧
blktap2/tapdisk.h定义了tapdisk2进程的主要数据结构:
struct td_request {
int op;
char *buf;
td_sector_t sec;
int secs;
uint8_t blocked; /* blocked on a dependency */
td_image_t *image;
void * /*td_callback_t*/ cb;
void *cb_data;
uint64_t id;
int sidx;
void *private;
#ifdef MEMSHR
uint64_t memshr_hnd;
#endif
};
/*
* Prototype of the callback to activate as requests complete.
*/
typedef void (*td_callback_t)(td_request_t, int);
最后,tapdisk2的实现:
更多关于blktap的link:从tapdisk2.c 的main函数开始
switch (fork()) {
case -1:
printf("fork failed: %d\n", errno);
return errno;
case 0:
return tapdisk2_create_device(params);
default:
return tapdisk2_wait_for_device();
}fork出来的子进程,调用tapdisk2_create_device(params)其中params是类似tap2:vhd:xxxxx的参数。父进程则调用tapdisk2_wait_for_device。父子进程通过管道channel,父进程查看子进程是否正确后退出,所有tapdisk2的子进程最后都被init托管
static int
tapdisk2_create_device(const char *params)
{
char *path;
int err, type;
chdir("/");
tapdisk_start_logging("tapdisk2");
err = tapdisk2_set_child_fds();
if (err)
goto out;
err = tapdisk2_check_environment();
if (err)
goto out;
err = tapdisk_parse_disk_type(params, &path, &type);
if (err)
goto out;
err = tapdisk2_prepare_device();
if (err)
goto out;
err = tapdisk_server_initialize(NULL, NULL);
if (err)
goto fail;
err = tapdisk2_open_device(type, path, params);
if (err)
goto fail;
cprintf(0, "%s%d\n", BLKTAP2_IO_DEVICE, handle.minor);
close(STDOUT_FILENO);
err = tapdisk_server_run();
if (err)
goto fail;
err = 0;
out:
tapdisk_stop_logging();
return err;
fail:
tapdisk2_free_device();
goto out;
}
tapdisk_start_logging("tapdisk2")会打开/var/log/message作为tapdisk2的log,e.g.
Aug 11 14:28:33 r02b08013 tapdisk2[28960]: Created /dev/xen/blktap-2/blktap24 device
格式为tapdisk2[pid], pid为tapdisk2的进程pid
tapdisk2_set_child_fds的作用是打开之前那个channel管道的写端,用于给父进程发response
tapdisk2_check_environment首先会检查一个字符设备/dev/xen/blktap-2/control是否存在,如果存在就没啥事情做了,直接退出。如果不存在的话,打开/proc/misc,查询blktap-control,找出minor号,一般都是55。然后调用mknod(glibc里的mknod)创建一个major, minor为10, 55的字符设备/dev/xen/blktap-2/control
tapdisk_parse_disk_type用来parse出disk的类型,比如aio, vhd, qcow等
tapdisk2_prepare_device用来创建设备。 首先调用了control提供的ioctl接口(这个/dev/xen/blktap-2/control的驱动要好好看下。。)ioctl(fd, BLKTAP2_IOCTL_ALLOC_TAP, &handle),之后创建/dev/xen/blktap-2/blktapX设备(和前面的tapdisk2_check_enviroment一样,都是调用tapdisk2_make_device来创建设备,这个函数最终调用mknod)和/dev/xen/blktap-2/tapdevX的块设备。这两个blktapX和tapdevX设备,按照目前我的理解,tapdevX是一个真正的后端块设备,而blktapX是请求的ring,所以是一个字符设备
tapdisk_server_initialize初始化tapdisk2进程。tapdisk_server相关的方法下面会详细去讲
tapdisk2_open_device基于params和path,e.g.某个vhd文件,创建一个vbd设备。之后是一组tapdisk_vbd的操作。以后再详细去讲吧
最后,是tapdisk_server_run,和其他这类daemon一样,tapdisk2进程用一个select函数阻塞等待IO请求过来,然后对有event需要处理的fd,调用相应的注册函数,这里就不多说了,基本和apache, nginx这类web server是一个调调。
http://wiki.xen.org/xenwiki/blktap
http://wiki.xen.org/xenwiki/blktap2