原文地址:http://godorz.info/
最近想学习Linux IO子系统, 找了flashcache代码, 它通过内核提供的Device Mapper机制, 将一块SSD和一块普通磁盘虚拟为一个块设备, 其中SSD作为cache, 数据最终落地到普通磁盘. 这种混合存储的策略, 向上层应用(如mysql)屏蔽了底层的实现, 上层应用看到的只是一个挂载到虚拟块设备上的某种文件系统, 使用常见的文件系统接口即可读写数据, 一方面保持兼容, 一方面获得不错的性能. flashcache的代码只有几千行, 从commit log中可以看到版本迭代比较频繁, 也因此引入了较多我个人不关心的新特性. flashcache源码中作者写到借鉴了dm-cache的代码, 所以查了下资料, 竟是国人出品, sloc不足两千, 一晚上就可以看完, 正合胃口. dm-cache的使用可以参考flashcache文档, 原理见flashcache原理.
dm-cache思路非常简单, 它把SSD作为cache, 将数据持久化到普通磁盘. 其中, SSD cache组织方式为set-associative map, 这和CPU cache的组织非常相像, 只是这里的key是cacheblock编号. cacheblock是dm-cache为了方便存取数据引入的单位, 粒度在磁盘block之上. 在通过dmsetup创建dm-cache块设备时时可以指定cacheblock的大小, 默认为8个连续的磁盘block组成一个cacheblock, 即4k字节. 上层的IO请求由Device Mapper框架切割为cacheblock大小(且对齐)的bio, 然后交由dm-cache处理. 也就是说, 不管是对SSD, 还是普通磁盘, dm-cache处理IO的单位都是cacheblock. 它在内存中的metadata为:
117 /* Cache block metadata structure */
118 struct cacheblock {
119 spinlock_t lock; /* Lock to protect operations on the bio list */
120 sector_t block; /* Sector number of the cached block */
121 unsigned short state; /* State of a block */
122 unsigned long counter; /* Logical timestamp of the block’s last access */
123 struct bio_list bios; /* List of pending bios */
124 };
其中, block字段表示当前cacheblock的起始扇区编号. 既然SSD作为cache, 针对写请求必定会有writeback和writethrough等多种选择. writeback即数据先写到SSD, 然后由后台线程在合适的时间写回磁盘. writethrough指数据同时写入磁盘和SSD. (flashcache在这基础上又增加了writearound的方式, 意思是绕过SSD cache, 数据直接写入磁盘, 在处理读请求时更新到SSD.) 不管是writeback, 还是writethrough, 数据写入磁盘(或者由磁盘读取数据更新至cache)都不可能一蹴而就, 所以每个cacheblock必定会有一个状态(state字段). 另外, cache有淘汰的概念, dm-cache支持FIFO或LRU淘汰, 所以需要为每个cacheblock保存其最后访问时间(counter字段). 最后, 为了互斥同时请求同一个cacheblock, 每个cacheblock还对应一个spinlock. 被互斥的后发请求记录在bios链表中. 在当前cacheblock上的操作完成后, dm-cache将重新提交bios链表上的bio.
接下来看下dm-cache的总控结构体cache_c:
80 /*
81 * Cache context
82 */
83 struct cache_c {
84 struct dm_dev *src_dev; /* Source device */
85 struct dm_dev *cache_dev; /* Cache device */
86 struct dm_kcopyd_client *kcp_client; /* Kcopyd client for writing back data */
87
88 struct cacheblock *cache; /* Hash table for cache blocks */
89 sector_t size; /* Cache size */
90 unsigned int bits; /* Cache size in bits */
91 unsigned int assoc; /* Cache associativity */
92 unsigned int block_size; /* Cache block size */
93 unsigned int block_shift; /* Cache block size in bits */
94 unsigned int block_mask; /* Cache block mask */
95 unsigned int consecutive_shift; /* Consecutive blocks size in bits */
96 unsigned long counter; /* Logical timestamp of last access */
97 unsigned int write_policy; /* Cache write policy */
98 sector_t dirty_blocks; /* Number of dirty blocks */
99
100 spinlock_t lock; /* Lock to protect page allocation/deallocation */
101 struct page_list *pages; /* Pages for I/O */
102 unsigned int nr_pages; /* Number of pages */
103 unsigned int nr_free_pages; /* Number of free pages */
104 wait_queue_head_t destroyq; /* Wait queue for I/O completion */
105 atomic_t nr_jobs; /* Number of I/O jobs */
106 struct dm_io_client *io_client; /* Client memory pool*/
107
108 /* Stats */
109 unsigned long reads; /* Number of reads */
110 unsigned long writes; /* Number of writes */
111 unsigned long cache_hits; /* Number of cache hits */
112 unsigned long replace; /* Number of cache replacements */
113 unsigned long writeback; /* Number of replaced dirty blocks */
114 unsigned long dirty; /* Number of submitted dirty blocks */
115 };
其中, src_dev和cache_dev分别为磁盘和SSD在DM框架的抽象. cache字段为连续的cacheblock数组, 元素个数即size字段. 其余字段顾名思义, 不再赘述.
dm-cache的初始化代码相对简单, DM框架获取dmsetup参数, 传递给cache_ctr(), dm-cache通过该函数构造一个cache_c对象, 保存在dm_target.private中. dm_target结构中另一重要字段为split_io, 这个字段表示DM框架分割bio的粒度, cache_ctr()函数指定其为cacheblock大小.
上层的读写请求在IO内核路径上表示为bio, 针对Device Mapper框架虚拟出来的块设备的bio请求, DM框架通过bio的block编号找到所属的dm_targets(一个bio的请求可能横跨多个dm_target), 逐个回调dm_target.type->map, 该字段为函数指针, 在dm-cache模块加载到内核时, 由该模块的初始化函数dm_cache_init()注册为cache_map(). 也就是说, 读写请求的入口都是cache_map().
如上所述, 读写请求的入口都是cache_map(), 其实现如下:
1202 /*
1203 * Decide the mapping and perform necessary cache operations for a bio request.
1204 */
1205 static int cache_map(struct dm_target *ti, struct bio *bio,
1206 union map_info *map_context)
1207 {
1208 struct cache_c *dmc = (struct cache_c *) ti->private;
1209 sector_t request_block, cache_block = 0, offset;
1210 int res;
1211
1212 offset = bio->bi_sector & dmc->block_mask;
1213 request_block = bio->bi_sector – offset;
1214
1220 if (bio_data_dir(bio) == READ) dmc->reads++;
1221 else dmc->writes++;
1222
1223 res = cache_lookup(dmc, request_block, &cache_block);
1224 if (1 == res) /* Cache hit; server request from cache */
1225 return cache_hit(dmc, bio, cache_block);
1226 else if (0 == res) /* Cache miss; replacement block is found */
1227 return cache_miss(dmc, bio, cache_block);
1228 else if (2 == res) { /* Entire cache set is dirty; initiate a write-back */
1229 write_back(dmc, cache_block, 1);
1230 dmc->writeback++;
1231 }
1232
1233 /* Forward to source device */
1234 bio->bi_bdev = dmc->src_dev->bdev;
1235
1236 return 1;
1237 }
该函数首先从ti->private中获取cache_c *dmc, 这个对象由cache_ctr()中构造. 接着获得bio所请求的起始扇区(即bio->bi_sector)所属的cacheblock的扇区编号, 保存在request_block变量. 接着通过cache_lookup()函数在dmc->cache中查找, key便是request_block. cache_lookup()代码相对简单, 不再细述.
如果cache中查找失败, 则进入cache_miss()逻辑. 其最后一个参数cache_block为cache_lookup()以某种淘汰形式找到的待替换的cacheblock的扇区编号.
1189 /* Handle cache misses */
1190 static int cache_miss(struct cache_c *dmc, struct bio* bio, sector_t cache_block) {
1191 if (bio_data_dir(bio) == READ)
1192 return cache_read_miss(dmc, bio, cache_block);
1193 else
1194 return cache_write_miss(dmc, bio, cache_block);
1195 }
cache_miss()函数判断bio是读是写, 读则调用cache_read_miss(), 否则调用cache_write_miss().
篇幅所限, 接下来我们只看下读请求未命中cache的情况, 这时cache_read_miss()将被调用.
1073 /*
1074 * Handle a read cache miss:
1075 * Update the metadata; fetch the necessary block from source device;
1076 * store data to cache device.
1077 */
1078 static int cache_read_miss(struct cache_c *dmc, struct bio* bio,
1079 sector_t cache_block) {
1080 struct cacheblock *cache = dmc->cache;
1081 unsigned int offset, head, tail;
1082 struct kcached_job *job;
1083 sector_t request_block, left;
1084
1085 offset = (unsigned int)(bio->bi_sector & dmc->block_mask);
1086 request_block = bio->bi_sector – offset;
1087
1095 cache_insert(dmc, request_block, cache_block); /* Update metadata first */
1096
1097 job = new_kcached_job(dmc, bio, request_block, cache_block);
1098
1099 head = to_bytes(offset);
1100
1101 left = (dmc->src_dev->bdev->bd_inode->i_size>>9) – request_block;
1102 if (left < dmc->block_size) {
1103 tail = to_bytes(left) – bio->bi_size – head;
1104 job->src.count = left;
1105 job->dest.count = left;
1106 } else
1107 tail = to_bytes(dmc->block_size) – bio->bi_size – head;
1108
1109 /* Requested block is aligned with a cache block */
1110 if (0 == head && 0 == tail)
1111 job->nr_pages= 0;
1112 else /* Need new pages to store extra data */
1113 job->nr_pages = dm_div_up(head, PAGE_SIZE) + dm_div_up(tail, PAGE_SIZE);
1114 job->rw = READ; /* Fetch data from the source device */
1115
1118 queue_job(job);
1119
1120 return 0;
1121 }
函数首先调用cache_insert()更新cache, 设置该cacheblock.state为RESERVED. 然后调用new_kcached_job()分配一个kcached_job对象. 第1109~1104行是核心代码, 如前文所述, 上层请求的bio已经由DM框架按cacheblock单位切分, 也就是说, cache_map()所处理的每个bio请求的扇区数最大为cacheblock. 如下图所示: 第1099行获得这个bio在cacheblock中的偏移, 保存在head. 第1101行获得第request_block块扇区到磁盘最后一块扇区所跨过的扇区数. 第1103或1107行获得bio请求的数据最后一个字节离磁盘最后一字节(或者下一个cacheblock)的偏移. 第1110行, 如果0 == head且0 == tail, 说明所请求的bio正好覆盖整个cacheblock. 否则, 说明请求的bio只占cacheblock的一部分, 针对这种情况, 需要为该bio未请求的前后两部分分别分配页面. 因为dm-cache请求磁盘的单位为cacheblock大小. 第1114行指定job的读写方向为READ. 最后, 第1118行提交job.
回头看看cache_read_miss()中的1097行分配job所调用的函数new_kcached_job(), 第3个参数request_block表示bio请求在磁盘的起始扇区号, 第4个参数cache_block表示bio请求在SSD的起始扇区号.
1049 static struct kcached_job *new_kcached_job(struct cache_c *dmc, struct bio* bio,
1050 sector_t request_block,
1051 sector_t cache_block)
1052 {
1053 struct dm_io_region src, dest;
1054 struct kcached_job *job;
1055
1056 src.bdev = dmc->src_dev->bdev;
1057 src.sector = request_block;
1058 src.count = dmc->block_size;
1059 dest.bdev = dmc->cache_dev->bdev;
1060 dest.sector = cache_block << dmc->block_shift;
1061 dest.count = src.count;
1062
1063 job = mempool_alloc(_job_pool, GFP_NOIO);
1064 job->dmc = dmc;
1065 job->bio = bio;
1066 job->src = src;
1067 job->dest = dest;
1068 job->cacheblock = &dmc->cache[cache_block];
1069
1070 return job;
1071 }
接下来看看job结构的定义:
126 /* Structure for a kcached job */
127 struct kcached_job {
128 struct list_head list;
129 struct cache_c *dmc;
130 struct bio *bio; /* Original bio */
131 struct dm_io_region src;
132 struct dm_io_region dest;
133 struct cacheblock *cacheblock;
134 int rw;
135 /*
136 * When the original bio is not aligned with cache blocks,
137 * we need extra bvecs and pages for padding.
138 */
139 struct bio_vec *bvec;
140 unsigned int nr_pages;
141 struct page_list *pages;
142 };
在dm-cache中, job有3种状态, 它以list字段链入其所属于的链表, 分别为_io_jobs, _pages_jobs和_complete_jobs. 其中, io_jobs表示待执行IO的任务, page_jobs待分配页面的任务, compelet_jobs表示待收尾的任务. kcached_job的bio字段存储DM框架发给cache_map()的bio请求, src和dest分别指向磁盘和SSD的dm_io_region. cacheblock指针指向cache_c.cache数组中以请求的bio所落在的SSD磁盘上的cacheblock编号为下标的偏移. rw字段表示job当前的读写方向.
回到cache_read_miss()函数, 它在第1118行调用queue_job()提交了任务, 代码如下:
736 static void queue_job(struct kcached_job *job)
737 {
738 atomic_inc(&job->dmc->nr_jobs);
739 if (job->nr_pages > 0) /* Request pages */
740 push(&_pages_jobs, job);
741 else /* Go ahead to do I/O */
742 push(&_io_jobs, job);
743 wake();
744 }
可以看到, 如果需要为job分配页面, 则将job链入_pages_jobs链表, 否则, 链入_io_jobs链表. 然后调用wake():
299 static inline void wake(void)
300 {
301 queue_work(_kcached_wq, &_kcached_work);
302 }
wake()函数只是对queue_work()的封装, 它将_kcached_work提交到_kcached_wq. 在dm-cache模块初始化函数dm_cache_init()中, _kcached_work被注册的回调为do_work. 所以, 当_kcached_work被调度时, do_work()将被回调.
729 static void do_work(struct work_struct *ignored)
730 {
731 process_jobs(&_complete_jobs, do_complete);
732 process_jobs(&_pages_jobs, do_pages);
733 process_jobs(&_io_jobs, do_io);
734 }
可见, do_work()依次遍历_complete_jobs, _pages_jobs和_io_jobs链表中的任务, 以任务为参数, 分别回调do_complete, do_pages, do_io. 在这里, 遍历的顺序是有讲究的: 先处理_complete_jobs任务, 是因为此类任务完成后可能释放一些页面回页面内存池; 然后处理_pages_jobs任务, 因为此类任务只有获取页面后才能执行IO操作, 它从页面内存池中获取页面; 最后处理_io_jobs链表任务.
process_jobs()代码如下:
696 /*
697 * Run through a list for as long as possible. Returns the count
698 * of successful jobs.
699 */
700 static int process_jobs(struct list_head *jobs,
701 int (*fn) (struct kcached_job *))
702 {
703 struct kcached_job *job;
704 int r, count = 0;
705
706 while ((job = pop(jobs))) {
707 r = fn(job);
708
709 if (r < 0) {
710 /* error this rogue job */
711 DMERR("process_jobs: Job processing error");
712 }
713
714 if (r > 0) {
715 /*
716 * We couldn’t service this job ATM, so
717 * push this job back onto the list.
718 */
719 push(jobs, job);
720 break;
721 }
722
723 count++;
724 }
725
726 return count;
727 }
它依次遍历链表, 调用回调.
回到queue_job(), 前面说过, 因为dm-cache读写SSD及磁盘的粒度为cacheblock大小, 所以如果bio请求未对其cacheblock, 或请求大小不等于cacheblock大小, 则需要为该cacheblock中, bio不关心的前后部分分配页面, 即把job提交到_pages_jobs链表. 否则, 直接提交到_io_jobs链表.
_pages_jobs链表的回调函数do_pages非常简单, 它从页面内存池获取一些页面(页面数为nr_pages), 保存在kcached_job结构的pages字段, 然后将job提交到_io_jobs链表.
针对_io_jobs链表上的任务, do_work()将以do_io回调来处理该任务.
618 static int do_io(struct kcached_job *job)
619 {
620 int r = 0;
621
622 if (job->rw == READ) { /* Read from source device */
623 r = do_fetch(job);
624 } else { /* Write to cache device */
625 r = do_store(job);
626 }
627
628 return r;
629 }
针对读请求, 很明显是进入do_fetch()分支.
400 /*
401 * Fetch data from the source device asynchronously.
402 * For a READ bio, if a cache block is larger than the requested data, then
403 * additional data are prefetched. Larger cache block size enables more
404 * aggressive read prefetching, which is useful for read-mostly usage.
405 * For a WRITE bio, if a cache block is larger than the requested data, the
406 * entire block needs to be fetched, and larger block size incurs more overhead.
407 * In scenaros where writes are frequent, 4KB is a good cache block size.
408 */
409 static int do_fetch(struct kcached_job *job)
410 {
411 int r = 0, i, j;
412 struct bio *bio = job->bio;
413 struct cache_c *dmc = job->dmc;
414 unsigned int offset, head, tail, remaining, nr_vecs, idx = 0;
415 struct bio_vec *bvec;
416 struct page_list *pl;
417 printk("do_fetch");
418 offset = (unsigned int) (bio->bi_sector & dmc->block_mask);
419 head = to_bytes(offset);
420 tail = to_bytes(dmc->block_size) – bio->bi_size – head;
425
426 if (bio_data_dir(bio) == READ) { /* The original request is a READ */
427 if (0 == job->nr_pages) { /* The request is aligned to cache block */
428 r = dm_io_async_bvec(1, &job->src, READ,
429 bio->bi_io_vec + bio->bi_idx,
430 io_callback, job);
431 return r;
432 }
433
434 nr_vecs = bio->bi_vcnt – bio->bi_idx + job->nr_pages;
435 bvec = kmalloc(nr_vecs * sizeof(*bvec), GFP_NOIO);
436 if (!bvec) {
437 DMERR("do_fetch: No memory");
438 return 1;
439 }
440
441 pl = job->pages;
442 i = 0;
443 while (head) {
444 bvec[i].bv_len = min(head, (unsigned int)PAGE_SIZE);
445 bvec[i].bv_offset = 0;
446 bvec[i].bv_page = pl->page;
447 head -= bvec[i].bv_len;
448 pl = pl->next;
449 i++;
450 }
451
452 remaining = bio->bi_size;
453 j = bio->bi_idx;
454 while (remaining) {
455 bvec[i] = bio->bi_io_vec[j];
456 remaining -= bvec[i].bv_len;
457 i++; j++;
458 }
459
460 while (tail) {
461 bvec[i].bv_len = min(tail, (unsigned int)PAGE_SIZE);
462 bvec[i].bv_offset = 0;
463 bvec[i].bv_page = pl->page;
464 tail -= bvec[i].bv_len;
465 pl = pl->next;
466 i++;
467 }
468
469 job->bvec = bvec;
470 r = dm_io_async_bvec(1, &job->src, READ, job->bvec, io_callback, job);
471 return r;
472 } else { /* The original request is a WRITE */
541 }
542 }
-如果任务没有申请页面, 即bio请求正好cacheblock对齐且请求大小正好为一个cacheblock, 则直接调用dm_io_async_bvec().
-如果任务申请了页面, 即bio请求不是cacheblock对齐, 或者请求大小不是一个cacheblock, 则通过第434~467行代码主动构造一个bio_vec *bvec, 保存在job->bvec中, 然后调用dm_io_async_bvec().
仔细比较上述两种情况调用dm_io_async_bvec()所传递的参数, 不难发现, 只有第4个参数是不一样的. 前者传递的为原来请求的bio的bvec, 后者传递的为主动构造的bvec.
dm_io_async_bvec()函数提交IO, 从磁盘(job->src)中读取数据到第4个参数, 然后回调io_callback().
382 static void io_callback(unsigned long error, void *context)
383 {
384 struct kcached_job *job = (struct kcached_job *) context;
385
386 if (error) {
387 /* TODO */
388 DMERR("io_callback: io error");
389 return;
390 }
391
392 if (job->rw == READ) {
393 job->rw = WRITE;
394 push(&_io_jobs, job);
395 } else
396 push(&_complete_jobs, job);
397 wake();
398 }
读请求的job->rw为READ, 将其修改为WRITE后将job提交到_io_jobs链表. _io_jobs链表元素再次由do_work()以do_io()回调. 此时, 因为job->rw为WRITE, 所以调用的函数变成了do_store().
544 /*
545 * Store data to the cache source device asynchronously.
546 * For a READ bio request, the data fetched from the source device are returned
547 * to kernel and stored in cache at the same time.
548 * For a WRITE bio request, the data are written to the cache and source device
549 * at the same time.
550 */
551 static int do_store(struct kcached_job *job)
552 {
553 int i, j, r = 0;
554 struct bio *bio = job->bio ;
555 struct cache_c *dmc = job->dmc;
556 unsigned int offset, head, tail, remaining, nr_vecs;
557 struct bio_vec *bvec;
558 offset = (unsigned int) (bio->bi_sector & dmc->block_mask);
559 head = to_bytes(offset);
560 tail = to_bytes(dmc->block_size) – bio->bi_size – head;
566
567 if (0 == job->nr_pages) /* Original request is aligned with cache blocks */
568 r = dm_io_async_bvec(1, &job->dest, WRITE, bio->bi_io_vec + bio->bi_idx,
569 io_callback, job);
570 else {
571 if (bio_data_dir(bio) == WRITE && head > 0 && tail > 0) {
573 nr_vecs = job->nr_pages + bio->bi_vcnt – bio->bi_idx;
574 if (offset && (offset + bio->bi_size < PAGE_SIZE)) nr_vecs++;
576 bvec = kmalloc(nr_vecs * sizeof(*bvec), GFP_KERNEL);
577 if (!bvec) {
578 DMERR("do_store: No memory");
579 return 1;
580 }
581
582 i = 0;
583 while (head) {
584 bvec[i].bv_len = min(head, job->bvec[i].bv_len);
585 bvec[i].bv_offset = 0;
586 bvec[i].bv_page = job->bvec[i].bv_page;
587 head -= bvec[i].bv_len;
588 i++;
589 }
590 remaining = bio->bi_size;
591 j = bio->bi_idx;
592 while (remaining) {
593 bvec[i] = bio->bi_io_vec[j];
594 remaining -= bvec[i].bv_len;
595 i++; j++;
596 }
597 j = (to_bytes(offset) + bio->bi_size) / PAGE_SIZE;
598 bvec[i].bv_offset = (to_bytes(offset) + bio->bi_size) -
599 j * PAGE_SIZE;
600 bvec[i].bv_len = PAGE_SIZE – bvec[i].bv_offset;
601 bvec[i].bv_page = job->bvec[j].bv_page;
602 tail -= bvec[i].bv_len;
603 i++; j++;
604 while (tail) {
605 bvec[i] = job->bvec[j];
606 tail -= bvec[i].bv_len;
607 i++; j++;
608 }
609 kfree(job->bvec);
610 job->bvec = bvec;
611 }
612
613 r = dm_io_async_bvec(1, &job->dest, WRITE, job->bvec, io_callback, job);
614 }
615 return r;
616 }
这段代码和do_fetch()非常相像, 不再细述. 它把do_fetch()中从磁盘读取的数据, 通过dm_io_async_bvec()函数, 写入SSD(job->dest). 然后io_callback()再次被回调. 此时, 因为job->rw为WRITE, io_callback()将任务提交到_complete_jobs链表. 该链表对应的回调函数为do_complete():
673 static int do_complete(struct kcached_job *job)
674 {
675 int r = 0;
676 struct bio *bio = job->bio;
677
680 bio_endio(bio, 0);
681
682 if (job->nr_pages > 0) {
683 kfree(job->bvec);
684 kcached_put_pages(job->dmc, job->pages);
685 }
686
687 flush_bios(job->cacheblock);
688 mempool_free(job, _job_pool);
689
690 if (atomic_dec_and_test(&job->dmc->nr_jobs))
691 wake_up(&job->dmc->destroyq);
692
693 return r;
694 }
do_complete()首先调用bio_endio(), 告诉IO子系统上层, 当前bio已经处理完成. 然后释放页面. 之后调用flush_bios()重新提交在当前bio之后所有发往同个cacheblock的bios, 最后释放job.
至此, 读请求完成. 写请求与读请求大同小异, 不表.
IO处理内核化是一种有效的IO优化方式. 另外, IO路径网络化(iSCSI)也是大势所趋, 如Amazon的EBS及腾讯的CBS(入门参考块存储的世界). 希望以后一窥究竟.