最近想学习Linux IO子系统, 找了flashcache代码, 它通过内核提供的Device Mapper机制, 将一块SSD和一块普通磁盘虚拟为一个块设备, 其中SSD作为cache, 数据最终落地到普通磁盘. 这种混合存储的策略, 向上层应用(如mysql)屏蔽了底层的实现, 上层应用看到的只是一个挂载到虚拟块设备上的某种文件系统, 使用常见的文件系统接口即可读写数据, 一方面保持兼容, 一方面获得不错的性能. flashcache的代码只有几千行, 从commit log中可以看到版本迭代比较频繁, 也因此引入了较多我个人不关心的新特性. flashcache源码中作者写到借鉴了dm-cache的代码, 所以查了下资料, 竟是国人出品, sloc不足两千, 一晚上就可以看完, 正合胃口. dm-cache的使用可以参考flashcache文档, 原理见flashcache原理.

内存结构

dm-cache思路非常简单, 它把SSD作为cache, 将数据持久化到普通磁盘. 其中, SSD cache组织方式为set-associative map, 这和CPU cache的组织非常相像, 只是这里的key是cacheblock编号. cacheblock是dm-cache为了方便存取数据引入的单位, 粒度在磁盘block之上. 在通过dmsetup创建dm-cache块设备时时可以指定cacheblock的大小, 默认为8个连续的磁盘block组成一个cacheblock, 即4k字节. 上层的IO请求由Device Mapper框架切割为cacheblock大小(且对齐)的bio, 然后交由dm-cache处理. 也就是说, 不管是对SSD, 还是普通磁盘, dm-cache处理IO的单位都是cacheblock. 它在内存中的metadata为:

117 /* Cache block metadata structure */
118 struct cacheblock {
119     spinlock_t lock;    /* Lock to protect operations on the bio list */
120     sector_t block;     /* Sector number of the cached block */
121     unsigned short state;  /* State of a block */
122     unsigned long counter; /* Logical timestamp of the block’s last access */
123     struct bio_list bios;  /* List of pending bios */
124 };

其中, block字段表示当前cacheblock的起始扇区编号. 既然SSD作为cache, 针对写请求必定会有writeback和writethrough等多种选择. writeback即数据先写到SSD, 然后由后台线程在合适的时间写回磁盘. writethrough指数据同时写入磁盘和SSD. (flashcache在这基础上又增加了writearound的方式, 意思是绕过SSD cache, 数据直接写入磁盘, 在处理读请求时更新到SSD.) 不管是writeback, 还是writethrough, 数据写入磁盘(或者由磁盘读取数据更新至cache)都不可能一蹴而就, 所以每个cacheblock必定会有一个状态(state字段). 另外, cache有淘汰的概念, dm-cache支持FIFO或LRU淘汰, 所以需要为每个cacheblock保存其最后访问时间(counter字段). 最后, 为了互斥同时请求同一个cacheblock, 每个cacheblock还对应一个spinlock. 被互斥的后发请求记录在bios链表中. 在当前cacheblock上的操作完成后, dm-cache将重新提交bios链表上的bio.

接下来看下dm-cache的总控结构体cache_c:

 80 /*
 81  * Cache context
 82  */
 83 struct cache_c {
 84     struct dm_dev *src_dev;        /* Source device */
 85     struct dm_dev *cache_dev;  /* Cache device */
 86     struct dm_kcopyd_client *kcp_client; /* Kcopyd client for writing back data */
 87 
 88     struct cacheblock *cache;  /* Hash table for cache blocks */
 89     sector_t size;          /* Cache size */
 90     unsigned int bits;     /* Cache size in bits */
 91     unsigned int assoc;        /* Cache associativity */
 92     unsigned int block_size;   /* Cache block size */
 93     unsigned int block_shift;  /* Cache block size in bits */
 94     unsigned int block_mask;   /* Cache block mask */
 95     unsigned int consecutive_shift;    /* Consecutive blocks size in bits */
 96     unsigned long counter;     /* Logical timestamp of last access */
 97     unsigned int write_policy; /* Cache write policy */
 98     sector_t dirty_blocks;      /* Number of dirty blocks */
 99 
100     spinlock_t lock;        /* Lock to protect page allocation/deallocation */
101     struct page_list *pages;   /* Pages for I/O */
102     unsigned int nr_pages;     /* Number of pages */
103     unsigned int nr_free_pages;    /* Number of free pages */
104     wait_queue_head_t destroyq; /* Wait queue for I/O completion */
105     atomic_t nr_jobs;       /* Number of I/O jobs */
106     struct dm_io_client *io_client;   /* Client memory pool*/
107 
108     /* Stats */
109     unsigned long reads;       /* Number of reads */
110     unsigned long writes;      /* Number of writes */
111     unsigned long cache_hits;  /* Number of cache hits */
112     unsigned long replace;     /* Number of cache replacements */
113     unsigned long writeback;   /* Number of replaced dirty blocks */
114     unsigned long dirty;       /* Number of submitted dirty blocks */
115 };

其中, src_dev和cache_dev分别为磁盘和SSD在DM框架的抽象. cache字段为连续的cacheblock数组, 元素个数即size字段. 其余字段顾名思义, 不再赘述.

初始化

dm-cache的初始化代码相对简单, DM框架获取dmsetup参数, 传递给cache_ctr(), dm-cache通过该函数构造一个cache_c对象, 保存在dm_target.private中. dm_target结构中另一重要字段为split_io, 这个字段表示DM框架分割bio的粒度, cache_ctr()函数指定其为cacheblock大小.

上层的读写请求在IO内核路径上表示为bio, 针对Device Mapper框架虚拟出来的块设备的bio请求, DM框架通过bio的block编号找到所属的dm_targets(一个bio的请求可能横跨多个dm_target), 逐个回调dm_target.type->map, 该字段为函数指针, 在dm-cache模块加载到内核时, 由该模块的初始化函数dm_cache_init()注册为cache_map(). 也就是说, 读写请求的入口都是cache_map().

请求处理

如上所述, 读写请求的入口都是cache_map(), 其实现如下:

1202 /*
1203  * Decide the mapping and perform necessary cache operations for a bio request.
1204  */
1205 static int cache_map(struct dm_target *ti, struct bio *bio,
1206               union map_info *map_context)
1207 {
1208     struct cache_c *dmc = (struct cache_c *) ti->private;
1209     sector_t request_block, cache_block = 0, offset;
1210     int res;
1211 
1212     offset = bio->bi_sector & dmc->block_mask;
1213     request_block = bio->bi_sector – offset;
1214 
1220     if (bio_data_dir(bio) == READ) dmc->reads++;
1221     else dmc->writes++;
1222 
1223     res = cache_lookup(dmc, request_block, &cache_block);
1224     if (1 == res)  /* Cache hit; server request from cache */
1225         return cache_hit(dmc, bio, cache_block);
1226     else if (0 == res) /* Cache miss; replacement block is found */
1227         return cache_miss(dmc, bio, cache_block);
1228     else if (2 == res) { /* Entire cache set is dirty; initiate a write-back */
1229         write_back(dmc, cache_block, 1);
1230         dmc->writeback++;
1231     }
1232 
1233     /* Forward to source device */
1234     bio->bi_bdev = dmc->src_dev->bdev;
1235 
1236     return 1;
1237 }

该函数首先从ti->private中获取cache_c *dmc, 这个对象由cache_ctr()中构造. 接着获得bio所请求的起始扇区(即bio->bi_sector)所属的cacheblock的扇区编号, 保存在request_block变量. 接着通过cache_lookup()函数在dmc->cache中查找, key便是request_block. cache_lookup()代码相对简单, 不再细述.

如果cache中查找失败, 则进入cache_miss()逻辑. 其最后一个参数cache_block为cache_lookup()以某种淘汰形式找到的待替换的cacheblock的扇区编号.

1189 /* Handle cache misses */
1190 static int cache_miss(struct cache_c *dmc, struct bio* bio, sector_t cache_block) {
1191     if (bio_data_dir(bio) == READ)
1192         return cache_read_miss(dmc, bio, cache_block);
1193     else
1194         return cache_write_miss(dmc, bio, cache_block);
1195 }

cache_miss()函数判断bio是读是写, 读则调用cache_read_miss(), 否则调用cache_write_miss().

篇幅所限, 接下来我们只看下读请求未命中cache的情况, 这时cache_read_miss()将被调用.

1073 /*
1074  * Handle a read cache miss:
1075  *  Update the metadata; fetch the necessary block from source device;
1076  *  store data to cache device.
1077  */
1078 static int cache_read_miss(struct cache_c *dmc, struct bio* bio,
1079                            sector_t cache_block) {
1080     struct cacheblock *cache = dmc->cache;
1081     unsigned int offset, head, tail;
1082     struct kcached_job *job;
1083     sector_t request_block, left;
1084 
1085     offset = (unsigned int)(bio->bi_sector & dmc->block_mask);
1086     request_block = bio->bi_sector – offset;
1087 
1095     cache_insert(dmc, request_block, cache_block); /* Update metadata first */
1096 
1097     job = new_kcached_job(dmc, bio, request_block, cache_block);
1098 
1099     head = to_bytes(offset);
1100 
1101     left = (dmc->src_dev->bdev->bd_inode->i_size>>9) – request_block;
1102     if (left < dmc->block_size) {
1103         tail = to_bytes(left) – bio->bi_size – head;
1104         job->src.count = left;
1105         job->dest.count = left;
1106     } else
1107         tail = to_bytes(dmc->block_size) – bio->bi_size – head;
1108 
1109     /* Requested block is aligned with a cache block */
1110     if (0 == head && 0 == tail)
1111         job->nr_pages= 0;
1112     else /* Need new pages to store extra data */
1113         job->nr_pages = dm_div_up(head, PAGE_SIZE) + dm_div_up(tail, PAGE_SIZE);
1114     job->rw = READ; /* Fetch data from the source device */
1115 
1118     queue_job(job);
1119 
1120     return 0;
1121 }

函数首先调用cache_insert()更新cache, 设置该cacheblock.state为RESERVED. 然后调用new_kcached_job()分配一个kcached_job对象. 第1109~1104行是核心代码, 如前文所述, 上层请求的bio已经由DM框架按cacheblock单位切分, 也就是说, cache_map()所处理的每个bio请求的扇区数最大为cacheblock. 如下图所示: 第1099行获得这个bio在cacheblock中的偏移, 保存在head. 第1101行获得第request_block块扇区到磁盘最后一块扇区所跨过的扇区数. 第1103或1107行获得bio请求的数据最后一个字节离磁盘最后一字节(或者下一个cacheblock)的偏移. 第1110行, 如果0 == head且0 == tail, 说明所请求的bio正好覆盖整个cacheblock. 否则, 说明请求的bio只占cacheblock的一部分, 针对这种情况, 需要为该bio未请求的前后两部分分别分配页面. 因为dm-cache请求磁盘的单位为cacheblock大小. 第1114行指定job的读写方向为READ. 最后, 第1118行提交job.

回头看看cache_read_miss()中的1097行分配job所调用的函数new_kcached_job(), 第3个参数request_block表示bio请求在磁盘的起始扇区号, 第4个参数cache_block表示bio请求在SSD的起始扇区号.

1049 static struct kcached_job *new_kcached_job(struct cache_c *dmc, struct bio* bio,
1050                                            sector_t request_block,
1051                                            sector_t cache_block)
1052 {
1053     struct dm_io_region src, dest;
1054     struct kcached_job *job;
1055 
1056     src.bdev = dmc->src_dev->bdev;
1057     src.sector = request_block;
1058     src.count = dmc->block_size;
1059     dest.bdev = dmc->cache_dev->bdev;
1060     dest.sector = cache_block << dmc->block_shift;
1061     dest.count = src.count;
1062 
1063     job = mempool_alloc(_job_pool, GFP_NOIO);
1064     job->dmc = dmc;
1065     job->bio = bio;
1066     job->src = src;
1067     job->dest = dest;
1068     job->cacheblock = &dmc->cache[cache_block];
1069 
1070     return job;
1071 }

接下来看看job结构的定义:

126 /* Structure for a kcached job */
127 struct kcached_job {
128     struct list_head list;
129     struct cache_c *dmc;
130     struct bio *bio;   /* Original bio */
131     struct dm_io_region src;
132     struct dm_io_region dest;
133     struct cacheblock *cacheblock;
134     int rw;
135     /*
136      * When the original bio is not aligned with cache blocks,
137      * we need extra bvecs and pages for padding.
138      */
139     struct bio_vec *bvec;
140     unsigned int nr_pages;
141     struct page_list *pages;
142 };

在dm-cache中, job有3种状态, 它以list字段链入其所属于的链表, 分别为_io_jobs, _pages_jobs和_complete_jobs. 其中, io_jobs表示待执行IO的任务, page_jobs待分配页面的任务, compelet_jobs表示待收尾的任务. kcached_job的bio字段存储DM框架发给cache_map()的bio请求, src和dest分别指向磁盘和SSD的dm_io_region. cacheblock指针指向cache_c.cache数组中以请求的bio所落在的SSD磁盘上的cacheblock编号为下标的偏移. rw字段表示job当前的读写方向.

回到cache_read_miss()函数, 它在第1118行调用queue_job()提交了任务, 代码如下:

736 static void queue_job(struct kcached_job *job)
737 {
738     atomic_inc(&job->dmc->nr_jobs);
739     if (job->nr_pages > 0/* Request pages */
740         push(&_pages_jobs, job);
741     else /* Go ahead to do I/O */
742         push(&_io_jobs, job);
743     wake();
744 }

可以看到, 如果需要为job分配页面, 则将job链入_pages_jobs链表, 否则, 链入_io_jobs链表. 然后调用wake():

299 static inline void wake(void)
300 {
301     queue_work(_kcached_wq, &_kcached_work);
302 }

wake()函数只是对queue_work()的封装, 它将_kcached_work提交到_kcached_wq. 在dm-cache模块初始化函数dm_cache_init()中, _kcached_work被注册的回调为do_work. 所以, 当_kcached_work被调度时, do_work()将被回调.

729 static void do_work(struct work_struct *ignored)
730 {
731     process_jobs(&_complete_jobs, do_complete);
732     process_jobs(&_pages_jobs, do_pages);
733     process_jobs(&_io_jobs, do_io);
734 }

可见, do_work()依次遍历_complete_jobs, _pages_jobs和_io_jobs链表中的任务, 以任务为参数, 分别回调do_complete, do_pages, do_io. 在这里, 遍历的顺序是有讲究的: 先处理_complete_jobs任务, 是因为此类任务完成后可能释放一些页面回页面内存池; 然后处理_pages_jobs任务, 因为此类任务只有获取页面后才能执行IO操作, 它从页面内存池中获取页面; 最后处理_io_jobs链表任务.

process_jobs()代码如下:

696 /*
697  * Run through a list for as long as possible.  Returns the count
698  * of successful jobs.
699  */
700 static int process_jobs(struct list_head *jobs,
701                         int (*fn) (struct kcached_job *))
702 {
703     struct kcached_job *job;
704     int r, count = 0;
705 
706     while ((job = pop(jobs))) {
707         r = fn(job);
708 
709         if (r < 0) {
710             /* error this rogue job */
711             DMERR("process_jobs: Job processing error");
712         }
713 
714         if (r > 0) {
715             /*
716              * We couldn’t service this job ATM, so
717              * push this job back onto the list.
718              */
719             push(jobs, job);
720             break;
721         }
722 
723         count++;
724     }
725 
726     return count;
727 }

它依次遍历链表, 调用回调.

回到queue_job(), 前面说过, 因为dm-cache读写SSD及磁盘的粒度为cacheblock大小, 所以如果bio请求未对其cacheblock, 或请求大小不等于cacheblock大小, 则需要为该cacheblock中, bio不关心的前后部分分配页面, 即把job提交到_pages_jobs链表. 否则, 直接提交到_io_jobs链表.

_pages_jobs链表的回调函数do_pages非常简单, 它从页面内存池获取一些页面(页面数为nr_pages), 保存在kcached_job结构的pages字段, 然后将job提交到_io_jobs链表.

针对_io_jobs链表上的任务, do_work()将以do_io回调来处理该任务.

618 static int do_io(struct kcached_job *job)
619 {
620     int r = 0;
621 
622     if (job->rw == READ) { /* Read from source device */
623         r = do_fetch(job);
624     } else { /* Write to cache device */
625         r = do_store(job);
626     }
627 
628     return r;
629 }

针对读请求, 很明显是进入do_fetch()分支.

400 /*
401  * Fetch data from the source device asynchronously.
402  * For a READ bio, if a cache block is larger than the requested data, then
403  * additional data are prefetched. Larger cache block size enables more
404  * aggressive read prefetching, which is useful for read-mostly usage.
405  * For a WRITE bio, if a cache block is larger than the requested data, the
406  * entire block needs to be fetched, and larger block size incurs more overhead.
407  * In scenaros where writes are frequent, 4KB is a good cache block size.
408  */
409 static int do_fetch(struct kcached_job *job)
410 {
411     int r = 0, i, j;
412     struct bio *bio = job->bio;
413     struct cache_c *dmc = job->dmc;
414     unsigned int offset, head, tail, remaining, nr_vecs, idx = 0;
415     struct bio_vec *bvec;
416     struct page_list *pl;
417     printk("do_fetch");
418     offset = (unsigned int) (bio->bi_sector & dmc->block_mask);
419     head = to_bytes(offset);
420     tail = to_bytes(dmc->block_size) – bio->bi_size – head;
425 
426     if (bio_data_dir(bio) == READ) { /* The original request is a READ */
427         if (0 == job->nr_pages) { /* The request is aligned to cache block */
428             r = dm_io_async_bvec(1, &job->src, READ,
429                                  bio->bi_io_vec + bio->bi_idx,
430                                  io_callback, job);
431             return r;
432         }
433 
434         nr_vecs = bio->bi_vcnt – bio->bi_idx + job->nr_pages;
435         bvec = kmalloc(nr_vecs * sizeof(*bvec), GFP_NOIO);
436         if (!bvec) {
437             DMERR("do_fetch: No memory");
438             return 1;
439         }
440 
441         pl = job->pages;
442         i = 0;
443         while (head) {
444             bvec[i].bv_len = min(head, (unsigned int)PAGE_SIZE);
445             bvec[i].bv_offset = 0;
446             bvec[i].bv_page = pl->page;
447             head -= bvec[i].bv_len;
448             pl = pl->next;
449             i++;
450         }
451 
452         remaining = bio->bi_size;
453         j = bio->bi_idx;
454         while (remaining) {
455             bvec[i] = bio->bi_io_vec[j];
456             remaining -= bvec[i].bv_len;
457             i++; j++;
458         }
459 
460         while (tail) {
461             bvec[i].bv_len = min(tail, (unsigned int)PAGE_SIZE);
462             bvec[i].bv_offset = 0;
463             bvec[i].bv_page = pl->page;
464             tail -= bvec[i].bv_len;
465             pl = pl->next;
466             i++;
467         }
468 
469         job->bvec = bvec;
470         r = dm_io_async_bvec(1, &job->src, READ, job->bvec, io_callback, job);
471         return r;
472     } else { /* The original request is a WRITE */
541     }
542 }

-如果任务没有申请页面, 即bio请求正好cacheblock对齐且请求大小正好为一个cacheblock, 则直接调用dm_io_async_bvec().
-如果任务申请了页面, 即bio请求不是cacheblock对齐, 或者请求大小不是一个cacheblock, 则通过第434~467行代码主动构造一个bio_vec *bvec, 保存在job->bvec中, 然后调用dm_io_async_bvec().

仔细比较上述两种情况调用dm_io_async_bvec()所传递的参数, 不难发现, 只有第4个参数是不一样的. 前者传递的为原来请求的bio的bvec, 后者传递的为主动构造的bvec.

dm_io_async_bvec()函数提交IO, 从磁盘(job->src)中读取数据到第4个参数, 然后回调io_callback().

382 static void io_callback(unsigned long error, void *context)
383 {
384     struct kcached_job *job = (struct kcached_job *) context;
385 
386     if (error) {
387         /* TODO */
388         DMERR("io_callback: io error");
389         return;
390     }
391 
392     if (job->rw == READ) {
393         job->rw = WRITE;
394         push(&_io_jobs, job);
395     } else
396         push(&_complete_jobs, job);
397     wake();
398 }

读请求的job->rw为READ, 将其修改为WRITE后将job提交到_io_jobs链表. _io_jobs链表元素再次由do_work()以do_io()回调. 此时, 因为job->rw为WRITE, 所以调用的函数变成了do_store().

544 /*
545  * Store data to the cache source device asynchronously.
546  * For a READ bio request, the data fetched from the source device are returned
547  * to kernel and stored in cache at the same time.
548  * For a WRITE bio request, the data are written to the cache and source device
549  * at the same time.
550  */
551 static int do_store(struct kcached_job *job)
552 {
553     int i, j, r = 0;
554     struct bio *bio = job->bio ;
555     struct cache_c *dmc = job->dmc;
556     unsigned int offset, head, tail, remaining, nr_vecs;
557     struct bio_vec *bvec;
558     offset = (unsigned int) (bio->bi_sector & dmc->block_mask);
559     head = to_bytes(offset);
560     tail = to_bytes(dmc->block_size) – bio->bi_size – head;
566 
567     if (0 == job->nr_pages) /* Original request is aligned with cache blocks */
568         r = dm_io_async_bvec(1, &job->dest, WRITE, bio->bi_io_vec + bio->bi_idx,
569                              io_callback, job);
570     else {
571         if (bio_data_dir(bio) == WRITE && head > 0 && tail > 0) {
573             nr_vecs = job->nr_pages + bio->bi_vcnt – bio->bi_idx;
574             if (offset && (offset + bio->bi_size < PAGE_SIZE)) nr_vecs++;
576             bvec = kmalloc(nr_vecs * sizeof(*bvec), GFP_KERNEL);
577             if (!bvec) {
578                 DMERR("do_store: No memory");
579                 return 1;
580             }
581 
582             i = 0;
583             while (head) {
584                 bvec[i].bv_len = min(head, job->bvec[i].bv_len);
585                 bvec[i].bv_offset = 0;
586                 bvec[i].bv_page = job->bvec[i].bv_page;
587                 head -= bvec[i].bv_len;
588                 i++;
589             }
590             remaining = bio->bi_size;
591             j = bio->bi_idx;
592             while (remaining) {
593                 bvec[i] = bio->bi_io_vec[j];
594                 remaining -= bvec[i].bv_len;
595                 i++; j++;
596             }
597             j = (to_bytes(offset) + bio->bi_size) / PAGE_SIZE;
598             bvec[i].bv_offset = (to_bytes(offset) + bio->bi_size) -
599                                 j * PAGE_SIZE;
600             bvec[i].bv_len = PAGE_SIZE – bvec[i].bv_offset;
601             bvec[i].bv_page = job->bvec[j].bv_page;
602             tail -= bvec[i].bv_len;
603             i++; j++;
604             while (tail) {
605                 bvec[i] = job->bvec[j];
606                 tail -= bvec[i].bv_len;
607                 i++; j++;
608             }
609             kfree(job->bvec);
610             job->bvec = bvec;
611         }
612 
613         r = dm_io_async_bvec(1, &job->dest, WRITE, job->bvec, io_callback, job);
614     }
615     return r;
616 }

这段代码和do_fetch()非常相像, 不再细述. 它把do_fetch()中从磁盘读取的数据, 通过dm_io_async_bvec()函数, 写入SSD(job->dest). 然后io_callback()再次被回调. 此时, 因为job->rw为WRITE, io_callback()将任务提交到_complete_jobs链表. 该链表对应的回调函数为do_complete():

673 static int do_complete(struct kcached_job *job)
674 {
675     int r = 0;
676     struct bio *bio = job->bio;
677 
680     bio_endio(bio, 0);
681 
682     if (job->nr_pages > 0) {
683         kfree(job->bvec);
684         kcached_put_pages(job->dmc, job->pages);
685     }
686 
687     flush_bios(job->cacheblock);
688     mempool_free(job, _job_pool);
689 
690     if (atomic_dec_and_test(&job->dmc->nr_jobs))
691         wake_up(&job->dmc->destroyq);
692 
693     return r;
694 }

do_complete()首先调用bio_endio(), 告诉IO子系统上层, 当前bio已经处理完成. 然后释放页面. 之后调用flush_bios()重新提交在当前bio之后所有发往同个cacheblock的bios, 最后释放job.

至此, 读请求完成. 写请求与读请求大同小异, 不表.

总结陈词

IO处理内核化是一种有效的IO优化方式. 另外, IO路径网络化(iSCSI)也是大势所趋, 如Amazon的EBS及腾讯的CBS(入门参考块存储的世界). 希望以后一窥究竟.