linux dm-cache,dm-cache源码浅析

最近想学习Linux IO子系统, 找了flashcache代码, 它通过内核提供的Device Mapper机制, 将一块SSD和一块普通磁盘虚拟为一个块设备, 其中SSD作为cache, 数据最终落地到普通磁盘. 这种混合存储的策略, 向上层应用(如mysql)屏蔽了底层的实现, 上层应用看到的只是一个挂载到虚拟块设备上的某种文件系统, 使用常见的文件系统接口即可读写数据, 一方面保持兼容, 一方面获得不错的性能. flashcache的代码只有几千行, 从commit log中可以看到版本迭代比较频繁, 也因此引入了较多我个人不关心的新特性. flashcache源码中作者写到借鉴了dm-cache的代码, 所以查了下资料, 竟是国人出品, sloc不足两千, 一晚上就可以看完, 正合胃口. dm-cache的使用可以参考flashcache文档, 原理见flashcache原理.

内存结构

dm-cache思路非常简单, 它把SSD作为cache, 将数据持久化到普通磁盘. 其中, SSD cache组织方式为set-associative map, 这和CPU cache的组织非常相像, 只是这里的key是cacheblock编号. cacheblock是dm-cache为了方便存取数据引入的单位, 粒度在磁盘block之上. 在通过dmsetup创建dm-cache块设备时时可以指定cacheblock的大小, 默认为8个连续的磁盘block组成一个cacheblock, 即4k字节. 上层的IO请求由Device Mapper框架切割为cacheblock大小(且对齐)的bio, 然后交由dm-cache处理. 也就是说, 不管是对SSD, 还是普通磁盘, dm-cache处理IO的单位都是cacheblock. 它在内存中的metadata为:

117/*Cache block metadata structure*/

118struct cacheblock {

119    spinlock_t lock;    /*Lock to protect operations on the bio list*/

120    sector_t block;     /*Sector number of the cached block*/

121    unsigned short state;  /*State of a block*/

122    unsigned long counter; /*Logical timestamp of the block’s last access*/

123    struct bio_list bios;  /*List of pending bios*/

124};

其中, block字段表示当前cacheblock的起始扇区编号. 既然SSD作为cache, 针对写请求必定会有writeback和writethrough等多种选择. writeback即数据先写到SSD, 然后由后台线程在合适的时间写回磁盘. writethrough指数据同时写入磁盘和SSD. (flashcache在这基础上又增加了writearound的方式, 意思是绕过SSD cache, 数据直接写入磁盘, 在处理读请求时更新到SSD.) 不管是writeback, 还是writethrough, 数据写入磁盘(或者由磁盘读取数据更新至cache)都不可能一蹴而就, 所以每个cacheblock必定会有一个状态(state字段). 另外, cache有淘汰的概念, dm-cache支持FIFO或LRU淘汰, 所以需要为每个cacheblock保存其最后访问时间(counter字段). 最后, 为了互斥同时请求同一个cacheblock, 每个cacheblock还对应一个spinlock. 被互斥的后发请求记录在bios链表中. 在当前cacheblock上的操作完成后, dm-cache将重新提交bios链表上的bio.

接下来看下dm-cache的总控结构体cache_c:

80/*

81* Cache context

82*/

83struct cache_c {

84    struct dm_dev *src_dev;        /*Source device*/

85    struct dm_dev *cache_dev;  /*Cache device*/

86    struct dm_kcopyd_client *kcp_client; /*Kcopyd client for writing back data*/

87

88    struct cacheblock *cache;  /*Hash table for cache blocks*/

89    sector_t size;          /*Cache size*/

90    unsigned int bits;     /*Cache size in bits*/

91    unsigned int assoc;        /*Cache associativity*/

92    unsigned int block_size;   /*Cache block size*/

93    unsigned int block_shift;  /*Cache block size in bits*/

94    unsigned int block_mask;   /*Cache block mask*/

95    unsigned int consecutive_shift;    /*Consecutive blocks size in bits*/

96    unsigned long counter;     /*Logical timestamp of last access*/

97    unsigned int write_policy; /*Cache write policy*/

98    sector_t dirty_blocks;      /*Number of dirty blocks*/

99

100    spinlock_t lock;        /*Lock to protect page allocation/deallocation*/

101    struct page_list *pages;   /*Pages for I/O*/

102    unsigned int nr_pages;     /*Number of pages*/

103    unsigned int nr_free_pages;    /*Number of free pages*/

104    wait_queue_head_t destroyq; /*Wait queue for I/O completion*/

105    atomic_t nr_jobs;       /*Number of I/O jobs*/

106    struct dm_io_client *io_client;   /*Client memory pool*/

107

108    /*Stats*/

109    unsigned long reads;       /*Number of reads*/

110    unsigned long writes;      /*Number of writes*/

111    unsigned long cache_hits;  /*Number of cache hits*/

112    unsigned long replace;     /*Number of cache replacements*/

113    unsigned long writeback;   /*Number of replaced dirty blocks*/

114    unsigned long dirty;       /*Number of submitted dirty blocks*/

115};

其中, src_dev和cache_dev分别为磁盘和SSD在DM框架的抽象. cache字段为连续的cacheblock数组, 元素个数即size字段. 其余字段顾名思义, 不再赘述.

初始化

dm-cache的初始化代码相对简单, DM框架获取dmsetup参数, 传递给cache_ctr(), dm-cache通过该函数构造一个cache_c对象, 保存在dm_target.private中. dm_target结构中另一重要字段为split_io, 这个字段表示DM框架分割bio的粒度, cache_ctr()函数指定其为cacheblock大小.

上层的读写请求在IO内核路径上表示为bio, 针对Device Mapper框架虚拟出来的块设备的bio请求, DM框架通过bio的block编号找到所属的dm_targets(一个bio的请求可能横跨多个dm_target), 逐个回调dm_target.type->map, 该字段为函数指针, 在dm-cache模块加载到内核时, 由该模块的初始化函数dm_cache_init()注册为cache_map(). 也就是说, 读写请求的入口都是cache_map().

请求处理

如上所述, 读写请求的入口都是cache_map(), 其实现如下:

1202/*

1203* Decide the mapping and perform necessary cache operations for a bio request.

1204*/

1205static int cache_map(struct dm_target *ti, struct bio *bio,

1206              union map_info *map_context)

1207{

1208    struct cache_c *dmc = (struct cache_c *) ti->private;

1209    sector_t request_block, cache_block = 0, offset;

1210    int res;

1211

1212    offset = bio->bi_sector & dmc->block_mask;

1213    request_block = bio->bi_sector – offset;

1214

1220    if (bio_data_dir(bio) == READ) dmc->reads++;

1221    else dmc->writes++;

1222

1223    res = cache_lookup(dmc, request_block, &cache_block);

1224    if (1 == res)  /*Cache hit; server request from cache*/

1225        return cache_hit(dmc, bio, cache_block);

1226    else if (0 == res) /*Cache miss; replacement block is found*/

1227        return cache_miss(dmc, bio, cache_block);

1228    else if (2 == res) { /*Entire cache set is dirty; initiate a write-back*/

1229        write_back(dmc, cache_block, 1);

1230        dmc->writeback++;

1231    }

1232

1233    /*Forward to source device*/

1234    bio->bi_bdev = dmc->src_dev->bdev;

1235

1236    return 1;

1237}

该函数首先从ti->private中获取cache_c *dmc, 这个对象由cache_ctr()中构造. 接着获得bio所请求的起始扇区(即bio->bi_sector)所属的cacheblock的扇区编号, 保存在request_block变量. 接着通过cache_lookup()函数在dmc->cache中查找, key便是request_block. cache_lookup()代码相对简单, 不再细述.

如果cache中查找失败, 则进入cache_miss()逻辑. 其最后一个参数cache_block为cache_lookup()以某种淘汰形式找到的待替换的cacheblock的扇区编号.

1189/*Handle cache misses*/

1190static int cache_miss(struct cache_c *dmc, struct bio* bio, sector_t cache_block) {

1191    if (bio_data_dir(bio) == READ)

1192        return cache_read_miss(dmc, bio, cache_block);

1193    else

1194        return cache_write_miss(dmc, bio, cache_block);

1195}

cache_miss()函数判断bio是读是写, 读则调用cache_read_miss(), 否则调用cache_write_miss().

篇幅所限, 接下来我们只看下读请求未命中cache的情况, 这时cache_read_miss()将被调用.

1073/*

1074* Handle a read cache miss:

1075*  Update the metadata; fetch the necessary block from source device;

1076*  store data to cache device.

1077*/

1078static int cache_read_miss(struct cache_c *dmc, struct bio* bio,

1079                           sector_t cache_block) {

1080    struct cacheblock *cache = dmc->cache;

1081    unsigned int offset, head, tail;

1082    struct kcached_job *job;

1083    sector_t request_block, left;

1084

1085    offset = (unsigned int)(bio->bi_sector & dmc->block_mask);

1086    request_block = bio->bi_sector – offset;

1087

1095    cache_insert(dmc, request_block, cache_block); /*Update metadata first*/

1096

1097    job = new_kcached_job(dmc, bio, request_block, cache_block);

1098

1099    head = to_bytes(offset);

1100

1101    left = (dmc->src_dev->bdev->bd_inode->i_size>>9) – request_block;

1102    if (left < dmc->block_size) {

1103        tail = to_bytes(left) – bio->bi_size – head;

1104        job->src.count = left;

1105        job->dest.count = left;

1106    } else

1107        tail = to_bytes(dmc->block_size) – bio->bi_size – head;

1108

1109    /*Requested block is aligned with a cache block*/

1110    if (0 == head && 0 == tail)

1111        job->nr_pages= 0;

1112    else /*Need new pages to store extra data*/

1113        job->nr_pages = dm_div_up(head, PAGE_SIZE) + dm_div_up(tail, PAGE_SIZE);

1114    job->rw = READ; /*Fetch data from the source device*/

1115

1118    queue_job(job);

1119

1120    return 0;

1121}

函数首先调用cache_insert()更新cache, 设置该cacheblock.state为RESERVED. 然后调用new_kcached_job()分配一个kcached_job对象. 第1109~1104行是核心代码, 如前文所述, 上层请求的bio已经由DM框架按cacheblock单位切分, 也就是说, cache_map()所处理的每个bio请求的扇区数最大为cacheblock. 如下图所示: 第1099行获得这个bio在cacheblock中的偏移, 保存在head. 第1101行获得第request_block块扇区到磁盘最后一块扇区所跨过的扇区数. 第1103或1107行获得bio请求的数据最后一个字节离磁盘最后一字节(或者下一个cacheblock)的偏移. 第1110行, 如果0 == head且0 == tail, 说明所请求的bio正好覆盖整个cacheblock. 否则, 说明请求的bio只占cacheblock的一部分, 针对这种情况, 需要为该bio未请求的前后两部分分别分配页面. 因为dm-cache请求磁盘的单位为cacheblock大小. 第1114行指定job的读写方向为READ. 最后, 第1118行提交job.

回头看看cache_read_miss()中的1097行分配job所调用的函数new_kcached_job(), 第3个参数request_block表示bio请求在磁盘的起始扇区号, 第4个参数cache_block表示bio请求在SSD的起始扇区号.

1049static struct kcached_job *new_kcached_job(struct cache_c *dmc, struct bio* bio,

1050                                           sector_t request_block,

1051                                           sector_t cache_block)

1052{

1053    struct dm_io_region src, dest;

1054    struct kcached_job *job;

1055

1056    src.bdev = dmc->src_dev->bdev;

1057    src.sector = request_block;

1058    src.count = dmc->block_size;

1059    dest.bdev = dmc->cache_dev->bdev;

1060    dest.sector = cache_block << dmc->block_shift;

1061    dest.count = src.count;

1062

1063    job = mempool_alloc(_job_pool, GFP_NOIO);

1064    job->dmc = dmc;

1065    job->bio = bio;

1066    job->src = src;

1067    job->dest = dest;

1068    job->cacheblock = &dmc->cache[cache_block];

1069

1070    return job;

1071}

接下来看看job结构的定义:

126/*Structure for a kcached job*/

127struct kcached_job {

128    struct list_head list;

129    struct cache_c *dmc;

130    struct bio *bio;   /*Original bio*/

131    struct dm_io_region src;

132    struct dm_io_region dest;

133    struct cacheblock *cacheblock;

134    int rw;

135    /*

136* When the original bio is not aligned with cache blocks,

137* we need extra bvecs and pages for padding.

138*/

139    struct bio_vec *bvec;

140    unsigned int nr_pages;

141    struct page_list *pages;

142};

在dm-cache中, job有3种状态, 它以list字段链入其所属于的链表, 分别为_io_jobs, _pages_jobs和_complete_jobs. 其中, io_jobs表示待执行IO的任务, page_jobs待分配页面的任务, compelet_jobs表示待收尾的任务. kcached_job的bio字段存储DM框架发给cache_map()的bio请求, src和dest分别指向磁盘和SSD的dm_io_region. cacheblock指针指向cache_c.cache数组中以请求的bio所落在的SSD磁盘上的cacheblock编号为下标的偏移. rw字段表示job当前的读写方向.

回到cache_read_miss()函数, 它在第1118行调用queue_job()提交了任务, 代码如下:

736static void queue_job(struct kcached_job *job)

737{

738    atomic_inc(&job->dmc->nr_jobs);

739    if (job->nr_pages > 0) /*Request pages*/

740        push(&_pages_jobs, job);

741    else /*Go ahead to do I/O*/

742        push(&_io_jobs, job);

743    wake();

744}

可以看到, 如果需要为job分配页面, 则将job链入_pages_jobs链表, 否则, 链入_io_jobs链表. 然后调用wake():

299static inline void wake(void)

300{

301    queue_work(_kcached_wq, &_kcached_work);

302}

wake()函数只是对queue_work()的封装, 它将_kcached_work提交到_kcached_wq. 在dm-cache模块初始化函数dm_cache_init()中, _kcached_work被注册的回调为do_work. 所以, 当_kcached_work被调度时, do_work()将被回调.

729static void do_work(struct work_struct *ignored)

730{

731    process_jobs(&_complete_jobs, do_complete);

732    process_jobs(&_pages_jobs, do_pages);

733    process_jobs(&_io_jobs, do_io);

734}

可见, do_work()依次遍历_complete_jobs, _pages_jobs和_io_jobs链表中的任务, 以任务为参数, 分别回调do_complete, do_pages, do_io. 在这里, 遍历的顺序是有讲究的: 先处理_complete_jobs任务, 是因为此类任务完成后可能释放一些页面回页面内存池; 然后处理_pages_jobs任务, 因为此类任务只有获取页面后才能执行IO操作, 它从页面内存池中获取页面; 最后处理_io_jobs链表任务.

process_jobs()代码如下:

696/*

697* Run through a list for as long as possible.  Returns the count

698* of successful jobs.

699*/

700static int process_jobs(struct list_head *jobs,

701                        int (*fn) (struct kcached_job *))

702{

703    struct kcached_job *job;

704    int r, count = 0;

705

706    while ((job = pop(jobs))) {

707        r = fn(job);

708

709        if (r

710            /*error this rogue job*/

711            DMERR("process_jobs: Job processing error");

712        }

713

714        if (r > 0) {

715            /*

716* We couldn’t service this job ATM, so

717* push this job back onto the list.

718*/

719            push(jobs, job);

720            break;

721        }

722

723        count++;

724    }

725

726    return count;

727}

它依次遍历链表, 调用回调.

回到queue_job(), 前面说过, 因为dm-cache读写SSD及磁盘的粒度为cacheblock大小, 所以如果bio请求未对其cacheblock, 或请求大小不等于cacheblock大小, 则需要为该cacheblock中, bio不关心的前后部分分配页面, 即把job提交到_pages_jobs链表. 否则, 直接提交到_io_jobs链表.

_pages_jobs链表的回调函数do_pages非常简单, 它从页面内存池获取一些页面(页面数为nr_pages), 保存在kcached_job结构的pages字段, 然后将job提交到_io_jobs链表.

针对_io_jobs链表上的任务, do_work()将以do_io回调来处理该任务.

618static int do_io(struct kcached_job *job)

619{

620    int r = 0;

621

622    if (job->rw == READ) { /*Read from source device*/

623        r = do_fetch(job);

624    } else { /*Write to cache device*/

625        r = do_store(job);

626    }

627

628    return r;

629}

针对读请求, 很明显是进入do_fetch()分支.

400/*

401* Fetch data from the source device asynchronously.

402* For a READ bio, if a cache block is larger than the requested data, then

403* additional data are prefetched. Larger cache block size enables more

404* aggressive read prefetching, which is useful for read-mostly usage.

405* For a WRITE bio, if a cache block is larger than the requested data, the

406* entire block needs to be fetched, and larger block size incurs more overhead.

407* In scenaros where writes are frequent, 4KB is a good cache block size.

408*/

409static int do_fetch(struct kcached_job *job)

410{

411    int r = 0, i, j;

412    struct bio *bio = job->bio;

413    struct cache_c *dmc = job->dmc;

414    unsigned int offset, head, tail, remaining, nr_vecs, idx = 0;

415    struct bio_vec *bvec;

416    struct page_list *pl;

417    printk("do_fetch");

418    offset = (unsigned int) (bio->bi_sector & dmc->block_mask);

419    head = to_bytes(offset);

420    tail = to_bytes(dmc->block_size) – bio->bi_size – head;

425

426    if (bio_data_dir(bio) == READ) { /*The original request is a READ*/

427        if (0 == job->nr_pages) { /*The request is aligned to cache block*/

428            r = dm_io_async_bvec(1, &job->src, READ,

429                                 bio->bi_io_vec + bio->bi_idx,

430                                 io_callback, job);

431            return r;

432        }

433

434        nr_vecs = bio->bi_vcnt – bio->bi_idx + job->nr_pages;

435        bvec = kmalloc(nr_vecs * sizeof(*bvec), GFP_NOIO);

436        if (!bvec) {

437            DMERR("do_fetch: No memory");

438            return 1;

439        }

440

441        pl = job->pages;

442        i = 0;

443        while (head) {

444            bvec[i].bv_len = min(head, (unsigned int)PAGE_SIZE);

445            bvec[i].bv_offset = 0;

446            bvec[i].bv_page = pl->page;

447            head -= bvec[i].bv_len;

448            pl = pl->next;

449            i++;

450        }

451

452        remaining = bio->bi_size;

453        j = bio->bi_idx;

454        while (remaining) {

455            bvec[i] = bio->bi_io_vec[j];

456            remaining -= bvec[i].bv_len;

457            i++; j++;

458        }

459

460        while (tail) {

461            bvec[i].bv_len = min(tail, (unsigned int)PAGE_SIZE);

462            bvec[i].bv_offset = 0;

463            bvec[i].bv_page = pl->page;

464            tail -= bvec[i].bv_len;

465            pl = pl->next;

466            i++;

467        }

468

469        job->bvec = bvec;

470        r = dm_io_async_bvec(1, &job->src, READ, job->bvec, io_callback, job);

471        return r;

472    } else { /*The original request is a WRITE*/

541    }

542}

-如果任务没有申请页面, 即bio请求正好cacheblock对齐且请求大小正好为一个cacheblock, 则直接调用dm_io_async_bvec().

-如果任务申请了页面, 即bio请求不是cacheblock对齐, 或者请求大小不是一个cacheblock, 则通过第434~467行代码主动构造一个bio_vec *bvec, 保存在job->bvec中, 然后调用dm_io_async_bvec().

仔细比较上述两种情况调用dm_io_async_bvec()所传递的参数, 不难发现, 只有第4个参数是不一样的. 前者传递的为原来请求的bio的bvec, 后者传递的为主动构造的bvec.

dm_io_async_bvec()函数提交IO, 从磁盘(job->src)中读取数据到第4个参数, 然后回调io_callback().

382static void io_callback(unsigned long error, void *context)

383{

384    struct kcached_job *job = (struct kcached_job *) context;

385

386    if (error) {

387        /*TODO*/

388        DMERR("io_callback: io error");

389        return;

390    }

391

392    if (job->rw == READ) {

393        job->rw = WRITE;

394        push(&_io_jobs, job);

395    } else

396        push(&_complete_jobs, job);

397    wake();

398}

读请求的job->rw为READ, 将其修改为WRITE后将job提交到_io_jobs链表. _io_jobs链表元素再次由do_work()以do_io()回调. 此时, 因为job->rw为WRITE, 所以调用的函数变成了do_store().

544/*

545* Store data to the cache source device asynchronously.

546* For a READ bio request, the data fetched from the source device are returned

547* to kernel and stored in cache at the same time.

548* For a WRITE bio request, the data are written to the cache and source device

549* at the same time.

550*/

551static int do_store(struct kcached_job *job)

552{

553    int i, j, r = 0;

554    struct bio *bio = job->bio ;

555    struct cache_c *dmc = job->dmc;

556    unsigned int offset, head, tail, remaining, nr_vecs;

557    struct bio_vec *bvec;

558    offset = (unsigned int) (bio->bi_sector & dmc->block_mask);

559    head = to_bytes(offset);

560    tail = to_bytes(dmc->block_size) – bio->bi_size – head;

566

567    if (0 == job->nr_pages) /*Original request is aligned with cache blocks*/

568        r = dm_io_async_bvec(1, &job->dest, WRITE, bio->bi_io_vec + bio->bi_idx,

569                             io_callback, job);

570    else {

571        if (bio_data_dir(bio) == WRITE && head > 0 && tail > 0) {

573            nr_vecs = job->nr_pages + bio->bi_vcnt – bio->bi_idx;

574            if (offset && (offset + bio->bi_size < PAGE_SIZE)) nr_vecs++;

576            bvec = kmalloc(nr_vecs * sizeof(*bvec), GFP_KERNEL);

577            if (!bvec) {

578                DMERR("do_store: No memory");

579                return 1;

580            }

581

582            i = 0;

583            while (head) {

584                bvec[i].bv_len = min(head, job->bvec[i].bv_len);

585                bvec[i].bv_offset = 0;

586                bvec[i].bv_page = job->bvec[i].bv_page;

587                head -= bvec[i].bv_len;

588                i++;

589            }

590            remaining = bio->bi_size;

591            j = bio->bi_idx;

592            while (remaining) {

593                bvec[i] = bio->bi_io_vec[j];

594                remaining -= bvec[i].bv_len;

595                i++; j++;

596            }

597            j = (to_bytes(offset) + bio->bi_size) / PAGE_SIZE;

598            bvec[i].bv_offset = (to_bytes(offset) + bio->bi_size) -

599                                j * PAGE_SIZE;

600            bvec[i].bv_len = PAGE_SIZE – bvec[i].bv_offset;

601            bvec[i].bv_page = job->bvec[j].bv_page;

602            tail -= bvec[i].bv_len;

603            i++; j++;

604            while (tail) {

605                bvec[i] = job->bvec[j];

606                tail -= bvec[i].bv_len;

607                i++; j++;

608            }

609            kfree(job->bvec);

610            job->bvec = bvec;

611        }

612

613        r = dm_io_async_bvec(1, &job->dest, WRITE, job->bvec, io_callback, job);

614    }

615    return r;

616}

这段代码和do_fetch()非常相像, 不再细述. 它把do_fetch()中从磁盘读取的数据, 通过dm_io_async_bvec()函数, 写入SSD(job->dest). 然后io_callback()再次被回调. 此时, 因为job->rw为WRITE, io_callback()将任务提交到_complete_jobs链表. 该链表对应的回调函数为do_complete():

673static int do_complete(struct kcached_job *job)

674{

675    int r = 0;

676    struct bio *bio = job->bio;

677

680    bio_endio(bio, 0);

681

682    if (job->nr_pages > 0) {

683        kfree(job->bvec);

684        kcached_put_pages(job->dmc, job->pages);

685    }

686

687    flush_bios(job->cacheblock);

688    mempool_free(job, _job_pool);

689

690    if (atomic_dec_and_test(&job->dmc->nr_jobs))

691        wake_up(&job->dmc->destroyq);

692

693    return r;

694}

do_complete()首先调用bio_endio(), 告诉IO子系统上层, 当前bio已经处理完成. 然后释放页面. 之后调用flush_bios()重新提交在当前bio之后所有发往同个cacheblock的bios, 最后释放job.

至此, 读请求完成. 写请求与读请求大同小异, 不表.

总结陈词

IO处理内核化是一种有效的IO优化方式. 另外, IO路径网络化(iSCSI)也是大势所趋, 如Amazon的EBS及腾讯的CBS(入门参考块存储的世界). 希望以后一窥究竟.

你可能感兴趣的:(linux,dm-cache)