这一节讲缓存的写回磁盘流程。这里隆重介绍一下两位幕后的英雄:
1724/*
1725 * Sync all dirty blocks. We pick off dirty blocks, sort them, merge them with
1726 * any contigous blocks we can within the set and fire off the writes.
1727 */
1728void
1729flashcache_sync_blocks(struct cache_c *dmc)
同步所有的脏块,从cache块中挑出脏块,排序,合并,下发到磁盘。
第二位刷缓存的英雄:
1004/*
1005 * Clean dirty blocks in this set as needed.
1006 *
1007 * 1) Select the n blocks that we want to clean (choosing whatever policy), sort them.
1008 * 2) Then sweep the entire set looking for other DIRTY blocks that can be tacked onto
1009 * any of these blocks to form larger contigous writes. The idea here is that if you
1010 * are going to do a write anyway, then we might as well opportunistically write out
1011 * any contigous blocks for free (Bob's idea).
1012 */
1013void
1014flashcache_clean_set(struct cache_c *dmc, int set)
同步集合里的脏块到磁盘上。根据刷缓存策略(FIFO/LRU),选择合适个数的脏块并排序。
第一个函数flashcache_sync_blocks主要用于flashcache设备要删除时,或者用户手动刷写缓存时调用。该函数遍历所有的cache块,看到脏的就记录下来,然后按集合为单位一一写回。
1724/*
1725 * Sync all dirty blocks. We pick off dirty blocks, sort them, merge them with
1726 * any contigous blocks we can within the set and fire off the writes.
1727 */
1728void
1729flashcache_sync_blocks(struct cache_c *dmc)
1730{
1731 unsigned long flags;
1732 int index;
1733 struct dbn_index_pair *writes_list;
1734 int nr_writes;
1735 int i, set;
1736 struct cacheblock *cacheblk;
1737
1738 /*
1739 * If a (fast) removal of this device is in progress, don't kick off
1740 * any more cleanings. This isn't sufficient though. We still need to
1741 * stop cleanings inside flashcache_dirty_writeback_sync() because we could
1742 * have started a device remove after tested this here.
1743 */
1744 if (atomic_read(&dmc->fast_remove_in_prog) || sysctl_flashcache_stop_sync)
1745 return;
1746 writes_list = kmalloc(dmc->assoc * sizeof(struct dbn_index_pair), GFP_NOIO);
1747 if (writes_list == NULL) {
1748 dmc->memory_alloc_errors++;
1749 return;
1750 }
1751 nr_writes = 0;
1752 set = -1;
1753 spin_lock_irqsave(&dmc->cache_spin_lock, flags);
1754 index = dmc->sync_index;
1755 while (index < dmc->size &&
1756 (nr_writes + dmc->clean_inprog) < dmc->max_clean_ios_total) {
1757 VERIFY(nr_writes <= dmc->assoc);
1758 if (((index % dmc->assoc) == 0) && (nr_writes > 0)) {
1759 /*
1760 * Crossing a set, sort/merge all the IOs collected so
1761 * far and issue the writes.
1762 */
1763 VERIFY(set != -1);
1764 flashcache_merge_writes(dmc, writes_list, &nr_writes, set);
1765 spin_unlock_irqrestore(&dmc->cache_spin_lock, flags);
1766 for (i = 0 ; i < nr_writes ; i++)
1767 flashcache_dirty_writeback_sync(dmc, writes_list[i].index);
1768 nr_writes = 0;
1769 set = -1;
1770 spin_lock_irqsave(&dmc->cache_spin_lock, flags);
1771 }
1772 cacheblk = &dmc->cache[index];
1773 if ((cacheblk->cache_state & (DIRTY | BLOCK_IO_INPROG)) == DIRTY) {
1774 cacheblk->cache_state |= DISKWRITEINPROG;
1775 writes_list[nr_writes].dbn = cacheblk->dbn;
1776 writes_list[nr_writes].index = cacheblk - &dmc->cache[0];
1777 set = index / dmc->assoc;
1778 nr_writes++;
1779 }
1780 index++;
1781 }
1782 dmc->sync_index = index;
1783 if (nr_writes > 0) {
1784 VERIFY(set != -1);
1785 flashcache_merge_writes(dmc, writes_list, &nr_writes, set);
1786 spin_unlock_irqrestore(&dmc->cache_spin_lock, flags);
1787 for (i = 0 ; i < nr_writes ; i++)
1788 flashcache_dirty_writeback_sync(dmc, writes_list[i].index);
1789 } else
1790 spin_unlock_irqrestore(&dmc->cache_spin_lock, flags);
1791 kfree(writes_list);
1792}
首先进入1744行,判断是否要快速移除该设备,如果是的话就直接返回不用刷缓存了。接着申请一个内存块writes_list用于记录脏块,这里对应的记录数据结构是struct dbn_index_pair,这个结构体很简单,只有两个域:
393struct dbn_index_pair {
394 sector_t dbn;
395 int index;
396};
域dbn就是用来记录cache块对应的磁盘扇区,这个域用于写回磁盘之前的排序,另一个域是index,是cache块在dmc->cache中的下标。接下来到1751行,变量nr_writes用于记录每次写磁盘的cache块数,初始化为0。1754行变量index用于记录开始扫描脏块的起始位置,这样下次再进入这个函数的时候就从未扫描的cache块接着刷。
到1755行一个while循环,循环结束满足下列条件之一:
1)已经扫描到最后一个cache块
2)下发脏块已经达到系统上限
到1758行检查是否到了下一个集合,如果此时有脏块就开始写到磁盘。然后到了排序合并脏块,对于机械磁盘来说排序之后写的速度会快些。
402void
403flashcache_merge_writes(struct cache_c *dmc, struct dbn_index_pair *writes_list,
404 int *nr_writes, int set)
405{
406 int start_index = set * dmc->assoc;
407 int end_index = start_index + dmc->assoc;
408 int old_writes = *nr_writes;
409 int new_inserts = 0;
410 struct dbn_index_pair *set_dirty_list = NULL;
411 int ix, nr_set_dirty;
412
413 if (unlikely(*nr_writes == 0))
414 return;
415 sort(writes_list, *nr_writes, sizeof(struct dbn_index_pair),
416 cmp_dbn, swap_dbn_index_pair);
417 if (sysctl_flashcache_write_merge == 0)
418 return;
419 set_dirty_list = kmalloc(dmc->assoc * sizeof(struct dbn_index_pair), GFP_ATOMIC);
420 if (set_dirty_list == NULL) {
421 dmc->memory_alloc_errors++;
422 goto out;
423 }
424 nr_set_dirty = 0;
425 for (ix = start_index ; ix < end_index ; ix++) {
426 struct cacheblock *cacheblk = &dmc->cache[ix];
427
428 /*
429 * Any DIRTY block in "writes_list" will be marked as
430 * DISKWRITEINPROG already, so we'll skip over those here.
431 */
432 if ((cacheblk->cache_state & (DIRTY | BLOCK_IO_INPROG)) == DIRTY) {
433 set_dirty_list[nr_set_dirty].dbn = cacheblk->dbn;
434 set_dirty_list[nr_set_dirty].index = ix;
435 nr_set_dirty++;
436 }
437 }
438 if (nr_set_dirty == 0)
439 goto out;
到415行以dbn为关键字进行排序,接下来425行循环本集合,看是否有cache块为脏,如果没有的话就从439行跳出。刚开始还挺疑惑,进来这个函数之前不是已经把cache块都扫描过了吗?为什么这里还要再扫描一遍?何况外面扫描的时候已经加锁了,所以进来这里扫描也肯定是扫描不到新的脏块啊?后面再仔细看了一遍上下文,才发现是这样的,第一次调用的时候dmc->sync_index为0,那么到flashcache_merge_writes函数也就扫描不到新的脏块了,就是说如果是从每个集合的第一个cache块开始扫描时到flashcache_merge_writes函数也就扫描不到新的脏块,但如果这一次扫描到一个集合中间的某一个cache块时,下一次扫描从dmc->sync_index开始,那么进入flashcache_merge_writes时就相当时将整个集合再扫描一次提取脏块。
既然flashcache_merge_writes都会将集合再重新扫描一遍,那flashcache_sync_blocks就不用费功夫先扫描直接让前者扫描不就行了?但前者还有个开关sysctl_flashcache_write_merge设置需不需要merge脏块,不merge的话就什么也不做,所以flashcache_sync_blocks还是要先扫描一遍,做自己该做的事情,至于flashcache_merge_writes做不做merge我不管,我只知道调用了flashcache_merge_writes之后给我返回最新的要写回的脏块。
这里就回到flashcache_sync_blocks函数1766行将脏块写回磁盘,这个函数对我们来说已经是小case了,但是否真正理解吗?来看一下函数原型:
1670static void
1671flashcache_dirty_writeback_sync(struct cache_c *dmc, int index)
从原型可以推测,对于一个脏块,只要知道其index,就可以得到其源地址和目的地址,即在SSD中的位置和在磁盘上的位置。这里就不说答案了,因为答案在前面的小节里已经展示出来了,现在再说出来反而浪费了一个独立思考的机会。
退出while循环,来到了第1782行记录这一次最后遍历的cache块下标,接着判断一下是否是因为下发脏块达到最大值而退出,如果是的话那么就有可能有记录的脏块没有下发,这里判断一下nr_writes,如果大于0表示还有未下发的脏块需要下发。
到这里一次刷新脏块结束了,如果是因为下发脏块达到最大值而结束的,那么下一次调用是怎么触发的呢?
当然是要等前面的刷新结束之后再继续调用这个函数下发新的刷新,具体流程大家自己分析一下代码会更有成就感。
接着讲flashcache_clean_set,在代码时搜索了一下这个函数,好多地方都会调用到,归纳一下:
1)找不到可用cache块时
2)写cache块时
3)写磁盘返回时
1004/*
1005 * Clean dirty blocks in this set as needed.
1006 *
1007 * 1) Select the n blocks that we want to clean (choosing whatever policy), sort them.
1008 * 2) Then sweep the entire set looking for other DIRTY blocks that can be tacked onto
1009 * any of these blocks to form larger contigous writes. The idea here is that if you
1010 * are going to do a write anyway, then we might as well opportunistically write out
1011 * any contigous blocks for free (Bob's idea).
1012 */
1013void
1014flashcache_clean_set(struct cache_c *dmc, int set)
1015{
1016 unsigned long flags;
1017 int to_clean = 0;
1018 struct dbn_index_pair *writes_list;
1019 int nr_writes = 0;
1020 int start_index = set * dmc->assoc;
1021
1022 /*
1023 * If a (fast) removal of this device is in progress, don't kick off
1024 * any more cleanings. This isn't sufficient though. We still need to
1025 * stop cleanings inside flashcache_dirty_writeback() because we could
1026 * have started a device remove after tested this here.
1027 */
1028 if (atomic_read(&dmc->fast_remove_in_prog))
1029 return;
1030 writes_list = kmalloc(dmc->assoc * sizeof(struct dbn_index_pair), GFP_NOIO);
1031 if (unlikely(sysctl_flashcache_error_inject & WRITES_LIST_ALLOC_FAIL)) {
1032 if (writes_list)
1033 kfree(writes_list);
1034 writes_list = NULL;
1035 sysctl_flashcache_error_inject &= ~WRITES_LIST_ALLOC_FAIL;
1036 }
1037 if (writes_list == NULL) {
1038 dmc->memory_alloc_errors++;
1039 return;
1040 }
1041 dmc->clean_set_calls++;
1042 spin_lock_irqsave(&dmc->cache_spin_lock, flags);
1043 if (dmc->cache_sets[set].nr_dirty < dmc->dirty_thresh_set) {
1044 dmc->clean_set_less_dirty++;
1045 spin_unlock_irqrestore(&dmc->cache_spin_lock, flags);
1046 kfree(writes_list);
1047 return;
1048 } else
1049 to_clean = dmc->cache_sets[set].nr_dirty - dmc->dirty_thresh_set;
1050 if (sysctl_flashcache_reclaim_policy == FLASHCACHE_FIFO) {
1051 int i, scanned;
1052 int start_index, end_index;
1053
1054 start_index = set * dmc->assoc;
1055 end_index = start_index + dmc->assoc;
1056 scanned = 0;
1057 i = dmc->cache_sets[set].set_clean_next;
1058 DPRINTK("flashcache_clean_set: Set %d", set);
1059 while (scanned < dmc->assoc &&
1060 ((dmc->cache_sets[set].clean_inprog + nr_writes) < dmc->max_clean_ios_set) &&
1061 ((nr_writes + dmc->clean_inprog) < dmc->max_clean_ios_total) &&
1062 nr_writes < to_clean) {
1063 if ((dmc->cache[i].cache_state & (DIRTY | BLOCK_IO_INPROG)) == DIRTY) {
1064 dmc->cache[i].cache_state |= DISKWRITEINPROG;
1065 writes_list[nr_writes].dbn = dmc->cache[i].dbn;
1066 writes_list[nr_writes].index = i;
1067 nr_writes++;
1068 }
1069 scanned++;
1070 i++;
1071 if (i == end_index)
1072 i = start_index;
1073 }
1074 dmc->cache_sets[set].set_clean_next = i;
1075 } else { /* flashcache_reclaim_policy == FLASHCACHE_LRU */
1076 struct cacheblock *cacheblk;
1077 int lru_rel_index;
1078
1079 lru_rel_index = dmc->cache_sets[set].lru_head;
1080 while (lru_rel_index != FLASHCACHE_LRU_NULL &&
1081 ((dmc->cache_sets[set].clean_inprog + nr_writes) < dmc->max_clean_ios_set) &&
1082 ((nr_writes + dmc->clean_inprog) < dmc->max_clean_ios_total) &&
1083 nr_writes < to_clean) {
1084 cacheblk = &dmc->cache[lru_rel_index + start_index];
1085 if ((cacheblk->cache_state & (DIRTY | BLOCK_IO_INPROG)) == DIRTY) {
1086 cacheblk->cache_state |= DISKWRITEINPROG;
1087 writes_list[nr_writes].dbn = cacheblk->dbn;
1088 writes_list[nr_writes].index = cacheblk - &dmc->cache[0];
1089 nr_writes++;
1090 }
1091 lru_rel_index = cacheblk->lru_next;
1092 }
1093 }
1094 if (nr_writes > 0) {
1095 int i;
1096
1097 flashcache_merge_writes(dmc, writes_list, &nr_writes, set);
1098 dmc->clean_set_ios += nr_writes;
1099 spin_unlock_irqrestore(&dmc->cache_spin_lock, flags);
1100 for (i = 0 ; i < nr_writes ; i++)
1101 flashcache_dirty_writeback(dmc, writes_list[i].index);
1102 } else {
1103 int do_delayed_clean = 0;
1104
1105 if (dmc->cache_sets[set].nr_dirty > dmc->dirty_thresh_set)
1106 do_delayed_clean = 1;
1107 spin_unlock_irqrestore(&dmc->cache_spin_lock, flags);
1108 if (dmc->cache_sets[set].clean_inprog >= dmc->max_clean_ios_set)
1109 dmc->set_limit_reached++;
1110 if (dmc->clean_inprog >= dmc->max_clean_ios_total)
1111 dmc->total_limit_reached++;
1112 if (do_delayed_clean)
1113 schedule_delayed_work(&dmc->delayed_clean, 1*HZ);
1114 dmc->clean_set_fails++;
1115 }
1116 kfree(writes_list);
1117}
先看输入参数,一个是dmc,另一个是set,集合下标。
1028行判断是否快速移除,如果是就不做任何处理。
1030申请写记录内存,结构struct dbn_index_pair刚刚已经看过了。
1031行,这个是用于测试用的,故意设置申请内存申请失败的情况下程序是否能正确运行。
1037行,申请不到内存就返回。
1043行,判断集合脏块是否达到水位线,没有达到水位线就不用频繁去刷了。
1049行,计算出当前需要刷的脏块数。
1050行,如果当前刷脏块策略为FIFO,则按照FIFO遍历集合,记录脏块信息。
1075行,是LRU策略。
这两个策略没有优劣之分,只有说在某种应用下哪种策略更适合。
那么还有其他可以比较的吗?来看一下两种策略的内存开销吧。
FIFO的开销是在每个集合管理结构cache_set中增加一个set_clean_next、set_fifo_next字段。
LRU的开销是集合中有lru_head, lru_tail,cache块中还有lru_prev, lru_next。
注意这里的lru_prev, lru_next都用16位无符号数来表示,在64位系统中每个字段节省了48个位。
带来的负面作用是每个集合中最多可以有2的16次方个cache块。
在某些应用中用下标表示会远比指针表示来得优越。在我曾经做过的一个项目中,要求在程序异常时能够立即恢复并且不影响正在使用服务的客户,这个时候就要求程序重新启动的时候要完全恢复到原来运行的状态,那么就要求需要恢复的数据通通不能用到指针,因为程序重启后这些指针都已经无效了,这个时候下标表示就派上用场了。
在获取脏块记录之后,在1094行下发脏块。
1102行,有两种情况一是没有脏块,二是下发脏块达到上限,第二种情况下到1113行隔1秒再调度一次刷脏块。
讲到这里,似乎已经把所有的系统都遍历一遍了。然而我们一直是太乐观了,因为还有更重要的好戏还等着我们去探索。