mystudy_zhou

Flashcache（混合存储）-结构

1.概述

什么是flashcache？flashcache是Facebook的一个开源项目。

将这个单词拆开：flash + cache，可以理解为用flash做cache。flash即SSD，所以flashcache是三层结构，实现的是一种混合存储方式。

2.Device-Mapper

Flashcache是基于Device-Mapper的，所以要理解Flashcache，必须先了解Device-Mapper，至少是Flashcache使用到的数据结构和接口

Device-Mapper Framework

详见： http://www.ibm.com/developerworks/cn/linux/l-devmapper/

2.1.DM

flashcache挂载在这个模块上。

2.2.Target

相当于一个逻辑句柄，DM模块通过此对象流入Flashcache。

Structure

结构如下（3.17 include/linux/device-mapper.h）

195 struct dm_target {
196         struct dm_table *table;
197         struct target_type *type;
198 
199         /* target limits */
200         sector_t begin;
201         sector_t len;
202 
203         /* If non-zero, maximum size of I/O submitted to a target. */
204         uint32_t max_io_len;
205 
206         /*
207          * A number of zero-length barrier bios that will be submitted
208          * to the target for the purpose of flushing cache.
209          *
210          * The bio number can be accessed with dm_bio_get_target_bio_nr.
211          * It is a responsibility of the target driver to remap these bios
212          * to the real underlying devices.
213          */
214         unsigned num_flush_bios;
215 
216         /*
217          * The number of discard bios that will be submitted to the target.
218          * The bio number can be accessed with dm_bio_get_target_bio_nr.
219          */
220         unsigned num_discard_bios;
221 
222         /*
223          * The number of WRITE SAME bios that will be submitted to the target.
224          * The bio number can be accessed with dm_bio_get_target_bio_nr.
225          */
226         unsigned num_write_same_bios;
227 
228         /*
229          * The minimum number of extra bytes allocated in each bio for the
230          * target to use.  dm_per_bio_data returns the data location.
231          */
232         unsigned per_bio_data_size;
233 
234         /*
235          * If defined, this function is called to find out how many
236          * duplicate bios should be sent to the target when writing
237          * data.
238          */
239         dm_num_write_bios_fn num_write_bios;
240 
241         /* target specific data */
242         void *private;
243 
244         /* Used to provide an error string from the ctr */
245         char *error;
246 
247         /*
248          * Set if this target needs to receive flushes regardless of
249          * whether or not its underlying devices have support.
250          */
251         bool flush_supported:1;
252 
253         /*
254          * Set if this target needs to receive discards regardless of
255          * whether or not its underlying devices have support.
256          */
257         bool discards_supported:1;
258 
259         /*
260          * Set if the target required discard bios to be split
261          * on max_io_len boundary.
262          */
263         bool split_discard_bios:1;
264 
265         /*
266          * Set if this target does not return zeroes on discarded blocks.
267          */
268         bool discard_zeroes_data_unsupported:1;
269 };

我们关心的是struct target_type *type;

2.3.Target Type

flashcache通过此结构在DM层注册了一些必要的钩子函数。

因此此结构在Flashcache内相当于一个句柄，所有数据和控制都是经过其完成。

136 struct target_type {
137         uint64_t features;
138         const char *name;
139         struct module *module;
140         unsigned version[3];
141         dm_ctr_fn ctr;
142         dm_dtr_fn dtr;
143         dm_map_fn map;
144         dm_map_request_fn map_rq;
145         dm_endio_fn end_io;
146         dm_request_endio_fn rq_end_io;
147         dm_presuspend_fn presuspend;
148         dm_postsuspend_fn postsuspend;
149         dm_preresume_fn preresume;
150         dm_resume_fn resume;
151         dm_status_fn status;
152         dm_message_fn message;
153         dm_ioctl_fn ioctl;
154         dm_merge_fn merge;
155         dm_busy_fn busy;
156         dm_iterate_devices_fn iterate_devices;
157         dm_io_hints_fn io_hints;
158 
159         /* For internal device-mapper use. */
160         struct list_head list;
161 };

2.4.Flashcache实现部分

因为过程和数据均是通过此结构从DM流入Flashcache，因此有必要对其中必要的字段进行说明。（Linux的注释十分完备）

2.4.1.constructor

 28 /*
 29  * In the constructor the target parameter will already have the
 30  * table, type, begin and len fields filled in.
 31  */
 32 typedef int (*dm_ctr_fn) (struct dm_target *target,
 33                           unsigned int argc, char **argv);

创建接口。意思就是回调函数会传给你这些参数：句柄，IO数据起始位置和长度。

2.4.2.destructor

 35 /*
 36  * The destructor doesn't need to free the dm_target, just
 37  * anything hidden ti->private.
 38  */
 39 typedef void (*dm_dtr_fn) (struct dm_target *ti);

释放结构。意思是申请的私有内存，要自己释放

2.4.3.map

 41 /*
 42  * The map function must return:
 43  * < 0: error
 44  * = 0: The target will handle the io by resubmitting it later
 45  * = 1: simple remap complete
 46  * = 2: The target wants to push back the io
 47  */
 48 typedef int (*dm_map_fn) (struct dm_target *ti, struct bio *bio);

2.4.4.status

 71 typedef void (*dm_status_fn) (struct dm_target *ti, status_type_t status_type,
 72                               unsigned status_flags, char *result, unsigned maxlen);

查询状态。

2.4.5.ioctl

 76 typedef int (*dm_ioctl_fn) (struct dm_target *ti, unsigned int cmd,
 77                             unsigned long arg);

ioctl接口

3.SSD特性

首先，简单介绍下SSD。

本质上SSD和HDD一样，都是做同样一件事：将逻辑地址转化成物理地址。

3.1.Sector

In computer diskstorage, a sector is a subdivision of a track on a magnetic disk or opticaldisc. Each sector stores a fixed amount of user-accessible data, traditionally512 bytes for hard disk drives (HDDs) and 2048 bytes for CD-ROMs and DVD-ROMs.Newer HDDs use 4096-byte (4 KiB) sectors, which are known as the AdvancedFormat (AF).

一般而言，现在Sector的大小为512B。在HDD上即一个扇区的大小。

3.2.逻辑地址

什么是逻辑地址呢？

假设有个128G的SSD，那么它的逻辑地址空间为0~128GB/512B，即0~2^28 Sectors。

3.3.物理地址

无论是HDD还是SSD，都有各自的存储介质和方式。

HDD使用磁性介质，分柱面、磁道和扇区，即柱面 + 磁道 + 扇区构成了HDD数据的物理地址

SSD使用固态电子存储芯片阵列，分Chanel、Plane、Block和Page。

3.4.地址映射

逻辑层只关注逻辑地址，如文件系统，但是物理层只认物理地址，所以在两者之间必须建立起映射关系。

这种映射关系形成一张表，我们称之为映射表。

3.5.SSD逻辑地址单元

SSD逻辑地址还是有别于HDD，一般为4KB。为什么是4KB呢，这和SSD的映射表及内存DRAM大小相关。举个例子，以128G SSD为例。（假设逻辑地址大小仍是一个Sector（512B））

一个128G大小的SSD一般有128M大小的DRAM。

同时有128GB/512B = 256M个Sectors。假设一个映射单元（逻辑地址到物理地址）需要4B的内存空间，所以整张映射表需要DRAM的大小为：256M * 4B = 1GB，这远远超过128M。

因此，SSD在实现上会将逻辑单元从Sector（512B）扩大到Page（4KB），从而降低整张映射表需要的内存空间：128GB / 4KB * 4B = 128MB。

3.6.SSD读、写单元

SSD存储介质为NAND Flash，其读的单元为4K，而写单元不同控制器略有不同，有8K，有16KB等，即是读单元的2次幂倍。

我们称SSD读单元为LPA（或VPage），写单元为PPA（或Page）

正因为有这些特点，在SSD读和写时，是很需要关注一些情景的。

Ø 读需要注意次序。

譬如Read Sectors 0 ~15。

如果顺序为Sector 0~15，那么到SSD里会转成LPA 0（Sectors 0~7）和LPA1（Sectors 8~15）两个请求；

但如果顺序改为Sectors 0~3，Sectors 8~11，Sectors 4~7，Sectors 12~15，那么会转成LPA 0（Sectors 0~3）、LPA 1（Sectors 8~11）、LPA 0（Sectors 4~7）和LPA 1（Sectors 12~15）四个请求。

Ø 写需要注意单元

如果你写入的数据为4KB（LPA）的倍数，但不足写单元，则不足部分会做dummy。如写单元为8KB，但你仅需写入4KB的数据，则SSD会用无效数据补足另外4KB（LPA）后，将这8KB（PPA）的数据同时写到NAND Flush上。

如果你写入的数据小于4KB，那问题就比较麻烦了，SSD需要把写入数据所在的LPA的其余数据读上来，进行merge后，才能写下去（为什么？）。譬如写入Sector 0，步骤是：

1) 读LPA 0，得到Sectors 1~7；

2) Merge Sector 0和Sectors 1~7得到整个LPA 0的最新数据；

3) Dummy一个LPA，和LPA 0组成一个PPA；

4) 写入这个PPA。

Ø WA

这里顺便提一下SSD的一个概念：WA（Write Amplification，写放大）。

简单计算公式： WA = （写入总数据量 – 期望写入数据量） / 期望写入数据量

第一个例子需要写入的数据量为4KB，但事实上写入NAND的数据为8KB。其WA高达100%。

而第二个例子更大，为15 / 1 = 1500%

WA不单会增加写入的数据量，还会减少闪存的寿命，更吃光闪存的可用带宽而间接影响随机写入性能。

3.7.OP（Over-provisioning）预留空间

因为SSD的特性，需要预留些空间，以保证SSD的性能。

特别是在Random Write的场景下，更需要预留足够多的空间。举个例子，如128GB的SSD，尽量只用其中的120GB。这会大大提高GC的效率，从而提高SSD的性能和寿命。

因此不建议在存在大量Random Write下，进行满盘使用。

4.CacheDevice Structure

注意：

非对齐部分针对的是Associate Unit，不是Cache Block Size。

默认情况下Cache Block大小为4KB，而Associate Unit为512 * 4KB = 2MB。

4.1.superblock

存放Cache的配置信息

注意：初始值为SSD的容量，单位为Sector；计算后为Cache的大小，单位为Block。

默认为Meta Block Size，即4K

Cache Device建立后，其配置信息为静态数据（SSD内称为冷数据）。如果这部分数据不参与到其它操作中，那是非常理想的情况。

因此建议大小为一个LPA（即4K）。这样后续的Meta Data Block数据的读和写不会影响到Super Block所存的配置信息。

4.2.Meta Data Block

保存在Cache中的Meta单元信息。

dbn即你保存数据的逻辑地址编号。

cache_state则为Cache单元的状态，在做Hit Check以及Cache Flush时需要比对。

这里对齐非常关键，无论checksun开启状态下的32Bytes还是未开启的16Bytes，都能保证Meta Block大小是其正数倍，这给操作Cache单元提供很大的便利。

同时需要注意Meta数据写入的原子性。

/* 
 * We do metadata updates only when a block trasitions from DIRTY -> CLEAN
 * or from CLEAN -> DIRTY. Consequently, on an unclean shutdown, we only
 * pick up blocks that are marked (DIRTY | CLEAN), we clean these and stick
 * them in the cache.
 * On a clean shutdown, we will sync the state for every block, and we will
 * load every block back into cache on a restart.
 * 
 * Note: When using larger flashcache metadata blocks, it is important to make 
 * sure that a flash_cacheblock does not straddle 2 sectors. This avoids
 * partial writes of a metadata slot on a powerfail/node crash. Aligning this
 * a 16b or 32b struct avoids that issue.
 * 
 * Note: If a on-ssd flash_cacheblock does not fit exactly within a 512b sector,
 * (ie. if there are any remainder runt bytes), logic in flashcache_conf.c which
 * reads and writes flashcache metadata on create/load/remove will break.
 * 
 * If changing these, make sure they remain a ^2 size !
 */

默认为8个Sectors，即4K。

Meta Block大小为一个读单元，这样设置有以下好处：

读为一个完整单元。（完全独立）
写为一个单元，可与其它Meta Block组成一个写单元。（数据独立）

4.3.CacheBlock