Android上普遍使用UBI文件系统,根据UBI官方文档http://www.linux-mtd.infradead.org/doc/ubi.html#L_ubi_operations的说法,UBI比传统的MTD驱动有很多好处,比如屏蔽了坏块管理,可以均衡负载,可以更新分区,ECC出错时也能自动搬移数据到好的块,这些都简化了上层软件的工作量

  • UBI volumes have no eraseblock wear-leveling constraints, so users do not have to care about this at all, which means the upper level software may be simpler;
  • UBI volumes have no bad eraseblocks, which also leads to simpler upper level software;
  • UBI volumes are dynamic in a sense that they may be created, removed or re-sized dynamically, while MTD partitions are static;
  • UBI handles bit-flips which again makes the upper level software simpler;
  • UBI provides a volume update operations which makes it easy to detect interrupted software updates and recover;
  • UBI provides an atomic logical eraseblock change operation which allows to change the contents of a logical eraseblock without loosing the data if an unclean reboot happens during the operation; this is might be very useful for the upper-level software (e.g., for a file-system);
  • UBI has an un-map operation, which just un-maps a logical eraseblock from the physical eraseblock, schedules the physical eraseblock for erasure and returns; this is very quick and frees upper level software from implementing their own mechanisms to defer erasures (e.g., JFFS2 has to implements such mechanisms).

至于工具git://git.infradead.org/mtd-utils.git

  • ubinfo - provides information about UBI devices and volumes found in the system;
  • ubiattach - attaches MTD devices (which describe raw flash) to UBI and creates corresponding UBI devices;
  • ubidetach - detaches MTD devices from UBI devices (the opposite to what ubiattach does);
  • ubimkvol - creates UBI volumes on UBI devices;
  • ubirmvol - removes UBI volumes from UBI devices;
  • ubiupdatevol - updates UBI volumes; this tool uses the UBI volume update feature which leaves the volume in "corrupted" state if the update was interrupted; additionally, this tool may be used to wipe out UBI volumes;
  • ubicrc32 - calculates CRC-32 checksum of a file with the same initial seed as UBI would use;
  • ubinize - generates UBI p_w_picpaths;
  • ubiformat - formats empty flash, erases flash and preserves erase counters, flashes UBI p_w_picpaths to MTD devices;
  • mtdinfo - reports information about MTD devices found in the system.

下面来看看UBI文件系统的具体结构:
 

UBI headers

在每个物理块上面,UBI均存有两个64bytes的头:

  • erase counter header (or EC header) which contains the erase counter of the physical eraseblock (PEB) plus some other not so important information;
  • volume identifier header (or VID header) which stores volume ID and logical eraseblock (LEB) number this PEB belongs to (plus some other not so important information).

这也是为什么逻辑上的每个块的可擦写大小比物理上小的原因,这两个头是用CRC-32的checksum保护起来的,drivers/mtd/ubi/ubi-media.h中有详细的关于头部分的内容。

当UBI文件系统attache一个MTD设备时,首先读取头信息,并且进行校验,将擦写技术以及物理区块(PEB)-逻辑区块(LEB)映射表都读取到RAM中。

当UBI擦写一个物理区块(PEB)以后,擦写计数会随着增加。这意味着EC header会永远的存在于物理区块中,除非是在当旧的EC header被擦除,新的EC header写入的过程中掉电,那么UBI在下次扫描这可区块的时候,会写入一个平均的擦写计数值。

VID header只是当前块被启用时才会被写入: 

 

 

  • The LEB un-map operation just un-maps the LEB from the PEB and schedules the PEB for erasure. When the PEB is erased, the EC header is written straight away. The VID header is not written.
  • The LEB map operation or a write operation to an un-mapped LEB makes UBI find an appropriate PEB and write the VID header to it (the EC header must already be there). Note, the write operation to an already mapped LEB just writes the data straight to PEB and does not change the UBI headers.

UBI之所以把头信息分为两个区来存储,是因为EC header和VID header会在不同时刻进行存储,这样带来的好处:

  • after a PEB is erased, the EC header is written straight away, which minimizes the probability of losing the erase counter due to unclean reboots;
  • when UBI associates a PEB with an LEB, the VID header is written to the PEB.

当EC header被写入物理区块的时候,UBI系统还不知道这个块会被关联到哪个分区和逻辑区块(LEB)

UBI volume table

分区表是一个存储在flash上的数据结构,它包含UBI文件系统中每一个分区信息,可以把它看成是一个分区表数据的数组,每一个分区记录的结构如下:

  • volume size;
  • volume name;
  • volume type (dynamic or static);
  • volume alignment;
  • update marker (set for volumes which had interrupted updates;
  • auto-resize flag;
  • CRC-32 checksum for this record.

数组下标和分区号一一对应,分区表的数量受LEB大小的限制不能大于128,也就是说UBI文件系统最多能有128个卷

当分区被创建,移动,重新分配大小,重命名或者被更新时,对应的分区表数据会被更新,UBI维护两个分区表数据的拷贝,这样即便在更新时掉电,分区表信息也能被恢复。从UBI内部看来,分区表存储于一个特殊的分区中,称之为layout volume,占用两个LEB的大小,每一个对应一个分区表的copy,这部分数据对用户透明,由UBI自己维护,对于这个分区数据的更新的机制和其他数据分区一样:

  • Prepare in-memory buffer with the new volume table contents.
  • Un-map LEB0 of the layout volume.
  • Write the new volume table to LEB0.
  • Un-map LEB1 of the layout volume.
  • Write the new volume table to LEB1.
  • Flush the UBI work queue to make sure the PEBs are corresponding to the un-mapped LEBs are erased.

当UBI系统关联MTD设备的时候,会首先检查两个分区表是否一致,如果不一致,那么首先将LEB0复制到LEB1,如果其中一个损坏,那么就用另外一个来恢复。

Minimum flash input/output unit

UBI认为flash或者MTD设备由可擦写的好块与坏块组成,每个好快都可以被读写和擦除,好块也可以标记为坏块。

最小的存储单元依据不同类型的flash而不同:

  • NOR flashes usually have min. I/O unit size of 1 byte, because NOR flashes usually allow reading and writing single bytes (in fact, it is even be possible to change individual bits).
  • Some NOR flashes may have other min. I/O unit sizes, e.g. 16 or 32 bytes in case of ECC'd NOR flashes.
  • NAND flashes usually have 512, 2048 or 4096 byte min. I/O. unit size, which corresponds to NAND page size. NAND flashes store per-NAND page ECC codes in the OOB area, which means that whole NAND page has to be written at once to calculate the ECC code, and whole NAND page has to be read at once to check the ECC code.

最小存储单位是MTD设备一个非常重要的属性:

  • VID header 的物理存储位置依赖于此,也就是说LEB的大小也由此决定,通常情况下,LEB的size比最小存储单位要小,NOR flashes usually have min. I/O unit size of 1 byte, because NOR flashes usually allow reading and writing single bytes (in fact, it is even be possible to change individual bits).
  • 所有对于LEB的写操作,都需要针对最小存储单位来对齐,虽然对于读操作似乎没有这样的规定,但是实际上在MTD这一层也是一样,只是最后将缓冲区中用户请求的大小copy回给用户而已

NAND flash sub-pages

前面提到所有的写操作都需要对齐,对于NAND则是针对page大小,虽然有些SLC的flash允许更小的单位,在MTD这一层我们称之为sub-pages,并不是所有的NAND都有sub-pages

  • MLC NANDs do not have sub-pages, at least to the date of writing of this piece of documentation (April 2009).
  • SLC NANDs usually do have sub-pages. E.g., 512-byte NAND pages usually consist of 2x256-byte sub-pages, and 2048-byte NAND pages consist of 4x512-byte sub-pages.
  • SLC OneNAND chips with 2048 bytes NAND page size have 4x512-byte sub-pages.

比如,对于128KiB的block大小,2048-byte page的flash,如果没有sub-pages,EC header 存于第一个page,VID header存储于2048偏移处,LEB大小为128KiB-2048-2048=124KiB。如果有sub-pages,那么EC header存储于第一个sub-page,VID header存储于512偏移处(第二个sub-page),LEB大小变为128KiB-2048=126KiB

  • in case of NOR flash which has 1 byte min. I/O unit, the VID header resides at offset 64;
  • in case of NAND flash which does not have sub-pages, the VID header resides at the second NAND page;
  • in case of NAND flash which has sub-pages, the VID header resides at the second sub-page.

Sub-pages只是UBI内部用于存储头信息,UBI的API不允许用户访问sub-pages,因为为了写一个sub-page的数据,驱动需要对整个page做写操作,比如写4个sub-page的时间会4倍于一个page的时间。

UBI headers position
EC header always resides at offset 0 and takes 64 bytes, the VID header resides at the next available min. I/O unit or sub-page, and also takes 64 bytes. For example:

  • in case of NOR flash which has 1 byte min. I/O unit, the VID header resides at offset 64;
  • in case of NAND flash which does not have sub-pages, the VID header resides at the second NAND page;
  • in case of NAND flash which has sub-pages, the VID header resides at the second sub-page.

Flash space overhead

UBI系统本身会占用一些flash空间,从而用户能使用的falsh空间会减少:

  • 2 PEBs are used to store the volume table;
  • 1 PEB is reserved for wear-leveling purposes;
  • 1 PEB is reserved for the atomic LEB change operation;
  • some amount of PEBs is reserved for bad PEB handling; this is applicable for NAND flash, but not for NOR flash; the percentage of reserved PEBs is configurable and is 1% by default;
  • UBI stores the EC and VID headers at the beginning of each PEB; the amount of bytes used for these purposes depends on the flash type and is explained below.

Lets introduce symbols:

  • P - total number of physical eraseblocks on the MTD device;
  • SP - physical eraseblock size;
  • SL - logical eraseblock size;
  • B - number of PEBs reserved for bad PEB handling; it is 1% of P for NAND by default, and 0 for NOR and other flash types which do not have bad PEBs;
  • O - the overhead related to storing EC and VID headers in bytes, i.e. O = SP - SL.

The UBI overhead is (B + 4) * SP + O * (P - B - 4) i.e., this amount of bytes will not be accessible for users. O is different for different flashes:

  • in case of NOR flash which has 1 byte minimum input/output unit, O is 128 bytes;
  • in case of NAND flash which does not have sub-pages (e.g., MLC NAND), O is 2 NAND pages, i.e. 4KiB in case of 2KiB NAND page and 1KiB in case of 512 bytes NAND page;
  • in case of NAND flash which has sub-pages, UBI optimizes its on-flash layout and puts the EC and VID headers at the same NAND page, but different sub-pages; in this case O is only one NAND page;
  • for other flashes the overhead should be 2 min. I/O units if the min. I/O unit size is greater or equivalent to 64 bytes, and 2 times 64 bytes aligned to the min. I/O unit size if the min. I/O unit size is less than 64 bytes.

Saving erase counters
使用 UBI文件系统,最重要的是要认识到UBI存储EC头在每一个物理擦除块(PEB)上用以记录当前块被擦写的次数。当然这部分信息也是在做擦写操作时要避免丢失的。

How UBI flasher should work
下面看看UBI文件系统是如何擦除flash和烧写映像的
 

  • First of all, scan the flash and collect the erase counters. Namely, it read the EC header from each PEB, check the CRC-32 checksum of the header, and save the erase counter in a RAM. It is not necessary to read VID headers. Bad PEBs should be skipped.
  • Calculate average erase counter. It should be used for PEBs with corrupted or missing EC headers. Such PEBs may be there because of unclean reboots, but there shouldn't be too many of them.
  • If the intention is to just erase the flash, then each PEB has to be erased and proper EC header has to be written at the beginning of the PEB. The EC header should contain incremented erase counter. Bad PEBs should be just skipped. For NAND flashes, in case of I/O errors while erasing or writing, the PEB should be marked as bad (see here for more information how UBI marks PEBs as bad).
  • If the intention is to flash an UBI p_w_picpath, then the flasher should do the following for each non-bad PEB.
    • Read the contents of this PEB from the UBI p_w_picpath (PEB size bytes) into a buffer.
    • Stripe min. I/O units full of 0xFF bytes from the end of the buffer (the details are given below in this section).
    • Erase the PEB.
    • Change the EC header in the buffer - put the new erase counter value there and re-calculate the CRC-32 checksum.
    • Write the buffer to the physical eraseblock.
    As usually, bad PEBs should be just skipped. And for NAND flashes, in case I/O errors while erasing or writing, the PEB should be marked as bad.

 一般情况下要烧写的文件大小都小于flash的大小,所以烧写程序需要烧写所有用到的物理区块,然后擦除未使用的区块。

需要注意到的是,UBI烧写时并不一定按照输入文件的区块顺序来做烧录,也就是说输入文件的第一个区块不一定就写在第一个物理区块(PEB)上,也有可能写在第二个甚至是最后一个。

如果你要写一个在生产线上烧写UBI文件的程序,那么你并不需要去改变输入映像里的EC值,因为对于一个新的flash,所有的PEB的EC值都是0,所以程序的逻辑相对简单。

当你的烧写映像包含UBI文件系统,并且你使用NAND,你需要在每一个没有使用完的PEB末端补上0xFF,虽然并非所有的NAND都要求这么做,但是如果不这么做,有可能在后续出现非常难于debug的问题。

在擦写时,实际擦写大小和烧写映像大小保持一致是有意义的,而不是每一次都擦写所有空间,也就是说,烧写程序需要丢掉所有从末端起空的物理区块(PEB),这么做同时也减小了擦写时间。不光是UBI文件系统,JFFS2也同样需要注意这个问题,如果不这样做,当会遇到ECC错误。

当使用mkfs.ubifs产生UBI文件映像时,可以使用参数(--space-fixup)来避免这个问题。

Marking eraseblocks as bad
UBI文件系统在两种情况下会把PEB标记为坏块:
1.当写一个区块失败时,UBI将要写到这个区块的数据写到其它区块,并且开始对这个区块做再次校验。
2.擦写操作遇到了EIO错误,此时这个区块直接标识为坏块。

再次校验的过程是在后台进行的,目的是检查区块是否真正损坏,因为操作失败有可能是由其他原因引起,比如驱动本身和不当的文件系统调用(比如对同一区块做多次写操作),校验包括以下步骤:

  • erase the eraseblock;
  • read it back and make sure it contains only 0xFF bytes;
  • write test pattern bytes;
  • read the eraseblock back and check the pattern;
  • and so on for several patterns (0xA5, 0x5A, 0x00).

如果区块通过了校验,那么不会被标记为坏块,比如在校验过程中检测到bit-flip。参见torture_peb()函数
 

Scalability issues

UBI系统初始化时需要读取所有PEB上的头信息,所以显然flash尺寸越大,那么所花时间也就会越大。

  • UBI scans the MTD device when attaching - it reads the erase EC and VID headers from every single PEB; the headers are small (64 bytes each), so this means reading 128 bytes from each PEB in case of NOR flash or one or two NAND pages in case of NAND flash (this depends on whether the NAND flash supports sub-pages or not); this is anyway much less than JFFS2 needs to read when it mounts MTD devices, so UBI attaches MTD devices many times faster than JFFS2 would mount a file system on the same MTD device;
  • UBI calculates CRC-32 checksum of each EC and VID header, which consumes CPU, although this is usually minor comparing to the flash I/O overhead


一些具体数据:

  • a 256MiB OneNAND flash found in Nokia N800 devices is attached for less than 1 sec; the flash does support sub-pages so UBI has to read the first 2KiB NAND page of each PEB while scanning;
  • a 1GiB NAND flash found in OLPC XO-1 devices is attached for about 2 seconds; the flash is an SLC NAND and supports sub-pages, but the Cafe controller which is used in the laptop does not allow sub-page writes, so UBI has to read two 2KiB NAND pages from each PEB.

Implementation details

UBI文件系统运行需要3个表:

volume table维护在flash上,只在volume创建,删除和重新分配大小时才会被改变,时延性要求不高,所以管理机制很简单。
EBA和EC table在每次LEB映射到PEB或者PEB被擦除,这会发生的相当频繁,所以管理机制要求速度快而且高效。
在flash上维护EBA和EC table无法满足时延和效率的要求,因此UBI系统在每次attach MTD设备时,在RAM里面建立这两个表,这意味着,UBI需要扫描整个flash以读取每个PEB上的EC和VID头,然后在RAM里构建EBA和EC table。
 

Volume auto-resize

Nand芯片在出厂的时候,会有一些PEB会标记为坏块,每个芯片坏块的数量和位置都是不一样的,芯片商一般都会保证开始的几个物理区块是好的,并且坏块数量不会超过一定的比率,比如一个256M的三星OneNAND不超过40个128KiB PEBs(当然随着使用时间,坏块会有所增加),大约占总容量的2%

当你要创建一个UBI映像并烧写到flash上时,你需要规划每个volume的大小(the sizes are stored in the UBI volume table),但是由于坏块数量的不同,每个flash的实际容量大小都不一样。

一个解决办法就是按照最坏的预期,每个flash都有最大数量的坏块。但是实际上每个flash拥有的坏块数量远没有那么多,当然这可以增加可靠性,因为UBI总是使用全部的flash区块。另外一方面,UBI通常都预留大约1%的区块来处理坏块问题.对于之前提到的三星NAND那么就是说,有1%的flash预留,然后0-2%的flash容量不可用。

还有另外一个办法就是使用auto-resize,UBI会在第一次运行的时候扩大volume的大小,然后去掉存储在volume table里面的auto-resize的标记,并且只能有一个volume有auto-resize标记。

对于之前例子里的flash,如果有分区使用了auto-resize,那么可用容量会增加0-2%,但是UBI仍然会预留1%。
linux内核2.6.25以上支持auto-resize

  • volume table which contains per-volume information, like volume size, type, etc;
  • eraseblock association (EBA) table which contains the logical-to-physical eraseblock mapping information; for example, when reading an LEB, UBI first looks up the table to find the corresponding PEB number, then reads from this PEB;
  • erase counters (EC) table which contains the erase counter value for each physical eraseblock; UBI wear-leveling sub-system uses this table when it needs to find, for example, a highly worn-out LEB;