H2的MVStore

翻译自http://www.h2database.com/html/mvstore.html

概述

MVStore是一个持久化的、日志结构是的kv存储。本计划用它作为H2的下一代存储子系统,但你也可以在一个不涉及JDBC或者SQL的应用中直接使用它。

  • MVStore代表多版本存储。
  • 每一个store包含大量的map,这些map可以用java.util.Map接口存取。
  • 支持基于文件存储和基于内存的操作。
  • 它希望更快,更简单的使用,更小。
  • 支持并发读写操作。
  • 支持事务(包括并发事务与两阶段提交(2-phase commit))
  • 模块化的工具,支持插拔式的数据类型定义、序列化实现,支持插拔式的存储载体(存到文件里、存到堆外内存),插拔式的映射实现(B-tree,R-tree,当前用的concurrent B-tree),BLOB存储,文件系统层的抽象以使其支持文件的加密与压缩

示例代码

import org.h2.mvstore.*;

// open the store (in-memory if fileName is null)
MVStore s = MVStore.open(fileName);

// create/get the map named "data"
MVMap<Integer, String> map = s.openMap("data");

// add and read some data
map.put(1, "Hello World");
System.out.println(map.get(1));

// close the store (this will persist changes)
s.close();

下面的代码展示了如何使用这些工具

Store Builder

MVStore.Builder提供了一个流畅优美的用可选配置项构造store的接口

示例用法:

MVStore s = new MVStore.Builder().
    fileName(fileName).
    encryptionKey("007".toCharArray()).
    compress().
    open();

可用选项的列表如下:

  • autoCommitBufferSize: 写buffer的大小.
  • autoCommitDisabled: 禁用自动commit.
  • backgroundExceptionHandler: 用于处理后台写入时产生的异常的处理器.
  • cacheSize: 缓存大小,以MB为单位.
  • compress: 是否采用LZF算法进行快速压缩.
  • compressHigh: 是否采用Deflate算法进慢速速压缩.
  • encryptionKey: 文件加密的key.
  • fileName: 基于文件存储时,用于存储的文件名.
  • fileStore: 存储实现.
  • pageSplitSize: pages的分割点.
  • readOnly: 是否以只读形式打开存储文件.

R-Tree

MVRTreeMap是一个用于快速的R-Tree实现,使用示例如下:

// create an in-memory store
MVStore s = MVStore.open(null);

// open an R-tree map
MVRTreeMap<String> r = s.openMap("data",
        new MVRTreeMap.Builder<String>());

// add two key-value pairs
// the first value is the key id (to make the key unique)
// then the min x, max x, min y, max y
r.add(new SpatialKey(0, -3f, -2f, 2f, 3f), "left");
r.add(new SpatialKey(1, 3f, 4f, 4f, 5f), "right");

// iterate over the intersecting keys
Iterator<SpatialKey> it =
        r.findIntersectingKeys(new SpatialKey(0, 0f, 9f, 3f, 6f));
for (SpatialKey k; it.hasNext();) {
    k = it.next();
    System.out.println(k + ": " + r.get(k));
}
s.close();

默认维度是2,new MVRTreeMap.Builder<String>().dimensions(3)这样可以设置一个不同的维度数,维度的取值最大值是32,最小值是1.

特性

Maps

每一个store含有一组命名map。每个map按key存储,支持通用查找操作,比如查找第一个,查找最后一个,迭代部分或者全部的key等等等。

也支持一些不太通用的操作:快速的按索引查找、高效的根据key算出其索引(位置、index)。也就是意味着取中间的两个key也是非常快的,也能快速统计某个范围内的key。The iterator supports fast skipping. This is possible because internally, each map is organized in the form of a counted B+-tree.

在数据库侧,一个map能被一张表一样使用,map的key就是表的主键,map的值就是表的行。map也能代笔索引,map的key相当于索引的key,map的值相当于表的主键(针对那种非联合索引,map的key需含有主键字段)

版本

版本是指在 指定时间的所有map中所有数据的一个快照。创建快照的速度很快:仅仅复制上一个快照后发生改变的page。这种行为通常也叫作COW(copy on write)。旧版本变成只读的。支持回滚到一个旧版本。

下面的示例代码展示了如何创建一个store,打开一个map,增加一些数据和存取当前的以及旧版本的数据:

// create/get the map named "data"
MVMap<Integer, String> map = s.openMap("data");

// add some data
map.put(1, "Hello");
map.put(2, "World");

// get the current version, for later use
long oldVersion = s.getCurrentVersion();

// from now on, the old version is read-only
s.commit();

// more changes, in the new version
// changes can be rolled back if required
// changes always go into "head" (the newest version)
map.put(1, "Hi");
map.remove(2);

// access the old data (before the commit)
MVMap<Integer, String> oldMap =
        map.openVersion(oldVersion);

// print the old version (can be done
// concurrently with further modifications)
// this will print "Hello" and "World":
System.out.println(oldMap.get(1));
System.out.println(oldMap.get(2));

// print the newest version ("Hi")
System.out.println(map.get(1));

事务

支持多路并发开启事务,TransactionStore实现了事务功能,其支持PostgreSQL的带savepoints的事务隔离级别得读提交(read committed),两阶段提交,其他数据的一些经典特性。事务的大小没有限制(针对大的或者长时间运行的事务,其日志被写到磁盘上)

基于内存形式的性能和用量

基于内存操作的性能约比java.util.TreeMap慢50%。

The memory overhead for large maps is slightly better than for the regular map implementations, but there is a higher overhead per map. For maps with less than about 25 entries, the regular map implementations need less memory.

如果没有指定文件名,存储的操作将是纯内存形式的,这种模式下支持初持久化之外的所有操作(多版本,索引查找,R-Tree等等)。如果自定了文件名,在数据持久化之前的所有操作都发生在内存中。

正如所有的map实现一样,所有的key是不可变的,这意味着实体被加入map之后就不允许改变key对象了。如果指定了文件名,在实体加入map之后其value对象也是允许被修改的,因为value或许已经被序列化了(当打开自动commit时序列化会随时发生)。

可插拔的数据类型

序列化方式是可插拔的。目前的默认的序列化方式支持许多普通的数据类型,针对其他的对象类型使用了java的序列化机制。下面这些类型是可以直接被支持的:Boolean, Byte, Short, Character, Integer, Long, Float, Double, BigInteger, BigDecimal, String, UUID, Date和数组(基本类型数组和对象数组)。For serialized objects, the size estimate is adjusted using an exponential moving average.

支持泛型数据类型。

存储引擎自身没有任何长度限制,所以key,value,page和chunk可以很大很大,而且针对map和chunk的数量也没有固定的限制。因为使用了日志结构存储,所以针对大的key和page也无需特殊的处理。

BOLB支持

支持大的二进制对象存储,方式是将其分隔成更小的块。这样就能存储内存里放不下的对象。Streaming as well as random access reads on such objects are supported. This tool is written on top of the store, using only the map interface.

R-Tree和可插拔的map实现

map的具体实现是可插拔的,目前默认实现是MVMap,here is a multi-version R-tree map implementation for spatial operations.

并发操作和缓存

支持并发读写。所有的读操作可以并行发生。支持与从文件系统中并发读一样的从page cache中 并发读。写操作首先将关联的page从磁盘读取到内存(这个可以并发执行),然后再修改数据,内存部分的写操作是同步的。将变化写入文件和将变化写入快照一样都可以并发的修改数据。

在page级别做了缓存,是一个并发的LIRS 缓存(LIRS 可以减少扫描)

For fully scalable concurrent write operations to a map (in-memory and to disk), the map could be split into multiple maps in different stores ('sharding'). The plan is to add such a mechanism later when needed.

Log Structured Storage

Internally, changes are buffered in memory, and once enough changes have accumulated, they are written in one continuous disk write operation. Compared to traditional database storage engines, this should improve write performance for file systems and storage systems that do not efficiently support small random writes, such as Btrfs, as well as SSDs. (According to a test, write throughput of a common SSD increases with write block size, until a block size of 2 MB, and then does not further increase.) By default, changes are automatically written when more than a number of pages are modified, and once every second in a background thread, even if only little data was changed. Changes can also be written explicitly by calling commit().

When storing, all changed pages are serialized, optionally compressed using the LZF algorithm, and written sequentially to a free area of the file. Each such change set is called a chunk. All parent pages of the changed B-trees are stored in this chunk as well, so that each chunk also contains the root of each changed map (which is the entry point for reading this version of the data). There is no separate index: all data is stored as a list of pages. Per store, there is one additional map that contains the metadata (the list of maps, where the root page of each map is stored, and the list of chunks).

There are usually two write operations per chunk: one to store the chunk data (the pages), and one to update the file header (so it points to the latest chunk). If the chunk is appended at the end of the file, the file header is only written at the end of the chunk. There is no transaction log, no undo log, and there are no in-place updates (however, unused chunks are overwritten by default).

Old data is kept for at least 45 seconds (configurable), so that there are no explicit sync operations required to guarantee data consistency. An application can also sync explicitly when needed. To reuse disk space, the chunks with the lowest amount of live data are compacted (the live data is stored again in the next chunk). To improve data locality and disk space usage, the plan is to automatically defragment and compact data.

Compared to traditional storage engines (that use a transaction log, undo log, and main storage area), the log structured storage is simpler, more flexible, and typically needs less disk operations per change, as data is only written once instead of twice or 3 times, and because the B-tree pages are always full (they are stored next to each other) and can be easily compressed. But temporarily, disk space usage might actually be a bit higher than for a regular database, as disk space is not immediately re-used (there are no in-place updates).

Off-Heap and Pluggable Storage

Storage is pluggable. Unless pure in-memory operation is used, the default storage is to a single file.

An off-heap storage implementation is available. This storage keeps the data in the off-heap memory, meaning outside of the regular garbage collected heap. This allows to use very large in-memory stores without having to increase the JVM heap, which would increase Java garbage collection pauses a lot. Memory is allocated using ByteBuffer.allocateDirect. One chunk is allocated at a time (each chunk is usually a few MB large), so that allocation cost is low. To use the off-heap storage, call:

OffHeapStore offHeap = new OffHeapStore();
MVStore s = new MVStore.Builder().
        fileStore(offHeap).open();

File System Abstraction, File Locking and Online Backup

The file system is pluggable. The same file system abstraction is used as H2 uses. The file can be encrypted using a encrypting file system wrapper. Other file system implementations support reading from a compressed zip or jar file. The file system abstraction closely matches the Java 7 file system API.

Each store may only be opened once within a JVM. When opening a store, the file is locked in exclusive mode, so that the file can only be changed from within one process. Files can be opened in read-only mode, in which case a shared lock is used.

The persisted data can be backed up at any time, even during write operations (online backup). To do that, automatic disk space reuse needs to be first disabled, so that new data is always appended at the end of the file. Then, the file can be copied. The file handle is available to the application. It is recommended to use the utility class FileChannelInputStream to do this. For encrypted databases, both the encrypted (raw) file content, as well as the clear text content, can be backed up.

Encrypted Files

File encryption ensures the data can only be read with the correct password. Data can be encrypted as follows:

MVStore s = new MVStore.Builder().
    fileName(fileName).
    encryptionKey("007".toCharArray()).
    open();
The following algorithms and settings are used:

The password char array is cleared after use, to reduce the risk that the password is stolen even if the attacker has access to the main memory.
The password is hashed according to the PBKDF2 standard, using the SHA-256 hash algorithm.
The length of the salt is 64 bits, so that an attacker can not use a pre-calculated password hash table (rainbow table). It is generated using a cryptographically secure random number generator.
To speed up opening an encrypted stores on Android, the number of PBKDF2 iterations is 10. The higher the value, the better the protection against brute-force password cracking attacks, but the slower is opening a file.
The file itself is encrypted using the standardized disk encryption mode XTS-AES. Only little more than one AES-128 round per block is needed.

Tools

There is a tool, the MVStoreTool, to dump the contents of a file.

Exception Handling

This tool does not throw checked exceptions. Instead, unchecked exceptions are thrown if needed. The error message always contains the version of the tool. The following exceptions can occur:

IllegalStateException if a map was already closed or an IO exception occurred, for example if the file was locked, is already closed, could not be opened or closed, if reading or writing failed, if the file is corrupt, or if there is an internal error in the tool. For such exceptions, an error code is added so that the application can distinguish between different error cases.
IllegalArgumentException if a method was called with an illegal argument.
UnsupportedOperationException if a method was called that is not supported, for example trying to modify a read-only map.
ConcurrentModificationException if a map is modified concurrently.

H2的存储引擎

H2 1.4之后的版本(含1.4)默认使用MVStore作为存储引擎 (支持 SQL, JDBC, transactions, MVCC等等).针对老版本, 将;MV_STORE=TRUE拼接到database URL后面. Even though it can be used with the default table level locking, by default the MVCC mode is enabled when using the MVStore.

文件格式


数据被存储到文件里. 文件有两个(出于安全起见)文件头和大量的chunk. 每个文件头是一个4096 bytes的块.每个chunk至少一个块,但是通常是 200个或者更多个块. 数据已日志结构存储的形式存储在chunk中. 每个版本都有一个chunk。.

[ file header 1 ] [ file header 2 ] [ chunk ] [ chunk ] ... [ chunk ]

每一个chunk含有大量的B-Tree page,示例代码如下:
MVStore s = MVStore.open(fileName);
MVMap<Integer, String> map = s.openMap("data");
for (int i = 0; i < 400; i++) {
    map.put(i, "Hello");
}
s.commit();
for (int i = 0; i < 100; i++) {
    map.put(0, "Hi");
}
s.commit();
s.close();

结果是两个chunks (不包含metadata):

Chunk 1:
- Page 1: (root) node with 2 entries pointing to page 2 and 3
- Page 2: leaf with 140 entries (keys 0 - 139)
- Page 3: leaf with 260 entries (keys 140 - 399)
Chunk 2:
- Page 4: (root) node with 2 entries pointing to page 3 and 5
- Page 5: leaf with 140 entries (keys 0 - 139)
这意味着每个chunk含有一个版本的变更: 新版本的变更page和它的父page, 递归直至根page. 后来的page指向被早期的page引用。

文件header

这两有两个文件头,通常含有相同的数据. 但在某个文件头被更新的某一片刻, 写操作可能部分失败. 这就是为什么有第二个文件头的原因.(???) 文件头采用in-place update更新方式。文件头包含如下数据:

H:2,block:2,blockSize:1000,chunk:7,created:1441235ef73,format:1,version:7,fletcher:3044e6cc


这些数据被以键值对的形式存储. 其值都是以十六进制形式存储。

这些字段是:

H: H:2表示是H2数据库
block: 最新的chunk的block的数量 (but not necessarily the newest???).
blockSize: 文件的块的大小; 目前常用0x1000=4096, 与现代磁盘sector的大小匹配.
chunk: chunk的id, 通常与版本相同,没有版本的时候是0
created: 文件创建时间(从1970年到现在的毫秒数)
format: 文件格式,当前是1.
version: chunk的版本
fletcher: header的Fletcher-32形式的check sum值


打开文件时,读取文件头并校验其check sum值. 如果两个头都是合法的,那么新版本的将被使用. 最新版本的chunk被找到,而且从这里读取剩余的metadata 。如果chunk id, block and version没有存储在文件头中,那么从文件中最后一个chunk开始查找最近的chunk。

Chunk 格式

这里针对单个版本的chunk. 每个chunk由 header, 这个版本中发生修改的pages , 和一个footer组成. page包含map中实际的数据. chunk里的page被存储在header的后面的右侧, next to each other (unaligned). chunk的大小是块大小的倍数. footer被存储在至少128字节的chunk中。

[ header ] [ page ] [ page ] ... [ page ] [ footer ]

footer允许用来验证这个chunk是否完全写完成了, (一个chunk对应一次写操作),同时允许用来找到文件中最后一个chunk的开始位置chunk的header和footer包含如下数据:

chunk:1,block:2,len:1,map:6,max:1c0,next:3,pages:2,root:4000004f8c,time:1fc,version:1
chunk:1,block:2,version:1,fletcher:aed9a4f6

这些字段解析如下:

chunk: chunk id.
block: chunk的第一个block (multiply by the block size to get the position in the file).
len: chunk的size,即block的个数??.
map: 最新map的id; 当新map创建时会增加.
max: 所有的最大的page size的和 (see page format).
next: 为下一个chunk预估的开始位置.
pages: 一个chunk中page的个数
root: metadata根page的位置 (see page format).
time: 写chunk的时间, 从文件创建到写chunk之间的隔的毫秒数.
version: chunk体现的版本
fletcher: footer的check sum.


Chunks 从不取代式更新. 每个chunk含有相应版本的page (如上所说,一个chunk对应一个版本), plus all the parent nodes of those pages, recursively, up to the root page. 如果有一个entry在map中发生了增加、删除或者修改,然后相应的page将被拷贝、修改,并存储到下一个chunk中, 旧chunk中活(live)page的数量将减少. 这个机制叫作复制后写, 与Btrfs文件系统工作原理相似. 没有活(live)page的chunk将被打上释放的标志,所以这个空间能被更多的最近的chunk使用. Because not all chunks are of the same size, there can be a number of free blocks in front of a chunk for some time (until a small chunk is written or the chunks are compacted). There is a delay of 45 seconds (by default) before a free chunk is overwritten, to ensure new versions are persisted first.

当打开一个store时最新的chunk是如何被定位到的: The file header contains the position of a recent chunk, but not always the newest one. This is to reduce the number of file header updates. After opening the file, the file headers, and the chunk footer of the very last chunk (at the end of the file) are read. From those candidates, the header of the most recent chunk is read. If it contains a "next" pointer (see above), those chunk's header and footer are read as well. If it turned out to be a newer valid chunk, this is repeated, until the newest chunk was found. Before writing a chunk, the position of the next chunk is predicted based on the assumption that the next chunk will be of the same size as the current one. When the next chunk is written, and the previous prediction turned out to be incorrect, the file header is updated as well. In any case, the file header is updated if the next chain gets longer than 20 hops.

Page格式

每一个map是一个B-tree, map的数据被存存储在B-tree pages.:含有map的ey-value pairs 的叶子节点,那些仅含有key和指向叶子的内部节点. 树的根节点既是一个叶子也是一个内部节点. 与文件头、chunk头脚不同的是, page的数据是人类不可读的,它是以字节数组形式存储的, with long (8 bytes), int (4 bytes), short (2 bytes), and variable size int and long (1 to 5 / 10 bytes). page 格式是:

length (int): Length of the page in bytes.
checksum (short): Checksum (chunk id xor offset within the chunk xor page length).
mapId (variable size int): The id of the map this page belongs to.
len (variable size int): The number of keys in the page.
type (byte): The page type (0 for leaf page, 1 for internal node; plus 2 if the keys and values are compressed with the LZF algorithm, or plus 6 if the keys and values are compressed with the Deflate algorithm).
children (array of long; internal nodes only): The position of the children.
childCounts (array of variable size long; internal nodes only): The total number of entries for the given child page.
keys (byte array): All keys, stored depending on the data type.
values (byte array; leaf pages only): All values, stored depending on the data type.
Even though this is not required by the file format, pages are stored in the following order: For each map, the root page is stored first, then the internal nodes (if there are any), and then the leaf pages. This should speed up reads for media where sequential reads are faster than random access reads. The metadata map is stored at the end of a chunk.

Pointers to pages are stored as a long, using a special format: 26 bits for the chunk id, 32 bits for the offset within the chunk, 5 bits for the length code, 1 bit for the page type (leaf or internal node). The page type is encoded so that when clearing or removing a map, leaf pages don't have to be read (internal nodes do have to be read in order to know where all the pages are; but in a typical B-tree the vast majority of the pages are leaf pages). The absolute file position is not included so that chunks can be moved within the file without having to change page pointers; only the chunk metadata needs to be changed. The length code is a number from 0 to 31, where 0 means the maximum length of the page is 32 bytes, 1 means 48 bytes, 2: 64, 3: 96, 4: 128, 5: 192, and so on until 31 which means longer than 1 MB. That way, reading a page only requires one read operation (except for very large pages). The sum of the maximum length of all pages is stored in the chunk metadata (field "max"), and when a page is marked as removed, the live maximum length is adjusted. This allows to estimate the amount of free space within a block, in addition to the number of free pages.

The total number of entries in child pages are kept to allow efficient range counting, lookup by index, and skip operations. The pages form a counted B-tree.

Data compression: The data after the page type are optionally compressed using the LZF algorithm.

Metadata Map

In addition to the user maps, there is one metadata map that contains names and positions of user maps, and chunk metadata. The very last page of a chunk contains the root page of that metadata map. The exact position of this root page is stored in the chunk header. This page (directly or indirectly) points to the root pages of all other maps. The metadata map of a store with a map named "data", and one chunk, contains the following entries:

chunk.1: The metadata of chunk 1. This is the same data as the chunk header, plus the number of live pages, and the maximum live length.
map.1: The metadata of map 1. The entries are: name, createVersion, and type.
name.data: The map id of the map named "data". The value is "1".
root.1: The root position of map 1.
setting.storeVersion: The store version (a user defined value).

Similar Projects and Differences to Other Storage Engines


Unlike similar storage engines like LevelDB and Kyoto Cabinet, the MVStore is written in Java and can easily be embedded in a Java and Android application.

The MVStore is somewhat similar to the Berkeley DB Java Edition because it is also written in Java, and is also a log structured storage, 但是H2的许可证更自由.

Like SQLite 3, the MVStore keeps all data in one file. Unlike SQLite 3, the MVStore uses is a log structured storage. The plan is to make the MVStore both easier to use as well as faster than SQLite 3. In a recent (very simple) test, the MVStore was about twice as fast as SQLite 3 on Android.

The API of the MVStore is similar to MapDB (previously known as JDBM) from Jan Kotek, and some code is shared between MVStore and MapDB. However, unlike MapDB, the MVStore uses is a log structured storage. The MVStore does not have a record size limit.

Current State

The code is still experimental at this stage. The API as well as the behavior may partially change. Features may be added and removed (even though the main features will stay).

Requirements


The MVStore is included in the latest H2 jar file.

There are no special requirements to use it. The MVStore should run on any JVM as well as on Android.

To build just the MVStore (without the database engine), run:

./build.sh jarMVStore
This will create the file bin/h2mvstore-1.4.191.jar (about 200 KB).

你可能感兴趣的:(H2的MVStore)