Hbase 实现随机,实时读写访问的原理
复习:
Hbase 写数据流程
WAL:1. 保证了写入数据的可靠性 2. 提供出错恢复机制。
memstore:1. 数据的快速写入(包括insert和update和delete[ delete原理为在写入的数据后添加一个type标记位,标识数据被删除 ]) 2. 数据写入后立即被访问
flush to disk:文件写入到磁盘
compact:
HBase has two types of compaction: one called “minor compaction” that just merges two or more small files into one, and the other called “major compaction” that picks up all the files in the region, merges them and performs some cleanup. In a major compaction, deleted key/values are removed, this new file doesn’t contain the tombstone markers and all the duplicate key/values (replace value operations) are removed.
Apache Hadoop I/O file formats
SequenceFile : only allowed append your key/value pairs!
MapFile:
contains two sequencefiles " /data" and " /index"
append sorted key/values pairs and every N keys(where N is a conifgurable interval ) it sotres the key and the offset in the index!
lookup fast by scan the index
HBase & MapFile
The HBase Key consists of: the row key, column family, column qualifier, timestamp and a type.
To solve the problem of deleting key/value pairs, the idea is to use the “type” field to mark key as deleted (tombstone markers). Solving the problem of replacing key/value pairs is just a matter of picking the later timestamp (the correct value is near the end of the file, append only means last inserted is near the end).
To solve the “non-ordered” key problem we keep the last added key-values in memory. When you’ve reached a threshold, HBase flush it to a MapFile. In this way, you end up adding sorted key/values to a MapFile.
HFile v1
和MapFile区别:增加metadata和index在data文件中
MapFile :
Data --> Data Index
Hfile:
Data --> Data Index
Meta --> MetaIndex
FileInfo
Meta Block : Bloom Filter
FileInfo : max sequenceId,major compaction key,timerange info
In HBase 0.20, MapFile is replaced by HFile: a specific map file implementation for HBase. The idea is quite similar to MapFile, but it adds more features than just a plain key/value file. Features such as support for metadata and the index is now kept in the same file.
The data blocks contain the actual key/values as a MapFile. For each “block close operation” the first key is added to the index, and the index is written on HFile close.
The HFile format also adds two extra “metadata” block types: Meta and FileInfo. These two key/value blocks are written upon file close.
The Meta block is designed to keep a large amount of data with its key as a String, while FileInfo is a simple Map preferred for small information with keys and values that are both byte-array. Regionserver’s StoreFile uses Meta-Blocks to store a Bloom Filter, and FileInfo for Max SequenceId, Major compaction key and Timerange info. This information is useful to avoid reading the file if there’s no chance that the key is present (Bloom Filter), if the file is too old (Max SequenceId) or if the file is too new (Timerange) to contain what we’re looking for.
HFile v2
避免Hfile v1中将整个索引和巨大的Bloom Filter加载到内存,Hfile v2引入多级索引和Block级别的Bloom Filter.(自己理解为每个data block有自己leaf index和bloom filter以及block type,root-index,etc,好处数据定位加载内存更少,寻找数据速度更快)
In HBase 0.92, the HFile format was changed a bit (HBASE-3857) to improve the performance when large amounts of data are stored. One of the main problems with the HFile v1 is that you need to load all the monolithic indexes and large Bloom Filters in memory, and to solve this problem v2 introduces multi-level indexes and a block-level Bloom Filter. As a result, HFile v2 features improved speed, memory, and cache usage.
The main feature of this v2 are “inline blocks”, the idea is to break the index and Bloom Filter per block, instead of having the whole index and Bloom Filter of the whole file in memory. In this way you can keep in ram just what you need.
Since the index is moved to block level you then have a multi-level index, meaning each block has its own index (leaf-index). The last key of each block is kept to create the intermediate/index that makes the multilevel-index b+tree like.
The block header now contains some information: The “Block Magic” field was replaced by the “Block Type” field that describes the content of the block “Data”, Leaf-Index, Bloom, Metadata, Root-Index, etc. Also three fields (compressed/uncompressed size and offset prev block) were added to allow fast backward and forward seeks.
Data Block Encodings
性能分析
Prefix Encoding理解:
因为key本身有很多数据是相同的(rowkey,family,qualifier),只要记录key len和前一条记录不同的数据即可
6.HBase中的PREFIX_TREE结构 参考<HBase PREFIX_TREE>
PREFIX_TREE会分别对KeyValue中的row family qualifier形成三颗树,以Row为例,插入aa、aac、bb、bc、bc的过程如图
树上的节点分为3类:
branch(黄色): 内部节点,不代表实际数据,必定含有子节点
leaf(绿色):一个叶子节点,没有子节点,代表着实际数据
nub(紫色):branch与leaf的混合,既代表了实际数据,又含有子节点
节点中的数字代表该节点对应的实际数据的数目,如果一个row下有多条KeyValue,那么每条KeyValue对应的qualifier、family、timestamp等都会在encode时进行记录
Diff Encoding理解:
split key field 相同信息不再写入,column family store once
如果key length,value length和type和上一条记录相同,跳过不存储
timestampe 存储和前一个的变化值
Since keys are sorted and usually very similar, it is possible to design a better compression than what a general purpose algorithm can do.
HBASE-4218 tried to solve this problem, and in HBase 0.94 you can choose between a couple of different algorithms: Prefix and Diff Encoding.
The main idea of Prefix Encoding is to store the common prefix only once, since the rows are sorted and the beginning is typically the same.
The Diff Encoding pushes this concept further. Instead of considering the key as an opaque sequence of bytes, the Diff Encoder splits each key field in order to compress each part in a better way. This being that the column family is stored once. If the key length, value length and type are the same as the row prior, the field is omitted. Also, for increased compression, the timestamp is stored is stored as a Diff from the previous one.
Note that this feature is off by default since writing and scanning are slower but more data is cached. To enable this feature you can set DATA_BLOCK_ENCODING = PREFIX | DIFF | FAST_DIFF in the table info.
HFile v3
HBASE-5313 contains a proposal to restructure the HFile layout to improve compression:
Also, a columnar format or a columnar encoding is under investigation, take a look at AVRO-806 for a columnar file format by Doug Cutting.
As you may see the trend in evolution is to be more aware about what the file contains, to get better compression or better location awareness that translates into less data to write/read from disk. Less I/O means more speed!
[1] http://blog.cloudera.com/blog/2011/01/hadoop-io-sequence-map-set-array-bloommap-files/
[2] http://blog.cloudera.com/blog/2012/06/hbase-write-path/
源文档 <http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/>