bigtable: A Distributed Storage System for Structured Data
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/bigtable-osdi06.pdf
http://www.dbthink.com/?p=493, 中文翻译
A Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key,column key, and a timestamp; each value in the map is an uninterpreted array of bytes. Bigtable supports single-row transactions.
一种面对bigdata的分布式数据库, 不需要维护RDBMS那样严格的关系模型, 可以由用户轻易的更新schema.
Bigtable保证row级别的事务, 即原子性, 可见在CAP理论中, Bigtable保证数据的强一致性, 而牺牲了可用性.
要理解Bigtable, 从下面几个方面
数据模型
一般的KV存储系统只需要满足简单的key-value存储和查询, 数据结构上来说就是map, 所以一般都是基于hash实现的, 如Dynamo系列
Bigtable虽然从数据存储结构上来看, 也算KV存储, 但是它需要更多的feature, 比如类似table的schema, 支持range查询...所以他的数据结构比较复杂称为multidimensional sorted map
从定义上看, 首先是sorted kv(按照row key排序), 并且是multidimensional的
什么叫做multidimensional? 因为key不仅仅是row key
如下图, 在kv的key是个复合结构, 其中包含了row, column family, column qualifier, time stamp信息, 所以是多维度的
LSMTree
Bigtable包含Master和tablet server, 而tablet server其实就是LSM Tree的实现
由于这个不是bigtable独创的, 所以paper里面简单带过了, 参考详解SSTable结构和LSMTree索引
LSMTree基于SSTable文件, bigtable直接使用GFS文件系统, 所以这块也没啥好说的
Master
Master最主要的任务就是管理tablet server和分配tablets
由于是基于master的设计, 所以其实也很简单
首先, 通过常驻内存中的metadata来维护table, tablet server, tablet之间的元数据
其次, 依赖于Chubby (类似, Zookeeper)做如下事情,
分布式锁
To ensure that there is at most one active master at any time;
分布式监控
To discover tablet servers and finalize tablet server deaths (see Section 5.2);
分布式配置管理
To store the bootstrap location of Bigtable data (see Section 5.1);
To store Bigtable schema information (the column family information for each table);
To store access control lists.
我只是笔记我认为比较好理解的部分, implementation部分, 写的不是很有条理, 所以替换成更好理解的版本...
Bigtable is designed to reliably scale to petabytes of data and thousands of machines. Bigtable has achieved several goals: wide applicability, scalability, high performance, and high availability.
Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size:petabytes of data across thousands of commodity servers.
Bigtable is used by more than sixty Google products and projects, including Google Analytics, Google Finance, Orkut, Personalized Search, Writely, and Google Earth. These products use Bigtable for a variety of demanding workloads, which range from throughput-oriented batch-processing jobs to latency-sensitive serving of data to end users.
可见, HBase应用范围非常广泛...
A Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key,column key, and a timestamp; each value in the map is an uninterpreted array of bytes.
Example, we would use URLs as row keys, various aspects of web pages as column names, and store the contents of the web pages in the contents: column under the timestamps when they were fetched, as illustrated in Figure 1.
The row keys in a table are arbitrary strings (currently up to 64KB in size, although 10-100 bytes is a typical size for most of our users).
Every read or write of data under a single row key is atomic (regardless of the number of different columns being read or written in the row).
Bigtable maintains data in lexicographic order by row key. The row range for a table is dynamically partitioned.
Each row range is called a tablet, which is the unit of distribution and load balancing. As a result, reads of short row ranges are efficient and typically require communication with only a small number of machines.
Column keys are grouped into sets called column families, which form the basic unit of access control. All data stored in a column family is usually of the same type (we compress data in the same column family together). It is our intent that the number of distinct column families in a table be small (in the hundreds at most), and that families rarely change during operation. In contrast, a table may have an unbounded number of columns.
定义column family的目的, 就是有效的组织columns, 所以families的数量相对比较少, 而且也比较稳定, 不太会被改变.
A column key is named using the following syntax: family:qualifier.
Column family names must be printable, but qualifiers may be arbitrary strings.
Access control and both disk and memory accounting are performed at the column-family level.
Each cell in a Bigtable can contain multiple versions of the same data; these versions are indexed by timestamp.
Bigtable timestamps are 64-bit integers. They can be assigned by Bigtable, in which case they represent “real time” in microseconds, or be explicitly assigned by client applications.
Applications that need to avoid collisions must generate unique timestamps themselves. Different versions of a cell are stored in decreasing timestamp order, so that the most recent versions can be read first.
To make the management of versioned data less onerous, we support two per-column-family settings that tell Bigtable togarbage-collect cell versions automatically.
The client can specify either that only the last n versions of a cell be kept, or that only new-enough versions be kept (e.g., only keep values that were written in the last seven days).
The Bigtable API provides functions for creating and deleting tables and column families. It also provides functions for changing cluster, table, and column family metadata, such as access control rights.
Bigtable supports several other features that allow the user to manipulate data in more complex ways.
First, Bigtable supports single-row transactions, which can be used to perform atomic read-modify-write sequences on data stored under a single row key. Bigtable does not currently support general transactions across row keys, although it provides an interface for batching writes across row keys at the clients.
Second, Bigtable allows cells to be used as integer counters.
Finally, Bigtable supports the execution of client-supplied scripts in the address spaces of the servers. The scripts are written in a language developed at Google for processing data called Sawzall [28].
Bigtable can be used with MapReduce [12], a framework for running large-scale parallel computations developed at Google.
Bigtable is built on several other pieces of Google infrastructure.
1. Bigtable uses the distributed Google File System (GFS) [17] to store log and data files.
2. A Bigtable cluster typically operates in a shared pool of machines that run a wide variety of other distributed applications, and Bigtable processes often share the same machines with processes from other applications. Bigtable depends on acluster management system for scheduling jobs, managing resources on shared machines, dealing with machine failures, and monitoring machine status.
3. The Google SSTable file format is used internally to store Bigtable data.
An SSTable provides a persistent, ordered immutable map from keys to values, where both keys and values are arbitrary byte strings. Operations are provided to look up the value associated with a specified key, and to iterate over all key/value pairs in a specified key range.
Internally, each SSTable contains a sequence of blocks (typically each block is 64KB in size, but this is configurable).
A block index (stored at the end of the SSTable) is used to locate blocks; the index is loaded into memory when the SSTable is opened.
A lookup can be performed with a single disk seek: we first find the appropriate block by performing a binary search in the in-memory index, and then reading the appropriate block from disk.
Optionally, an SSTable can be completely mapped into memory, which allows us to perform lookups and scans without touching disk.
4. Bigtable relies on a highly-available and persistent distributed lock service called Chubby [8].
A Chubby service consists of five active replicas, one of which is elected to be the master and actively serve requests. The service is live when a majority of the replicas are running and can communicate with each other.
Chubby uses the Paxos algorithm [9, 23] to keep its replicas consistent in the face of failure. Chubby provides a namespace that consists of directories and small files. Each directory or file can be used as a lock, and reads and writes to a file are atomic. The Chubby client library provides consistent caching of Chubby files.
Each Chubby client maintains a session with a Chubby service. A client's session expires if it is unable to renew its session lease within the lease expiration time. When a client's session expires, it
loses any locks and open handles. Chubby clients can also register callbacks on Chubby files and directories for notification of changes or session expiration.
Bigtable uses Chubby for a variety of tasks:
To ensure that there is at most one active master at any time;
To store the bootstrap location of Bigtable data (see Section 5.1);
To discover tablet servers and finalize tablet server deaths (see Section 5.2);
To store Bigtable schema information (the column family information for each table);
To store access control lists.
If Chubby becomes unavailable for an extended period of time, Bigtable becomes unavailable.
Bigtable’s implementation consists of three major components per Bigtable instance (cf. [CDG+06, p.4]):
As it has been stated in the last sections, tables are dynamically split into tablets and these tablets are distributed among a multiple tablet servers which can dynamically enter and leave a Bigtable instance at runtime. Hence, Bigtable has to provide means for managing and looking up tablet locations, such that master servers can redistribute tablets and client libraries can discover the tablet servers which are in charge of certain rows of table.
We use a three-level hierarchy analogous to(只是类似) that of a B+-tree [10] to store tablet location information (Figure 4).
The locations of tablets are stored in a table named METADATA which is completely held in memory.
This table is partitioned into a special first tablet (root tablet) and an arbitrary number of further tablets (other METADATA tablets).
The other METADATA tablets contain the location information for all tablets of user tables (i. e. tables created by client applications) whereas the root tablet contains information about the location of the other METADATA tablets and is never split itself.
The location information for the root tablet is stored in a file placed in a Chubby namespace.
The location information for a tabled is stored in row that identifiedby the “tablet’s table identifier and its end row”.
第一层是一个存储在Chubby中的文件,它包含了Root Tablet的位置信息。Root Tablet包含了一个特殊的METADATA表里所有的Tablet的位置信息。METADATA表的每个Tablet包含了一个用户Tablet的集合。
Root Tablet实际上是METADATA表的第一个Tablet,只不过对它的处理比较特殊 — Root Tablet永远不会被分割 — 这就保证了Tablet的位置信息存储结构不会超过三层。
在METADATA表里面,每个Tablet的位置信息对应于一个row key,而这个row key是由Tablet所在的表的标识符和Tablet 的最后一行编码而成的。
The METADATA table stores the location of a tablet under a row key that is an encoding of the tablet's table identifier and its end row.
METADATA的每一行耗费大约1KB的内存, 在一个大小适中的、容量限制为128MB的METADATA Tablet中,采用这种三层结构的存储模式,可以标识2^34个Tablet的地址(如果每个Tablet存储128MB数据,那么一共可以存储 2^61字节数据)。
The client library caches tablet locations.
If the client does not know the location of a tablet, or if it discovers that cached location information is incorrect, then it recursively moves up the tablet location hierarchy.
Although tablet locations are stored in memory, so no GFS accesses are required, we further reduce this cost in the common case by having the client library prefetch tablet locations: it reads the metadata for more than one tablet whenever it reads the METADATA table.
We also store secondary information in the METADATA table, including a log of all events pertaining to each tablet (such as when a server begins serving it). This information is helpful for debugging and performance analysis.
A tablet is created, deleted and assigned to a tablet server by the master server.
Each tablet of a Bigtable is assigned to at most one tablet server at a time;
When a tablet is unassigned, and a tablet server with sufficient room for the tablet is available, the master assigns the tablet by sending a tablet load request to the tablet server.
When a tablet server starts,
Tablet server creates a uniquely-named file in a predefined directory of a Chubby namespace and acquires an exclusive lock for this.
The master server of a Bigtable instance constantly monitors the tablet servers by asking them whether they have still locked their file in Chubby;
To detect when a tablet server is no longer serving its tablets, the master periodically asks each tablet server for the status of its lock.
If a tablet server reports that it has lost its lock, or if the master was unable to reach a server during its last several attempts,
the master attempts to acquire an exclusive lock on the server's file.
If the master is able to acquire the lock, then Chubby is live,
the tablet server is either dead or having trouble reaching Chubby,
so the master ensures that the tablet server can never serve again by deleting its server file.
Once a server's file has been deleted, the master can move all the tablets that were previously assigned to that server into the set of unassigned tablets.To ensure that a Bigtable cluster is not vulnerable to networking issues between the master and Chubby,the master kills itself if its Chubby session expires. However, as described above, master failures do not change the assignment of tablets to tablet servers.
Master的监控就是需要detect tablet server dead 或 无法touch Chubby, 因为这就说明该server无法继续serve了, 因为client是通过Chubby去找tablet server的.
The master executes the following steps at startup.
(1) The master grabs a unique master lock in Chubby, which prevents concurrent master instantiations.
(2) The master scans the servers directory in Chubby to find the live servers.
(3) The master communicates with every live tablet server to discover what tablets are already assigned to each server.
(4) The master scans the METADATA table to learn the set of tablets. Whenever this scan encounters a tablet that is not already assigned, the master adds the tablet to the set of unassigned tablets, which makes the tablet eligible for tablet assignment.
One complication is that the scan of the METADATA table cannot happen until the METADATA tablets have been assigned. Therefore, before starting this scan (step 4), the master adds the root tablet to the set of unassigned tablets if an assignment for the root tablet was not discovered during step 3. This addition ensures that the root tablet will be assigned. Because the root tablet contains the names of all METADATA tablets, the master knows about all of them after it has scanned the root tablet.
补充就是说, 我们需要确保在第四步前, 我们可以读到MetaDATA table, 这需要在第三步时, 确保root tablet已经被assign.
All write operations on tablets are “committed to a commit log that stores redo records” and is persisted in the Google File System (GFS).
The recently commited updates are put into a sorted RAM-buffer called memtable. When a memtable reaches a certain size, it is frozen, a new memtable is created, the frozen memtable gets transformed into the SSTable format and written to GFS; this process is called a minor compactation.
Hence, the older updates get persisted in a sequence of SSTables on disk while the newer ones are present in memory.
对于SSTables, 前面blog有介绍, 看过了, 就很容易理解了...
The information about where the SSTables comprising a tablet are located is stored in the METADATA table along with a set of pointers directing into one or more commit logs by which the memtable can be reconstructed when the tablet gets assigned to a tablet server.
Write operations are checked for well-formedness as well as authorization before they are written to the commit log and the memtable (when they are finally committed). The authorization-information is provided on column-family base and stored in a Chubby namespace.
Read operations are also checked whether they are well-formed and whether the requesting client is authorized to issue them. If a read operation is permitted, it “is executed on a merged view of the sequence of SSTables and the memtable”. The merged view required for read operations can be established efficiently as the SSTables and memtable arelexicographically sorted.
Besides minor compactions—the freeze, transformation and persistence of memtables as SSTables—the SSTables also get compacted from time to time. Such a merging compactation is executed asynchronously by a background service in a copy-on-modify fashion. The goal of merging compacations is to limit the number of SSTables which have to be considered for read operations.
A merging compaction that rewrites all SSTables into exactly one SSTable is called a major compaction.
SSTables produced by non-major compactions can contain special deletion entries that suppress deleted data in older SSTables that are still live.
A major compaction, on the other hand, produces an SSTable that contains no deletion information or deleted data. Bigtable cycles through all of its tablets and regularly applies major compactions to them. These major compactions allow Bigtable to reclaim resources used by deleted data, and also allow it to ensure that deleted data disappears from the system in a timely fashion, which is important for services that store sensitive data.
对于会有大量删除操作, 和涉及敏感数据的存储, 这种major compaction比较有用...
During all compactations read as well as write operations can still be served. This is due to the fact that SSTables as well as a frozen memtable are immutable and only discarded if the compactation was finished successfully. In addition, a memtable for committed write operations is always present.
The implementation described in the previous section required a number of refinements to achieve the high performance, availability, and reliability required by our users. This section describes portions of the implementation in more detail in order to highlight these refinements.
Clients can group multiple column families together into a locality group. A separate SSTable is generated for each locality group in each tablet.
Segregating column families that are not typically accessed together into separate locality groups enables more efficient reads.
For example, page metadata in Webtable (such as language and checksums) can be in one locality group, and the contents of the page can be in a different group: an application that wants to read the metadata does not need to read through all of the page contents.
In addition, some useful tuning parameters can be specified on a per-locality group basis. For example, a locality group can be declared to be in-memory. SSTables for in-memory locality groups are loaded lazily into the memory of the tablet server. Once loaded, column families that belong to such locality groups can be read without accessing the disk. This feature is useful for small pieces of data that are accessed frequently: we use it internally for the location column family in the METADATA table.
Clients can control whether or not the SSTables for a locality group are compressed, and if so, which compression format is used. The user-specified compression format is applied to each SSTable block (whose size is controllable via a locality group specific tuning parameter).
Although we lose some space by compressing each block separately, we benefit in that small portions of an SSTable can be read without decompressing the entire file. Many clients use a two-pass custom compression scheme. The first pass uses Bentley and McIlroy's scheme [6], which compresses long common strings across a large window. The second pass uses a fast compression algorithm that looks for repetitions in a small 16 KB window of the data. Both compression passes are very fast, they encode at 100~200 MB/s, and decode at 400~1000 MB/s on modern machines
To improve read performance, tablet servers use two levels of caching.
The Scan Cache is a higher-level cache that caches the key-value pairs returned by the SSTable interface to the tablet server code.
The Block Cache is a lower-level cache that caches SSTables blocks that were read from GFS.
The Scan Cache is most useful for applications that tend to read the same data repeatedly. The Block Cache is useful for applications that tend to read data that is close to the data they recently read (e.g., sequential reads, or random reads of different columns in the same locality group within a hot row).
As described in Section 5.3, a read operation has to read from all SSTables that make up the state of a tablet.
If these SSTables are not in memory, we may end up doing many disk accesses. We reduce the number of disk accessesby allowing clients to specify that Bloom filters [7] should be created for SSTables in a particular locality group. A Bloom filter allows us to ask whether an SSTable might contain any data for a specified row/column pair. For certain applications, a small amount of tablet server memory used for storing Bloom filters drastically reduces the number of disk seeks required for read operations. Our use of Bloom filters also implies that most lookups for non-existent rows or columns do not need to touch disk.
If we kept the commit log for each tablet in a separate log file, a very large number of files would be written concurrently in GFS. Depending on the underlying file system implementation on each GFS server, these writes could cause a large number of disk seeks to write to the different physical log files. In addition, having separate log files per tablet also reduces the effectiveness of the group commit optimization, since groups would tend to be smaller.
To fix these issues, we append mutations to a single commit log per tablet server, co-mingling mutations for different tablets in the same physical log file [18, 20].
Using one log provides significant performance benefits during normal operation, but it complicates recovery.
When a tablet server dies, the tablets that it served will be moved to a large number of other tablet servers: each server typically loads a small number of the original server's tablets. To recover the state for a tablet, the new tablet server needs to reapply the mutations for that tablet from the commit log written by the original tablet server. However, the mutations for these tablets were co-mingled in the same physical log file.
One approach would be for each new tablet server to read this full commit log file and apply just the entries needed for the tablets it needs to recover. However, under such a scheme, if 100 machines were each assigned a single tablet from a failed tablet server, then the log file would be read 100 times (once by each server).
We avoid duplicating log reads by first sorting the commit log entries in order of the keys (table; row name; log sequence number). In the sorted output, all mutations for a particular tablet are
contiguous and can therefore be read efficiently with one disk seek followed by a sequential read. To parallelize the sorting, we partition the log file into 64 MB segments, and sort each segment in parallel on different tablet servers. This sorting process is coordinated by the master and is initiated when a tablet server indicates that it needs to recover mutations from some commit log file.
The distributed Google File Systems is not robust against latency spikes, e. g. due to server crashes or network congestion. To reduce the impact of such latencies, each tablet server uses two writer threads for commit logs, each writing to its own file. Only one of these threads is actively writing to GFS at a given time. If this active thread suffers from GFS “performance hiccups”, the commit logging is switched to the second thread. As any operation in a commit log has a unique sequence number, duplicate entries in the two commit logs can be eliminated when a tablet server loads a tablet.
首先是, 一个tablet一个log file, 这样效率等方面会有问题因为这个tablet增长很快, 所以优化成一个tablet server一个log file.
这样的问题是, 所有tablets的log混一起, 当tablet server挂了, 需要从新rebalance tablets的时候, 新接手的server需要从log file中挑出自己负责的tablets的相关log.
解决办法也很简单, 所有log以三元组的形式, 记下table, 取的时候排序, 然后就可以很简单的取出你需要的tablet的log了, 当log比较大, 可以切割成64 MB segments, 多台tablet server用mapreduce来order
最后, 为了降低GFS存储latency的影响, 所以开两个write线程, 分别写到不同的log file里面, 一个反应慢, 就换另一个写
Tablet recovery is the process of tablet-loading done by a tablet server that a particular tablet has been assigned to. As discussed in this section and in section 6.1.4, a tablet server has to evaluate the commit logs attached that contain operations for the tablet to load. Besides the aforementioned Bloom Filter optimization, Bigtable tries to avoid that a tablet server has to read a
commit log at all when recovering a tablet.
This is achieved by employing two minor compactations when a tablet server stops serving a tablet.
The first compactation is employed to reduce “the amount of uncompacted state in the tablet server’s commit log”.
The second compactation (usually very fast) to eliminate any remaining uncompacted state in the tablet server's log that arrived while the first minor compaction was being performed
Before this second compactation is executed, the tablet server stops serving any requests. These optimizations to reduce tablet recovery time can only happen and take effect if a tablet server stops serving tablets in a controlled fashion (i. e. it has not due to a crash).
首先这个方法用于在tablet server正常关闭的情况下, 来加速tablet recovery, 如果是crash, 没办法加速, 因为memtable都丢失了, 只能老实读commit log.
而如果是正常关闭, 其实是没有必要去读commit log的, 因为我们可以直接从memtable里面读到, 所以这个加速就是在停止服务前, 先做minor compactation, 把memtable里面的内容都merge到sstable里面去
而且要做两次, 第一次compactation时间比较长, 所以过程中可能有新的write operation, 第二次compactation需要补充处理这些write operation
Besides the SSTable caches, various other parts of the Bigtable system have been simplified by the fact that all of the SSTables that we generate are immutable. For example,
Firstly,we do not need any synchronization of accesses to the file system when reading from SSTables. As a result, concurrency control over rows can be implemented very efficiently. The only mutable data structure that is accessed by both reads and writes is the memtable. To reduce contention during reads of the memtable, we make each memtable row copy-on-write and allow reads and writes to proceed in parallel.
Secondly, Since SSTables are immutable, the problem of permanently removing deleted data is transformed to garbage collecting obsolete SSTables. Each tablet's SSTables are registered in the METADATA table. The master removes obsolete SSTables as a mark-and-sweep garbage collection [25] over the set of SSTables, where the METADATA table contains the set of roots.
Finally, the immutability of SSTables enables us to split tablets quickly. Instead of generating a new set of SSTables for each child tablet, we let the child tablets share the SSTables of the parent tablet.
这个是由bigtable的append特性决定的, 当没有数据update时, 各种设计都会被大大优化.
SSTable是不变的, 即数据其实是不变的, 所以读写SSTable不会有冲突, 而且删除数据时也很容易, 甚至在split tablet时, 都很方便.
In the process of designing, implementing, maintaining, and supporting Bigtable, we gained useful experience and learned several interesting lessons.
Chang et al. criticize the assumption made in many distributed protocols that large distributed systems are only vulnerable to few failures like “the standard network partitions and fail-stop failures”. In contrast, they faced a lot more issues: “memory and network corruption, large clock skew, hung machines, extended and asymmetric network partitions, bugs in other systems that we are using (Chubby for example), overflow of GFS quotas, and planned and unplanned hardware maintenance”. Hence, they argue that such sources of failure also have to be addressed when designing and implementing distributed systems protocols. Examples that were implemented at Google are checksumming for RPC calls as well as removing assumptions in a part of the system about the other parts of the system (e. g. that only a fixed set of errors can be returned by a service like Chubby).
A lesson learned at Google while developing Bigtable at Google is to implement new features into such a system only if the actual usage patterns for them are known.
A counterexample Chang et al. mention are general purpose distributed transactions that were planned for Bigtable but never implemented as there never was an immediate need for them. It turned out that most applications using Bigtable only needed single-row transactions. The only use case for distributed transactions that came up was the maintenance of secondary indices which can be dealt with by a “specialized mechanism [. . . that] will be less general than distributed transactions, but will be more efficient”. Hence, general purpose implementations arising when no actual requirements and usage patterns are specified should be avoided according to Chang et al..
A practical suggestion is to monitor the system as well at its clients in order to detect and analyze problems. In Bigtable e. g. the RPC being used by it produces “a detailed trace of the important actions” which helped to “detect and fix many problems such as lock contention on tablet data structures, slow writes to GFS while committing Bigtable mutations, and stuck accesses to the METADATA table when METADATA tablets are unavailable”.
In the eyes of Chang et al. the most important lesson to be learned from Bigtable’s development is that simplicity and clarity in design as well as code are of great value—especially for big and unexpectedly evolving systems like Bigtable. As an example they mention the tabletserver membership protocol which was designed too simple at first, refactored iteratively so that it became too complex and too much depending on seldomly used Chubby-features, and in the end was redesigned to “to a newer simpler protocol that depends solely on widely-used Chubby features” (see section 6.1.4).
本文章摘自博客园,原文发布日期:2012-07-07