摘自:HBase Replication Overview
Apache HBase Replication is a way of copying data from one HBase cluster to a different and possibly distant HBase cluster. It works on the principle that the transactions from the originating cluster are pushed to another cluster. In HBase jargon, the cluster doing the push is called the master, and the one receiving the transactions is called the slave. This push of transactions is done asynchronously, and these transactions are batched in a configurable size (default is 64MB). Asynchronous mode incurs minimal overhead on the master, and shipping edits in a batch increases the overall throughput.
This blogpost discusses the possible use cases, underlying architecture and modes of HBase replication as supported in CDH4 (which is based on 0.92). We will discuss Replication configuration, bootstrapping, and fault tolerance in a follow up blogpost.
HBase replication supports replicating data across datacenters. This can be used for disaster recovery scenarios, where we can have the slave cluster serve real time traffic in case the master site is down. Since HBase replication is not intended for automatic failover, the act of switching from the master to the slave cluster in order to start serving traffic is done by the user. Afterwards, once the master cluster is up again, one can do a CopyTable job to copy the deltas to the master cluster (by providing the start/stop timestamps) as described in the CopyTable blogpost.
Another replication use case is when a user wants to run load intensive MapReduce jobs on their HBase cluster; one can do so on the slave cluster while bearing a slight performance decrease on the master cluster.
The underlying principle of HBase replication is to replay all the transactions from the master to the slave. This is done by replaying the WALEdits (Write Ahead Log entries) in the WALs (Write Ahead Log) from the master cluster, as described in the next section. These WALEdits are sent to the slave cluster region servers, after filtering (whether a specific edit is scoped for replication or not) and shipping in a customized batch size (default is 64MB). In case the WAL Reader reaches the end of the current WAL, it will ship whatever WALEdits have been read till then. Since this is an asynchronous mode of replication, the slave cluster may lag behind from the master in a write heavy application by the order of minutes.
In HBase, all the mutation operations (Puts/Deletes) are written to a memstore which belongs to a specific region and also appended to a write ahead log file (WAL) in the form of a WALEdit. A WALEdit is an object which represents one transaction, and can have more than one mutation operation. Since HBase supports single row-level transaction, one WALEdit can have entries for only one row. The WALs are repeatedly rolled after a configured time period (default is 60 minutes) such that at any given time, there is only one active WAL per regionserver.
IncrementColumnValue, a CAS (check and substitute) operation, is also converted into a Put when written to the WAL.
A memstore is an in-memory, sorted map containing keyvalues of the composing region; there is one memstore per each column family per region. The memstore is flushed to the disk as an HFile once it reaches the configured size (default is 64MB).
Writing to WAL is optional, but it is required to avoid data loss because in case a regionserver crashes, HBase may lose all the memstores hosted on that region server. In case of regionserver failure, its WALs are replayed by a Log splitting process to restore the data stored in the WALs.
For Replication to work, write to WALs must be enabled.
Every HBase cluster has a clusterID, a UUID type auto generated by HBase. It is kept in underlying filesystem (usually HDFS) so that it doesn’t change between restarts. This is stored inside the /hbase/hbaseid znode. This id is used to acheive master-master/acyclic replication. A WAL contains entries for a number of regions which are hosted on the regionserver. The replication code reads all the keyvalues and filters out the keyvalues which are scoped for replication. It does this by looking at the column family attribute of the keyvalue, and matching it to the WALEdit’s column family map data structure. In the case that a specific keyvalue is scoped for replication, it edits the clusterId parameter of the keyvalue to the HBase cluster Id.
The ReplicationSource is a java Thread object in the regionserver process and is responsible for replicating WAL entries to a specific slave cluster. It has a priority queue which holds the log files that are to be replicated. As soon as a log is processed, it is removed from the queue. The priority queue uses a comparator that compares the log files based on their creation timestamp, (which is appended to the log file name); so, logs are processed in the same order as their creation time (older logs are processed first). If there is only one log file in the priority queue, it will not be deleted as it represents the current WAL.
Zookeeper plays a key role in HBase Replication, where it manages/coordinates almost all the major replication activity, such as registering a slave cluster, starting/stopping replication, enqueuing new WALs, handling regionserver failover, etc. It is advisable to have a healthy Zookeeper quorum (at least 3 or more nodes) so as to have it up and running all the time. Zookeeper should be run independently (and not by HBase). The following figure shows a sample of replication related znodes structure in the master cluster (the text after colon is the data of the znode):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
/
hbase
/
hbaseid
:
b53f7ec6
-
ed8a
-
4227
-
b088
-
fd6552bd6a68
…
.
/
hbase
/
rs
/
foo1
.
bar
.
com
,
40020
,
1339435481742
:
/
hbase
/
rs
/
foo2
.
bar
.
com
,
40020
,
1339435481973
:
/
hbase
/
rs
/
foo3
.
bar
.
com
,
40020
,
1339435486713
:
/
hbase
/
replication
:
/
hbase
/
replication
/
state
:
true
/
hbase
/
replication
/
peers
:
/
hbase
/
replication
/
peers
/
1
:
zk
.
quorum
.
slave
:
281
:
/
hbase
/
hbase
/
replication
/
rs
:
/
hbase
/
replication
/
rs
/
foo1
.
bar
.
com
.
com
,
40020
,
1339435084846
:
/
hbase
/
replication
/
rs
/
foo1
.
bar
.
com
,
40020
,
1339435481973
/
1
:
/
hbase
/
replication
/
rs
/
foo1
.
bar
.
com
,
40020
,
1339435481973
/
1
/
foo1
.
bar
.
com
.
1339435485769
:
1243232
/
hbase
/
replication
/
rs
/
foo3
.
bar
.
com
,
40020
,
1339435481742
:
/
hbase
/
replication
/
rs
/
foo3
.
bar
.
com
,
40020
,
1339435481742
/
1
:
/
hbase
/
replication
/
rs
/
foo3
.
bar
.
com
,
40020
,
1339435481742
/
1
/
foo3
.
bar
.
.
com
.
1339435485769
:
1243232
/
hbase
/
replication
/
rs
/
foo2
.
bar
.
com
,
40020
,
1339435089550
:
/
hbase
/
replication
/
rs
/
foo2
.
bar
.
com
,
40020
,
1339435481742
/
1
:
/
hbase
/
replication
/
rs
/
foo2
.
bar
.
com
,
40020
,
1339435481742
/
1
/
foo2
.
bar
.
.
com
.
13394354343443
:
1909033
|
Figure 1. Replication znodes hierarchy
As per Figure 1, there are three regionservers in the master cluster, namely foo[1-3].bar.com. There are three znodes related to replication:
There are three modes for setting up HBase Replication:
Figure 2. Sun and Earth, two HBase clusters
Lets say cluster#Sun receives a fresh and valid mutation M on a table and column family which is scoped for replication in both the clusters. It will have a default clusterID (0L). The replication Source instance ReplicationSrc-E will set its cluster#Id equal to the originator id (100), and ships it to cluster#Earth. When cluster#Earth receives the mutation, it replays it and the mutation is saved in its WAL, as per the normal flow. The cluster#Id of the mutation is kept intact in this log file at cluster#Earth.The Replication source instance at cluster#Earth, (ReplicationSrc-S, will read the mutation and checks its cluster#ID with the slaveCluster# (100, equal to cluster#Sun). Since they are equal, it will skip this WALEdit entry.
Figure 3. A circular replication set up
Figure 3. shows a circular replication setup, where cluster#Sun is replicating to cluster#Earth, cluster#Earth is replicating to cluster#Venus, and cluster#Venus is replicating to cluster#Sun.
Let’s say cluster#Sun receives a fresh mutation M and is scoped to replication across all the above clusters. It will be replicated to cluster#Earth as explained above in the master master replication. The replication source instance at cluster#Earth, ReplicationSrc-V, will read the WAL and see the mutation and replicate it to cluster#Venus. The cluster#Id of the mutation is kept intact (as of cluster#Sun), at cluster#Venus. At this cluster, the replication source instance for cluster#Sun, ReplicationSrc-S, will see that the mutation has the same clusterId as its slaveCluster# (as of cluster#Sun), and therefore, skip this from replicating.
HBase Replication is powerful functionality which can be used in a disaster recovery scenario. It was added as a preview feature in the 0.90 release, and has been evolving along with HBase in general, with addition of functionalities such as master-master replication, acyclic replication (both added in 0.92), and enabling-disabling replication at peer level (added in 0.94).
In the next blogpost, we will discuss various operational features such as Configuration, etc, and other gotchas with HBase Replication.