hbase -tables replication/snapshot/backup within/cross clusters



serial no solutions




advances shortcomings use cases
1 direct client API log -

 transfer data

via both


2 export/import log


 -mr gen log files

-transfer files

-import with mr

3 copy table file  

 same as 

export ,but the

last step is:using 

hdfs put files

4 replication log    similar to #1      
5 bulkload            
6 snapshot file 

-flush before

snapshoting if 


 -create snapshot

-clone to new table

-restore from 

new table[cluster internal]

7 distcp file  

 -flush memstore

-distcp files within 

both clusters



now,i want to retrieve last month datum from a table to backup to another cluster,but both clusters cant mutually connected(no MR),so i issued the new steps:

1.subset the table data (last month:2014-06-01--> 2014-06-30)


hbase org.apache.hadoop.hbase.mapreduce.CopyTable -Dhbase.client.scanner.caching=1000 -Dmapred.map.tasks.speculative.execution=false --starttime=1401552000000 --endtime=1404057600000 --new.name=new-tableX tableX

 then you MUST flush this table as some data  lie on memstores,and the next step will operate on file level directly,


 echo "flush 'new-tableX' "|hbase shell



2.download hdfs table hfiles

 hadoop fs -get /hbase/new-tableX new-tableX

 (of curse u can run extend this command in multi nodes parallelly by subtasking the dirs)


3.transfer these files to other cluster parallelly

 a.scp part files to local nodeA,B,C...

 b.run scp part-files to peer node of another cluster

 (so these will balance the network bandwidth limited by one node for both sides)


4.now import the data to hdfs

 hadoop fs -put part-files /hbase


5.load these hfiles to meta and assign

 hbase hbck -fixMeta


 hbase hbck -fixAssignments

 (this will try one more ime to the jude whether table is readable or not)


6.rename the new table to original table[optional]

hbase shell> disable 'tableName'
hbase shell> snapshot 'tableName', 'tableSnapshot'
hbase shell> clone_snapshot 'tableSnapshot', 'newTableName'
hbase shell> delete_snapshot 'tableSnapshot'
hbase shell> drop 'tableName'

  utility snapshot is supported by 0.94.6+ version,and u can patch your old version also if u have a older one.


some optimized usages in step 1

-mapreduce failure times


 failure ratio


 -close the hlog writing(maybe refactor the Import.Imperter.java)


-decrease the block replication

-D-Ddfs.replication=2 or -D-Ddfs.replication=1

 -increase the buffer


 -presplit the new table when creation

 {NUMREGIONS => [1], SPLITALGO => 'HexStringSplit'}


 [1] hbase -how many regions are fit for a table when prespiting or keeping running 






 jira:snapshot of table (attached principle docs)

 复制部分HBase表用于测试 (some tools used java class in shell)




