serial no | solutions | level |
preconditino |
flow |
advances | shortcomings | use cases |
1 | direct client API | log | - | transfer data via both clusters |
|||
2 | export/import | log |
|
-mr gen log files -transfer files -import with mr |
|||
3 | copy table | file | same as export ,but the last step is:using hdfs put files |
||||
4 | replication | log | similar to #1 | ||||
5 | bulkload | ||||||
6 | snapshot | file | -flush before snapshoting if online |
-create snapshot -clone to new table -restore from new table[cluster internal] |
|||
7 | distcp | file | -flush memstore -distcp files within both clusters |
||||
now,i want to retrieve last month datum from a table to backup to another cluster,but both clusters cant mutually connected(no MR),so i issued the new steps:
1.subset the table data (last month:2014-06-01--> 2014-06-30)
hbase org.apache.hadoop.hbase.mapreduce.CopyTable -Dhbase.client.scanner.caching=1000 -Dmapred.map.tasks.speculative.execution=false --starttime=1401552000000 --endtime=1404057600000 --new.name=new-tableX tableX
then you MUST flush this table as some data lie on memstores,and the next step will operate on file level directly,
echo "flush 'new-tableX' "|hbase shell
2.download hdfs table hfiles
hadoop fs -get /hbase/new-tableX new-tableX
(of curse u can run extend this command in multi nodes parallelly by subtasking the dirs)
3.transfer these files to other cluster parallelly
a.scp part files to local nodeA,B,C...
b.run scp part-files to peer node of another cluster
(so these will balance the network bandwidth limited by one node for both sides)
4.now import the data to hdfs
hadoop fs -put part-files /hbase
5.load these hfiles to meta and assign
hbase hbck -fixMeta
then
hbase hbck -fixAssignments
(this will try one more ime to the jude whether table is readable or not)
6.rename the new table to original table[optional]
hbase shell> disable 'tableName' hbase shell> snapshot 'tableName', 'tableSnapshot' hbase shell> clone_snapshot 'tableSnapshot', 'newTableName' hbase shell> delete_snapshot 'tableSnapshot' hbase shell> drop 'tableName'
utility snapshot is supported by 0.94.6+ version,and u can patch your old version also if u have a older one.
some optimized usages in step 1
-mapreduce failure times
-D=mapred.map.max.attempts=2
failure ratio
-D=mapred.max.map.failures.percent=0.05
-close the hlog writing(maybe refactor the Import.Imperter.java)
-decrease the block replication
-D-Ddfs.replication=2 or -D-Ddfs.replication=1
-increase the buffer
-Dhbase.client.write.buffer=10485760
-presplit the new table when creation
{NUMREGIONS => [1], SPLITALGO => 'HexStringSplit'}
[1] hbase -how many regions are fit for a table when prespiting or keeping running
ref:
CDH:introduction-to-apache-hbase-snapshots
jira:snapshot of table (attached principle docs)
复制部分HBase表用于测试 (some tools used java class in shell)