hbase 数据迁移 : CopyTable

上一篇文章介绍了快照方式的迁移: hbase数据迁移:基于 hbase Snapshot。本文介绍基于CopyTable迁移。

copyTable是于HBase数据迁移的工具之一,以表级别进行数据迁移。copyTable的本质也是利用MapReduce进行同步的,利用MR去scan原表的数据,然后把scan出来的数据写入put到目标集群的表。

copyTable优点是使用方便,简单,可以集群内复制,可以集群间复制,可以增量复制,还可以对表进行重命名。但是由于采用的scan - put方式,性能比较差,数据量大时不推荐使用。

首先查看一下CopyTable 的参数列表,还是比较丰富的:

$ ./hbase org.apache.hadoop.hbase.mapreduce.CopyTable
Usage: CopyTable [general options] [--starttime=X] [--endtime=Y] [--new.name=NEW] [--peer.adr=ADR] 

Options:
 rs.class     hbase.regionserver.class of the peer cluster
              specify if different from current cluster
 rs.impl      hbase.regionserver.impl of the peer cluster
 startrow     the start row
 stoprow      the stop row
 starttime    beginning of the time range (unixtime in millis)
              without endtime means from starttime to forever
 endtime      end of the time range.  Ignored if no starttime specified.
 versions     number of cell versions to copy
 new.name     new table's name
 peer.adr     Address of the peer cluster given in the format
              hbase.zookeeer.quorum:hbase.zookeeper.client.port:zookeeper.znode.parent
 families     comma-separated list of families to copy
              To copy from cf1 to cf2, give sourceCfName:destCfName. 
              To keep the same name, just give "cfName"
 all.cells    also copy delete markers and deleted cells
 bulkload     Write input into HFiles and bulk load to the destination table

Args:
 tablename    Name of the table to copy

Examples:
 To copy 'TestTable' to a cluster that uses replication for a 1 hour window:
 $ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --starttime=1265875194289 --endtime=1265878794289 --peer.adr=server1,server2,server3:2181:/hbase --families=myOldCf:myNewCf,cf2,cf3 TestTable 
For performance consider the following general option:
  It is recommended that you set the following to >=100. A higher value uses more memory but
  decreases the round trip time to the server and may increase performance.
    -Dhbase.client.scanner.caching=100
  The following should always be set to false, to prevent writing data twice, which may produce 
  inaccurate results.
    -Dmapreduce.map.speculative=false

例:将表TestTable迁移到集群:server1,server2,server3:2181:/hbase,指定起始时间(增量)和列族,并且重新命名

bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --starttime=1265875194289 --endtime=1265878794289 --peer.adr=server1,server2,server3:2181:/hbase --families=myOldCf:myNewCf,cf2,cf3 --new.name='newTestTable' TestTable 

如果想要全量迁移,将起始时间去掉即可:

bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable -Dhbase.client.scanner.caching=200 -Dmapreduce.local.map.tasks.maximum=16 -Dmapred.map.tasks.speculative.execution=false  --peer.adr=server1,server2,server3:2181:/hbase  --new.name='newTestTable' TestTable 

参数说明

  • mapreduce.local.map.tasks.maximum
    并行执行的最大map个数。不指定的话默认是1,所有任务都是串行执行的。
  • hbase.client.scanner.caching
    建议设置为大于100的数。这个数越大,使用的内存越多,但是会减少scan与服务端的交互次数,对提升读性能有帮助。
  • mapred.map.tasks.speculative.execution
    建议设置为false,避免因预测执行机制导致数据写两次。

更多参数参见官方文档 CopyTable

参考资料:https://yq.aliyun.com/articles/176546?utm_content=m_29050

你可能感兴趣的:(hbase)