Apache HBase 全攻略

基础概念

Coprocessor

 Coprocessor 其实是一个类似 MapReduce 的分析组件,不过它极大简化了 MapReduce 模型。将请求独立地在各个 Region 中并行地运行,并提供了一套框架让用户灵活地自定义 Coprocessor

编程技巧

充分利用好 CellUtil
// 直接使用 byte[] 进行匹配,效率会更高
// Bad: cf.equals(Bytes.toString(CellUtil.cloneFamily(cell)))
CellUtil.matchingFamily(cell, cf) && CellUtil.matchingQualifier(cell, col)
// 同理,应尽量使用 `Bytes.equals`,来替代 `String#equals`
发挥好协处理的并行计算能力
// 某些很难使得表数据分布均匀的场景下,可以设置好预分区 [00, 01, 02, ..., 99],并关闭自动分区(详见:常见命令-分区),则可保证每个 Region 上的只有单个 xx 前缀。这样,导表数据的时候,轮询地在 rowkey 前加上 xx 前缀,则可保证无热点 Region
// 在协处理器的程序中,则可先获取到 xx 前缀,并在构建 Scan 的时候,将前缀加在 startKey/endKey 前面即可
static String getStartKeyPrefix(HRegion region) {
    if (region == null) throw new RuntimeException("Region is null!");
    byte[] startKey = region.getStartKey();
    if (startKey == null || startKey.length == 0) return "00";
    String startKeyStr = Bytes.toString(startKey);
    return isEmpty(startKeyStr) ? "00" : startKeyStr.substring(0, 2);
}
private static boolean isEmpty(final String s) {
    return s == null || s.length() == 0;
}
处理好协处理器程序里的异常

 如果在协处理器里面有异常被抛出,并且 hbase.coprocessor.abortonerror 参数没有开启,那么,该协处理器会直接从被加载的环境中被删除掉。否则,则需要看异常类型,如果是 IOException 类型,则会直接被抛出;如果是 DoNotRetryIOException 类型,则不做重试,抛出异常。否则,默认将会尝试 10 次 (硬编码在 AsyncConnectionImpl#RETRY_TIMER 中了)。因此需要依据自己的业务场景,对异常做好妥善的处理

日志打印
// 只能使用 Apache Commons 的 Log 类,否则将无法打印
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

private static final Log log = LogFactory.getLog(CoprocessorImpl.class.getName());

部署

# 先上传 coprocessor 处理器 jar 包
$ hadoop fs -copyFromLocal /home/hbase/script/coprocessor-0.0.1.jar hdfs://yuzhouwan/hbase/coprocessor/
$ hadoop fs -ls hdfs://yuzhouwan/hbase/coprocessor/

# 卸载旧的 coprocessor
$ alter 'yuzhouwan', METHOD => 'table_att_unset', NAME =>'coprocessor$1'
# 指定新的 coprocessor
$ alter 'yuzhouwan', METHOD => 'table_att', 'coprocessor' => 'hdfs://yuzhouwan/hbase/coprocessor/coprocessor-0.0.1.jar|com.yuzhouwan.hbase.coprocessor.Aggregation|111|'

# 通过查看 RegionServer 的日志,可观察协处理器的运行状况

常用命令

集群相关

$ su - hbase
$ start-hbase.sh

# HMaster    ThriftServer
$ jps | grep -v Jps
  32538 ThriftServer
  9383 HMaster
  8423 HRegionServer

# BackUp HMaster    ThriftServer
$ jps | grep -v Jps
  24450 jar
  21882 HMaster
  2296 HRegionServer
  14598 ThriftServer
  5998 Jstat

# BackUp HMaster    ThriftServer
$ jps | grep -v Jps
  31119 Bootstrap
  8775 HMaster
  25289 Bootstrap
  14823 Bootstrap
  12671 Jstat
  9052 ThriftServer
  26921 HRegionServer

# HRegionServer
$ jps | grep -v Jps
  29356 hbase-monitor-process-0.0.3-jar-with-dependencies.jar    # monitor
  11023 Jstat
  26135 HRegionServer


$ export -p | egrep -i "(hadoop|hbase)"
  declare -x HADOOP_HOME="/home/bigdata/software/hadoop"
  declare -x HBASE_HOME="/home/bigdata/software/hbase"
  declare -x PATH="/usr/local/anaconda/bin:/usr/local/R-3.2.1/bin:/home/bigdata/software/java/bin:/home/bigdata/software/hadoop/bin:/home/bigdata/software/hive/bin:/home/bigdata/software/sqoop/bin:/home/bigdata/software/hbase/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin"


$ java -XX:+PrintFlagsFinal -version | grep MaxHeapSize
    uintx MaxHeapSize                              := 32126271488     {product}           # 29.919921875 GB
  java version "1.7.0_60-ea"
  Java(TM) SE Runtime Environment (build 1.7.0_60-ea-b15)
  Java HotSpot(TM) 64-Bit Server VM (build 24.60-b09, mixed mode)


$ top
  top - 11:37:03 up 545 days, 18:45,  5 users,  load average: 8.74, 10.39, 10.96
  Tasks: 653 total,   1 running, 652 sleeping,   0 stopped,   0 zombie
  Cpu(s): 32.9%us,  0.7%sy,  0.0%ni, 66.3%id,  0.0%wa,  0.0%hi,  0.1%si,  0.0%st
  Mem:  264484056k total, 260853032k used,  3631024k free,  2235248k buffers
  Swap: 10485756k total, 10485756k used,        0k free, 94307776k cached
  # Memory: 252 GB


# `hbase classpath` 可以拿到 HBase 相关的所有依赖
$ java -classpath ~/opt/hbase/soft/yuzhouwan.jar:`hbase classpath` com.yuzhouwan.hbase.MainApp


# Usage
Usage: hbase []  []
Options:
  --config DIR    Configuration direction to use. Default: ./conf
  --hosts HOSTS   Override the list in 'regionservers' file

Commands:
Some commands take arguments. Pass no args or -h for usage.
  shell           Run the HBase shell
  hbck            Run the hbase 'fsck' tool
  hlog            Write-ahead-log analyzer
  hfile           Store file analyzer
  zkcli           Run the ZooKeeper shell
  upgrade         Upgrade hbase
  master          Run an HBase HMaster node
  regionserver    Run an HBase HRegionServer node
  zookeeper       Run a Zookeeper server
  rest            Run an HBase REST server
  thrift          Run the HBase Thrift server
  thrift2         Run the HBase Thrift2 server
  clean           Run the HBase clean up script
  classpath       Dump hbase CLASSPATH
  mapredcp        Dump CLASSPATH entries required by mapreduce
  pe              Run PerformanceEvaluation
  ltt             Run LoadTestTool
  version         Print the version
  CLASSNAME       Run the class named CLASSNAME


# hbase版本信息
$ hbase version
  2017-01-13 11:05:07,580 INFO  [main] util.VersionInfo: HBase 0.98.8-hadoop2
  2017-01-13 11:05:07,580 INFO  [main] util.VersionInfo: Subversion file:///e/hbase_compile/hbase-0.98.8 -r Unknown
  2017-01-13 11:05:07,581 INFO  [main] util.VersionInfo: Compiled by 14074019 on Mon Dec 26 20:17:32     2016


$ hadoop fs -ls /hbase
  drwxr-xr-x   - hbase hbase          0 2017-03-01 00:05 /hbase/.hbase-snapshot
  drwxr-xr-x   - hbase hbase          0 2016-10-26 16:42 /hbase/.hbck
  drwxr-xr-x   - hbase hbase          0 2016-12-19 13:02 /hbase/.tmp
  drwxr-xr-x   - hbase hbase          0 2017-01-22 20:18 /hbase/WALs
  drwxr-xr-x   - hbase hbase          0 2015-09-18 09:34 /hbase/archive
  drwxr-xr-x   - hbase hbase          0 2016-10-18 09:44 /hbase/coprocessor
  drwxr-xr-x   - hbase hbase          0 2015-09-15 17:21 /hbase/corrupt
  drwxr-xr-x   - hbase hbase          0 2017-02-20 14:34 /hbase/data
  -rw-r--r--   2 hbase hbase         42 2015-09-14 12:10 /hbase/hbase.id
  -rw-r--r--   2 hbase hbase          7 2015-09-14 12:10 /hbase/hbase.version
  drwxr-xr-x   - hbase hbase          0 2016-06-28 12:14 /hbase/inputdir
  drwxr-xr-x   - hbase hbase          0 2017-03-01 10:40 /hbase/oldWALs
  -rw-r--r--   2 hbase hbase     345610 2015-12-08 16:54 /hbase/test_bulkload.txt


$ hadoop fs -ls /hbase/WALs
  drwxr-xr-x   - hbase hbase          0 2016-12-27 16:08 /hbase/WALs/yuzhouwan03,60020,1482741120018-splitting
  drwxr-xr-x   - hbase hbase          0 2017-03-01 10:36 /hbase/WALs/yuzhouwan03,60020,1483442645857
  drwxr-xr-x   - hbase hbase          0 2017-03-01 10:37 /hbase/WALs/yuzhouwan02,60020,1483491016710
  drwxr-xr-x   - hbase hbase          0 2017-03-01 10:37 /hbase/WALs/yuzhouwan01,60020,1483443835926
  drwxr-xr-x   - hbase hbase          0 2017-03-01 10:36 /hbase/WALs/yuzhouwan03,60020,1483444682422
  drwxr-xr-x   - hbase hbase          0 2017-03-01 10:16 /hbase/WALs/yuzhouwan04,60020,1485087488577
  drwxr-xr-x   - hbase hbase          0 2017-03-01 10:37 /hbase/WALs/yuzhouwan05,60020,1484790306754
  drwxr-xr-x   - hbase hbase          0 2017-03-01 10:37 /hbase/WALs/yuzhouwan06,60020,1484931966988


$ hadoop fs -ls /hbase/WALs/yuzhouwan01,60020,1483443835926

  -rw-r--r--   3 hbase hbase  127540109 2017-03-01 09:49 /hbase/WALs/yuzhouwan01,60020,1483443835926/yuzhouwan01%2C60020%2C1483443835926.1488330961720
  # ...
  -rw-r--r--   3 hbase hbase         83 2017-03-01 10:37 /hbase/WALs/yuzhouwan01,60020,1483443835926/yuzhouwan01%2C60020%2C1483443835926.1488335822133


# log
$ vim /home/hbase/logs/hbase-hbase-regionserver-yuzhouwan03.log


# HBase 批处理
$ echo "" | hbase shell
$ hbase shell ../script/batch.hbase


# HBase 命令行
$ hbase shell


$ status
  1 servers, 0 dead, 41.0000 average load


$ zk_dump
  HBase is rooted at /hbase
  Active master address: yuzhouwan03,60000,1481009498847
  Backup master addresses:
   yuzhouwan02,60000,1481009591957
   yuzhouwan01,60000,1481009567346
  Region server holding hbase:meta: yuzhouwan03,60020,1483442645857
  Region servers:
   yuzhouwan02,60020,1483491016710
   # ...
  /hbase/replication: 
  /hbase/replication/peers: 
  /hbase/replication/peers/1: yuzhouwan03,yuzhouwan02,yuzhouwan01:2016:/hbase
  /hbase/replication/peers/1/peer-state: ENABLED
  /hbase/replication/rs: 
  /hbase/replication/rs/yuzhouwan03,60020,1483442645857: 
  /hbase/replication/rs/yuzhouwan03,60020,1483442645857/1: 
  /hbase/replication/rs/yuzhouwan03,60020,1483442645857/1/yuzhouwan03%2C60020%2C1483442645857.1488334114131: 116838271
  /hbase/replication/rs/1485152902048.SyncUpTool.replication.org,1234,1: 
  /hbase/replication/rs/yuzhouwan06,60020,1484931966988: 
  /hbase/replication/rs/yuzhouwan06,60020,1484931966988/1: 
  # ...
  Quorum Server Statistics:
   yuzhouwan02:2015
    Zookeeper version: 3.4.6-1569965, built on 02/20/2014 09:09 GMT
    Clients:
     /yuzhouwan:62003[1](queued=0,recved=625845,sent=625845)
     # ...
     /yuzhouwan:11151[1](queued=0,recved=8828,sent=8828)  
    Latency min/avg/max: 0/0/1
    Received: 161
    Sent: 162
    Connections: 168
    Outstanding: 0
    Zxid: 0xc062e91c6
    Mode: follower
    Node count: 25428
   yuzhouwan03:2015
    Zookeeper version: 3.4.6-1569965, built on 02/20/2014 09:09 GMT
    Clients:
     /yuzhouwan:39582[1](queued=0,recved=399812,sent=399812)
     # ...
     /yuzhouwan:58770[1](queued=0,recved=3234,sent=3234)

$ stop-hbase.sh

增删查改

$ list
  TABLE
  mytable
  yuzhouwan
  # ...
  20 row(s) in 1.4080 seconds


$ create 'yuzhouwan', {NAME => 'info', VERSIONS => 3}, {NAME => 'data', VERSIONS => 1}
  0 row(s) in 0.2650 seconds
  => Hbase::Table - yuzhouwan


$ put 'yuzhouwan', 'rk0001', 'info:name', 'Benedict Jin'
$ put 'yuzhouwan', 'rk0001', 'info:gender', 'Man'
$ put 'yuzhouwan', 'rk0001', 'data:pic', '[picture]'


$ get 'yuzhouwan', 'rk0001', {FILTER => "ValueFilter(=, 'binary:[picture]')"}
  COLUMN                                              CELL
  data:pic                                           timestamp=1479092170498, value=[picture]
  1 row(s) in 0.0200 seconds


$ get 'yuzhouwan', 'rk0001', {FILTER => "QualifierFilter(=, 'substring:a')"}
  COLUMN                                              CELL
  info:name                                          timestamp=1479092160236, value=Benedict Jin
  1 row(s) in 0.0050 seconds


$ scan 'yuzhouwan', {FILTER => "QualifierFilter(=, 'substring:a')"}
  ROW                                                 COLUMN+CELL
  rk0001                                             column=info:name, timestamp=1479092160236, value=Benedict Jin
  1 row(s) in 0.0140 seconds


# 按照 timestamp 进行查询
$ scan 'yuzhouwan', { TIMERANGE => [0, 1416083300000] }


# [rk0001, rk0003)
$ put 'yuzhouwan', 'rk0003', 'info:name', 'asdf2014'
$ scan 'yuzhouwan', {COLUMNS => 'info', STARTROW => 'rk0001', ENDROW => 'rk0003'}


# row key start with 'rk'
$ put 'yuzhouwan', 'aha_rk0003', 'info:name', 'Jin'
$ scan 'yuzhouwan', {FILTER => "PrefixFilter('rk')"}
  ROW                                                 COLUMN+CELL
  rk0001                                             column=data:pic, timestamp=1479092170498, value=[picture]
  rk0001                                             column=info:gender, timestamp=1479092166019, value=Man
  rk0001                                             column=info:name, timestamp=1479092160236, value=Benedict Jin
  rk0003                                             column=info:name, timestamp=1479092728688, value=asdf2014
  2 row(s) in 0.0150 seconds


$ delete 'yuzhouwan', 'rk0001', 'info:gender'
$ get 'yuzhouwan', 'rk0001'
  COLUMN                                              CELL
  data:pic                                           timestamp=1479092170498, value=[picture]
  info:name                                          timestamp=1479092160236, value=Benedict Jin
  2 row(s) in 0.0100 seconds


$ disable 'yuzhouwan'
$ drop 'yuzhouwan'

行列修改

# 修改表
$ disable 'yuzhouwan'
# 添加列
$ alter 'yuzhouwan', NAME => 'f1'
$ alter 'yuzhouwan', NAME => 'f2'
  Updating all regions with the new schema...
  1/1 regions updated.
  Done.
  0 row(s) in 1.3020 seconds


# 修改 CQ
$ create 'yuzhouwan', {NAME => 'info'}
$ put 'yuzhouwan', 'rk00001', 'info:name', 'China'

$ get 'yuzhouwan', 'rk00001', {COLUMN => 'info:name'}, 'value'
$ put 'yuzhouwan', 'rk00001', 'info:address', 'value'

$ scan 'yuzhouwan'
  ROW                                                 COLUMN+CELL
   rk00001                                            column=info:address, timestamp=1480556328381, value=value
  1 row(s) in 0.0220 seconds


# 删除列
$ alter 'yuzhouwan', {NAME => 'f3'}, {NAME => 'f4'}
$ alter 'yuzhouwan', {NAME => 'f5'}, {NAME => 'f1', METHOD => 'delete'}, {NAME => 'f2', METHOD => 'delete'}, {NAME => 'f3', METHOD => 'delete'}, {NAME => 'f4', METHOD => 'delete'}

# 无法细到 CQ 级别,alter 'ns_rec:tb_mem_tag', {NAME => 'cf_tag:partyIdType', METHOD => 'delete'}

# 删除行
$ deteleall ,   
  

清空表数据

# 清空表数据
$ describe 'yuzhouwan'
  Table yuzhouwan is ENABLED
  COLUMN FAMILIES DESCRIPTION
  {NAME => 'data', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
  {NAME => 'f5', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '1', TTL => 'FOREVER', MIN_VERSIONS => '0', KEEP_DELETED_CELLS => 'FALSE'
  , BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
  {NAME => 'info', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
  3 row(s) in 0.0230 seconds

# 0.98 版本引入的命令,可以清空表数据的同时,保留 region 分区
$ truncate_preserve 'yuzhouwan'

# truncate 会进行 drop table 和 create table 的操作
$ truncate 'yuzhouwan'
$ scan 'yuzhouwan'
  ROW                                                 COLUMN+CELL
  0 row(s) in 0.3170 seconds

改表名

# 注意 snapshot 的名字 不可带 ':' 之类的字符,也就是说,不需要特意去区分 namespace
$ disable 'yuzhouwan'
$ snapshot 'yuzhouwan', 'yuzhouwan_snapshot'
$ clone_snapshot 'yuzhouwan_snapshot', 'ns_site:yuzhouwan'
$ delete_snapshot 'yuzhouwan_snapshot'
$ drop 'yuzhouwan'
$ grant 'site', 'CXWR', 'ns_site:yuzhouwan'

$ user_permission 'yuzhouwan'
  User                                                Table,Family,Qualifier:Permission
   site                                               default,yuzhouwan,,: [Permission: actions=CREATE,EXEC,WRITE,READ]
   hbase                                              default,yuzhouwan,,: [Permission: actions=READ,WRITE,EXEC,CREATE,ADMIN]

$ disable 'ns_site:yuzhouwan'
$ drop 'ns_site:yuzhouwan'

$ exists 'ns_site:yuzhouwan'
  Table ns_site:yuzhouwan does not exist
  0 row(s) in 0.0200 seconds

改表属性

$ disable 'yuzhouwan'

# versions
$ alter 'yuzhouwan', NAME => 'f', VERSIONS => 5

# ttl(注意,超时属性是针对 CF 的,而不是 Table 级别的,且单位是 秒)
$ alter 'yuzhouwan', NAME => 'f', TTL => 20

$ enable 'yuzhouwan'
$ describe 'yuzhouwan'

压缩算法

# 压缩算法为 'SNAPPY' 报错,ERROR: java.io.IOException: Compression algorithm 'snappy' previously failed test.
# 尝试 LZ4(低压缩比,高速,在 Spark 2.x 中已作为默认压缩算法)
$ create 'yuzhouwan', {NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'}, {NAME => 'v', COMPRESSION => 'LZ4', BLOOMFILTER => 'NONE', DATA_BLOCK_ENCODING => 'FAST_DIFF'}

$ describe 'yuzhouwan'
  Table yuzhouwan is ENABLED
  COLUMN FAMILIES DESCRIPTION
  {NAME => 'v', DATA_BLOCK_ENCODING => 'FAST_DIFF', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'LZ4', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
  1 row(s) in 0.0280 seconds

权限控制

# ACL
# R - read
# W - write
# X - execute
# C - create
# A - admin
$ grant 'benedict', 'WRXC', 'yuzhouwan'
$ echo "scan 'hbase:acl'" | hbase shell > acl.txt
  yuzhouwan column=l:benedict, timestamp=1496216745249, value=WRXC
  yuzhouwan column=l:hbase, timestamp=1496216737326, value=RWXCA

$ user_permission                  # 如果不接 ,将从 'hbase:acl' 表中获取全部
$ user_permission 'yuzhouwan'
    User                                 Table,Family,Qualifier:Permission
   hbase                               default,yuzhouwan,,: [Permission: actions=READ,WRITE,EXEC,CREATE,ADMIN]
   benedict                            default,yuzhouwan,,: [Permission: actions=WRITE,READ,EXEC,CREATE]
  2 row(s) in 0.0510 seconds

$ revoke 'benedict', 'yuzhouwan'

分区

# splits
$ create 'yuzhouwan', {NAME => 'f'}, SPLITS => ['1', '2', '3']                      # 5 regions
$ alter 'yuzhouwan', SPLITS => ['1', '2', '3', '4', '5', '6', '7', '8', '9']        # not work

# 关闭自动分区
$ alter 'yuzhouwan', {METHOD => 'table_att', SPLIT_POLICY => 'org.apache.hadoop.hbase.regionserver.DisabledRegionSplitPolicy'}

# 配置 Master 是否去平衡各个 RegionServer 的 Region 数量
# 维护或者重启一个 RegionServer 时,会关闭 balancer,会导致 Region 在 RegionServer 上的分布不均,这个时候需要手工的开启 balance
$ balance_switch true
$ balance_switch false

命名空间

# namespace
$ list_namespace_tables 'hbase'
  TABLE
  acl
  meta
  namespace
  3 row(s) in 0.0050 seconds

$ list_namespace
  NAMESPACE
  default
  hbase
  # ...
  50 row(s) in 0.3710 seconds

$ create_namespace 'www'
$ exists 'www:yuzhouwan.site'
$ create 'www:yuzhouwan.site', {NAME => 'info', VERSIONS=> 9}, SPLITS => ['1','2','3','4','5','6','7','8','9']
$ alter_namespace 'www', {METHOD => 'set', 'PROPERTY_NAME' => 'PROPERTY_VALUE'}

$ drop_namespace 'www'

手动 Split

$ create 'yuzhouwan', {NAME => 'info', VERSIONS => 3}, {NAME => 'data', VERSIONS => 1}
$ put 'yuzhouwan', 'rk0001', 'info:name', 'Benedict Jin'
$ put 'yuzhouwan', 'rk0001', 'info:gender', 'Man'
$ put 'yuzhouwan', 'rk0001', 'data:pic', '[picture]'
$ put 'yuzhouwan', 'rk0002', 'info:name', 'Yuzhouwan'

# Usage:
#   split 'tableName'
#   split 'namespace:tableName'
#   split 'regionName' # format: 'tableName,startKey,id'
#   split 'tableName', 'splitKey'
#   split 'regionName', 'splitKey'
$ split 'yuzhouwan', 'rk0002'
  # Name                                                            Region Server        Start Key   End Key    Locality    Requests
  yuzhouwan,,1500964657548.bd21cdf7ae9e2d8e5b2ed3730eb8b738.        yuzhouwan01:60020                rk0002     1.0         0
  yuzhouwan,rk0002,1500964657548.76f95590aed5d39291a087c5e8e83833.  yuzhouwan02:60020    rk0002                 1.0         2

Phoenix 命令

# 执行外部 SQL 脚本
$ sqlline.py :/phoenix sql.txt

实战技巧

Hive 数据导入(Bulkload)

 Bulkload 就是 依据 Hive 表的 schema 解析 RCFile,然后通过 MapReduce 程序 生成 HBase 的 HFile 文件,最后直接利用 bulkload 机制将 HFile 文件导入到 HBase 中。也就是 直接存放到 HDFS 中。这样会比 调用 Api 一条条的导入,效率会高很多(一般的,Hive 数据入库 HBase,都会使用 bulkload 的方式)

集群间复制(CopyTable + Replication)

相关命令

Commend Comment
add_peer 添加一条复制连接,ID 是连接的标识符,CLUSTER_KEY 的格式是 HBase.zookeeper.quorum: HBase.zookeeper.property.clientPort: zookeeper.znode.parent
list_peers 查看所有的复制连接
enable_peer 设置某条复制连接为可用状态,add_peer 一条连接默认就是 enable 的,通过 disable_peer 命令让该连接变为不可用的时候,可以通过 enable_peer 让连接变成可用
disable_peer 设置某条复制连接为不可用状态
remove_peer 删除某条复制连接
set_peer_tableCFs 设置某条复制连接可以复制的表信息
默认 add_peer 添加的复制连接是可以复制集群所有的表。如果,只想复制某些表的话,就可以用 set_peer_tableCFs,复制连接的粒度可以到表的列族。表之间通过 ‘;’ 分号 隔开,列族之间通过 ‘,’ 逗号 隔开。e.g. set_peer_tableCFs ‘2’, “table1; table2:cf1,cf2; table3:cfA,cfB”。使用 ‘set_peer_tableCFs’ 命令,可以设置复制连接所有的表
append_peer_tableCFs 可以为复制连接添加需要复制的表
remove_peer_tableCFs 为复制连接删除不需要复制的表
show_peer_tableCFs 查看某条复制连接复制的表信息,查出的信息为空时,表示复制所有的表
list_replicated_tables 列出所有复制的表

监控 Replication

HBase Shell
$ status 'replication'
Metrics

源端

Metrics Name Comment
sizeOfLogQueue 还有多少 WAL 文件没处理
ageOfLastShippedOp 上一次复制延迟时间
shippedBatches 传输了多少批数据
shippedKBs 传输了多少 KB 的数据
shippedOps 传输了多少条数据
logEditsRead 读取了多少个 logEdits
logReadInBytes 读取了多少 KB 数据
logEditsFiltered 实际过滤了多少 logEdits

目的端

Metrics Name Comment
sink.ageOfLastAppliedOp 上次处理的延迟
sink.appliedBatches 处理的批次数
sink.appliedOps 处理的数据条数

完整步骤

CopyTable
# 明确迁移时间
2017-01-01 00:00:00(1483200000000)          2017-05-01 00:00:00(1493568000000)
# 这里需要转换时间格式为 13 位的 毫秒级 unix timestamp
# 在线转换工具 http://tool.chinaz.com/Tools/unixtime.aspx
# 或者用 shell
$ echo "`date -d "2017-01-01 00:00:00" +%s`000"
$ echo "`date -d "2017-05-01 00:00:00" +%s`000"
# 这里不用担心出现 边界问题 [starttime, endtime)
# 源集群执行(不限制 starttime,可以增加参数 --starttime=0)
$ hbase org.apache.hadoop.hbase.mapreduce.CopyTable --starttime=1483200000000 --endtime=1493568000000 --peer.adr=,,...::/ # 检查数据一致性(在两个集群分别执行,比较 RowCount 是否一致)
$ hbase org.apache.hadoop.hbase.mapreduce.RowCounter 
--endtime=1493568000000# 进一步检查数据一致性(在两个集群分别执行,比较 字节数 是否一致) $ hadoop fs -du hdfs:///hbase/data//
Replication
# 小集群上执行
# 预先进行 list_peers,避免 peer id 冲突
$ list_peers
$ add_peer '', ",,...::/"

# 开启表的 REPLICATION_SCOPE
$ disable '
' # 1: open; 0: close(default) $ alter '
', {NAME => '', REPLICATION_SCOPE => '1'} $ enable '
'
Trouble shooting
# 源集群执行
$ hbase hbck
# 出现问题后 hbase hbck --repair
# 没有问题后 `hbase shell` 中执行
$ balance_switch true

关闭自动分区

$ alter 'yuzhouwan', {METHOD => 'table_att', SPLIT_POLICY => 'org.apache.hadoop.hbase.regionserver.DisabledRegionSplitPolicy'}

JMX 获取部分指标项

# 语法
http://namenode:50070/jmx?qry=<指标项>

# 例如,只返回 NameNodeInfo 指标项
http://namenode:50070/jmx?qry=hadoop:service=NameNode,name=NameNodeInfo

架构

总图

(图片来源: HBase:The Definitive Guide)

踩过的坑

Table is neither in disabled nor in enabled state

描述

 执行完正常的建表语句之后,一直卡在 enable table 这步上

解决

# 检查发现 table 既不处于 `enable` 状态,也不处于 `disable` 状态
$ is_enabled 'yuzhouwan'
  false
$ is_disabled 'yuzhouwan'
  false
$ hbase zkcli
$ delete /hbase/table/yuzhouwan
$ hbase hbck -fixMeta -fixAssignments
# 重启 active HMaster
$ is_enabled 'yuzhouwan'
  true
$ disable 'yuzhouwan'

No GCs detected

解决

# 从 JDK6 开始 默认开启偏向锁
-XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=0
# 但是不适合高并发的场景,Cassandra 中已默认关闭(https://github.com/apache/cassandra/blob/trunk/conf/jvm.options#L116)
-XX:-UseBiasedLocking

十六进制无法在命令行被识别

解决

# 只需要用双引号包起来就可以了
$ put 'yuzhouwan', 'rowkey01', 'cf:age', "\xFF"  #255

性能优化

社区跟进

详见,《开源社区》

资料

Doc

  • HBase Metrics

Blog

Put

  • 数据写入流程解析

Read

  • 数据读取流程解析

Replication

  • HBase 复制详解

BulkLoad

  • HBaseBulkLoad
  • Bulkload Hive 表到 HBase

Flush

  • HBase – Memstore Flush 深度解析

Code Resource

  • 有态度的 HBase / Spark / BigData

更多资源,欢迎加入,一起交流学习

Technical Discussion Group:(人工智能 1020982(高级)& 1217710(进阶)| BigData 1670647)


Post author:Benedict Jin
Post link: https://yuzhouwan.com/posts/45888/
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.

你可能感兴趣的:(Apache,HBase)