解决异常NotServingRegionException [hbase1.1]

【发现问题】hbase统计数据量时报错

hbase(main):004:0> count 'users:apps_user', INTERVAL => 150000
ERROR: org.apache.hadoop.hbase.NotServingRegionException: Region users:apps_user,,1504001042919.1b6b1a81b931f95ad499816c6260b7b4. is not online on kmr-5b9c18fc-gn-7b3518df-core-1-005.ksc.com,16020,1516156948629
at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2928)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:974)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2259)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32295)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2127)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)
at java.lang.Thread.run(Thread.java:745)
错误提示的大概意思是,表的区域 1b6b1a81b931f95ad499816c6260b7b4 下线了


每次hbase master 对region 的一个open 或一个close 操作都会向Master 的RIT列表中插入一条记录,因为master 对region 的操作要保持原子性。region 的 open 和 close 是通过Hmaster 和 region server 协助来完成的。为了满足这些操作的协调、回滚、一致性,Hmaster 采用了 RIT 机制并结合Zookeeper 中znode状态来保证操作的安全和一致性。对region的操作有以下几种:

OFFLINE, // region is in an offline state
PENDING_OPEN, // sent rpc to server to open but has not begun
OPENING, // server has begun to open but not yet done
OPEN, // server opened region and updated meta
PENDING_CLOSE, // sent rpc to server to close but has not begun
CLOSING, // server has begun to close but not yet done
CLOSED, // server closed region and updated meta
SPLITTING, // server started split of a region
SPLIT // server completed split of a region


【排查问题】
查看hdfs目录下,此区域下是有数据的:
[hdfs@kmr-5b9c18fc-gn-7b3518df-master-1-001 root]$ hdfs dfs -ls -R /apps/hbase/data/data/users | grep --color1b6b1a81b931f95ad499816c6260b7b4
drwxr-xr-x   - hbase hdfs          0 2018-01-24 10:22 /apps/hbase/data/data/users/apps_user/1b6b1a81b931f95ad499816c6260b7b4
-rw-r--r--   3 hbase hdfs         67 2017-08-29 18:04 /apps/hbase/data/data/users/apps_user/1b6b1a81b931f95ad499816c6260b7b4/.regioninfo
drwxr-xr-x   - hbase hdfs          0 2018-01-24 10:22 /apps/hbase/data/data/users/apps_user/1b6b1a81b931f95ad499816c6260b7b4/.tmp
drwxr-xr-x   - hbase hdfs          0 2018-01-24 10:22 /apps/hbase/data/data/users/apps_user/1b6b1a81b931f95ad499816c6260b7b4/cf
-rw-r--r--   3 hbase hdfs    1361119 2018-01-24 10:22 /apps/hbase/data/data/users/apps_user/1b6b1a81b931f95ad499816c6260b7b4/cf/c631bc7d8e554b8888856609ec775678
drwxr-xr-x   - hbase hdfs          0 2018-01-24 10:21 /apps/hbase/data/data/users/apps_user/1b6b1a81b931f95ad499816c6260b7b4/recovered.edits
-rw-r--r--   3 hbase hdfs          0 2018-01-24 10:21 /apps/hbase/data/data/users/apps_user/1b6b1a81b931f95ad499816c6260b7b4/recovered.edits/141.seqid
再去查看hbase master webUI,发现Regions In Transition(RIT)这一栏里的区域1b6b1a81b931f95ad499816c6260b7b4 state=OFFLINE很长时间了,很纠结。
查了一下,说的是region的RIT状态会在两个地方存储,一个是在zookeeper,另一个在master的内存中。从web UI上看到的该区域RIT状态是来自master内存中的,如果master中的此区域是offline,那么web UI看到的状态就一直是OFFLINE。
重启master后,该region仍然是OFFLINE。想起来region的RIT状态在zk上也会存储,于是登入zk查看,果然在region-in-transition节点下赫然有该region。所以思路就是,删掉znode下的此region节点。


【解决问题】
(1)使用命令"hbase zkcli"找到RIT有关的zk节点,用"rmr /hbase"删掉它。
那么,和RIT有关的/region-in-transition节点是在/hbase-secure节点下,还是在/hbase-unsecure节点下?这个可以从hbase-site.xml找到 hbase.rpc.protection的值,我这里使用的Ambari监控管理页面查找到的是authentication,所以有关的znode位于/hbase-unsecure节点下。
先找到RIT节点下的region名称——
[zk: core1:2181,core2:2181,core3:2181(CONNECTED) 4] ls /hbase-unsecure/region-in-transition
[6f6f4f8204ccaa0b71df08591de155ba, 23492eb0416646adf740a5af1995497d, e0a1130f8b76e3e74d798ef636227b7e,1b6b1a81b931f95ad499816c6260b7b4]
再把他删掉——
[zk: core1:2181,core2:2181,core3:2181(CONNECTED) 5] rmr /hbase-unsecure/region-in-transition/1b6b1a81b931f95ad499816c6260b7b4
[zk: core1:2181,core2:2181,core3:2181(CONNECTED) 6] ls /hbase-unsecure/region-in-transition
[6f6f4f8204ccaa0b71df08591de155ba, 23492eb0416646adf740a5af1995497d, e0a1130f8b76e3e74d798ef636227b7e]

(2)重启hmaster,它会重新在zk上创建新的znode。hmaster会从zk上查看/region-in-transaction节点查看是否有region处于rit状态,发现没有,那么hmaster认为这个region就是正常的。
[zk: core1:2181,core2:2181,core3:2181(CONNECTED) 6] ls /hbase-unsecure/region-in-transition

[6f6f4f8204ccaa0b71df08591de155ba, 23492eb0416646adf740a5af1995497d, e0a1130f8b76e3e74d798ef636227b7e]

(3)验证一下,正常了

hbase(main):001:0> count 'users:apps_user', INTERVAL => 150000

13188 row(s) in 5.7620 seconds

=> 13188

你可能感兴趣的:(大数据)