Phoenix本地索引压测HBase部分Region出现offline和RIT问题

问题

在压测Phoenix二级索引功能时,出现HBase Master异常,通过RegionServer日志看到,报了org.apache.hadoop.hbase.exceptions.RegionOpeningException异常,提示Region已经打开。具体日志细节如下:

INFO AsyncProcess
#7, table=_LOCAL_IDX_ADRECORD, attempt=46/350 failed=2ops, last exception: org.apache.hadoop.hbase.exceptions.RegionOpeningException: org.apache.hadoop.hbase.exceptions.RegionOpeningException: Region _LOCAL_IDX_ADRECORD,\x00v\xD5\xB0\x00\x00\x01tTR#\xF0\xD9z\x83\xAE,1599204014761.cad1741bb904a2456323fa07af04a9c7. is opening on hbase-master,60020,1599217809763
at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2994)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1069)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2100)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService 2. c a l l B l o c k i n g M e t h o d ( C l i e n t P r o t o s . j a v a : 33656 ) a t o r g . a p a c h e . h a d o o p . h b a s e . i p c . R p c S e r v e r . c a l l ( R p c S e r v e r . j a v a : 2191 ) a t o r g . a p a c h e . h a d o o p . h b a s e . i p c . C a l l R u n n e r . r u n ( C a l l R u n n e r . j a v a : 112 ) a t o r g . a p a c h e . h a d o o p . h b a s e . i p c . R p c E x e c u t o r 2.callBlockingMethod(ClientProtos.java:33656) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2191) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112) at org.apache.hadoop.hbase.ipc.RpcExecutor 2.callBlockingMethod(ClientProtos.java:33656)atorg.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2191)atorg.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)atorg.apache.hadoop.hbase.ipc.RpcExecutorHandler.run(RpcExecutor.java:183)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:163)
on hbase-master,60020,1599184059223, tracking started null, retrying after=20025ms, replay=2ops

解决办法

在HDFS上检查文件并没有损坏,用hbase hbck等命令修复也无法完成,会卡住。通过查找Cloudera社区发现类似案例(案例环境与集群完全类似),是因为open region的线程出现死锁导致,可以通过提高线程数来解决,修改配置并重启HBase集群后恢复正常。
在CDH下,RegionServer的高级增加如下配置:

<property>
    <name>hbase.regionserver.executor.openregion.threadsname> 
    <value>100value> 
property>

参考来源

一次HBase集群崩溃的修复历程
Can phoenix local indexes create a deadlock during an HBase restart? Solved

你可能感兴趣的:(Cloudera,hbase)