记一次HBASE META表故障

HBASE菜鸟一枚,欢迎吐槽

hbase版本:0.90.6
共32个节点
17号节点regionserver日志:
2016-10-12 18:02:07,051 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020-0x453857482f10009-0x453857482f10009-0x453857482f10009-0x453857482f10009-0x453857482f10009-0x453857482f10009-0x453857482f10009-0x453857482f10009-0x453857482f10009-0x453857482f10009 Unable to get data of znode /hbase/unassigned/e50b05f8446a86044f32d64fd92df5ee
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/unassigned/e50b05f8446a86044f32d64fd92df5ee
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:921)
        at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:614)
        at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:707)
        at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNodeOpening(ZKAssign.java:589)
        at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNodeOpening(ZKAssign.java:582)
        at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.transitionZookeeperOfflineToOpening(OpenRegionHandler.java:354)
        at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:96)
        at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:162)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
2016-10-12 18:02:07,051 ERROR org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020-0x453857482f10009-0x453857482f10009-0x453857482f10009-0x453857482f10009-0x453857482f10009-0x453857482f10009-0x453857482f10009-0x453857482f10009-0x453857482f10009-0x453857482f10009 Received unexpected KeeperException, re-throwing exception
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/unassigned/e50b05f8446a86044f32d64fd92df5ee
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:921)
        at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:614)
        at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:707)
        at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNodeOpening(ZKAssign.java:589)
        at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNodeOpening(ZKAssign.java:582)
        at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.transitionZookeeperOfflineToOpening(OpenRegionHandler.java:354)
        at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:96)
        at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:162)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
2016-10-12 18:02:07,051 ERROR org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Error transition from OFFLINE to OPENING for region=e50b05f8446a86044f32d64fd92df5ee
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/unassigned/e50b05f8446a86044f32d64fd92df5ee
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:921)
        at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:614)
        at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:707)
        at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNodeOpening(ZKAssign.java:589)
        at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNodeOpening(ZKAssign.java:582)
        at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.transitionZookeeperOfflineToOpening(OpenRegionHandler.java:354)
        at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:96)
        at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:162)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
2016-10-12 18:02:07,051 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020-0x453857482f10009-0x453857482f10009-0x453857482f10009-0x453857482f10009-0x453857482f10009-0x453857482f10009-0x453857482f10009-0x453857482f10009-0x453857482f10009-0x453857482f10009 Unable to get data of znode /hbase/unassigned/c27fd73fc6e74513995fa537ef8cb0ff
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/unassigned/c27fd73fc6e74513995fa537ef8cb0ff
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:921)
        at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:614)
        at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:707)
        at org.apache.hadoop.hbase.zookeeper.ZKAssign.retransitionNodeOpening(ZKAssign.java:622)
        at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.tickleOpening(OpenRegionHandler.java:380)
        at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:111)
        at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:162)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
2016-10-12 18:02:07,051 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020-0x453857482f10009-0x453857482f10009-0x453857482f10009-0x453857482f10009-0x453857482f10009-0x453857482f10009-0x453857482f10009-0x453857482f10009-0x453857482f10009-0x453857482f10009 Unable to get data of znode /hbase/unassigned/c27fd73fc6e74513995fa537ef8cb0ff
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/unassigned/c27fd73fc6e74513995fa537ef8cb0ff
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:921)
        at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:614)
        at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:707)
        at org.apache.hadoop.hbase.zookeeper.ZKAssign.retransitionNodeOpening(ZKAssign.java:622)
        at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.tickleOpening(OpenRegionHandler.java:380)
        at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:111)
        at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:162)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
2016-10-12 18:02:07,052 ERROR org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020-0x453857482f10009-0x453857482f10009-0x453857482f10009-0x453857482f10009-0x453857482f10009-0x453857482f10009-0x453857482f10009-0x453857482f10009-0x453857482f10009-0x453857482f10009 Received unexpected KeeperException, re-throwing exception
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/unassigned/c27fd73fc6e74513995fa537ef8cb0ff
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:921)
        at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:614)
        at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:707)
        at org.apache.hadoop.hbase.zookeeper.ZKAssign.retransitionNodeOpening(ZKAssign.java:622)
        at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.tickleOpening(OpenRegionHandler.java:380)
        at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:111)
        at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:162)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
2016-10-12 18:02:07,052 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed transition from OFFLINE to OPENING for region=e50b05f8446a86044f32d64fd92df5ee
2016-10-12 18:02:07,052 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was hijacked? It no longer exists, encodedName=e50b05f8446a86044f32d64fd92df5ee
2016-10-12 18:02:07,052 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server serverName=pc-zjddxd17,60020,1458232919352, load=(requests=16, regions=3619, usedHeap=15891, maxHeap=15962): Exception refreshing OPENING; region=c27fd73fc6e74513995fa537ef8cb0ff, context=post_region_open
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/unassigned/c27fd73fc6e74513995fa537ef8cb0ff
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:921)
        at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:614)
        at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:707)
        at org.apache.hadoop.hbase.zookeeper.ZKAssign.retransitionNodeOpening(ZKAssign.java:622)
        at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.tickleOpening(OpenRegionHandler.java:380)
        at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:111)
        at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:162)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)

重启regionserver后查询某些数据的时候报错:
org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Region is not online: dr_query20160402,39113587772078,1459623124495.09c83e19e72d05085a5916e12c60cf0a.
        at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2415)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1668)
        at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)

org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Region is not online: dr_query20160402,39113587772078,1459623124495.09c83e19e72d05085a5916e12c60cf0a.
        at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2415)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1668)
        at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
...
...

通过查询META表,发现出错的region信息,regionserver信息指向17号regionserver
处理过程:
1.关闭17号regionserver,开启balance,恢复失败
2.开启17号regionserver,graceful_stop 17号regionserver,转移到(大概3100/3500)个region的时候卡住,强制取消掉后再次开启17号regionserver并graceful_stop,成功,查询仍然报错,恢复失败
3.执行hbase hbck命令,结果如下:
Version: 0.90.6
.......................................................................................................
Number of Tables: 878
Number of live region servers: 32
Number of dead region servers: 0
...Number of empty REGIONINFO_QUALIFIER rows in .META.: 6
ERROR: Region dr_query20161106,604,1467707757971.022595cb412e87def50d0b924e2b173a. not deployed on any region server.
ERROR: Region dr_query20161102,bf2,1467707544941.0278e8ee1c3f707e1c544960bda52e50. not deployed on any region server.
ERROR: Region dr_query20160707,e70,1458715535447.03ede891a7b25bab5b5844992a423e07. not deployed on any region server.
ERROR: Region dr_query20160703,814,1458715454615.05785c1fa2d5c9db008622707362f89f. not deployed on any region server.
ERROR: Region dr_query20160506,3f4,1462553098226.06094bf11f5d3e5cd2edf57bfa30ae30. not deployed on any region server.
ERROR: Region dr_query20160516,370,1463423461018.06cb538928b55350b25e15241d466226. not deployed on any region server.
ERROR: Region dr_query20160402,39113587772078,1459623124495.09c83e19e72d05085a5916e12c60cf0a. not deployed on any region server.
ERROR: Region dr_query20161026,b2c,1467707179546.0ace5ddd0c7a01d04fccf17ed2835f67. not deployed on any region server.
ERROR: Region dr_query20161128,dd6,1467708842082.0b48993d271ad2963fb445775a1bd5e8. not deployed on any region server.
ERROR: Region dr_query20160609,5cc18757409143,1465492961475.0bb9d0907649504ac0a00adc6a22c9b9. not deployed on any region server.
ERROR: Region dr_query20160926,23c,1467705563796.0c4243e304763cfd36ab45fd37f6b805. not deployed on any region server.
ERROR: Region dr_query20161223,ad4,1467709995651.0f01cd67d9d402b1784bf66bccf76e22. not deployed on any region server.
......
....
455 inconsistencies detected.
Status: INCONSISTENT

另有个别region异常如下:
ERROR: Region UNKNOWN_REGION on pc-xxxxxx22:60020, key=62ab2e52ba494d6cbe6564ca44fc6b00, not on HDFS or in META but deployed on pc-xxxxxx22:60020
ERROR: Region hdfs://10.70.235.107:9001/hbasedata/dr_query20161011/6eb85c15a35b20b3d3e0212dd0c2a49c not in META, but deployed on pc-xxxxxx16:60020
ERROR: Region hdfs://10.70.235.107:9001/hbasedata/dr_query20161010/a81ea7f23bdfd9e1483e19940336f4b5 not in META, but deployed on pc-xxxxxx19:60020
ERROR: Region hdfs://10.70.235.107:9001/hbasedata/dr_query20161009/e373569fd47f7e49316c25e1370c4063 on HDFS, but not listed in META or deployed on any region server

4.再次检查META表,查看hbck结果中not deployed的region信息,结果不存在
5.手动assign
   hbase shell
   hbase>assign 'dr_query20161106,604,1467707757971.022595cb412e87def50d0b924e2b173a.'
   ....
6.再次执行hbase hbck
Version: 0.90.6
.......................................................................................................
Number of Tables: 878
Number of live region servers: 32
Number of dead region servers: 0
...Number of empty REGIONINFO_QUALIFIER rows in .META.: 6
ERROR: Region UNKNOWN_REGION on pc-zjddxd22:60020, key=62ab2e52ba494d6cbe6564ca44fc6b00, not on HDFS or in META but deployed on pc-zjddxd22:60020
ERROR: Region hdfs://10.70.235.107:9001/hbasedata/dr_query20161011/6eb85c15a35b20b3d3e0212dd0c2a49c not in META, but deployed on pc-zjddxd16:60020
ERROR: Region hdfs://10.70.235.107:9001/hbasedata/dr_query20161010/a81ea7f23bdfd9e1483e19940336f4b5 not in META, but deployed on pc-zjddxd19:60020
ERROR: Region hdfs://10.70.235.107:9001/hbasedata/dr_query20161009/e373569fd47f7e49316c25e1370c4063 on HDFS, but not listed in META or deployed on any region server
...
...
4 inconsistencies detected.
Status: INCONSISTENT


这四个不会处理,目前等待时间窗口重启hbase集群。

你可能感兴趣的:(记一次HBASE META表故障)