HBase集群整体宕机报告(2016.7.13)

情景与操作记录

10点50分左右,接到运维人员通知,HBase集群B所有节点宕机,以下记录恢复集群的所有操作。

登录HBase UI:http://192.168.3.146:60010/,无法登录
登录hbase shell 查看:

>status 'simple'
5 dead servers

所有regionserver确实都挂掉,迅速拉起所有的regionserver

service hbase-regionserver start

继续运用status 'simple'查看集群状态,发现regionserver没有拉起,查看regionserver日志:

2016-07-13 10:37:13,480 ERROR [regionserver60020] regionserver.HRegionServer: Failed init
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot create directory /hbase/WALs/Slave4.hadoop,60020,1468377431049. Name node is in safe mode.
Resources are low on NN. Please add or free up more resources then turn off safe mode manually. NOTE: If you turn off safe mode before adding resources, the NN will immediately return to safe mode. Use “hdfs dfsadmin -safemode leave” to turn safe mode off.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1197)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:3568)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:3544)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:739)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:558)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos ClientNamenodeProtocol 2.callBlockingMethod(ClientNamenodeProtocolProtos.java)

根据上述描述,regionserver拉起失败的缘由是:HDFS进入了safe mode,迅速转向HDFS通过hdfs dfsadmin -safemode get查看HDFS状态,确实进入了safe mode,执行命令:hdfs dfsadmin -safe mode leave,离开safe mode,再次去拉起所有regionserver。
再次进入hbase shell,此时regionserver都已经拉起,进入web UI查看集群状态,发现,regionserver虽然拉起,但上面的region并没有拉起,执行命令hbase hbck发现,有1000多个不一致的地方,执行修复命令hbase hbck -repair,再次观察web UI状态,regionserver正在迅速地拉起对应的region,等待所有region都被拉起,不一致消失,HBase集群恢复服务。

故障分析

HBase集群恢复,查找为何会导致整个HBase集群如此严重问题,查看regionserver宕机日志,如下

2016-07-13 10:29:40,383 WARN [Thread-16] regionserver.HStore: Failed flushing store file, retrying num=0
java.io.IOException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot create file/hbase/data/default/platform_common_user_flow_consumer/e8878
12d9c1be58014f0733cf6e7b058/.tmp/72ace704aa374894afbabf2118225ebb. Name node is in safe mode.
Resources are low on NN. Please add or free up more resources then turn off safe mode manually. NOTE: If you turn off safe mode before adding resources, the NN will immediately return to safe mod
e. Use “hdfs dfsadmin -safemode leave” to turn safe mode off.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1197)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2225)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2180)

接着,

2016-07-13 10:29:51,603 FATAL [Thread-16] regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: []
2016-07-13 10:29:51,660 INFO [Thread-16] regionserver.HRegionServer: STOPPED: Replay of HLog required. Forcing server shutdown
2016-07-13 10:29:51,663 INFO [RpcServer.handler=45,port=60020] ipc.RpcServer: RpcServer.handler=45,port=60020: exiting
2016-07-13 10:29:51,661 INFO [Priority.RpcServer.handler=0,port=60020] ipc.RpcServer: Priority.RpcServer.handler=0,port=60020: exiting
2016-07-13 10:29:51,661 INFO [RpcServer.handler=3,port=60020] ipc.RpcServer: RpcServer.handler=3,port=60020: exiting
2016-07-13 10:29:51,663 INFO [RpcServer.handler=49,port=60020] ipc.RpcServer: RpcServer.handler=49,port=60020: exiting
2016-07-13 10:29:51,663 INFO [RpcServer.handler=52,port=60020] ipc.RpcServer: RpcServer.handler=52,port=60020: exiting
2016-07-13 10:29:51,663 INFO [RpcServer.handler=53,port=60020] ipc.RpcServer: RpcServer.handler=53,port=60020: exiting
2016-07-13 10:29:51,663 INFO [RpcServer.handler=55,port=60020] ipc.RpcServer: RpcServer.handler=55,port=60020: exiting
2016-07-13 10:29:51,663 INFO [RpcServer.handler=56,port=60020] ipc.RpcServer: RpcServer.handler=56,port=60020: exiting
2016-07-13 10:29:51,661 INFO [RpcServer.handler=0,port=60020] ipc.RpcServer: RpcServer.handler=0,port=60020: exiting
2016-07-13 10:29:51,663 INFO [RpcServer.handler=59,port=60020] ipc.RpcServer: RpcServer.handler=59,port=60020: exiting
2016-07-13 10:29:51,661 INFO [RpcServer.handler=11,port=60020] ipc.RpcServer: RpcServer.handler=11,port=60020: exiting
2016-07-13 10:29:51,661 INFO [RpcServer.handler=10,port=60020] ipc.RpcServer: RpcServer.handler=10,port=60020: exiting
2016-07-13 10:29:51,663 INFO [RpcServer.handler=64,port=60020] ipc.RpcServer: RpcServer.handler=64,port=60020: exiting
2016-07-13 10:29:51,661 INFO [RpcServer.handler=2,port=60020] ipc.RpcServer: RpcServer.handler=2,port=60020: exiting
2016-07-13 10:29:51,663 INFO [RpcServer.handler=67,port=60020] ipc.RpcServer: RpcServer.handler=67,port=60020: exiting
2016-07-13 10:29:51,664 INFO [RpcServer.handler=69,port=60020] ipc.RpcServer: RpcServer.handler=69,port=60020: exiting
2016-07-13 10:29:51,663 INFO [RpcServer.handler=66,port=60020] ipc.RpcServer: RpcServer.handler=66,port=60020: exiting
2016-07-13 10:29:51,663 INFO [RpcServer.handler=65,port=60020] ipc.RpcServer: RpcServer.handler=65,port=60020: exiting
2016-07-13 10:29:51,663 INFO [RpcServer.handler=63,port=60020] ipc.RpcServer: RpcServer.handler=63,port=60020: exiting
2016-07-13 10:29:51,663 INFO [RpcServer.handler=62,port=60020] ipc.RpcServer: RpcServer.handler=62,port=60020: exiting
2016-07-13 10:29:51,663 INFO [RpcServer.handler=61,port=60020] ipc.RpcServer: RpcServer.handler=61,port=60020: exiting
2016-07-13 10:29:51,664 INFO [RpcServer.handler=79,port=60020] ipc.RpcServer: RpcServer.handler=79,port=60020: exiting

上述日志说明,regionserver在刷新store file至hdfs时失败,接着regionserver异常,HLog需要重新Repaly,强制server shutdown。而引起上述的罪魁祸首是hdfs处于了safe mode,所有的矛头都指向了namenode,查看NameNode日志,如下:

2016-07-13 10:29:25,239 WARN org.apache.hadoop.hdfs.server.namenode.NameNodeResourceChecker: Space available on volume ‘null’ is 0, which is below the configured reserved amount 104857600
2016-07-13 10:29:25,239 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: NameNode low on available disk space. Entering safe mode.
2016-07-13 10:29:25,239 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for (journal JournalAndStream(mgr=FileJournalManager(root=/data/namenode/nfsmount/nn), stream=Ed
itLogFileOutputStream(/data/namenode/nfsmount/nn/current/edits_inprogress_0000000000339257361)))
java.io.IOException: Input/output error
at sun.nio.ch.FileDispatcherImpl.size0(Native Method)
at sun.nio.ch.FileDispatcherImpl.size(FileDispatcherImpl.java:83)
at sun.nio.ch.FileChannelImpl.size(FileChannelImpl.java:294)
at org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream.preallocate(EditLogFileOutputStream.java:219)
at org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream.flushAndSync(EditLogFileOutputStream.java:202)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:112)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:106)
at org.apache.hadoop.hdfs.server.namenode.JournalSet JournalSetOutputStream 8.apply(JournalSet.java:498)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:358)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.access 100(JournalSet.java:57)atorg.apache.hadoop.hdfs.server.namenode.JournalSet JournalSetOutputStream.flush(JournalSet.java:494)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:624)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2238)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2180)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:505)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:354)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos ClientNamenodeProtocol 2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine Server ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC Server.call(RPC.java:1026)atorg.apache.hadoop.ipc.Server Handler 1.run(Server.java:1986)atorg.apache.hadoop.ipc.Server Handler 1.run(Server.java:1982)atjava.security.AccessController.doPrivileged(NativeMethod)atjavax.security.auth.Subject.doAs(Subject.java:415)atorg.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)atorg.apache.hadoop.ipc.Server Handler.run(Server.java:1980)
at org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream.flushAndSync(EditLogFileOutputStream.java:202)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:112)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:106)
at org.apache.hadoop.hdfs.server.namenode.JournalSet JournalSetOutputStream 8.apply(JournalSet.java:498)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:358)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.access 100(JournalSet.java:57)atorg.apache.hadoop.hdfs.server.namenode.JournalSet JournalSetOutputStream.flush(JournalSet.java:494)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:624)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2238)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2180)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:505)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:354)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos ClientNamenodeProtocol 2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine Server ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC Server.call(RPC.java:1026)atorg.apache.hadoop.ipc.Server Handler 1.run(Server.java:1986)atorg.apache.hadoop.ipc.Server Handler 1.run(Server.java:1982)atjava.security.AccessController.doPrivileged(NativeMethod)atjavax.security.auth.Subject.doAs(Subject.java:415)atorg.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)atorg.apache.hadoop.ipc.Server Handler.run(Server.java:1980)
2016-07-13 10:29:25,240 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLog: Disabling journal JournalAndStream(mgr=FileJournalManager(root=/data/namenode/nfsmount/nn), stream=EditLogFileOutputStream(/data/namenode/nfsmount/nn/current/edits_inprogress_0000000000339257361))
2016-07-13 10:29:40,240 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /hbase/data/default/poi_history_v2/fc7face2e73dfea4605afe5fb55bd0c4/.tmp/ec0ceaeb684840128cca8358dcad6abd is closed by DFSClient_NONMAPREDUCE_-795547122_28
2016-07-13 10:29:40,240 INFO org.apache.hadoop.hdfs.StateChange: STATE* Safe mode is ON.

由上述日志分析,/data/namenode/nfsmount没有可使用的空间,导致hdfs进入safe mode,根据运维人员说明,/data/namenode/nfsmount空间足够,但为何/data/namenode/nfsmount写入失败,继续查看,/data/namenode/nfsmount该目录为网络挂载点,并非挂载于本地磁盘(由于本集群NameNode为单点,挂载网络盘是为了防止数据丢失),继续查看10:29分左右情况,运维发现系统日志中有:

Jul 13 10:29:21 Master kernel: nfs: server Slave2.hadoop not responding, timed out
Jul 13 10:29:25 Master kernel: nfs: server Slave2.hadoop not responding, timed out
Jul 13 10:29:25 Master kernel: nfs: server Slave2.hadoop not responding, timed out
Jul 13 10:29:40 Master kernel: nfs: server Slave2.hadoop not responding, timed out
Jul 13 10:30:00 Master kernel: nfs: server Slave2.hadoop not responding, timed out
Jul 13 10:30:20 Master kernel: nfs: server Slave2.hadoop not responding, timed out
Jul 13 10:30:40 Master kernel: nfs: server Slave2.hadoop not responding, timed out
Jul 13 10:31:00 Master kernel: nfs: server Slave2.hadoop not responding, timed out

Slave2.hadoop正是网络挂载点,此时请求正好timed out,flush数据失败,而上述所报空间不足可能是由于报错不严谨导致,真正缘由为网络挂载点请求超时,导致hdfs切换至safe mode,最终导致regionserver集体下线。
为防止此种现象在新集群发生,NameNode作HA,元数据不存放网络盘,开启journalnode集群进行元数据同步,此问题可以得到很好解决。

你可能感兴趣的:(002,HBase)