到了这一步就基本判定是block出现了问题,因为hdfs的block如果出现损坏或者离线,hdfs会开启自我保护机制,即
模式上线
尝试通过重启master,让hdfs自我修复掉坏的block,但是失败了,仍然提示一个block异常
2019-01-02 16:21:40,081 INFO ipc.Server (Server.java:logException(2394)) - IPC Server handler 583 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.renewLease from 72.118.0.12:36554 Call#1477 Retry#0: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot renew lease for DFSClient_NONMAPREDUCE_-2130476810_1. Name node is in safe mode.
尝试关掉安全模式,然后启动hdfs的master,启动后ambari的collector任然没提示无法获取需要的数据
2019-01-02 16:01:05,596 INFO org.apache.hadoop.hbase.client.RpcRetryingCaller: Call exception, tries=10, retries=35, started=68259 ms ago, cancelled=false, msg=org.apache.hadoop.hbase.NotServingRegionException: Region SYSTEM.CATALOG,,1545030793895.709d49c78a9511adb4082e4177ca23e0. is not online on 1.snamenode1,61320,1546415985105
at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3077)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1015)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.execService(RSRpcServices.java:1955)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32389)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2150)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:187)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:167)
row '' on table 'SYSTEM.CATALOG' at region=SYSTEM.CATALOG,,1545030793895.709d49c78a9511adb4082e4177ca23e0., hostname=1.snamenode1,61320,1545716568014, seqNum=43
最终只能选择手动修复
su - hdfs
hdfs dfsadmin -safemode leave
hdfs fsck / -delete #因为损坏的block并不是很重要,所以直接删除掉了
然后再次重启master和collector,服务正常