硬盘无法识别导致HDFS无法正常使用

环境信息:

Hadoop版本:2.7.2

现象:

HDFS升级配置重启后空间大量减少

HDFS状态为INCONSISTENT,无法正常使用,DataNode进程随即消失

问题分析:

可能原因:

1、由于HADOOP集群进行过扩展,导致集群配置异构,hdfs-site.xml的配置不同,可能在配置文件scp的时候导致错误的替换,部分硬盘未识别

2、部分硬盘损坏导致数据无法读取

问题排查:

1、查看hdfs-site.xml,发现配置正确,排除该问题的可能

2、查看各个DataNode的日志,日志如下:

硬盘无法识别导致HDFS无法正常使用_第1张图片

直接报错信息为:

2018-07-23 10:22:16,959 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool (Datanode Uuid unassigned) service to hb1/192.168.10.32:9000. Exiting.
org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 6, volumes configured: 10, volumes failed: 4, volume failures tolerated: 0
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.(FsDatasetImpl.java:285)
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetFactory.newInstance(FsDatasetFactory.java:34)
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetFactory.newInstance(FsDatasetFactory.java:30)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1371)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1323)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:317)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:223)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:802)
        at java.lang.Thread.run(Thread.java:745)

第一反应是硬盘出问题了,导致HDFS不识别,而HDFS配置是有一块盘不识别即认为该DataNode不可用,所以导致节点无法使用,使用smartctl -H /dev/sdb命令查看所有的硬盘状态,得到的都是SMART Health Status: OK,不是硬盘本身的问题

继续查看日志,发现了真正的问题:

2018-07-23 10:22:16,731 WARN org.apache.hadoop.hdfs.server.common.Storage: org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /data6 is in an inconsistent state: Root /data6: DatanodeUuid=4310e0f4-6667-4cdc-bc1b-931cba355606, does not match f38
5b27e-6a2f-4077-8468-40fd3dec7dc2 from other StorageDirectory
.2018-07-23 10:22:16,760 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /data7/in_use.lock acquired by nodename 173507@hd12
2018-07-23 10:22:16,770 WARN org.apache.hadoop.hdfs.server.common.Storage: org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /data7 is in an inconsistent state: Root /data7: DatanodeUuid=4310e0f4-6667-4cdc-bc1b-931cba355606, does not match f38
5b27e-6a2f-4077-8468-40fd3dec7dc2 from other StorageDirectory
.2018-07-23 10:22:16,778 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /data8/in_use.lock acquired by nodename 173507@hd12
2018-07-23 10:22:16,803 INFO org.apache.hadoop.hdfs.server.common.Storage: Analyzing storage directories for bpid BP-2078449412-192.168.10.32-1497544230293
2018-07-23 10:22:16,803 INFO org.apache.hadoop.hdfs.server.common.Storage: Locking is disabled for /data8/current/BP-2078449412-192.168.10.32-1497544230293
2018-07-23 10:22:16,837 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /data9/in_use.lock acquired by nodename 173507@hd12
2018-07-23 10:22:16,862 WARN org.apache.hadoop.hdfs.server.common.Storage: org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /data9 is in an inconsistent state: Root /data9: DatanodeUuid=4310e0f4-6667-4cdc-bc1b-931cba355606, does not match f38
5b27e-6a2f-4077-8468-40fd3dec7dc2 from other StorageDirectory
.2018-07-23 10:22:16,886 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /data10/in_use.lock acquired by nodename 173507@hd12
2018-07-23 10:22:16,947 WARN org.apache.hadoop.hdfs.server.common.Storage: org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /data10 is in an inconsistent state: Root /data10: DatanodeUuid=4310e0f4-6667-4cdc-bc1b-931cba355606, does not match f
385b27e-6a2f-4077-8468-40fd3dec7dc2 from other StorageDirectory.

正好4块盘,与错误提示中4块盘失败对应上了,至此找到了真正的问题所在

问题解决:

去有问题硬盘目录(配置在hdfs-site.xml中dfs.datanode.data.dir配置项的目录)下的current目录,找到VERSION文件,打开该文件,将datanodeUuid属性按照错误日志进行修改保存,重启HDFS问题解决(我的例子中是将4310e0f4-6667-4cdc-bc1b-931cba355606修改为f385b27e-6a2f-4077-8468-40fd3dec7dc2)

 

你可能感兴趣的:(Hadoop)