报错内容
2023-08-23 17:12:44,688 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in doCheckpoint
java.io.IOException: Inconsistent checkpoint fields.
LV = -63 namespaceID = 710154960 cTime = 0 ; clusterId = CID-46a67bdd-c7b0-4056-9c54-d82d5d84964a ; blockpoolId = BP-1873526852-172.16.21.30-1692769875005.
Expecting respectively: -63; 648161912; 0; CID-46a67bdd-c7b0-4056-9c54-d82d5d84964a; BP-1073838461-172.16.20.24-1691128065381.
关注到这个问题的背景是这样的:
执行了重置集群命令,
格式化
hdfs namenode -format
此方法导致集群ID不一致了, 不想丢数据, 于是将集群号统一改成原集群号, 包括目录
hdfs-site.xml 定义的目录
dfs.datanode.data.dir: /home/hadoop/dfs/data/current/VERSION
dfs.namenode.name.dir: /home/hadoop/dfs/name/current/VERSION
core-site.xml 定义的目录: hadoop.tmp.dir
/home/hadoop/tmp/hadoop-${user.name}
此目录包含如下几个, 主要是我用到的用户
/home/hadoop/tmp/hadoop-root/dfs/namesecondary/current/VERSION
/home/hadoop/tmp/hadoop-hadoop/dfs/namesecondary/current/VERSION
集群号改成一致了仍然解决不了问题, 仍然报错相同的错误
网上的解决办法(无效):
将core-site.xml中的如下配置写入 hdfs-site.xml
hadoop.tmp.dir
/home/hadoop/tmp/hadoop-${user.name}
A base for other temporary directories.
照样报错
注意到 namespaceID = 710154960 , 这个id跟 原来的 namespaceID=648161912 不一样
那么我应该找到710154960所在的位置, 换成 648161912
根据上述报错找到
blockpoolId = BP-1873526852-172.16.21.30-1692769875005.
看VERSION的内容
cat /home/hadoop/dfs/data/current/BP-1873526852-172.16.21.30-1692769875005/current/VERSION
#Wed Aug 23 17:35:36 CST 2023
namespaceID=710154960
cTime=0
blockpoolID=BP-1873526852-172.16.21.30-1692769875005
layoutVersion=-56
确实找到了 namespaceID=710154960 , 修改成 namespaceID=648161912
重启集群 datanode 无法启动成功 报错
2023-08-23 17:46:44,586 WARN org.apache.hadoop.hdfs.server.common.Storage: Failed to analyze storage directories for block pool BP-1873526852-172.16.21.30-1692769875005
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /work1/home/hadoop/dfs/data/current/BP-1873526852-172.16.21.30-1692769875005 is in an inconsistent state: namespaceID is incompatible with others.
at org.apache.hadoop.hdfs.server.common.StorageInfo.setNamespaceID(StorageInfo.java:189)
at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.setFieldsFromProperties(BlockPoolSliceStorage.java:329)
at org.apache.hadoop.hdfs.server.common.StorageInfo.readProperties(StorageInfo.java:232)
at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:364)
at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:173)
at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:216)
at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:244)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:395)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:477)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1358)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1323)
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:317)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:223)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:802)
at java.lang.Thread.run(Thread.java:750)
2023-08-23 17:46:44,587 WARN org.apache.hadoop.hdfs.server.common.Storage: Failed to add storage for block pool: BP-1873526852-172.16.21.30-1692769875005 : Directory /work1/home/hadoop/dfs/data/current/BP-1873526852-172.16.21.30-1692769875005 is in an inconsistent state: namespaceID is incompatible with others.
2023-08-23 17:46:44,588 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool (Datanode Uuid unassigned) service to localhost/127.0.0.1:9000. Exiting.
java.io.IOException: All specified directories are failed to load.
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:478)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1358)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1323)
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:317)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:223)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:802)
at java.lang.Thread.run(Thread.java:750)
这里面观察到一个问题:
BP-1073838461-172.16.20.24-1691128065381
BP-1873526852-172.16.21.30-1692769875005
注意中间的ip, 这里面缺少一个背景介绍, 我是从172.16.20.24整体拷贝到172.16.21.30的
这样造成了ip的变动, 解决办法将BP-1073838461-172.16.20.24-1691128065381 命名成 BP-1073838461-172.16.21.30-1691128065381, 然后删除 BP-1873526852-172.16.21.30-1692769875005
不再报错, data node也不挂了