hadoop 报错: Exception in doCheckpointjava.io.IOException: Inconsistent checkpoint fields

报错内容

2023-08-23 17:12:44,688 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in doCheckpoint
java.io.IOException: Inconsistent checkpoint fields.
LV = -63 namespaceID = 710154960 cTime = 0 ; clusterId = CID-46a67bdd-c7b0-4056-9c54-d82d5d84964a ; blockpoolId = BP-1873526852-172.16.21.30-1692769875005.
Expecting respectively: -63; 648161912; 0; CID-46a67bdd-c7b0-4056-9c54-d82d5d84964a; BP-1073838461-172.16.20.24-1691128065381.

关注到这个问题的背景是这样的:

执行了重置集群命令, 

 格式化
 hdfs namenode -format

此方法导致集群ID不一致了, 不想丢数据, 于是将集群号统一改成原集群号, 包括目录

 hdfs-site.xml  定义的目录

dfs.datanode.data.dir: /home/hadoop/dfs/data/current/VERSION

dfs.namenode.name.dir: /home/hadoop/dfs/name/current/VERSION

core-site.xml  定义的目录: hadoop.tmp.dir

/home/hadoop/tmp/hadoop-${user.name}

此目录包含如下几个, 主要是我用到的用户

/home/hadoop/tmp/hadoop-root/dfs/namesecondary/current/VERSION

/home/hadoop/tmp/hadoop-hadoop/dfs/namesecondary/current/VERSION

集群号改成一致了仍然解决不了问题, 仍然报错相同的错误

网上的解决办法(无效):

 将core-site.xml中的如下配置写入 hdfs-site.xml
   
      hadoop.tmp.dir
      /home/hadoop/tmp/hadoop-${user.name}
      A base for other temporary directories.
   

照样报错

注意到 namespaceID = 710154960 , 这个id跟 原来的 namespaceID=648161912 不一样

那么我应该找到710154960所在的位置, 换成 648161912

根据上述报错找到

blockpoolId = BP-1873526852-172.16.21.30-1692769875005.

看VERSION的内容

cat /home/hadoop/dfs/data/current/BP-1873526852-172.16.21.30-1692769875005/current/VERSION 

#Wed Aug 23 17:35:36 CST 2023
namespaceID=710154960
cTime=0
blockpoolID=BP-1873526852-172.16.21.30-1692769875005
layoutVersion=-56

确实找到了  namespaceID=710154960 , 修改成 namespaceID=648161912

重启集群 datanode 无法启动成功 报错 

2023-08-23 17:46:44,586 WARN org.apache.hadoop.hdfs.server.common.Storage: Failed to analyze storage directories for block pool BP-1873526852-172.16.21.30-1692769875005
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /work1/home/hadoop/dfs/data/current/BP-1873526852-172.16.21.30-1692769875005 is in an inconsistent state: namespaceID is incompatible with others.
        at org.apache.hadoop.hdfs.server.common.StorageInfo.setNamespaceID(StorageInfo.java:189)
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.setFieldsFromProperties(BlockPoolSliceStorage.java:329)
        at org.apache.hadoop.hdfs.server.common.StorageInfo.readProperties(StorageInfo.java:232)
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:364)
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:173)
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:216)
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:244)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:395)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:477)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1358)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1323)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:317)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:223)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:802)
        at java.lang.Thread.run(Thread.java:750)
2023-08-23 17:46:44,587 WARN org.apache.hadoop.hdfs.server.common.Storage: Failed to add storage for block pool: BP-1873526852-172.16.21.30-1692769875005 : Directory /work1/home/hadoop/dfs/data/current/BP-1873526852-172.16.21.30-1692769875005 is in an inconsistent state: namespaceID is incompatible with others.
2023-08-23 17:46:44,588 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool  (Datanode Uuid unassigned) service to localhost/127.0.0.1:9000. Exiting. 
java.io.IOException: All specified directories are failed to load.
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:478)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1358)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1323)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:317)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:223)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:802)
        at java.lang.Thread.run(Thread.java:750)

这里面观察到一个问题:

BP-1073838461-172.16.20.24-1691128065381

BP-1873526852-172.16.21.30-1692769875005

注意中间的ip, 这里面缺少一个背景介绍, 我是从172.16.20.24整体拷贝到172.16.21.30的

这样造成了ip的变动, 解决办法将BP-1073838461-172.16.20.24-1691128065381 命名成 BP-1073838461-172.16.21.30-1691128065381, 然后删除 BP-1873526852-172.16.21.30-1692769875005

不再报错, data node也不挂了

你可能感兴趣的:(hadoop,各种问题,hadoop,大数据,分布式)