could only be replicated to 0 nodes, instead of 1

用三台Linux搭建hadoop环境时出错,master主机部分信息日志如下:
2015-03-28 17:13:12,147 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for null bad datanode[0] nodes == null
2015-03-28 17:13:12,147 WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/home/hadoop/tmp/mapred/system/jobtracker.info" - Aborting...
2015-03-28 17:13:12,147 WARN org.apache.hadoop.mapred.JobTracker: Writing to file hdfs://master:10000/home/hadoop/tmp/mapred/system/jobtracker.info failed!
2015-03-28 17:13:12,147 WARN org.apache.hadoop.mapred.JobTracker: FileSystem is not ready yet!
2015-03-28 17:13:12,151 WARN org.apache.hadoop.mapred.JobTracker: Failed to initialize recovery manager.
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /home/hadoop/tmp/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1920)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:783)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)


初看还以为是 dfs.replication的配置错误,其实不然,当看到
Error Recovery for null bad datanode[0] nodes == null
怀疑是原先启动时造成的数据缓存问题,于是清空 hadoop.tmp.dir的数据并重启hadoop,访问http://master:50030和http://master:50070,正常显示页面,看起来算是成功解决了问题!

虽然50030和50070可以成功访问,但是其它2个slave节点使用 ps -ux命令时发现没有hadoop相关进程,这显然是不正常的,于是看了下save1的日志,如下:

2015-03-29 09:48:24,093 ERROR org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Error getting localhost name. Using 'localhost'...
java.net.UnknownHostException: hsdb01: hsdb01: unknown error
at java.net.InetAddress.getLocalHost(InetAddress.java:1484)
at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.getHostname(MetricsSystemImpl.java:481)
at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.configureSystem(MetricsSystemImpl.java:412)
at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.configure(MetricsSystemImpl.java:408)
at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.start(MetricsSystemImpl.java:152)
at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:133)
at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.init(DefaultMetricsSystem.java:40)
at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.initialize(DefaultMetricsSystem.java:50)
at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1650)
at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1669)
at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:1795)
at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1812)
Caused by: java.net.UnknownHostException: hsdb01: unknown error
at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:907)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1302)
at java.net.InetAddress.getLocalHost(InetAddress.java:1479)
... 11 more


可见是 /etc/hosts配置遗漏,需加入hsdb01和hsdb02。如下:
127.0.0.1  localhost
28.18.19.34 master root123
28.18.12.57 slave1 hsdb01
28.18.12.58 slave2 hsdb02


重启Hadoop
1、bin/hadoop namenode -format
2、bin/start-all.sh

访问50030和50070正常,可查看slave的日志时发现
2015-03-29 10:15:47,221 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in /home/hadoop/tmp/dfs/data: namenode namespaceID = 1374430296; datanode namespaceID = 627398707
at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:232)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:147)
at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:414)
at org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:321)
at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1712)
at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1651)
at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1669)
at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:1795)
at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1812)


关闭hadoop时信息如下:

could only be replicated to 0 nodes, instead of 1

产生此问题一般是由于两次或两次以上格式化namenode造成,解决方案借鉴了 hadoop常见问题(2).no datanode to stop中提到修改namespaceID,问题最终解决。(*^__^*) 嘻嘻!

PS:本次我修改master的namespaceID后重启hadoop时错误依旧,所以我修改的是slave的namespaceID,但如果slave很多,维护可就耗时了!hadoop的集群部署一次性成功最好,否则真是"后患无穷"啊!呵呵!

Hadoop集群(第5期)_Hadoop安装配置
hadoop启动和运行中的error总结和处理方法

hadoop异常“could only be replicated to 0 nodes, instead of 1” 解决

下面这两种方法从书籍《Hadoop实战》第2版中看到,在此记录一下,在实际应用也可能会用到。
1、重启坏掉的DataNode或JobTrack。当hadoop集群的单个节点出现问题时,一般不必重启整个系统,只须重启这个节点,它会自动连入这个集群。
在坏死的节点上输入如下命令即可:

bin/hadoop-daemon.sh  start  datanode
bin/hadoop-daemon.sh  start  jobtracker

2、动态加入DataNode或JobTracker。下面这条命令允许用户动态地将某个节点加入到集群中。

bin/hadoop-daemon.sh  --config  ./conf  start  datanode
bin/hadoop-daemon.sh  --config  ./conf  start  tasktracker

你可能感兴趣的:(hadoop)