最近在搭建hadoop集群实验中,启动集群时常报错:jobtracker.info could only be replicated to 0 nodes, instead of 1
查了好多资料,用了好多方式尝试解决,通过下面的命令总有找到问题所在了。
原因:Configured Capacity也就是datanode 没用分配容量
[root@dev9106 bin]# ./hadoop dfsadmin -report
Configured Capacity: 0 (0 KB)
Present Capacity: 0 (0 KB)
DFS Remaining: 0 (0 KB)
DFS Used: 0 (0 KB)
DFS Used%: ?%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
解决方法:
<!--[if !supportLists]-->1. <!--[endif]-->查看你的文件系统
[root@dev9106 /]# df -hl
文件系统 容量 已用 可用 已用% 挂载点
/dev/sda3 1.9G 1.6G 302M 84% /
/dev/sda8 845G 47G 756G 6% /home
/dev/sda7 5.7G 147M 5.3G 3% /tmp
/dev/sda6 9.5G 4.0G 5.1G 45% /usr
/dev/sda5 9.5G 273M 8.8G 3% /var
/dev/sda1 190M 15M 167M 8% /boot
tmpfs 7.8G 0 7.8G 0% /dev/shm
<!--[if !supportLists]-->2. <!--[endif]-->修改文件Hadoop conf/core-site.xml 中hadoop.tmp.dir的值
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/dhfs/tmp</value>
</property>
</configuration>
<!--[if !supportLists]-->3. <!--[endif]-->停止hadoop服务,重新格式化namenode
<!--[if !supportLists]-->4. <!--[endif]-->重启服务
<!--[if !supportLists]-->5. <!--[endif]-->Ok
2. 运行hadoop程序时, 中途我把它终止了,然后再向hdfs加文件或删除文件时,出现Name node is in safe mode错误:
rmr: org.apache.hadoop.dfs.SafeModeException: Cannot delete /user/hadoop/input. Name node is in safe mode
解决的命令:
bin/hadoop dfsadmin -safemode leave 关闭safe mode
终于找到错误原因了,之所以datanode的log为空,是因为版本原因,我是用nutch-1.0自带的hadoop-0.19.1。可能是因为版本过低吧,我在装了hadoop-0.20.2试了下,同样的配置datanode报出了错误,太高兴了(终于有错误了),org.apache.ipc.Client:Retrying connet to server:openlab0/192.168.1.180:9000.Already tried 0 time(s)
…………
not available yet ,Zzzzz...
在网上查了一下原因,把dfs.defaul.name有hdfs://openlab0:9000改成了hdfs://192.168.1.180:900,也就是把主机机器名换成了IP,mapred.job.tracker同样的改发
再重新format就行了。
困扰了半个月的问题终于搞定了,真TMD高兴!!:'(:)
刚查看了一下端口使用情况,发现了一些问题:
master(yanxinhe)结点(既是namenode又是datanode)端口情况:
激活Internet连接 (服务器和已建立连接的)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 *:50020 *:* LISTEN
tcp 0 0 yanxinhe:38504 *:* LISTEN
tcp 0 0 yanxinhe:9000 *:* LISTEN
tcp 0 0 yanxinhe:9001 *:* LISTEN
tcp 0 0 *:50090 *:* LISTEN
tcp 0 0 *:50060 *:* LISTEN
tcp 0 0 *:50030 *:* LISTEN
tcp 0 0 *:50070 *:* LISTEN
tcp 0 0 *:ssh *:* LISTEN
tcp 0 0 yanxinhe:ipp *:* LISTEN
tcp 0 0 *:50010 *:* LISTEN
tcp 0 0 *:50075 *:* LISTEN
tcp 0 0 yanxinhe.local:ssh wangchi:38780 ESTABLISHED
tcp 0 0 yanxinhe:60070 yanxinhe:9000 ESTABLISHED
tcp 0 0 yanxinhe:9001 yanxinhe:51473 ESTABLISHED
tcp 0 0 yanxinhe.local:ssh 192.168.1.124:ripd ESTABLISHED
tcp 0 0 yanxinhe:51473 yanxinhe:9001 ESTABLISHED
tcp 0 0 yanxinhe:9000 yanxinhe:60070 ESTABLISHED
datanode(wangchi)端口使用情况:
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 *:50060 *:* LISTEN
tcp 0 0 *:ssh *:* LISTEN
tcp 0 0 wangchi:ipp *:* LISTEN
tcp 0 0 wangchi:39544 *:* LISTEN
tcp 0 0 wangchi.local:38185 wangchi1:ssh ESTABLISHED
tcp 0 0 wangchi.local:38780 yanxinhe:ssh ESTABLISHED
tcp6 0 0 [::]:ssh [::]:* LISTEN
tcp6 0 0 localhost:ipp [::]:* LISTEN
udp 0 0 *:mdns *:*
udp 0 0 *:39672 *:*
datanode(wangchi1)端口使用情况:
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 wangchi1:35171 *:* LISTEN
tcp 0 0 *:50060 *:* LISTEN
tcp 0 0 *:ssh *:* LISTEN
tcp 0 0 wangchi1:ipp *:* LISTEN
tcp 0 0 wangchi1.local:ssh wangchi:38185 ESTABLISHED
tcp6 0 0 [::]:ssh [::]:* LISTEN
tcp6 0 0 localhost:ipp [::]:* LISTEN
udp 0 0 *:46504 *:*
udp 0 0 *:mdns *:*
观察上面的东西可以发现,datanode都只启动了50060端口(TaskTracker HTTP状态监视地址) ,而50020端口(DataNode IPC服务的地址),50010端口(DataNode服务的地址),50075端口(DataNode HTTP状态监视地址)没有启动