1. mapreduce.shuffle set in yarn.nodemanager.aux-services is invalid.The valid service name should only contain a-zA-Z0-9_ and can not start with numbers
解决办法:
在yarn-site.xml 配置文件中增加:
重启就ok了
这个问题其实是由于
https://issues.apache.org/jira/i#browse/YARN-1289
2. Datanode denied communication with namenode: DatanodeRegistration(0.0.0.0, storageID=DS-445663431-127.0.0.1-50010-1394867858930, infoPort=50075, ipcPort=50020, storageInfo=lv=-47;cid=CID-982d36dc-3def-47fa-8cf7-9f2f19089eaa;nsid=1165019335;c=0)
网上查了一些资料,说了可能是两种原因造成的:
1. clusterID不一致,namenode的cid和datanode的cid不一致,导致的原因是对namenode进行format的之后,datanode不会进行format,所以datanode里面的cid还是和format之前namenode的cid一样,解决办法是删除datanode里面的dfs.datanode.data.dir目录和tmp目录,然后再启动start-dfs.sh
2.即使删除iptables之后,仍然报Datanode denied communication with namenode: DatanodeRegistration错误,参考文章http://stackoverflow.com/questions/17082789/cdh4-3exception-from-the-logs-after-start-dfs-sh-datanode-and-namenode-star,可以知道需要把集群里面每个houst对应的ip写入/etc/hosts文件就能解决问题。
我每次datanode数据目录下和tmp都会清空,因此不是第一种问题,但是第2个问题我没看懂,我猜是DNS解析是/etc/hosts文件出的问题,我原hdfs配置文件虽然配的都是本机,但是用的不是localhost, 而是ip地址的形式,我猜是这个原因,因此将配置文件中所有ip地址改为localhost, 问题解决。
参考: http://wang-2011-ying.iteye.com/blog/1996654
http://grokbase.com/t/cloudera/scm-users/135y4jn3dw/datanode-denied-connection-with-namenode
3. 2014-03-15 16:15:10,307 INFO org.apache.hadoop.service.AbstractService: Service ResourceManager failed in state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to login
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to login
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:631)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:872)
Caused by: java.net.UnknownHostException: localhost.localdomain: localhost.localdomain
at java.net.InetAddress.getLocalHost(InetAddress.java:1425)
at org.apache.hadoop.security.SecurityUtil.getLocalHostName(SecurityUtil.java:227)
at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:247)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.doSecureLogin(ResourceManager.java:685)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:629)
... 2 more
2014-03-15 16:15:10,308 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ResourceManager metrics system...
2014-03-15 16:15:10,308 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics system stopped.
2014-03-15 16:15:10,308 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics system shutdown complete.
2014-03-15 16:15:10,309 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting ResourceManager
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to login
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:631)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:872)
Caused by: java.net.UnknownHostException: localhost.localdomain: localhost.localdomain
at java.net.InetAddress.getLocalHost(InetAddress.java:1425)
at org.apache.hadoop.security.SecurityUtil.getLocalHostName(SecurityUtil.java:227)
at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:247)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.doSecureLogin(ResourceManager.java:685)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:629)
以上问题是因为我在第2个问题中只改了hdfs.xml, 没有改/etc/hosts文件,
没辙,只能再改:
# Do not remove the following line, or various programs
# that require network functionality will fail.
#127.0.0.1 localhost.localdomain localhost
#::1 localhost6.localdomain6 localhost6
127.0.0.1 localhost
注意了,这里光改这个是不行的,还要修改/etc/sysconfig/network
执行hostname命令可以看到当前的主机名
也就是说,Hadoop在格式化HDFS的时候,通过hostname命令获取到的主机名是localhost.localdomain,然后在/etc/hosts文件中进行映射的时候,没有找到,也就说,通过localhost.localdomain根本无法映射到一个IP地址,所以报错了。修改/etc/sysconfig/network文件:
NETWORKING=yes
NETWORKING_IPV6=yes
#HOSTNAME=localhost.localdomain
HOSTNAME=localhost
重新执行mapreduc作业:
hadoop jar hadoop-mapreduce-examples-2.2.0.jar pi 200 1000
OK.
参考: http://blog.csdn.net/shirdrn/article/details/6562292