Hadoop+Spark 大数据集群日常1 (There are 0 datanode(s) running报错 处理)

Hadoop+Spark 大数据集群日常1

由于项目涉及Hadoop+Spark大数据集群,特写此文档,方便将来处理类似问题参照,也为后人提供解决方案。
本人才疏学浅,文档难免有错漏与不妥之处,欢迎与本人进行交流。本人主页:http://35.234.31.208

问题描述

Hadoop 2.7.7,搭建了一个生产环境,集群内有一台master机,一台slave机。使用一个脚本来启动集群

#!/bin/bash
cd /usr/local/
rm -rf ./hadoop/tmp
rm -rf ./hadoop/logs/*
cd /
/usr/local/hadoop/bin/hdfs namenode -format
/usr/local/hadoop/sbin/start-all.sh
/usr/local/spark/sbin/start-master.sh
/usr/local/spark/sbin/start-slaves.sh
/usr/local/hadoop/bin/hadoop fs -mkdir -p /user/30/logs
/usr/local/hadoop/bin/hadoop fs -mkdir -p /user/spark-python
/usr/local/hadoop/bin/hadoop fs -put /usr/local/spark-python.zip /user/spark-python
/usr/local/hadoop/bin/hadoop fs -mkdir -p /user/spark/lib_jars/
/usr/local/hadoop/bin/hadoop fs -put /usr/local/spark/jars/*.jar /user/spark/lib_jars/
hdfs dfs -chmod 777 /user/30/logs
/bin/bash

在执行到/usr/local/hadoop/bin/hadoop fs -put的时候出现以下错误:

put: File /user/spark/lib_jars/spark-repl_2.11-2.2.0.jar._COPYING_ could only be replicated to 0 nodes instead of minReplication (=1).  There are 0 datanode(s) running and no node(s) are excluded in this operation.
20/08/14 11:15:25 WARN hdfs.DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/spark/lib_jars/spark-sketch_2.11-2.2.0.jar._COPYING_ could only be replicated to 0 nodes instead of minReplication (=1).  There are 0 datanode(s) running and no node(s) are excluded in this operation.
	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1620)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3135)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3059)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:725)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:493)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2217)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2213)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2211)

	at org.apache.hadoop.ipc.Client.call(Client.java:1476)
	at org.apache.hadoop.ipc.Client.call(Client.java:1413)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
	at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:418)
	at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
	at com.sun.proxy.$Proxy11.addBlock(Unknown Source)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1603)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1388)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:554)

在slave节点上使用指令jps查看java进程,发现并无Datanode进程,可知是Datanode未启动,发生了错误。
网上各类论坛已有文针对本问题给出解决方案。大致分为以下两类:
1、namenode与datanode下VERSION文件的clusterID值不匹配。
2、master节点的9000端口绑定在了不同的地址上。
经排除本文所讨论情况均不属于这两种情况的范围,本文不再讨论。

查看$HADOOP_HOME/logs/下的日志文件,发现是各个配置文件core-site.xml,hdfs-site.xml,mapred-site.xml,yarn-site.xml中的master节点的hosts出现了错误,才回忆起,由于项目功能要求,本人搭建的是弹性集群,若干个集群根据要求,其中包含不同的slaves。即某个特定的系统在不同的任务下充当不同集群的slaves。此需求要求在每一次启动Hadop集群之前自动化更新各配置文件中的hosts地址。

本着先定位问题的原则,手动修改各配置文件后使用脚本重启集群,发现上述
could only be replicated to 0 nodes instead of minReplication (=1). There are 0 datanode(s) running and no node(s) are excluded in this operation.
问题仍然存在。但是slave节点上的Datanode进程已成功启动。

于是手动执行指令/usr/local/hadoop/bin/hadoop fs -put /usr/local/spark-python.zip /user/spark-python发现成功。

这使我有充分的理由怀疑是脚本刚执行完start-all.sh,该进程在slaves节点上启动Datanode需要时间,而slaves上的Datanode还未成功启动,master就开始hdfs put操作造成的。于是修改启动脚本为

#!/bin/bash
cd /usr/local/
rm -rf ./hadoop/tmp
rm -rf ./hadoop/logs/*
cd /
/usr/local/hadoop/bin/hdfs namenode -format
/usr/local/hadoop/sbin/start-all.sh
/usr/local/spark/sbin/start-master.sh
/usr/local/spark/sbin/start-slaves.sh
/usr/local/hadoop/bin/hadoop fs -mkdir -p /user/30/logs
/usr/local/hadoop/bin/hadoop fs -mkdir -p /user/spark-python
sleep 10s
/usr/local/hadoop/bin/hadoop fs -put /usr/local/spark-python.zip /user/spark-python
/usr/local/hadoop/bin/hadoop fs -mkdir -p /user/spark/lib_jars/
/usr/local/hadoop/bin/hadoop fs -put /usr/local/spark/jars/*.jar /user/spark/lib_jars/
hdfs dfs -chmod 777 /user/30/logs
/bin/bash

再一次重新启动脚本,发现hdfs -put .../spark-python时出现了上述错误,而hdfs -put .../lib_jars/ 成功了,完美的验证了我的想法。

将启动脚本最后修改为下文以后,问题完美解决。

#!/bin/bash
cd /usr/local/
rm -rf ./hadoop/tmp
rm -rf ./hadoop/logs/*
cd /
/usr/local/hadoop/bin/hdfs namenode -format
/usr/local/hadoop/sbin/start-all.sh
/usr/local/spark/sbin/start-master.sh
/usr/local/spark/sbin/start-slaves.sh
/usr/local/hadoop/bin/hadoop fs -mkdir -p /user/30/logs
/usr/local/hadoop/bin/hadoop fs -mkdir -p /user/spark-python
sleep 20s
/usr/local/hadoop/bin/hadoop fs -put /usr/local/spark-python.zip /user/spark-python
/usr/local/hadoop/bin/hadoop fs -mkdir -p /user/spark/lib_jars/
/usr/local/hadoop/bin/hadoop fs -put /usr/local/spark/jars/*.jar /user/spark/lib_jars/
hdfs dfs -chmod 777 /user/30/logs
/bin/bash

你可能感兴趣的:(Hadoop,hadoop,spark,hdfs,运维)