hadoop及zookeeper集群搭建常见问题

问题一:yarn启动异常:

2019-09-30 18:15:49,231 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting ResourceManager
org.apache.hadoop.HadoopIllegalArgumentException: Configuration doesn't specify yarn.resourcemanager.cluster-id
    at org.apache.hadoop.yarn.conf.YarnConfiguration.getClusterId(YarnConfiguration.java:1785)
    at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceInit(EmbeddedElectorService.java:82)
    at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
    at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
    at org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceInit(AdminService.java:145)
    at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
    at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
    at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:276)
    at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
    at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1309)

解决办法:在yarn-site.xml中配置一个id,如下:

<property>
    <name>yarn.resourcemanager.cluster-id</name>
    <value>cluster1</value>
</property>

问题二:ZKFailoverController启动失败问题
异常一:

2019-09-30 18:15:45,010 FATAL org.apache.hadoop.hdfs.tools.DFSZKFailoverController: Got a fatal error, exiting now
java.lang.IllegalArgumentException: Missing required configuration 'ha.zookeeper.quorum' for ZooKeeper quorum
    at com.google.common.base.Preconditions.checkArgument(Preconditions.java:115)
    at org.apache.hadoop.ha.ZKFailoverController.initZK(ZKFailoverController.java:340)
    at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:190)
    at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:60)
    at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:171)
    at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:167)
    at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:444)
    at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:167)
    at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:192)

解决办法:
①确认是否配置了

<property>
    <name>ha.zookeeper.quorum</name>
    <value>node01:2181,node02:2181,node03:2181</value>
</property>

②检查服务器时间是否同步,同步需要在root用户下。

hdfs namenode -initializeSharedEdits

异常二:

2019-09-30 15:42:05,418 FATAL org.apache.hadoop.ha.ZKFailoverController: Unable to start failover controller. Parent znode does not exist.
Run with -formatZK flag to initialize ZooKeeper.

解决办法: 此刻以为这你的hadoop节点还没有注册到zookeeper中,需要初始化。

hdfs zkfc -formatZK

重新起动集群即可。
问题三:namenode启动异常

2019-10-18 15:32:36,835 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-xxx/xxx:8485. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=100, sleepTime=10000 MILLISECONDS)
2019-10-18 15:32:37,263 WARN org.apache.hadoop.hdfs.server.namenode.FSEditLog: Unable to determine input streams from QJM to [xxx1:8485, xxx2:8485, xxx3:8485]. Skipping.
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
	at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
	at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:471)
	at org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:278)
	at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1463)
	at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1487)
	at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:212)
	at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:324)
	at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:282)
	at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:299)
	at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:412)
	at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:295)
2019-10-18 15:32:37,267 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Stopping services started for standby state
2019-10-18 15:32:37,267 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Edit log tailer interrupted
java.lang.InterruptedException: sleep interrupted
	at java.lang.Thread.sleep(Native Method)
	at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:337)
	at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:282)
	at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:299)
	at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:412)
	at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:295)
2019-10-18 15:32:46,837 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-xxx/xxx:8485. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=100, sleepTime=10000 MILLISECONDS)
2019-10-18 15:32:46,838 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-xxx/xxx:8485. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=100, sleepTime=10000 MILLISECONDS)
2019-10-18 15:32:56,841 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-xxx/xxx:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=100, sleepTime=10000 MILLISECONDS)

错误原因:
我们在执行start-dfs.sh的时候,默认启动顺序是namenode>datanode>journalnode>zkfc,如果journalnode和namenode不在一台机器启动的话,很容易因为网络延迟问题导致namenode无法连接journalnode,无法实现选举,最后导致刚刚启动的namenode会突然挂掉。虽然namenode启动时有重试机制等待journalnode的启动,但是由于重试次数限制,可能网络情况不好,导致重试次数用完了,也没有启动成功。

解决方法:
方法①:手动启动namenode,避免了网络延迟等待journalnode的步骤,一旦两个namenode连入journalnode,实现了选举,则不会出现失败情况。
方法②:先启动journalnode然后再运行start-dfs.sh。
方法③:把namenode对journalnode的容错次数或时间调成更大的值,保证能够对正常的启动延迟、网络延迟能容错。在hdfs-site.xml中修改ipc参数,namenode对journalnode检测的重试次数,默认为10次,每次1000ms,故网络情况差需要增加。具体修改信息为:

<property>
     <name>ipc.client.connect.max.retries</name>
     <value>100</value>
     <description>Indicates the number of retries a client will make to establish
      a server connection.
     </description>
</property>
<property>
     <name>ipc.client.connect.retry.interval</name>
    <value>10000</value>
    <description>Indicates the number of milliseconds a client will wait for
  before retrying to establish a server connection.
    </description>
</property>

hdfs namenode -initializeSharedEdits

然后重启集群,结果namenode运行正常。
问题四:namenode连接不上,查看日志发现

java.io.IOException: There appears to be a gap in the edit log.  We expected txid 1, but got txid 2.

解决方法:在hadoop的bin目录下修复元数据

hadoop namenode -recover

先选y后选c
问题五:Hadoop HA在停掉active namenode后无法自动切换到standby namenode,但是手动重启停掉的active namenode后,standby namenode就变成active了。通过查看日志发现报错:

Caused by: java.net.ConnectException: Connection refused
Caused by: java.io.FileNotFoundException: /root/.ssh/id_dsa (No such file or directory)

解决方法:通过网上查阅得知,hdfs-site.xml通过参数dfs.ha.fencing.methods来实现,出现故障时通过哪种方式登录到另一个namenode上进行接管工作,这意味着dfs.ha.fencing.ssh.private-key-files参数用来指定存放ssh免密登录到另一个节点的私钥的路径,而/root/.ssh/id_dsa路径是从别人的配置拷过来的,修改成本机存放私钥的文件就行。修改完后,重启集群,停掉active namenode后standby namenode自动切换成active了。
问题六:Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect

2016-11-11 18:27:37,185 INFO  org.apache.zookeeper.ClientCnxn.logStartConnect:966 - Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
2016-11-11 18:27:37,191 WARN  org.apache.zookeeper.ClientCnxn.run:1089 - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)

排查问题 :

/home/lib/apache-zookeeper-3.5.8-bin/bin/zkServer.sh status

JMX enabled by default
Using config: /Users/zhangxiaolong/Documents/zhangxiaolong/program/zookeeper-3.4.5/bin/../conf/zoo.cfg
Error contacting service. It is probably not running.

解决方法:发现是zk没有启动导致,重新启动zk即可。
问题七:启动zookeeper失败:Error contacting service. It is probably not running
解决方法:
方法①:打开zkServer.sh 找到status)

STAT=`echo stat | nc localhost $(grep clientPort "$ZOOCFG" | sed -e 's/.*=//') 2> /dev/null| grep Mode`

在nc与localhost之间加上 -q 1 (是数字1而不是字母l),如果已存在则去掉
注:因为我用的zookeeper是3.4.8版本,所以在我的zkServer.sh脚本文件里根本没有这一行,所以没有生效
方法②:调用sh zkServer.sh status 遇到这个问题。百度,google了后发现有人是修改sh脚本里的一个nc的参数来解决,可在3.4.8的sh文件里并没有找到nc的调用。配置文档里指定的log目录没有创建导致出错,手动增加目录后重启,问题解决。
注:我想不是日志的问题所以这个方法根本就没有试
方法③:创建数据目录,也就是在你zoo.cfg配置文件里dataDir指定的那个目录下创建myid文件,并且指定id,改id为你zoo.cfg文件中server.1=hadoop1:2888:3888中的 1.只要在myid头部写入1即可.
注:在我第二次安装的时候,没有将myid文件创建在dataDir指定的那个目录下,也报了该错误。之后在dataDir指定的那个目录下创建myid文件就没有报错。
方法四:因为防火墙没有关闭。关闭防火墙:
设置开机禁用防火墙:

systemctl disable firewalld.service

关闭防火墙:

systemctl stop firewalld

关闭selinux:

vim /etc/selinux/config
SELINUX=disabled

检查防火墙状态(若为active,防火墙未关闭;若为inactive,防火墙关闭成功):

systemctl status firewalld

方法⑤:没有建立主机和ip之间的映射关系。
建立主机和ip之间映射关系的命令为 vim /etc/hosts 在文件的末端加入各个主机和ip地址之间的映射关系就可以了。
注意:只有在建立了映射关系之后,才可以将在同一个网段下的机器利用主机名进行文件传递。问题解决!
本文参考网上总结了hadoop可能会出现的异常,希望能帮助你们哦

你可能感兴趣的:(HBASE,hadoop)