Yarn-Client 模式下执行spark任务, Error initializing SparkContext. Failed to connect to driver!

redhat7.3系统

大数据集群4台机器--集群外1台机器

通过集群外的机器向大数据集群提交spark-sql任务,任务如下:

 

任务执行失败。

主要表现为:

服务端shell日志报错为:

ERROR SparkContext: Error initializing SparkContext.

org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.

Yarn端日志报错为:

Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:52 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:53 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:54 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:55 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:56 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:57 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:57 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:59 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:00 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:01 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:02 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:03 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:04 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:05 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:06 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:06 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:07 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:09 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:10 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:11 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:12 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:13 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:14 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:15 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:15 ERROR ApplicationMaster: Uncaught exception: 
org.apache.spark.SparkException: Failed to connect to driver!

 

解决办法:

关闭集群外机器的防火墙。

防火墙的开启、关闭、禁用命令
(1)设置开机启用防火墙:systemctl enable firewalld.service
(2)设置开机禁用防火墙:systemctl disable firewalld.service
(3)启动防火墙:systemctl start firewalld
(4)关闭防火墙:systemctl stop firewalld
(5)检查防火墙状态:systemctl status firewalld 

 

路过的坑:

一开始就查看过防火墙,但是使用的命令是iptables,没有返回信息,就误以为是防火墙已关闭,但实际上端口一直没ping通。

 

 

补充:

今天又一次出现了这个错误,防火墙也已经关闭,但是还是没解决问题,然后发现是IP地址的原因。

服务器存在管理网IP和业务网IP,而从大数据集群返回时,总是选择连接管理网IP,而大数据集群ping 管理网IP却ping不通,所以这个问题的原因有两个,一个是地址ping不通,一个是端口ping不通。

针对端口ping不通,我们的解决办法是关闭防火墙;

针对IP地址ping不通,我们的解决办法是在saprk-sql命令后面加上--conf spark.driver.host=$your_ip_address

 

参考文章:

https://www.cnblogs.com/fbiswt/p/4667956.html

1、客户端安装的机器一般是虚拟机,虚拟机的名称可能是随便搞的,然而,yarn-client模式提交任务,是默认把本机当成driver的。所以导致其他的机器无法通过host的name直接访问这台机器。报错就是Failed to connect to driver at x.x.x.x,retrying.....

 解决办法:在命令后面加上一个--conf spark.driver.host=$your_ip_address,后面直接填客户端机器的IP地址就行。还有一个办法:export SPARK_JAVA_OPTS="-Dspark.driver.host=$your_ip_address",但是这种方法你在用完yarn-client后就没有办法再用yarn-cluster了。千万不能把这个参数配置到spark-default.conf里面。

2、客户机的防火墙是开着的,把端口给屏蔽掉了。因为这个访问机器的端口,是随机的...所以还是关闭防火墙比较好。 

 

 

 

 

 

 

 

 

 

 

 

你可能感兴趣的:(Yarn-Client 模式下执行spark任务, Error initializing SparkContext. Failed to connect to driver!)