(这个博客主要是留给很菜的自己看的)
节点在云上,因为内外网IP和权限问题折腾了一段时间
环境:
spark:2.1.2
hadoop:2.7.5
java:1.8
IDE:idea
各种弯路:
--在IDEA上运行没有成功,spark standalone 模式下始终提示申请不到资源(待解决)
每个节点8个core,32G memory,各取1个core,1024M memory,依然提示:
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
--cluster模式下无法使用spark-shell
Error: Cluster deploy mode is not applicable to Spark shells.
--spark 2.0之后,"yarn-cluster"的写法已经不可用
WARN SparkConf: spark.master yarn-cluster is deprecated in Spark 2.0+, please instead use "yarn" with specified deploy mode. // OR
Warning: Master yarn-cluster is deprecated since 2.0. Please use master "yarn" with specified deploy mode instead.
--HDFS上存入spark.yarn.jars或者spark.yarn.archive比较好,不用每次上传了
WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
-- $ netstat -nltp #查看端口,基础知识伤不起……
--HDFS在node1内网IP的8020端口上,访问node1的外网IP的8020端口是连接不到HDFS的
*18/03/15 17:43:32 INFO Client: Retrying connect to server: 101.198.186.9/101.198.186.9:8020. Already tried 15 time(s); maxRetries=45 |
内外网IP都能ping通,但是经telnet测试,只能用telnet连接到内网IP的8020端口
--maven要添加
<dependency>
<groupId>org.apache.sparkgroupId>
<artifactId>spark-yarn_2.11artifactId>
<version>${spark.version}version>
dependency>
否则会提示:
*Unable to load YARN support......
*Caused by: java.lang.ClassNotFoundException: org.apache.spark.deploy.yarn.YarnSparkHadoopUtil
--idea上用yarn cluster模式运行时提示:
*Detected yarn cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext.
*Please use spark-submit.
大致原因在于SparkContext中这样两端代码:
// System property spark.yarn.app.id must be set if user code ran by AM on a YARN cluster
if (master == "yarn" && deployMode == "cluster" && !_conf.contains("spark.yarn.app.id")) {
throw new SparkException("Detected yarn cluster mode, but isn't running on a cluster. " +
"Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.")
} // 其实也不是很明白……为什么一定要有id
if (master == "yarn" && deployMode == "client") System.setProperty("SPARK_YARN_MODE", "true")
--idea提交运行在yarn client模式下时,找不到参数指定yarn的IP
于是:
Retrying connect to server: 0.0.0.0/0.0.0.0:8032 # 试图连接yarn的默认地址
# cluster模式
# 太懒了,没有具体分配资源……
bin/spark-submit --class com.xxx.yyy.ZZZ \
--master yarn \
--deploy-mode cluster \
original-offline-engine-1.0-SNAPSHOT.jar
# 因为是cluster模式,所以期望结果没有控制台打印,可以在log里面查看
# log在执行时可以在yarn的8088端口查看,执行结束后log会被存储到hdfs目录,留下一个呵呵的提示:
# client模式
# 依然没有具体分配资源……呵呵
bin/spark-submit --class com.xxx.yyy.ZZZ \
--master yarn \
--deploy-mode client \
original-offline-engine-1.0-SNAPSHOT.jar
# client模式,可以看到期望的打印
SparkConf内容示例
ArrayBuffer((spark.executor.extraClassPath,/opt/bigdata/nfs/spark-2.1.2-bin-hadoop2.7/mariadb-java-client-2.2.2.jar), |
ArrayBuffer((spark.executor.extraClassPath,/opt/bigdata/nfs/spark-2.1.2-bin-hadoop2.7/mariadb-java-client-2.2.2.jar), |