Spark On Yarn 远程idea提交运行(不是调试)
1. 需要注意的问题
1.1 centos搭建的集群会出现is running beyond virtual memory limits的问题
Current usage: xx MB of xxGB physical memory used; xx GB of xx GB virtual memory used.
解决方法:
# yarn-site.xml中添加以下属性
yarn.nodemanager.vmem-check-enabled
false
1.2 在linux下使用idea连接docker搭建的集群,之间虽然能够互相ping通,但是还是有防火墙依然会让集群不能访问宿主机
19/01/21 16:44:16 INFO Client: Application report for application_1548058747747_0006 (state: ACCEPTED)
程序运行一直出现这个记录, 解决办法:关闭防火墙
1.3 宿主机占不到集群,一直使用0.0.0.0:8032端口(这一步设置很重要)
这是因为没有把resource资源文件设置成资源文件, 解决方案:
右键点击resource文件,选择Mark Directory as
>> Resources root
2. 最终文件形式(src部分)
在idea新建项目, sbt构建项目, sbt版本随意, scala版本选择2.11.8, 因为我的集群中没有专门配置scala,因此用spark-2.3.1-bin-hadoop2.7自带的scala, 其版本号就是2.11.8, src目录如下
# 右键点击resource选择Mark Directory as >> Resources root, 或者去project struct设置
src
├── main
│ ├── resource
│ │ ├── core-site.xml
│ │ ├── hdfs-site.xml
│ │ └── yarn-site.xml
│ └── scala
│ ├── SparkPI.scala
│ └── WordCount.scala
└── test
└── scala
2.1 以提交wordcount为例子
单单这些代码是不能运行的,还需要设置集群,1) 添加集群jars包, 2) 使用sbt打包
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object WordCount {
def main(args: Array[String]): Unit = {
System.setProperty("HADOOP_USER_NAME", "root")
System.setProperty("user.name", "root")
val conf = new SparkConf().setAppName("WordCount").setMaster("yarn")
.set("deploy-mode", "client")
.set("spark.yarn.jars", "hdfs:/user/root/jars/*") //集群的jars包,是你自己上传上去的
.setJars(List("/home/lee/IdeaProjects/test/target/scala-2.11/test_2.11-0.1.jar")) //这是sbt打包后的文件
.setIfMissing("spark.driver.host", "192.168.1.9") //设置你自己的ip
val sc = new SparkContext(conf)
val rdd = sc.textFile("hdfs:/input/README.txt")
val count = rdd.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_)
count.collect().foreach(println)
}
}
2.2 依赖
# build.sbt中添加一下内容
// https://mvnrepository.com/artifact/org.apache.spark/spark-yarn
libraryDependencies += "org.apache.spark" %% "spark-yarn" % "2.3.1"
3. 步骤
3.1 设置jars
注意 wordcount
中conf
中的.set("spark.yarn.jars", "hdfs:/user/root/jars/*")
,这里面由于没有在本地添加spark的jars
包,因此直接使用集群中的jars包, 这个包需要在集群里面提交
# 在docker环境下, 可以使用如下指令
docker exec spark-master /opt/module/hadoop/bin/hdfs dfs -mkdir /input
docker exec spark-master /opt/module/hadoop/bin/hdfs dfs -mkdir /user
docker exec spark-master /opt/module/hadoop/bin/hdfs dfs -mkdir /user/root
docker exec spark-master /opt/module/hadoop/bin/hdfs dfs -mkdir /user/root/jars
docker exec spark-master /opt/module/hadoop/bin/hdfs dfs -put /opt/module/spark/jars/* /user/root/jars
docker exec spark-master /opt/module/hadoop/bin/hdfs dfs -put /opt/module/hadoop/README.txt /input
# /opt/module/hadoop/ 是你自己的hadoop目录
# /opt/module/spark/ 是你自己的spark目录
# 在集群中,假如环境都设置好了,那么就可以
hdfs dfs -mkdir /input
hdfs dfs -mkdir /user
hdfs dfs -mkdir /user/root
hdfs dfs -mkdir /user/root/jars
hdfs dfs -put your_spark_path/jars/* /user/root/jars
hdfs dfs -put /opt/module/hadoop/README.txt /input
当然如果你不喜欢用/user/root目录来放jars,那么也可以自定义,当然在wordcount里面就要做出对应改变了。
3.3 选用本地jars包(与3.1二选一)
如果不想提交spark的jars包到集群,那么可以把spark的jars
可以复制到项目里
ls /opt/module/spark
bin conf data examples jars kubernetes LICENSE licenses logs NOTICE python R README.md RELEASE sbin work yarn
对就是SPARK_HOME
目录下的jars文件夹, 复制到项目, 最终你的 your_project/jars
里面应该是下面这些内容
activation-1.1.1.jar hadoop-yarn-client-2.7.3.jar metrics-graphite-3.1.5.jar
......
zstd-jni-1.3.2-2.jar
hadoop-yarn-api-2.7.3.jar metrics-core-3.1.5.jar
选择file>>project structure>>module, 选择name方框下的dependecies,在点击该栏目右上方的+
号, 选择1. jars and Directories
, 再弹出框中选择 your_project/jars
3.3 打包
在idea底部选择sbt shell
第一次输入clean
第二次输入package
如果选择其他的打包方式,那就需要修改conf
的setJars
。
4. 运行
19/01/21 16:44:41 INFO YarnScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool
19/01/21 16:44:41 INFO DAGScheduler: ResultStage 1 (collect at WordCount.scala:20) finished in 0.827 s
19/01/21 16:44:41 INFO DAGScheduler: Job 0 finished: collect at WordCount.scala:20, took 6.945556 s
(under,1)
(this,3)
(distribution,2)
(Technology,1)
(country,1)
(is,1)
(Jetty,1)
(currently,1)
(permitted.,1)
(check,1)
(have,1)
(Security,1)
(U.S.,1)
(with,1)
(BIS,1)
(This,1)
(mortbay.org.,1)
((ECCN),1)
(using,2)
(security,1)
(Department,1)
(export,1)
(reside,1)
(any,1)
(algorithms.,1)
(from,1)
(re-export,2)
(has,1)
(SSL,1)
(Industry,1)
(Administration,1)
(details,1)
(provides,1)
(http://hadoop.apache.org/core/,1)
(country's,1)
(Unrestricted,1)
(740.13),1)
(policies,1)
(country,,1)
(concerning,1)
(uses,1)
(Apache,1)
(possession,,2)
(information,2)
(our,2)
(as,1)
(,18)
(Bureau,1)
(wiki,,1)
(please,2)
(form,1)
(information.,1)
(ENC,1)
(Export,2)
(included,1)
(asymmetric,1)
(Commodity,1)
(Software,2)
(For,1)
(it,1)
(The,4)
(about,1)
(visit,1)
(website,1)
( ,1)
(performing,1)
(Section,1)
(on,2)
((see,1)
(http://wiki.apache.org/hadoop/,1)
(classified,1)
(following,1)
(in,1)
(object,1)
(cryptographic,3)
(which,2)
(See,1)
(encryption,3)
(Number,1)
(and/or,1)
(software,2)
(for,3)
((BIS),,1)
(makes,1)
(at:,2)
(manner,1)
(Core,1)
(latest,1)
(your,1)
(may,1)
(the,8)
(Exception,1)
(includes,2)
(restrictions,1)
(import,,2)
(project,1)
(you,1)
(use,,2)
(another,1)
(if,1)
(or,2)
(Commerce,,1)
(source,1)
(software.,2)
(laws,,1)
(BEFORE,1)
(Hadoop,,1)
(License,1)
(written,1)
(code,1)
(Regulations,,1)
(software,,2)
(more,2)
(software:,1)
(see,1)
(regulations,1)
(of,5)
(libraries,1)
(by,1)
(exception,1)
(Control,1)
(code.,1)
(eligible,1)
(both,1)
(to,2)
(Foundation,1)
(Government,1)
(functions,1)
(and,6)
(5D002.C.1,,1)
((TSU),1)
(Hadoop,1)
19/01/21 16:44:42 INFO SparkContext: Invoking stop() from shutdown hook
19/01/21 16:44:42 INFO SparkUI: Stopped Spark web UI at http://192.168.1.9:4040
19/01/21 16:44:42 INFO YarnClientSchedulerBackend: Interrupting monitor thread
19/01/21 16:44:42 INFO YarnClientSchedulerBackend: Shutting down all executors
19/01/21 16:44:42 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
19/01/21 16:44:42 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
(serviceOption=None,
services=List(),
started=false)
19/01/21 16:44:42 INFO YarnClientSchedulerBackend: Stopped
19/01/21 16:44:42 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/01/21 16:44:42 INFO MemoryStore: MemoryStore cleared
19/01/21 16:44:42 INFO BlockManager: BlockManager stopped
19/01/21 16:44:42 INFO BlockManagerMaster: BlockManagerMaster stopped
19/01/21 16:44:42 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/01/21 16:44:42 INFO SparkContext: Successfully stopped SparkContext
19/01/21 16:44:42 INFO ShutdownHookManager: Shutdown hook called
19/01/21 16:44:42 INFO ShutdownHookManager: Deleting directory /tmp/spark-88c6c289-4d49-4035-96d7-19ba6410ef8a