本文讨论 yarn client 和 cluster模式
spark-submit --files file_paths
其中file_paths可为多种方式:file:,hdfs://,http://,ftp://,local:
,多个路径用逗号隔开
-- files
每个文件路径逗号隔开,cluster模式下带到每个executor的工作路径下以及driver的工作路径下(参数user.dir
可查看),client模式下仅带到每个executor的工作路径下。对于client 模式,文件路径file_paths需要指向本地文件。对于cluster 模式,文件路径file_paths可以是本地文件或者是一个全局可见文件(例如hdfs://path
) ,cluster 模式如果是本地路径,不要求每个节点的该路径都有文件,只要提交任务的机器上有文件即可。
当使用spark-submit --files时,会将–files后面的文件路径记录下来传给driver进程,然后当启动driver进程时,会调用SparkFiles.addFile(file_path),并复制文件到driver的临时文件目录中。之后executor启动之后,将从driver这里fetch文件到自己的工作目录。
所以SparkFiles.get(fileName)所得的路径(get(filename)
可以查询通过SparkContext.addFile()
上传的文件的完整路径),对于driver就是SparkEnv.get.driverTmpDir+fileName,对于executor就是workDir+fileName。
经过验证,driver的临时目录会被清空(但是cluster模式下文件会被写到driver的工作目录下)
到底是 先写到工作目录,再写到临时目录,还是先写到临时目录再写到工作目录目前无法验证,有机会读读源码才知道。但是没关系,只要知道cluster模式下最终driver下是临时目录是没有文件的,都在它的工作目录下,和executor一样即可。client模式下,文件不会被复制到工作目录下,要读取文件的话,需要知道文件原始路径。
--jars
cluster 模式下外部JARs 会被加载到driver和 executor的classpath 下。client模式下,只加载到 executor的classpath 下。对于client模式,文件path必须是本地路径。对于cluster 模式,文件路径可以是本地文件或者是一个全局可见文件(例如hdfs://path
) ,cluster 模式如果是本地路径,不要求每个节点的该路径都有文件,只要提交任务的机器上有文件即可。这些jar也会被放到driver和 executor的工作目录下。
记录下来的文件原始路径可在这些参数中查看:
println(System.getProperties.asScala.mkString(System.lineSeparator()))
spark.yarn.dist.archives
Comma separated list of archives to be extracted into the working directory of each executor.spark.yarn.dist.files
Comma-separated list of files to be placed in the working directory of each executor.spark.yarn.dist.jars
Comma-separated list of jars to be placed in the working directory of each executor.获取文件路径:
filePath = SparkFiles.get(fileName)
获取文件数据流:
driver: inputStream = new FileInputStream(fileName)
executor:inputStream = new FileInputStream(fileName)
或者inputStream = new FileInputStream(SparkFiles.get(fileName))
获取文件内容:
driver: Source.fromFile(fileName)
executor: Source.fromFile(fileName)
或者Source.fromFile(SparkFiles.get(fileName))
executor和cluster模式一样。
driver:
需要获取文件的原始路径:
方法1:直接将原始路径当参数传入oriPath。 Source.fromFile(oriPath)
方法2:System.getProperty("spark.yarn.dist.files")
,然后过滤出需要的文件名oriPath,然后 Source.fromFile(oriPath)
spark-submit --files /opt/test/spark.properties
val ss = SparkSession.builder().enableHiveSupport().getOrCreate()
ss.sparkContext.setLogLevel("INFO")
//具体文件
val filePath = SparkFiles.get("spark.properties")
val workDir=System.getProperty("user.dir")
LOGGER.info("**** Driver 具体文件spark.properties:"+filePath)
LOGGER.info("**** Driver 工作路径"+workDir)
LOGGER.info(s"**** Driver properties:${System.getProperties.asScala.mkString(System.lineSeparator())}")
def subDir(dir:File):Iterator[File] ={
LOGGER.info(s"**** subDir of $dir")
if(dir.listFiles() == null) Array.empty[File].toIterator
else dir.listFiles().toIterator
}
//查看临时目录下所有文件:
LOGGER.info("**** Driver 临时目录下所有文件:"+subDir(new File(filePath).getParentFile).mkString(System.lineSeparator()))
//查看工作目录下所有文件:
LOGGER.info("**** Driver 工作目录下所有文件:"+subDir(new File(workDir)).mkString(System.lineSeparator()))
import ss.sqlContext.implicits._
val arr=Seq("1").toDF().map{x=>
List(System.getProperties.asScala.mkString(System.lineSeparator()),
System.getenv().asScala.mkString(System.lineSeparator()),
SparkFiles.get("spark.properties"),
System.getProperty("user.dir"),
subDir(new File(System.getProperty("user.dir"))).mkString(System.lineSeparator()),
subDir(new File(SparkFiles.get("spark.properties")).getParentFile).mkString(System.lineSeparator())
)
}.collect().head
//具体文件
LOGGER.info("**** Executor 具体文件spark.properties:"+arr(2))
LOGGER.info("**** Executor 工作路径"+arr(3))
LOGGER.info(s"**** Executor properties: "+arr(0))
//查看文件目录下所有文件:
LOGGER.info("**** Executor-内部 文件目录下所有文件:"+arr(4))
//查看工作目录下所有文件:
LOGGER.info("**** Executor-内部 工作目录下所有文件:"+arr(5))
//查看文件目录下所有文件:
LOGGER.info("**** Executor-外部 文件目录下所有文件:"+subDir(new File(arr(2)).getParentFile).mkString(System.lineSeparator()))
//查看工作目录下所有文件:
LOGGER.info("**** Executor-外部 工作目录下所有文件:"+subDir(new File(arr(3))).mkString(System.lineSeparator()))
LOGGER.info("程序结束")
**** Driver 具体文件spark.properties:/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/spark-c307a8ed-3853-47af-ad92-7d1970683b3a/userFiles-eb591a81-139b-482f-8056-1829f6219a4b/spark.properties
**** Driver 工作路径/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001
**** Driver properties:
spark.yarn.dist.files -> file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/config/spark.properties,file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/config/hbase-site.xml,file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/config/kdc.conf
spark.yarn.dist.jars -> file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/lib/bcprov-ext-jdk15on-1.68.jar,file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/lib/CryptoUtil-1.1.5.304.jar,file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/lib/fastjson-1.2.78.jar,file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/lib/commons-pool2-2.8.1.jar
**** Driver 临时目录下所有文件: (!注意这里为空)
**** Driver 工作目录下所有文件:/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/user.keytab-ca6ead50-35d4-4f1d-8e37-d87c7366106c
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/tmp
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/bcprov-ext-jdk15on-1.68.jar
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/carbon.properties
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/mapred-site.xml
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/launch_container.sh
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/commons-pool2-2.8.1.jar
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/topology.properties
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/jaas-zk.conf
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/spark.properties
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/container_tokens
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/hbase-site.xml
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/__app__.jar
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/jets3t.properties
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/log4j-executor.properties
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/fastjson-1.2.78.jar
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/CryptoUtil-1.1.5.304.jar
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/__spark_libs__
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/__spark_conf__
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/kdc.conf
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/arm
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/x86
**** Executor 具体文件spark.properties:/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/./spark.properties
**** Executor 工作路径/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003
**** Executor properties:
spark.yarn.dist.files 不存在
spark.yarn.dist.jars 不存在
**** Executor-内部 文件目录下所有文件:/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/__app__.jar
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/__spark_libs__
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/mapred-site.xml
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/kdc.conf
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/log4j-executor.properties
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/jets3t.properties
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/spark.properties
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/jaas-zk.conf
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/__spark_conf__
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/bcprov-ext-jdk15on-1.68.jar
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/arm
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/topology.properties
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/launch_container.sh
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/fastjson-1.2.78.jar
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/x86
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/commons-pool2-2.8.1.jar
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/carbon.properties
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/hbase-site.xml
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/CryptoUtil-1.1.5.304.jar
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/tmp
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/container_tokens
**** Executor-内部 工作目录下所有文件:/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/./__app__.jar
(和上面一样,略)
**** Executor-外部 文件目录下所有文件:
**** Executor-外部 工作目录下所有文件:
可以看到,Executor 端文件路径和工作路径一样。
dirver端,文件路径和工作路径不一样,因为记录的是临时文件目录。工作目录下该文件是存在的。
**** Driver 具体文件spark.properties:/tmp/spark-aa67cf79-1f10-4d85-99a0-c9fcced0ee80/userFiles-814810e5-3b1d-435b-a575-796b7e3a2a91/spark.properties
**** Driver 工作路径/opt/HIBI-ExecuteShell
**** Driver properties:
spark.yarn.dist.files -> file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/config/spark.properties,file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/config/hbase-site.xml,file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/config/kdc.conf
spark.yarn.dist.jars -> file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/lib/bcprov-ext-jdk15on-1.68.jar,file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/lib/CryptoUtil-1.1.5.304.jar,file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/lib/fastjson-1.2.78.jar,file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/lib/commons-pool2-2.8.1.jar
**** Driver 临时目录下所有文件:
**** Driver 工作目录下所有文件:/opt/HIBI-ExecuteShell/略(沒有spark.properties)
**** Executor 具体文件spark.properties:/srv/BigData/hadoop/data2/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0019/container_e25_1686228163560_0019_01_000002/./spark.properties
**** Executor 工作路径/srv/BigData/hadoop/data2/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0019/container_e25_1686228163560_0019_01_000002
**** Executor properties:
spark.yarn.dist.files 不存在
spark.yarn.dist.jars 不存在
其他executor信息和cluster模式一样,(略)
可以看到,在client模式下,driver端并没有将文件复制到工作目录。
另外,为什么工作路径是/opt/HIBI-ExecuteShell
?
> echo $SPARK_HOME
/opt/ficlient20210106/Spark2x/spark
> echo $JAVA_HOME
/opt/ficlient20210106/JDK/jdk-8u201
因为jvm的启动路径是动态的,自定义的spark-submit的启动脚本 exec-hive.sh
放在了/opt/HIBI-ExecuteShell
下面,提交任务格式为:
/opt/HIBI-ExecuteShell/exec-hive.sh 你的脚本.sh
下面解释为什么FileInputStream和Source.fromFile直接写文件名也可以,因为scala io的相对路径取的事jvm的相对路径,而jvm的相对路径的根目录和driver和executor的工作路径是相同的。
Source.fromFile(filePath)
,时,filePath可为相对路径:
相对路径以jvm环境变量中的user.dir
为根目录
在Scala中启动Scala的路径即为user.dir
的值。
在IDEA中user.dir的值为项目的根目录下面。
获取文件目录和子文件:
//遍历某目录下所有的文件和子文件
def main(args:Array[String]):Unit = {
for(d <- subDirRec(new File("d:\\AAA\\")))
println(d)
}
def subDirRec(dir:File):Iterator[File] ={
val dirs = dir.listFiles().filter(_.isDirectory())
val files = dir.listFiles().filter(_.isFile())
files.toIterator ++ dirs.toIterator.flatMap(subDirRec _)
}
//非循环
def subDir(dir:File):Iterator[File] ={
if(dir.listFiles() == null) Array.empty[File].toIterator
else dir.listFiles().toIterator
}
在Yarn-client中,Application Master仅仅从Yarn中申请资源给Executor,之后client会跟container通信进行作业的调度,下图是Yarn-client模式
问题现象
Spark任务抛出如下异常:
Exception in thread "main" java.lang.NullPointerException
at org.apache.spark.SparkFiles$.getRootDirectory(SparkFiles.scala:37)
at org.apache.spark.SparkFiles$.get(SparkFiles.scala:31)
...
问题分析
该现象为在初始化SparkContext之前调用了SparkFiles.get()。
问题解决方案
优先初始化SparkContext。
Whether core requests are honored in scheduling decisions depends on which scheduler is in use and how it is configured.
In cluster mode, the local directories used by the Spark executors and the Spark driver will be the local directories configured for YARN (Hadoop YARN config yarn.nodemanager.local-dirs). If the user specifies spark.local.dir, it will be ignored. In client mode, the Spark executors will use the local directories configured for YARN while the Spark driver will use those defined in spark.local.dir. This is because the Spark driver does not run on the YARN cluster in client mode, only the Spark executors do.
The --files and --archives options support specifying file names with the # similar to Hadoop. For example, you can specify: --files localtest.txt#appSees.txt and this will upload the file you have locally named localtest.txt into HDFS but this will be linked to by the name appSees.txt, and your application should use the name as appSees.txt to reference it when running on YARN.
The --jars option allows the SparkContext.addJar function to work if you are using it with local files and running in cluster mode. It does not need to be used if you are using it with HDFS, HTTP, HTTPS, or FTP files.
官方-submit
官方-cluster模式
官方-spark参数
参数解释
Spark --files使用总结
Spark --files理解(该理解有一些错误,不要被误导)
Spark读取配置文件的探究
Scala IO总结
Java中user.dir到底是什么?System.getProperty方法获取属性
–files 对应源码
java.io.File中的绝对路径和相对路径
Spark中函数addFile和addJar函数介绍
Spark源码中添加deleteFile方法
官方文档-addfile函数
知乎-Spark源码分析——Spark-Submit源码实现分析
Spark源码分析-作业提交(spark-submit
spark-client模式下,设置spark的日志级别
Spark进阶-- spark-client和spark-cluster详解