spark参数详解

spark的配置参数可以在多个地方配置,以executor的memory为例,有三个地方可以配置
(1)spark-submit的--executor-memory选项
(2)spark-defaults.conf的spark.executor.memory配置
(3)spark-env.sh的SPARK_EXECUTOR_MEMORY配置

优先级:spark-submit --选项 > spark-defaults.conf配置 > spark-env.sh配置 > 默认值

一、spark-defaults.conf
1、spark.driver.extraClassPath :Extra classpath entries to prepend to the classpath of the driver. 
     注意要保证启动driver的机器上存在该路径
2、spark.driver.extraJavaOptions :A string of extra JVM options to pass to the driver.
     通过此参数可以指定driver的jvm参数。但不要设置-Xmx,而是通过--driver-memory设置
3、spark.driver.extraLibraryPath :Set a special library path to use when launching the driver JVM. 
4、spark.executor.extraClassPath :Extra classpath entries to prepend to the classpath of executors.
     注意要保证启动executor的机器上存在该路径
5、spark.executor.extraJavaOptions :A string of extra JVM options to pass to executors.
     通过此参数可以指定driver的jvm参数。但不要设置-Xmx,而是通过spark.executor.memory设置
6、spark.executor.extraLibraryPath :Set a special library path to use when launching executor JVM.
7、spark.local.dir :Directory to use for "scratch" space in Spark, including map output files and RDDs that get stored on disk.
8、spark.yarn.archive 或 spark.yarn.jars :To make Spark runtime jars accessible from YARN side, 
      you can specify spark.yarn.archive or spark.yarn.jars. For details please refer to Spark Properties. 
      If neither spark.yarn.archive nor spark.yarn.jars is specified, Spark will create a zip file with all 
      jars under $SPARK_HOME/jars and upload it to the distributed cache.
      比如:hdfs:///spark_lib/spark_lib.zip    
9、spark.yarn.dist.files :Comma-separated list of files to be placed in the working directory of each executor.
     在提交任务,会将此参数指定的文件上传到分布式存储系统(比如hdfs),每个executor(包括driver)的工作路径会包含这些
     文件,其实和--py-files指定的文件一样,区别是--py-files会因为任务不同而变化,而此参数指定的文件每个任务都一样。
10、spark.yarn.dist.jars :和spark.yarn.dist.files类似,区别是指定jar
----------Dynamic Allocation根据负载情况动态分配executor----------
11、spark.dynamicAllocation.enabled :true表示启动动态分配executor,并且必须要求spark.shuffle.service.enabled必须为true。
    (1)在${SPARK_HOME}下找到spark--yarn-shuffle.jar,是版本号,比如2.4.3,并保存到每个节点的
    (2)${HADOOP_HOME}/share/hadoop/yarn/lib/下。修改每个节点的yarn-site.xml,如下:
   
        yarn.nodemanager.aux-services
        mapreduce_shuffle,spark_shuffle
   

   
        yarn.nodemanager.aux-services.spark_shuffle.class
        org.apache.spark.network.yarn.YarnShuffleService
   

    (3)相关参数:spark.dynamicAllocation.minExecutors、spark.dynamicAllocation.maxExecutors、spark.dynamicAllocation.initialExecutors
12、spark.yarn.shuffle.stopOnFailure :当spark.dynamicAllocation.enabled为true时才使用,Whether to stop the      NodeManager when there's a failure in the Spark Shuffle Service's initialization. This prevents application failures caused by 
    running containers on NodeManagers where the Spark Shuffle Service is not running.
13、spark.executor.memory :executor堆内存
14、spark.yarn.executor.memoryOverhead :executor非堆内存
15、spark.executor.cores :每个executor的cores数量
16、spark.driver.memory :driver堆内存
17、spark.yarn.driver.memoryOverhead :driver非堆内存
18、spark.driver.cores :driver的cores数量
19、spark.driver.maxResultSize :driver拉取结果的最大值,比如collect
20、spark.default.parallelism :rdd的默认分区数
21、spark.speculation :是否推测执行
23、spark.master :The cluster manager to connect to。若设置为yarn,则cluster manager的配置在spark-env.sh中由        HADOOP_CONF_DIR 或者 YARN_CONF_DIR指定
-----------spark history的配置-----------
24、spark.eventLog.enabled :为true表示记录spark event log
25、spark.eventLog.compress :为true表示event log压缩
26、spark.eventLog.dir :event log的存放地址,比如hdfs
27、spark.task.maxFailures :任务最大失败次数
28、spark.serializer :序列化类,比如org.apache.spark.serializer.KryoSerializer
29、spark.sql.shuffle.partitions :Configures the number of partitions to use when shuffling data for joins or aggregations.


二、spark-env.sh
1、HADOOP_CONF_DIR :指定hadoop配置文件的路径
2、SPARK_HISTORY_OPTS :指定spark history的参数,如
    "-Dspark.history.ui.port=18080 -Dspark.history.retainedApplications=1000 -Dspark.history.ui.maxApplications=1000 
     -Dspark.history.fs.logDirectory=hdfs://ip:port/log/ -Dspark.history.fs.cleaner.enabled=true 
     -Dspark.history.fs.cleaner.interval=1d -Dspark.history.fs.cleaner.maxAge=7d"
3、export SPARK_LOG_DIR=/data/spark/log : 指定日志存放路径
4、export SPARK_LOCAL_DIRS=/data/spark/local : storage directories to use on this node for shuffle and RDD data

 

你可能感兴趣的:(spark)