SPARK配置参数的两个地方:
1. $SPARK_HOME/conf/spark-env.sh 脚本上配置。 配置格式如下:
export SPARK_DAEMON_MEMORY=1024m
2. 编程的方式(程序中在创建SparkContext之前,使用System.setProperty(“xx”,“xxx”)语句设置相应系统属性值),即在spark-shell下配置
如:scala> System.setProperty("spark.akka.frameSize","10240m")
一、环境变量spark-env.sh配置项
SCALA_HOME #指向你的scala安装路径
MESOS_NATIVE_LIBRARY #如果你要在Mesos上运行集群的话
SPARK_WORKER_MEMORY #作业可使用的内存容量,默认格式1000M或者 2G (默认: 所有RAM去掉给操作系统用的1 GB);每个作业独立的内存空间由SPARK_MEM决定。
SPARK_JAVA_OPTS #添加JVM选项。你可以通过-D来获取任何系统属性 eg: SPARK_JAVA_OPTS+="-Dspark.kryoserializer.buffer.mb=1024"
SPARK_MEM #设置每个节点所能使用的内存总量。他们应该和JVM‘s -Xmx选项的格式保持一致(e.g.300m或1g)。注意:这个选项将很快被弃用支持系统属性spark.executor.memory,所以我们推荐将它使用在新代码中。
SPARK_DAEMON_MEMORY #分配给Spark master和worker守护进程的内存空间(默认512M)
SPARK_DAEMON_JAVA_OPTS #Spark master和worker守护进程的JVM选项(默认:none)
二、System Properties
Property Name | Default | Meaning |
spark.executor.memory | 512m | Amount of memory to use per executor process, in the same format as JVM memory strings (e.g. `512m`, `2g`). |
spark.akka.frameSize | 10m | Maximum message size to allow in "control plane" communication (for serialized tasks and task results), in MB. Increase this if your tasks need to send back large results to the driver (e.g. using collect() on a large dataset). |
spark.default.parallelism | 8 | Default number of tasks to use for distributed shuffle operations (groupByKey , reduceByKey , etc) when not set by user. |
参考:
http://rdc.taobao.org/?p=533
http://spark.incubator.apache.org/docs/0.7.3/configuration.html
https://github.com/amplab/shark/blob/master/conf/shark-env.sh.template
http://www.cnblogs.com/vincent-hv/p/3316502.html
http://www.07net01.com/linux/Sparkdulibushumoshi_545676_1374481945.html
https://groups.google.com/forum/#!searchin/spark-users/java.lang.OutOfMemoryError$3A$20GC$20overhead$20limit