Options: | 说明 | 备注【个人翻译和根据使用经验备注,有错误欢迎支持】 |
--master MASTER_URL | spark://host:port, mesos://host:port, yarn, k8s://https://host:port, or local (Default: local[*]). |
常用local本地模式、yarn集群模式 |
--deploy-mode DEPLOY_MODE | Whether to launch the driver program locally ("client") or on one of the worker machines inside the cluster ("cluster") (Default: client). | 驱动程序是本地客户端client启动还是集群cluster上的工作节点启动. 如果是cluster模式,Yarn集群会管理driver进程,application创建后,client客户端就可以退出了。 如果是client模式,driver进程会跑在client客户端进程中,Yarn只负责保证执行节点的资源,并不会管理master节点。 |
--class CLASS_NAME | Your application's main class (for Java / Scala apps). | Java/Scala脚本的main class |
--name NAME | A name of your application. | 给应用一个名称 |
--jars JARS | Comma-separated list of jars to include on the driver and executor classpaths. | 逗号分隔的jar包列表,会加载到驱动、执行节点的路径上 |
--packages | Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths. Will search the local maven repo, then maven central and any additional remote repositories given by --repositories. The format for the coordinates should be groupId:artifactId:version. | 逗号分隔的maven坐标下的package包列表,会加载到驱动、执行节点的路径上。会搜索本地的maven资源库或远程资源池来加载。 |
--exclude-packages | Comma-separated list of groupId:artifactId, to exclude while resolving the dependencies provided in --packages to avoid dependency conflicts. |
逗号分隔的package包,在解析依赖的时候会排除不解析,防止依赖冲突。 |
--repositories | Comma-separated list of additional remote repositories to search for the maven coordinates given with --packages. |
--py-files PY_FILES | Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps. | 逗号分隔的.zip , .egg, .py文件列表 |
--files FILES | Comma-separated list of files to be placed in the working directory of each executor. File paths of these files in executors can be accessed via SparkFiles.get(fileName). | 逗号分隔的文件列表,替换工作节点路径下的文件。文件可以通过SparkFiles.get(fileName)获取 【注:这里文件其实会被加载存放到工作节点路径下,也不用使用SparkFiles.get(fileName)方式读取,直接读文件名即可】 |
--conf, -c PROP=VALUE | Arbitrary Spark configuration property. | 配置选项 |
--properties-file FILE | Path to a file from which to load extra properties. If not specified, this will look for conf/spark-defaults.conf. | 配置文件 |
--driver-memory MEM | Memory for driver (e.g. 1000M, 2G) (Default: 1024M). | 驱动节点内存 |
--driver-java-options | Extra Java options to pass to the driver. | |
--driver-library-path | Extra library path entries to pass to the driver. | |
--driver-class-path | Extra class path entries to pass to the driver. Note that jars added with --jars are automatically included in the classpath. | |
--executor-memory MEM | Memory per executor (e.g. 1000M, 2G) (Default: 1G). | 执行节点内存 |
--proxy-user NAME | User to impersonate when submitting the application. This argument does not work with --principal / --keytab. | |
--help, -h | Show this help message and exit. | spark-submit --help 获取命令行帮助 |
--verbose, -v | Print additional debug output. | |
--version, | Print the version of current Spark. | spark-submit -version 查看当前版本号 |
Cluster deploy mode only: | 只适用于集群部署模式的命令 | |
--driver-cores NUM | Number of cores used by the driver, only in cluster mode (Default: 1). | |
Spark standalone or Mesos with cluster deploy mode only: | ||
--supervise | If given, restarts the driver on failure. | |
Spark standalone, Mesos or K8s with cluster deploy mode only: | ||
--kill SUBMISSION_ID | If given, kills the driver specified. | |
--status SUBMISSION_ID | If given, requests the status of the driver specified. | |
Spark standalone, Mesos and Kubernetes only: | ||
--total-executor-cores NUM | Total cores for all executors. | |
Spark standalone, YARN and Kubernetes only: | ||
--executor-cores NUM | Number of cores used by each executor. (Default: 1 in YARN and K8S modes, or all available cores on the worker in standalone mode). | |
Spark on YARN and Kubernetes only: | 适用于Yarn和Kubernetes部署模式的命令 | |
--num-executors NUM | Number of executors to launch (Default: 2). If dynamic allocation is enabled, the initial number of executors will be at least NUM. | 执行节点个数 |
--principal PRINCIPAL | Principal to be used to login to KDC. | |
--keytab KEYTAB | The full path to the file that contains the keytab for the principal specified above. | |
Spark on YARN only: | 只适用于Yarn部署模式的命令 | |
--queue QUEUE_NAME | The YARN queue to submit to (Default: "default"). | 队列名称 |
--archives ARCHIVES | Comma separated list of archives to be extracted into the working directory of each executor. |
spark-submit --class TestClass
--master yarn \
--queue ${指定队列名称} \
--deploy-mode client \
--driver-memory 1G \
--conf spark.driver.maxResultSize=1G \
--driver-cores 2 \
--num-executors 4 \
--executor-cores 4 \
--executor-memory 16G \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.shuffle.service.enabled=true \
--conf spark.sql.shuffle.partitions=6400 \
--conf spark.default.parallelism=6400 \
--conf spark.storage.memoryfraction=0.4 \
--conf spark.shuffle.memoryFraction=0.4 \
--conf spark.blacklist.enabled=true \
--conf spark.speculation=true \
--conf spark.hadoop.hive.exec.orc.split.strategy=ETL \
--name scala_test \
spark-submit --class TestClass
--master local \
--deploy-mode client \
--driver-memory 1G \
--conf spark.driver.maxResultSize=1G \
--executor-memory 16G \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.shuffle.service.enabled=true \
--conf spark.storage.memoryfraction=0.4 \
--conf spark.shuffle.memoryFraction=0.4 \
--conf spark.blacklist.enabled=true \
--conf spark.speculation=true \
--name scala_test \
2.2 object脚本示例
(1)一个python脚本,无任何其他依赖文件 的情况
spark-submit \
--master yarn \
--queue ${这是集群的队列} \
--deploy-mode client \
--driver-memory 4G \
--driver-cores 4 \
--executor-memory 8G \
--executor-cores 4 \
--num-executors 100 \
--conf spark.default.parallelism=1600 \
--name "spark_demo_yarn" \
(2)一个python脚本,加上一个/多个 txtfile的情况
(3)一个python脚本,加上一个/多个 依赖python脚本的情况
1.2 脚本示例: pyspark_example_yarn.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
if __name__ == '__main__':
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Word Count") \
.config("spark.some.config.option", "some-value") \
.enableHiveSupport() \
df = spark.sql("""
dt >= "2019-01-01"
AND dt <= "2020-12-31"
AND sku_code IN(700052, 721057)
2.1 spark-submit 命令模版
spark-submit \
--master local \
--deploy-mode client \
--name "spark_demo_local" \
2.2 脚本示例: pyspark_example_local.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
if __name__ == '__main__':
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Word Count") \
.config("spark.some.config.option", "some-value") \
.enableHiveSupport() \
print(spark.range(5000).where("id > 500").selectExpr("sum(id)").collect())
spark.range(500).where("id > 400").show()