7、spark的生产应用提交脚本spark-submit

一、通过查询命令 spark-submit --help 来查看提交任务时有哪些选项可以用。

Options: 说明 备注【个人翻译和根据使用经验备注,有错误欢迎支持】
  --master MASTER_URL          spark://host:port, mesos://host:port, yarn,
 k8s://https://host:port, or local (Default: local[*]).
常用local本地模式、yarn集群模式 
  --deploy-mode DEPLOY_MODE    Whether to launch the driver program locally ("client") or on one of the worker machines inside the cluster ("cluster")  (Default: client).

驱动程序是本地客户端client启动还是集群cluster上的工作节点启动.

如果是cluster模式,Yarn集群会管理driver进程,application创建后,client客户端就可以退出了。

如果是client模式,driver进程会跑在client客户端进程中,Yarn只负责保证执行节点的资源,并不会管理master节点。

  --class CLASS_NAME     Your application's main class (for Java / Scala apps). Java/Scala脚本的main class
  --name NAME       A name of your application. 给应用一个名称
  --jars JARS       Comma-separated list of jars to include on the driver and executor classpaths. 逗号分隔的jar包列表,会加载到驱动、执行节点的路径上
  --packages      Comma-separated list of maven coordinates of jars to include  on the driver and executor classpaths. Will search the local  maven repo, then maven central and any additional remote  repositories given by --repositories. The format for the  coordinates should be groupId:artifactId:version. 逗号分隔的maven坐标下的package包列表,会加载到驱动、执行节点的路径上。会搜索本地的maven资源库或远程资源池来加载。
  --exclude-packages           Comma-separated list of groupId:artifactId, to exclude while
resolving the dependencies provided in --packages to avoid
dependency conflicts.
逗号分隔的package包,在解析依赖的时候会排除不解析,防止依赖冲突。
  --repositories               Comma-separated list of additional remote repositories to
search for the maven coordinates given with --packages.
 
  --py-files PY_FILES          Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps. 逗号分隔的.zip , .egg, .py文件列表 
  --files FILES                Comma-separated list of files to be placed in the working directory of each executor. File paths of these files in executors can be accessed via SparkFiles.get(fileName).

逗号分隔的文件列表,替换工作节点路径下的文件。文件可以通过SparkFiles.get(fileName)获取

【注:这里文件其实会被加载存放到工作节点路径下,也不用使用SparkFiles.get(fileName)方式读取,直接读文件名即可】

  --conf, -c PROP=VALUE        Arbitrary Spark configuration property. 配置选项
  --properties-file FILE       Path to a file from which to load extra properties. If not specified, this will look for conf/spark-defaults.conf. 配置文件 
  --driver-memory MEM          Memory for driver (e.g. 1000M, 2G) (Default: 1024M).

驱动节点内存

  --driver-java-options        Extra Java options to pass to the driver.  
  --driver-library-path        Extra library path entries to pass to the driver.  
  --driver-class-path          Extra class path entries to pass to the driver. Note that jars added with --jars are automatically included in the classpath.  
  --executor-memory MEM        Memory per executor (e.g. 1000M, 2G) (Default: 1G). 执行节点内存
  --proxy-user NAME            User to impersonate when submitting the application. This argument does not work with --principal / --keytab.  
  --help, -h                   Show this help message and exit. spark-submit --help 获取命令行帮助
  --verbose, -v                Print additional debug output.  
  --version,                   Print the version of current Spark. spark-submit -version 查看当前版本号 
 Cluster deploy mode only:    只适用于集群部署模式的命令
  --driver-cores NUM           Number of cores used by the driver, only in cluster mode (Default: 1).  
 Spark standalone or Mesos with cluster deploy mode only:    
  --supervise                  If given, restarts the driver on failure.  
 Spark standalone, Mesos or K8s with cluster deploy mode only:    
  --kill SUBMISSION_ID         If given, kills the driver specified.  
  --status SUBMISSION_ID       If given, requests the status of the driver specified.  
 Spark standalone, Mesos and Kubernetes only:    
  --total-executor-cores NUM   Total cores for all executors.  
 Spark standalone, YARN and Kubernetes only:    
  --executor-cores NUM         Number of cores used by each executor. (Default: 1 in YARN and K8S modes, or all available cores on the worker in standalone mode).  
 Spark on YARN and Kubernetes only:   适用于Yarn和Kubernetes部署模式的命令
  --num-executors NUM          Number of executors to launch (Default: 2). If dynamic allocation is enabled, the initial number of executors will be at least NUM. 执行节点个数 
  --principal PRINCIPAL        Principal to be used to login to KDC.  
  --keytab KEYTAB              The full path to the file that contains the keytab for the principal specified above.  
 Spark on YARN only:    只适用于Yarn部署模式的命令
  --queue QUEUE_NAME           The YARN queue to submit to (Default: "default").  队列名称
  --archives ARCHIVES          Comma separated list of archives to be extracted into the working directory of each executor.  

二、scala脚本spark-submit

1、yarn集群模式

1.1 spark-submit 命令模版

spark-submit --class TestClass
--master yarn \
--queue ${指定队列名称} \
--deploy-mode client \
--driver-memory 1G \
--conf spark.driver.maxResultSize=1G \
--driver-cores 2 \
--num-executors 4 \
--executor-cores 4 \
--executor-memory 16G \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.shuffle.service.enabled=true \
--conf spark.sql.shuffle.partitions=6400 \
--conf spark.default.parallelism=6400 \
--conf spark.storage.memoryfraction=0.4 \
--conf spark.shuffle.memoryFraction=0.4 \
--conf spark.blacklist.enabled=true \
--conf spark.speculation=true \
--conf spark.hadoop.hive.exec.orc.split.strategy=ETL \
--name scala_test \
AtestSparkApplication.jar

1.2 object脚本示例

 

2、local本地模式

2.1 spark-submit 命令模版

spark-submit --class TestClass
--master local \
--deploy-mode client \
--driver-memory 1G \
--conf spark.driver.maxResultSize=1G \
--executor-memory 16G \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.shuffle.service.enabled=true \
--conf spark.storage.memoryfraction=0.4 \
--conf spark.shuffle.memoryFraction=0.4 \
--conf spark.blacklist.enabled=true \
--conf spark.speculation=true \
--name scala_test \
AtestSparkApplication.jar

2.2 object脚本示例

 

三、python脚本spark-submit

1、yarn集群模式

1.1 spark-submit 命令模版

(1)一个python脚本,无任何其他依赖文件 的情况

spark-submit \
 --master yarn \
 --queue ${这是集群的队列} \
 --deploy-mode client \
 --driver-memory 4G \
 --driver-cores 4 \
 --executor-memory 8G \
 --executor-cores 4 \
 --num-executors 100 \
 --conf spark.default.parallelism=1600 \
 --name "spark_demo_yarn" \
 pyspark_example_yarn.py 

(2)一个python脚本,加上一个/多个 txtfile的情况

(3)一个python脚本,加上一个/多个 依赖python脚本的情况

 

1.2 脚本示例: pyspark_example_yarn.py 

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from __future__ import print_function


if __name__ == '__main__':

    from pyspark.sql import SparkSession
    spark = SparkSession.builder \
                .appName("Word Count") \
                .config("spark.some.config.option", "some-value") \
                .enableHiveSupport() \
                .getOrCreate()

    df = spark.sql("""
            SELECT
	            COUNT(a.user_id)
            FROM
	            (
		            SELECT
			            user_id
		            FROM
			            app.app_purchase_table
		            WHERE
			            dt >= "2019-01-01"
			            AND dt <= "2020-12-31"
			            AND sku_code IN(700052, 721057)
		            GROUP BY
			            user_id
                )
                a
            """)
    df.show()

2、local本地模式

2.1 spark-submit 命令模版

spark-submit \
 --master local \
 --deploy-mode client \
 --name "spark_demo_local" \
 pyspark_example_local.py 

2.2 脚本示例:  pyspark_example_local.py 

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from __future__ import print_function


if __name__ == '__main__':

    from pyspark.sql import SparkSession
    spark = SparkSession.builder \
                .appName("Word Count") \
                .config("spark.some.config.option", "some-value") \
                .enableHiveSupport() \
                .getOrCreate()

    print(spark.range(5000).where("id > 500").selectExpr("sum(id)").collect())
    spark.range(500).where("id > 400").show()

 

你可能感兴趣的:(Spark权威指南,spark-submit)