Spark程序运行方式

Spark程序运行方式

本文主要介绍Spark上传集群运行的过程及shell脚本的编写

脚本文件编写参数介绍

在linux环境下 spark-submit指令打印如下

[hadoop@hadoop01 MyShell]$ spark-submit
Usage: spark-submit [options] <app jar | python file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn, or local.
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of local jars to include on the driver
                              and executor classpaths.
  --packages                  Comma-separated list of maven coordinates of jars to include
                              on the driver and executor classpaths. Will search the local
                              maven repo, then maven central and any additional remote
                              repositories given by --repositories. The format for the
                              coordinates should be groupId:artifactId:version.
  --exclude-packages          Comma-separated list of groupId:artifactId, to exclude while
                              resolving the dependencies provided in --packages to avoid
                              dependency conflicts.
  --repositories              Comma-separated list of additional remote repositories to
                              search for the maven coordinates given with --packages.
  --py-files PY_FILES         Comma-separated list of .zip, .egg, or .py files to place
                              on the PYTHONPATH for Python apps.
  --files FILES               Comma-separated list of files to be placed in the working
                              directory of each executor. File paths of these files
                              in executors can be accessed via SparkFiles.get(fileName).

  --conf PROP=VALUE           Arbitrary Spark configuration property.
  --properties-file FILE      Path to a file from which to load extra properties. If not
                              specified, this will look for conf/spark-defaults.conf.

  --driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
  --driver-java-options       Extra Java options to pass to the driver.
  --driver-library-path       Extra library path entries to pass to the driver.
  --driver-class-path         Extra class path entries to pass to the driver. Note that
                              jars added with --jars are automatically included in the
                              classpath.

  --executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).

  --proxy-user NAME           User to impersonate when submitting the application.
                              This argument does not work with --principal / --keytab.

  --help, -h                  Show this help message and exit.
  --verbose, -v               Print additional debug output.
  --version,                  Print the version of current Spark.
*************************************这条线上边的是local和其他模式公用的参数下边是standalone模式独有的***********************
 Spark standalone with cluster deploy mode only:
  --driver-cores NUM          Cores for driver (Default: 1).

 Spark standalone or Mesos with cluster deploy mode only:
  --supervise                 If given, restarts the driver on failure.
  --kill SUBMISSION_ID        If given, kills the driver specified.
  --status SUBMISSION_ID      If given, requests the status of the driver specified.

 Spark standalone and Mesos only:
  --total-executor-cores NUM  Total cores for all executors.

 Spark standalone and YARN only:
  --executor-cores NUM        Number of cores per executor. (Default: 1 in YARN mode,
                              or all available cores on the worker in standalone mode)
****************************************下边是yarn独有的参数**************************************************************
 YARN-only:
  --driver-cores NUM          Number of cores used by the driver, only in cluster mode
                              (Default: 1).
  --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
  --num-executors NUM         Number of executors to launch (Default: 2).
                              If dynamic allocation is enabled, the initial number of
                              executors will be at least NUM.
  --archives ARCHIVES         Comma separated list of archives to be extracted into the
                              working directory of each executor.
  --principal PRINCIPAL       Principal to be used to login to KDC, while running on
                              secure HDFS.
  --keytab KEYTAB             The full path to the file that contains the keytab for the
                              principal specified above. This keytab will be copied to
                              the node running the Application Master via the Secure
                              Distributed Cache, for renewing the login tickets and the
                              delegation tokens periodically.

1.local本地运行

代码如下

package com.wordcount

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}


/**
  * 使用scala实现wordCount,读取HDFS的数据
  * 注意这里在运行的时候用到了高可用的额组名,这样的话就需要将core-site.xml和hdfs-site.xml放在classPath的目录下
  * 放在resources的目录下运行就可以了
  *
  */
object _03wordCountHDFS {
  def main(args: Array[String]): Unit = {
    //创建配置对象
    val conf: SparkConf = new SparkConf()
      .setAppName(s"${_03wordCountHDFS.getClass.getSimpleName}")
      .setMaster("local")
    val sc = new SparkContext(conf)

    //生成RDD
    val textRDD: RDD[String] = sc.textFile("hdfs://bd1901/wordcount/in/word.txt")
    //进行transformation操作和action操作
    val retRDD: RDD[(String, Int)] = textRDD.flatMap(_.split("\\.")).map((_,1)).reduceByKey(_+_)
    //将最终的数据进行输出
   // retRDD.foreach(x=>println(x._1+"------->"+x._2))
    retRDD.saveAsTextFile("hdfs://bd1901/sparkwordout")
    //关闭资源
    sc.stop()
  }
}

将项目打成jar包上传到linux服务器本地目录
编写shell脚本文件

#!/bin/sh


SPARK_BIN=/home/hadoop/apps/spark/bin

${SPARK_BIN}/spark-submit \
--class com.wordcount._03wordCountHDFS \  //运行的类的全限定名
--master local \     //指定本地运行
--deploy-mode client \  //client模式
--executor-memory 600m \   //内存600m
/home/hadoop/jars/spark-core-1.0-SNAPSHOT.jar  //jar包的地址

2.standalone运行

这是采用Spark运行的方式
master的写法如下:
master: spark://bigdata01:7077或者 spark://bigdata01:7077,bigdata02:7077

  • 单机模式:spark://bigdata01:7077

  • 如果是HA的集群,spark://bigdata01:7077,bigdata02:7077

deploy-mode: 为cluster方式和client方式
两种方式的区别:

  • client :driver在本地启动,提交spark作业的那台机器就是driver,sparkcontext就在这台机器创建
  • cluster :driver不在本地启动,driver会在spark集群中被启动,driver也在worker节点上面

代码如下:

package com.wordcount

import org.apache.hadoop.fs.FileSystem
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}


/**
  * 使用scala实现wordCount,读取HDFS的数据
  * 注意这里在运行的时候用到了高可用的额组名,这样的话就需要将core-site.xml和hdfs-site.xml放在classPath的目录下
  * 放在resources的目录下运行就可以了
  *
  */
object _04wordCountHDFSjar{
  def main(args: Array[String]): Unit = {
    //创建配置对象
    val conf: SparkConf = new SparkConf()
      .setAppName(s"${_04wordCountHDFSjar.getClass.getSimpleName}")
  //.set("HADOOP_USER_NAME","hadoop").set("fs.defaultFS", "hdfs://bd1901:8020")
    val sc = new SparkContext(conf)

    //生成RDD
    val textRDD: RDD[String] = sc.textFile("hdfs://bd1901/wordcount/in/word.txt")
    //进行transformation操作和action操作
    val retRDD: RDD[(String, Int)] = textRDD.flatMap(_.split("\\.")).map((_,1)).reduceByKey(_+_)
    //将最终的数据进行输出
    retRDD.saveAsTextFile("hdfs://bd1901/sparkwordout")
    //关闭资源
    sc.stop()
  }
}

将代码打成jar包上传集群运行既可,这种方式应将jar包放在hdfs上,这样不同节点均能访问
编写脚本如下
client的脚本

#!/bin/sh


SPARK_BIN=/home/hadoop/apps/spark/bin

${SPARK_BIN}/spark-submit \
--class com.wordcount._04wordCountHDFSjar \
--master spark://hadoop01:7077,hadoop02:7077 \
--deploy-mode client \
--executor-memory 1000m \
--total-executor-cores 1 \
--executor-cores 1 \
hdfs://bd1901/jars/spark-core-1.0-SNAPSHOT.jar

cluster方式的脚本的编写

#!/bin/sh


SPARK_BIN=/home/hadoop/apps/spark/bin

${SPARK_BIN}/spark-submit \
--class com.wordcount._04wordCountHDFSjar \
--master spark://hadoop01:7077,hadoop02:7077 \
--deploy-mode cluster \
--executor-memory 1000M \
--total-executor-cores 1 \
--executor-cores 1 \
hdfs://bd1901/jars/spark-core-1.0-SNAPSHOT.jar

运行相应对的脚本即可

3.yarn的方式运行

master的写法 -------->master:yarn
deploy-mode:

client :driver在本地启动,提交spark作业的那台机器就是driver,sparkcontext就在这台机器创建

cluster :driver不在本地启动,driver会在yarn集群中被启动,driver也在nodemanager节点上面
代码如下

package com.wordcount

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}


/**
  * 使用scala实现wordCount,读取HDFS的数据
  * 注意这里在运行的时候用到了高可用的额组名,这样的话就需要将core-site.xml和hdfs-site.xml放在classPath的目录下
  * 放在resources的目录下运行就可以了
  *这种使用参数的形式进行
  */
object _05wordCountHDFSjarargs{
  def main(args: Array[String]): Unit = {
    if(args==null||args.length<3){
      println(
            """
              |Parameter Errors! Usage:<inputpath> <sleep> <output>
              |inputpath:   程序输入参数
              |sleep:   休眠时间
              |output:  程序输出参数
            """.stripMargin
      )
    }
    val Array(inputpath,sleep,outputpath)=args//模式匹配
    //创建配置对象
    val conf: SparkConf = new SparkConf()
      .setAppName(s"${_04wordCountHDFSjar.getClass.getSimpleName}")
    val sc = new SparkContext(conf)

    //生成RDD
    val textRDD: RDD[String] = sc.textFile(inputpath)
    //进行transformation操作和action操作
    val retRDD: RDD[(String, Int)] = textRDD.flatMap(_.split("\\.")).map((_,1)).reduceByKey(_+_)
    //将最终的数据进行输出
    retRDD.saveAsTextFile(outputpath)
    Thread.sleep(sleep.toLong)
    //关闭资源
    sc.stop()
  }
}

将代码打成jar包上传到hdfs上,这样就可以不同节点共享
编写shell脚本

client方式的脚本的编写

#!/bin/sh


SPARK_BIN=/home/hadoop/apps/spark/bin
OUT_PATH=hdfs://bd1901/out/spark/wc
`hdfs dfs -test -z ${OUT_PATH}`

if [ $? -eq 0 ]
then
  echo 1111
  `hdfs dfs -rm -R ${OUT_PATH}`
fi

${SPARK_BIN}/spark-submit \
--class com.wordcount._05wordCountHDFSjarargs \
--master yarn \
--deploy-mode client \
--executor-memory 1000m \
--num-executors 1 \ 
--executor-cores 1 \
hdfs://bd1901/jars/spark-core-1.0-SNAPSHOT.jar \
hdfs://bd1901/wordcount/in/word.txt 1000 ${OUT_PATH}

cluster的方式的shell编写

#!/bin/sh


SPARK_BIN=/home/hadoop/apps/spark/bin
OUT_PATH=hdfs://bd1901/out/spark/wc     //指定输出的路径
`hdfs dfs -test -z ${OUT_PATH}`          //判断输出路径是否存再

if [ $? -eq 0 ]                   //如果存在的话将此路径的文件进行删除,避免报错
then
  echo 1111
  `hdfs dfs -rm -R ${OUT_PATH}`
fi

${SPARK_BIN}/spark-submit \
--class com.wordcount._05wordCountHDFSjarargs \     //运行的类的完全限定名
--master yarn \                        //yarn模式
--deploy-mode cluster \               //cluster模式
--executor-memory 1000m \          //内存1000m
--num-executors 1 \ 
--executor-cores 1 \
hdfs://bd1901/jars/spark-core-1.0-SNAPSHOT.jar \      //jar包路径
hdfs://bd1901/wordcount/in/word.txt 1000 ${OUT_PATH}      //手动输入程序的参数

运行相应的脚本,注意这里需要在脚本中传入相应的参数给运行的程序,这样程序更加健壮

注意:
在打jar包的过程中需要注意,打成jar包之后需要检查一下是否包含所要运行的类,如果不包含的话就不要上传,否则也要报错,如果不包含所有的类的话有可能是maven的打包的插件不正常
可以引入如下的插件


	
		
			
				net.alchim31.maven
				scala-maven-plugin
				3.2.2
			
			
				org.apache.maven.plugins
				maven-compiler-plugin
				3.5.1
			
		
	
	
		
			net.alchim31.maven
			scala-maven-plugin
			
				
					scala-compile-first
					process-resources
					
						add-source
						compile
					
				
				
					scala-test-compile
					process-test-resources
					
						testCompile
					
				
			
		

		
			org.apache.maven.plugins
			maven-compiler-plugin
			
				
					compile
					
						compile
					
				
			
		

		
			org.apache.maven.plugins
			maven-shade-plugin
			2.4.3
			
				
					package
					
						shade
					
					
						
							
								*:*
								
									META-INF/*.SF
									META-INF/*.DSA
									META-INF/*.RSA
								
							
						
					
				
			
		
	

你可能感兴趣的:(scala,spark)