本文主要介绍Spark上传集群运行的过程及shell脚本的编写
在linux环境下 spark-submit指令打印如下
[hadoop@hadoop01 MyShell]$ spark-submit
Usage: spark-submit [options] <app jar | python file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]
Options:
--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or
on one of the worker machines inside the cluster ("cluster")
(Default: client).
--class CLASS_NAME Your application's main class (for Java / Scala apps).
--name NAME A name of your application.
--jars JARS Comma-separated list of local jars to include on the driver
and executor classpaths.
--packages Comma-separated list of maven coordinates of jars to include
on the driver and executor classpaths. Will search the local
maven repo, then maven central and any additional remote
repositories given by --repositories. The format for the
coordinates should be groupId:artifactId:version.
--exclude-packages Comma-separated list of groupId:artifactId, to exclude while
resolving the dependencies provided in --packages to avoid
dependency conflicts.
--repositories Comma-separated list of additional remote repositories to
search for the maven coordinates given with --packages.
--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place
on the PYTHONPATH for Python apps.
--files FILES Comma-separated list of files to be placed in the working
directory of each executor. File paths of these files
in executors can be accessed via SparkFiles.get(fileName).
--conf PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If not
specified, this will look for conf/spark-defaults.conf.
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
--driver-java-options Extra Java options to pass to the driver.
--driver-library-path Extra library path entries to pass to the driver.
--driver-class-path Extra class path entries to pass to the driver. Note that
jars added with --jars are automatically included in the
classpath.
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).
--proxy-user NAME User to impersonate when submitting the application.
This argument does not work with --principal / --keytab.
--help, -h Show this help message and exit.
--verbose, -v Print additional debug output.
--version, Print the version of current Spark.
*************************************这条线上边的是local和其他模式公用的参数下边是standalone模式独有的***********************
Spark standalone with cluster deploy mode only:
--driver-cores NUM Cores for driver (Default: 1).
Spark standalone or Mesos with cluster deploy mode only:
--supervise If given, restarts the driver on failure.
--kill SUBMISSION_ID If given, kills the driver specified.
--status SUBMISSION_ID If given, requests the status of the driver specified.
Spark standalone and Mesos only:
--total-executor-cores NUM Total cores for all executors.
Spark standalone and YARN only:
--executor-cores NUM Number of cores per executor. (Default: 1 in YARN mode,
or all available cores on the worker in standalone mode)
****************************************下边是yarn独有的参数**************************************************************
YARN-only:
--driver-cores NUM Number of cores used by the driver, only in cluster mode
(Default: 1).
--queue QUEUE_NAME The YARN queue to submit to (Default: "default").
--num-executors NUM Number of executors to launch (Default: 2).
If dynamic allocation is enabled, the initial number of
executors will be at least NUM.
--archives ARCHIVES Comma separated list of archives to be extracted into the
working directory of each executor.
--principal PRINCIPAL Principal to be used to login to KDC, while running on
secure HDFS.
--keytab KEYTAB The full path to the file that contains the keytab for the
principal specified above. This keytab will be copied to
the node running the Application Master via the Secure
Distributed Cache, for renewing the login tickets and the
delegation tokens periodically.
代码如下
package com.wordcount
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
/**
* 使用scala实现wordCount,读取HDFS的数据
* 注意这里在运行的时候用到了高可用的额组名,这样的话就需要将core-site.xml和hdfs-site.xml放在classPath的目录下
* 放在resources的目录下运行就可以了
*
*/
object _03wordCountHDFS {
def main(args: Array[String]): Unit = {
//创建配置对象
val conf: SparkConf = new SparkConf()
.setAppName(s"${_03wordCountHDFS.getClass.getSimpleName}")
.setMaster("local")
val sc = new SparkContext(conf)
//生成RDD
val textRDD: RDD[String] = sc.textFile("hdfs://bd1901/wordcount/in/word.txt")
//进行transformation操作和action操作
val retRDD: RDD[(String, Int)] = textRDD.flatMap(_.split("\\.")).map((_,1)).reduceByKey(_+_)
//将最终的数据进行输出
// retRDD.foreach(x=>println(x._1+"------->"+x._2))
retRDD.saveAsTextFile("hdfs://bd1901/sparkwordout")
//关闭资源
sc.stop()
}
}
将项目打成jar包上传到linux服务器本地目录
编写shell脚本文件
#!/bin/sh
SPARK_BIN=/home/hadoop/apps/spark/bin
${SPARK_BIN}/spark-submit \
--class com.wordcount._03wordCountHDFS \ //运行的类的全限定名
--master local \ //指定本地运行
--deploy-mode client \ //client模式
--executor-memory 600m \ //内存600m
/home/hadoop/jars/spark-core-1.0-SNAPSHOT.jar //jar包的地址
这是采用Spark运行的方式
master的写法如下:
master: spark://bigdata01:7077或者 spark://bigdata01:7077,bigdata02:7077
单机模式:spark://bigdata01:7077
如果是HA的集群,spark://bigdata01:7077,bigdata02:7077
deploy-mode: 为cluster方式和client方式
两种方式的区别:
代码如下:
package com.wordcount
import org.apache.hadoop.fs.FileSystem
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
/**
* 使用scala实现wordCount,读取HDFS的数据
* 注意这里在运行的时候用到了高可用的额组名,这样的话就需要将core-site.xml和hdfs-site.xml放在classPath的目录下
* 放在resources的目录下运行就可以了
*
*/
object _04wordCountHDFSjar{
def main(args: Array[String]): Unit = {
//创建配置对象
val conf: SparkConf = new SparkConf()
.setAppName(s"${_04wordCountHDFSjar.getClass.getSimpleName}")
//.set("HADOOP_USER_NAME","hadoop").set("fs.defaultFS", "hdfs://bd1901:8020")
val sc = new SparkContext(conf)
//生成RDD
val textRDD: RDD[String] = sc.textFile("hdfs://bd1901/wordcount/in/word.txt")
//进行transformation操作和action操作
val retRDD: RDD[(String, Int)] = textRDD.flatMap(_.split("\\.")).map((_,1)).reduceByKey(_+_)
//将最终的数据进行输出
retRDD.saveAsTextFile("hdfs://bd1901/sparkwordout")
//关闭资源
sc.stop()
}
}
将代码打成jar包上传集群运行既可,这种方式应将jar包放在hdfs上,这样不同节点均能访问
编写脚本如下
client的脚本
#!/bin/sh
SPARK_BIN=/home/hadoop/apps/spark/bin
${SPARK_BIN}/spark-submit \
--class com.wordcount._04wordCountHDFSjar \
--master spark://hadoop01:7077,hadoop02:7077 \
--deploy-mode client \
--executor-memory 1000m \
--total-executor-cores 1 \
--executor-cores 1 \
hdfs://bd1901/jars/spark-core-1.0-SNAPSHOT.jar
cluster方式的脚本的编写
#!/bin/sh
SPARK_BIN=/home/hadoop/apps/spark/bin
${SPARK_BIN}/spark-submit \
--class com.wordcount._04wordCountHDFSjar \
--master spark://hadoop01:7077,hadoop02:7077 \
--deploy-mode cluster \
--executor-memory 1000M \
--total-executor-cores 1 \
--executor-cores 1 \
hdfs://bd1901/jars/spark-core-1.0-SNAPSHOT.jar
运行相应对的脚本即可
master的写法 -------->master:yarn
deploy-mode:
client :driver在本地启动,提交spark作业的那台机器就是driver,sparkcontext就在这台机器创建
cluster :driver不在本地启动,driver会在yarn集群中被启动,driver也在nodemanager节点上面
代码如下
package com.wordcount
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
/**
* 使用scala实现wordCount,读取HDFS的数据
* 注意这里在运行的时候用到了高可用的额组名,这样的话就需要将core-site.xml和hdfs-site.xml放在classPath的目录下
* 放在resources的目录下运行就可以了
*这种使用参数的形式进行
*/
object _05wordCountHDFSjarargs{
def main(args: Array[String]): Unit = {
if(args==null||args.length<3){
println(
"""
|Parameter Errors! Usage:<inputpath> <sleep> <output>
|inputpath: 程序输入参数
|sleep: 休眠时间
|output: 程序输出参数
""".stripMargin
)
}
val Array(inputpath,sleep,outputpath)=args//模式匹配
//创建配置对象
val conf: SparkConf = new SparkConf()
.setAppName(s"${_04wordCountHDFSjar.getClass.getSimpleName}")
val sc = new SparkContext(conf)
//生成RDD
val textRDD: RDD[String] = sc.textFile(inputpath)
//进行transformation操作和action操作
val retRDD: RDD[(String, Int)] = textRDD.flatMap(_.split("\\.")).map((_,1)).reduceByKey(_+_)
//将最终的数据进行输出
retRDD.saveAsTextFile(outputpath)
Thread.sleep(sleep.toLong)
//关闭资源
sc.stop()
}
}
将代码打成jar包上传到hdfs上,这样就可以不同节点共享
编写shell脚本
client方式的脚本的编写
#!/bin/sh
SPARK_BIN=/home/hadoop/apps/spark/bin
OUT_PATH=hdfs://bd1901/out/spark/wc
`hdfs dfs -test -z ${OUT_PATH}`
if [ $? -eq 0 ]
then
echo 1111
`hdfs dfs -rm -R ${OUT_PATH}`
fi
${SPARK_BIN}/spark-submit \
--class com.wordcount._05wordCountHDFSjarargs \
--master yarn \
--deploy-mode client \
--executor-memory 1000m \
--num-executors 1 \
--executor-cores 1 \
hdfs://bd1901/jars/spark-core-1.0-SNAPSHOT.jar \
hdfs://bd1901/wordcount/in/word.txt 1000 ${OUT_PATH}
cluster的方式的shell编写
#!/bin/sh
SPARK_BIN=/home/hadoop/apps/spark/bin
OUT_PATH=hdfs://bd1901/out/spark/wc //指定输出的路径
`hdfs dfs -test -z ${OUT_PATH}` //判断输出路径是否存再
if [ $? -eq 0 ] //如果存在的话将此路径的文件进行删除,避免报错
then
echo 1111
`hdfs dfs -rm -R ${OUT_PATH}`
fi
${SPARK_BIN}/spark-submit \
--class com.wordcount._05wordCountHDFSjarargs \ //运行的类的完全限定名
--master yarn \ //yarn模式
--deploy-mode cluster \ //cluster模式
--executor-memory 1000m \ //内存1000m
--num-executors 1 \
--executor-cores 1 \
hdfs://bd1901/jars/spark-core-1.0-SNAPSHOT.jar \ //jar包路径
hdfs://bd1901/wordcount/in/word.txt 1000 ${OUT_PATH} //手动输入程序的参数
运行相应的脚本,注意这里需要在脚本中传入相应的参数给运行的程序,这样程序更加健壮
注意:
在打jar包的过程中需要注意,打成jar包之后需要检查一下是否包含所要运行的类,如果不包含的话就不要上传,否则也要报错,如果不包含所有的类的话有可能是maven的打包的插件不正常
可以引入如下的插件
net.alchim31.maven
scala-maven-plugin
3.2.2
org.apache.maven.plugins
maven-compiler-plugin
3.5.1
net.alchim31.maven
scala-maven-plugin
scala-compile-first
process-resources
add-source
compile
scala-test-compile
process-test-resources
testCompile
org.apache.maven.plugins
maven-compiler-plugin
compile
compile
org.apache.maven.plugins
maven-shade-plugin
2.4.3
package
shade
*:*
META-INF/*.SF
META-INF/*.DSA
META-INF/*.RSA