At a high level, every Spark application consists of a driver program
#每个Spark程序都会包含一个驱动程序:Driver
that runs the user’s main function
#这个driver运行在用户的main方法中
and executes various parallel operations on a cluster.
#并且在集群中执行各种并行化操作
The main abstraction Spark provides is a resilient distributed dataset (RDD),
#Spark中提供了一个主要的抽象:RDD【弹性分布式数据集】【SparkCore中的核心数据结构】
which is a collection of elements partitioned across the nodes of the cluster
# 被分区存储在各个节点中的一个元素的集合【类似于MapReduce中的分片的存在,所有分片构建一个RDD】
that can be operated on in parallel.
#可以被并行化的操作
RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system),
#RDD可以从一个Hadoop文件系统中被创建
or an existing Scala collection in the driver program, and transforming it.
#或者从一个Scala的集合中进行创建,由driver来实现,并且对RDD进行转换
Users may also ask Spark to persist an RDD in memory,
#用户可以将RDD中的数据持久化【缓存】在内存中,
allowing it to be reused efficiently across parallel operations.
#方便多次使用
Finally, RDDs automatically recover from node failures.
#最终,RDD会自动的从失败中恢复
理解:RDD类似于scala中的集合,但是RDD是分布式的,RDD集合中的数据分区存储在不同的机器上
MapReduce | Spark | |
---|---|---|
数据存储结构 | 磁盘HDFS文件系统 | 使用内存构建弹性分布式数据集RDD对数据进行运算和缓存 |
编程范式 | Map+Reduce | DAG(有向无环图):Transformation+action |
中间结果存储 | 中间结果落体磁盘,IO及序列化反序列化代价比较大 | 中间结果储存在内存中,速度比磁盘多几个数量级 |
运行方式 | Task以进程方式维护,任务启动慢 | Task以线程方式维护,任务启动快 |
常用的数据源
运行在各种分布式资源平台中
版本
发行厂商
基于以上的问题,需要基于Cloudera公司的依赖,自己编译Spark
spark-2.4.5-bin-cdh5.16.2-2.11.tgz
Master单点故障:启动多个Master
如何保证同一时刻只有一个Master,并且能自动切换:Zookeeper
任何一个分布式主从架构要不依赖于Zookeeper解决单点故障问题,要不自身实现了类似于Zookeeper的功能
如:Kafka依赖Zookeeper;ES实现了类似于Zookeeper的主动选举、自动切换功能
参考附录一导入虚拟机
参考附录二安装本地环境
启动HDFS
hadoop-daemon.sh start namenode
hadoop-daemon.sh start datanode
测试
启动Spark-shell
cd /export/server/spark
bin/spark-shell --master local[2]
创建测试文件
hdfs dfs -mkdir /datas
vim wordcount.data
hadoop spark hbase
hive hive hive hive
hadoop spark spark
hdfs dfs -put wordcount.data /datas
观察Spark-shell日志
Setting default log level to "WARN".
#默认日志级别为WARN级别
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
#如果你要修改日志级别,你要sc.setLogLevel(“INFO”)
Spark context Web UI available at http://node1.itcast.cn:4040
#每一个Spark程序都会自动开放一个Web监控界面,端口从4040开始,第二个程序的端口4041,依次类推
Spark context available as 'sc' (master = local[2], app id = local-1608102687971).
#创建了一个对象:sc:SparkContext
Spark session available as 'spark'.
#创建了一个对象:spark:SparkSession
测试WordCount
Input:读HDFS中/datas/wordcount.data
//调用sparkContext的方法读HDFS文件,存入RDD对象中
scala> val inputRdd = sc.textFile("/datas/wordcount.data")
inputRdd: org.apache.spark.rdd.RDD[String] = /datas/wordcount.data MapPartitionsRDD[1] at textFile at <console>:24
//查看RDD的第一行数据
scala> inputRdd.first
res0: String = hadoop spark hbase
//统计RDD的行数
scala> inputRdd.count
res1: Long = 3
Transform:实现词频统计
flatMap:RDD[String]
scala> inputRdd.flatMap(line => line.trim.split("\\s+"))
res2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at flatMap at <console>:26
scala> inputRdd.flatMap(line => line.trim.split("\\s+")).foreach(println)
hadoop
spark
hbase
hive
hive
hive
hive
hadoop
spark
spark
map:RDD[(String,Int)]
scala> inputRdd.flatMap(line => line.trim.split("\\s+")).map(word => (word,1)).foreach(println)
(hadoop,1)
(spark,1)
(hbase,1)
(hive,1)
(hive,1)
(hive,1)
(hive,1)
(hadoop,1)
(spark,1)
(spark,1)
reduceByKey:功能 = groupByKey + reduce:RDD[(String, Int)]
scala> val rsRDD = inputRdd.flatMap(line => line.trim.split("\\s+")).map(word => (word,1)).reduceByKey((tmp,item)=> tmp+item)
rsRDD: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[8] at reduceByKey at <console>:25
scala> rsRDD.foreach(println)
(hive,4)
(spark,3)
(hadoop,2)
(hbase,1)
Output
rsRDD.saveAsTextFile("/datas/output/output1")
测试运行Jar包
spark-submit:用于运行Spark的jar包的
Spark
spark-submit
[选项]
--class 指定运行哪个类
xxxx.jar
args #几个task=几核CPU运行
SPARK_HOME=/export/server/spark
${SPARK_HOME}/bin/spark-submit \
--master local[2] \
--class org.apache.spark.examples.SparkPi \
${SPARK_HOME}/examples/jars/spark-examples_2.11-2.4.5.jar \
10
注:第二个spark程序,4041访问
参考附录三安装集群环境
启动Spark Standalone集群
启动HDFS:第一台机器执行
start-dfs.sh
#创建一个目录,用于存储Spark程序的运行日志
hdfs dfs -mkdir -p /spark/eventLogs/
启动Master:第一台机器
/export/server/spark/sbin/start-master.sh
启动Worker:第一台机器
/export/server/spark/sbin/start-slaves.sh
查看WebUI
node1:8080
启动HistoryServer
/export/server/spark/sbin/start-history-server.sh
访问WebUI
node1:18080
测试
SPARK_HOME=/export/server/spark
${SPARK_HOME}/bin/spark-submit \
--master spark://node1:7077 \
--class org.apache.spark.examples.SparkPi \
${SPARK_HOME}/examples/jars/spark-examples_2.11-2.4.5.jar \
10
关闭所有Spark进程
/export/server/spark/sbin/stop-slaves.sh
/export/server/spark/sbin/stop-master.sh
/export/server/spark/sbin/stop-history-server.sh
修改配置文件
cd /export/server/spark/conf/
vim spark-env.sh
#注释60行
#SPARK_MASTER_HOST=node1
#添加68行
SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=node1:2181,node2:2181,node3:2181 -Dspark.deploy.zookeeper.dir=/spark-ha"
分发
cd /export/server/spark/conf
scp -r spark-env.sh node2:$PWD
scp -r spark-env.sh node3:$PWD
启动ZK
zookeeper-daemons.sh start
zookeeper-daemons.sh status
启动Master
第一台
/export/server/spark/sbin/start-master.sh
第二台
/export/server/spark/sbin/start-master.sh
启动Worker
/export/server/spark/sbin/start-slaves.sh
测试
SPARK_HOME=/export/server/spark
${SPARK_HOME}/bin/spark-submit \
--master spark://node1:7077,node2:7077 \ #把可用的Master都写上
--class org.apache.spark.examples.SparkPi \
${SPARK_HOME}/examples/jars/spark-examples_2.11-2.4.5.jar \
100
Spark Application运行到集群上时,由两部分组成:一个Driver Program 和 多个Executor。
用户程序从最开始的提交到最终的计算执行,需要经历以下几个阶段:
1)、用户程序创建 SparkContext 时,新创建的 SparkContext 实例会连接到 ClusterManager。 Cluster Manager 会根据用户提交时设置的 CPU 和内存等信息为本次提交分配计算资源,启动 Executor。
2)、Driver进程开始解析代码,构建DAG,Driver将程序划分为不同的执行阶段Stage,每个执行阶段Stage由一组完全相同Task组成,这些Task分别作用于待处理数据的不同分区。在阶段划分完成和Task创建后, Driver会向Executor发送 Task;
3)、Executor在接收到Task后,会下载Task的运行时依赖,在准备好Task的执行环境后,会开始执行Task,并且将Task的运行状态汇报给Driver;
4)、Driver会根据收到的Task的运行状态来处理不同的状态更新。 Task分为两种:一种是Shuffle Map Task,它实现数据的重新洗牌,洗牌的结果保存到Executor 所在节点的文件系统中;另外一种是Result Task,它负责生成结果数据;
5)、Driver 会不断地调用Task,将Task发送到Executor执行,Driver监控所有Task的运行,在所有的Task 都正确执行或者超过执行次数的限制仍然没有执行成功时停止。
一个分区用一个Task,为什么这里两个分区需要四个Task?
数据什么时候产生?
一个Rdd分区数越高,并行度就越高么?
Stage为什么是全局编号的(Application中不同job,stage全局编号)?
数据的构建有些在Executor,有些在Driver,有什么区别?
Driver:解析和运行所有代码:逻辑计划
Executor:Task:物理计划
task运行完把数据存到所在的executor,当代码用到数据时,把数据发给driver(代码的执行在driver)
println(inputRdd.first())
//inputRdd.first()的数据在executor上
//当需要打印时,将数据发给driver
解析和运行所有代码在driver,但数据的构建有些在driver有些在executor
package bigdata.itcast.cn.spark.scala.wordcount
import org.apache.spark.rdd.RDD
import org.apache.spark.{
SparkConf, SparkContext}
/**
* @ClassName SparkCoreWordCount
* @Description TODO 自己开发代码来实现Wordcount
*/
object SparkCoreWordCount {
def main(args: Array[String]): Unit = {
/**
* step1:先构建SparkContext:初始化资源对象
*/
//构建一个SparkConf:用于管理当前程序的所有配置
val conf = new SparkConf()
//给当前程序设置一个名字
.setAppName(this.getClass.getSimpleName.stripSuffix("$"))
//设置当前程序运行的模式
.setMaster("local[2]")
//构建一个SparkContext对象
val sc = new SparkContext(conf)
//调整日志级别
sc.setLogLevel("WARN")
/**
* step2:处理数据
*/
//todo:1-读取数据
val inputRdd: RDD[String] = sc.textFile("/datas/wordcount.data")
//todo:2-处理数据
val rsRdd = inputRdd
.filter(line => null != line && line.trim.length >0)
.flatMap(line => line.trim.split("\\s+"))
.map(word => word -> 1)
.reduceByKey((tmp,item) => tmp+item)
//todo:3-保存结果
rsRdd.foreach(println)
/**
* step3:释放资源
*/
Thread.sleep(1000000L)
sc.stop()
}
}
本地模式下,单机没有Executors,只有driver;
有Executors才能进行分布式的计算运行
package bigdata.itcast.cn.spark.scala.mode
import org.apache.spark.rdd.RDD
import org.apache.spark.{
SparkConf, SparkContext}
/**
* @ClassName SparkCoreMode
* @Description TODO SparkCore的基础模板
*/
object SparkCoreMode {
def main(args: Array[String]): Unit = {
/**
* step1:先构建SparkContext:初始化资源对象
*/
//构建一个SparkConf:用于管理当前程序的所有配置
val conf = new SparkConf()
//给当前程序设置一个名字
.setAppName(this.getClass.getSimpleName.stripSuffix("$"))
//设置当前程序运行的模式
.setMaster("local[2]")
//构建一个SparkContext对象
val sc = new SparkContext(conf)
//调整日志级别
sc.setLogLevel("WARN")
/**
* step2:处理数据
*/
//todo:1-读取数据
//todo:2-处理数据
//todo:3-保存结果
/**
* step3:释放资源
*/
Thread.sleep(1000000L)
sc.stop()
}
}
案例:基于word count选出前三个个数最多的单词
package bigdata.itcast.cn.spark.scala.topkey
import org.apache.spark.rdd.RDD
import org.apache.spark.{
SparkConf, SparkContext}
/**
* @ClassName SparkCoreWCTopKey
* @Description TODO Spark Core实现词频统计,并且排序
*/
object SparkCoreWCTopKey {
def main(args: Array[String]): Unit = {
/**
* step1:初始化一个SparkContext
*/
//构建配置对象
val conf = new SparkConf()
.setAppName(this.getClass.getSimpleName.stripSuffix("$"))
.setMaster("local[2]")
//构建SparkContext的实例,如果存在,直接获取,如果不存在,就构建
val sc = SparkContext.getOrCreate(conf)
//调整日志级别
sc.setLogLevel("WARN")
/**
* step2:实现数据的处理过程:读取、转换、保存
*/
//todo:1-读取
val inputRdd: RDD[String] = sc.textFile("datas/wordcount/wordcount.data")
println(s"first line = ${inputRdd.first()}")
println(s"count = ${inputRdd.count()}")
//todo:2-转换
val rsRdd = inputRdd
//对非法数据的过滤:def filter(f: T => Boolean)
.filter(line => null != line && line.trim.length > 0)
//提取所有单词,放到一个集合中
.flatMap(line => line.trim.split("\\s+"))
//转换为二元组
.map(word => (word,1))
//按照单词分组聚合
.reduceByKey((tmp,item) => tmp+item)
//方式一:sortByKey:只能对二元组类型进行排序,会调用collect或save方法,运用RDD中的数据,触发程序运行
//只能对Key排序,所以交换位置
.map(tuple => tuple.swap)
.sortByKey(ascending = false)
.take(3)
//方式二:sortBy
.sortBy(tuple => -tuple._2)
.take(3)
//方式三:top:直接对数据进行排序,根据key自动降序排序,取前N个值
.top(3)(Ordering.by(tuple => tuple._2)) //自定义比较器
.foreach(println)
//todo:3-保存
/**
* step3:释放资源
*/
Thread.sleep(1000000L)
sc.stop()
}
}
sortBy
def sortBy[K](
f: (T) => K,//指定按照谁进行排序
ascending: Boolean = true,
numPartitions: Int = this.partitions.length)
top:小数据量的排序选取中是比较合适的,返回值是数组,加载到driver内存
def top(num: Int)(implicit ord: Ordering[T]): Array[T]
先将程序打成jar包
package bigdata.itcast.cn.spark.scala.wordcount
import org.apache.spark.rdd.RDD
import org.apache.spark.{
SparkConf, SparkContext}
/**
* @ClassName SparkCoreWordCount
* @Description TODO 自己开发代码来实现Wordcount
*/
object SparkCoreWordCount {
def main(args: Array[String]): Unit = {
/**
* step1:先构建SparkContext:初始化资源对象
*/
//构建一个SparkConf:用于管理当前程序的所有配置
val conf = new SparkConf()
//给当前程序设置一个名字
.setAppName(this.getClass.getSimpleName.stripSuffix("$"))
//设置当前程序运行的模式
// .setMaster("local[2]")
//构建一个SparkContext对象
val sc = new SparkContext(conf)
//调整日志级别
sc.setLogLevel("WARN")
/**
* step2:处理数据
*/
//todo:1-读取数据
//使用args(0)作为输入路径
val inputRdd: RDD[String] = sc.textFile(args(0))
//todo:2-处理数据
val rsRdd = inputRdd
.filter(line => null != line && line.trim.length >0)
.flatMap(line => line.trim.split("\\s+"))
.map(word => word -> 1)
.reduceByKey((tmp,item) => tmp+item)
//todo:3-保存结果
rsRdd.foreach(println)
rsRdd.saveAsTextFile(args(1)+"-"+System.currentTimeMillis())
/**
* step3:释放资源
*/
// Thread.sleep(1000000L)
sc.stop()
}
}
上传到Linux上,并将jar包放到HDFS:方便在任何一台机器直接运行
hdfs dfs -mkdir /spark/apps
hdfs dfs -put spark-chapter01_2.11-1.0.0.jar.jar /spark/apps/
运行:spark-submit
用法
[root@node1 spark]# bin/spark-submit -h
Usage: spark-submit [options] [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]
Options:
--master MASTER_URL spark://host:port, mesos://host:port, yarn,
k8s://https://host:port, or local (Default: local[*]).
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or
on one of the worker machines inside the cluster ("cluster")
(Default: client).
--class CLASS_NAME Your application's main class (for Java / Scala apps).
--name NAME A name of your application.
--jars JARS Comma-separated list of jars to include on the driver
and executor classpaths.
--packages Comma-separated list of maven coordinates of jars to include
on the driver and executor classpaths. Will search the local
maven repo, then maven central and any additional remote
repositories given by --repositories. The format for the
coordinates should be groupId:artifactId:version.
--exclude-packages Comma-separated list of groupId:artifactId, to exclude while
resolving the dependencies provided in --packages to avoid
dependency conflicts.
--repositories Comma-separated list of additional remote repositories to
search for the maven coordinates given with --packages.
--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place
on the PYTHONPATH for Python apps.
--files FILES Comma-separated list of files to be placed in the working
directory of each executor. File paths of these files
in executors can be accessed via SparkFiles.get(fileName).
--conf PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If not
specified, this will look for conf/spark-defaults.conf.
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
--driver-java-options Extra Java options to pass to the driver.
--driver-library-path Extra library path entries to pass to the driver.
--driver-class-path Extra class path entries to pass to the driver. Note that
jars added with --jars are automatically included in the
classpath.
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).
--proxy-user NAME User to impersonate when submitting the application.
This argument does not work with --principal / --keytab.
--help, -h Show this help message and exit.
--verbose, -v Print additional debug output.
--version, Print the version of current Spark.
Cluster deploy mode only:
--driver-cores NUM Number of cores used by the driver, only in cluster mode
(Default: 1).
Spark standalone or Mesos with cluster deploy mode only:
--supervise If given, restarts the driver on failure.
--kill SUBMISSION_ID If given, kills the driver specified.
--status SUBMISSION_ID If given, requests the status of the driver specified.
Spark standalone and Mesos only:
--total-executor-cores NUM Total cores for all executors.
Spark standalone and YARN only:
--executor-cores NUM Number of cores per executor. (Default: 1 in YARN mode,
or all available cores on the worker in standalone mode)
YARN-only:
--queue QUEUE_NAME The YARN queue to submit to (Default: "default").
--num-executors NUM Number of executors to launch (Default: 2).
If dynamic allocation is enabled, the initial number of
executors will be at least NUM.
--archives ARCHIVES Comma separated list of archives to be extracted into the
working directory of each executor.
--principal PRINCIPAL Principal to be used to login to KDC, while running on
secure HDFS.
--keytab KEYTAB The full path to the file that contains the keytab for the
principal specified above. This keytab will be copied to
the node running the Application Master via the Secure
Distributed Cache, for renewing the login tickets and the
delegation tokens periodically.
常用的选项
提交的选项
--master:用于指定提交的模式,local、Standalone、yarn、messos、k8s
local[2]
spark://host:7077
yarn
--deploy-mode:指定deploy模式,client、cluster两种模式
--class:指定运行jar包中的哪个类
--jars:用于指定添加需要的一些额外的包
--conf:临时的修改Spark的属性配置
Driver进程资源的配置
--driver-memory MEM:driver进程的内存资源
Executor进程的资源配置
通用属性
--executor-memory:每个Executor能够使用的内存(Default: 1G).
Spark standalone and Mesos
--total-executor-cores:所有Executor总共使用的CPU核数
Spark standalone and YARN
--executor-cores:指定每个Executor能用几核CPU
YARN-only
--num-executors:指定Executor的个数
问题:Standalone集群中,如何指定Executor的个数
SPARK_HOME=/export/server/spark
$SPARK_HOME/bin/spark-submit \
--master local[2] \
--class cn.itcast.spark.pack.WordCountPack \
hdfs://node1:8020/spark/apps/spark_project-1.0-SNAPSHOT.jar \
/datas/wordcount.data \
/datas/output
直接提交
SPARK_HOME=/export/server/spark
$SPARK_HOME/bin/spark-submit \
--master spark://node1:7077 \
--class cn.itcast.spark.pack.WordCountPack \
hdfs://node1:8020/spark/apps/spark_project-1.0-SNAPSHOT.jar \
/datas/wordcount.data \
/datas/output
调整Executor和Driver的资源
SPARK_HOME=/export/server/spark
$SPARK_HOME/bin/spark-submit \
--master spark://node1:7077 \
--driver-memory 512M \
--executor-memory 512M \
--executor-cores 1 \
--total-executor-cores 2 \
--class cn.itcast.spark.pack.WordCountPack \
hdfs://node1:8020/spark/apps/spark_project-1.0-SNAPSHOT.jar \
/datas/wordcount.data \
/datas/output
修改yarn-site.xml
<property>
<name>yarn.log-aggregation-enablename>
<value>truevalue>
property>
<property>
<name>yarn.log-aggregation.retain-secondsname>
<value>604800value>
property>
<property>
<name>yarn.log.server.urlname>
<value>http://node1:19888/jobhistory/logsvalue>
property>
<property>
<name>yarn.nodemanager.pmem-check-enabledname>
<value>falsevalue>
property>
<property>
<name>yarn.nodemanager.vmem-check-enabledname>
<value>falsevalue>
property>
分发
cd /export/server/hadoop/etc/hadoop
scp -r yarn-site.xml root@node2:$PWD
scp -r yarn-site.xml root@node3:$PWD
关闭Spark集群
cd /export/server/spark
sbin/stop-master.sh
sbin/stop-slaves.sh
sbin/stop-history-server.sh
修改spark-env.sh
#添加yarn的地址
YARN_CONF_DIR=/export/server/hadoop/etc/hadoop
配置HDFS上Spark jar包的存储位置:解决YARN运行Spark没有Spark的依赖包
hdfs dfs -mkdir -p /spark/apps/jars/
hdfs dfs -put /export/server/spark/jars/* /spark/apps/jars/
修改spark-defaults.conf
#为了在8088中能直接访问Spark程序的监控,所以这里这里做了转接,如果在yarn中点击history,就转接18080
spark.yarn.historyServer.address node1:18080
#指定yarn运行时的spark的jar包的地址
spark.yarn.jars hdfs://node1:8020/spark/apps/jars/*
分发
cd /export/server/spark/conf/
scp spark-env.sh spark-defaults.conf node2:$PWD
scp spark-env.sh spark-defaults.conf node3:$PWD
启动YARN
start-yarn.sh
启动Jobhistoryserver
mr-jobhistory-daemon.sh start historyserver
启动Spark的HistoryServer
/export/server/spark/sbin/start-history-server.sh
提交程序
SPARK_HOME=/export/server/spark
$SPARK_HOME/bin/spark-submit \
--master yarn \
--driver-memory 512M \
--executor-memory 512M \
--executor-cores 1 \
--num-executors 3 \
--class cn.itcast.spark.pack.WordCountPack \
hdfs://node1:8020/spark/apps/spark_project-1.0-SNAPSHOT.jar \
/datas/wordcount.data \
/datas/output
driver:每次driver进程都启动在提交程序的客户端机器上
executor:个数和资源由用户自己指定,分配在哪些从节点,由集群自动分配管理
如果driver都启动在一台机器,会导致两个问题
解决:deploy模式:决定driver进程启动在哪台机器
Driver运行的位置不同
client模式
SPARK_HOME=/export/server/spark
$SPARK_HOME/bin/spark-submit \
--master spark://node1:7077 \
--driver-memory 512M \
--executor-memory 512M \
--executor-cores 1 \
--total-executor-cores 2 \
--class cn.itcast.spark.pack.WordCountPack \
hdfs://node1:8020/spark/apps/spark_project-1.0-SNAPSHOT.jar \
/datas/wordcount.data \
/datas/output
SPARK_HOME=/export/server/spark
$SPARK_HOME/bin/spark-submit \
--master spark://node1:7077 \
--deploy-mode client \
--driver-memory 512M \
--executor-memory 512M \
--executor-cores 1 \
--total-executor-cores 2 \
--class cn.itcast.spark.pack.WordCountPack \
hdfs://node1:8020/spark/apps/spark_project-1.0-SNAPSHOT.jar \
/datas/wordcount.data \
/datas/output
cluster
SPARK_HOME=/export/server/spark
$SPARK_HOME/bin/spark-submit \
--master spark://node1:7077 \
--deploy-mode cluster \
--driver-memory 512M \
--executor-memory 512M \
--executor-cores 1 \
--total-executor-cores 2 \
--class cn.itcast.spark.pack.WordCountPack \
hdfs://node1:8020/spark/apps/spark_project-1.0-SNAPSHOT.jar \
/datas/wordcount.data \
/datas/output
Spark on YARN cluster: driver进程启动在NodeManger中
什么叫流量激增?
咱们来想象一下。比如你的executor有100个,task有1000个。每个stage运行的时候,都有1000个task提交到executor上面去运行,平均每个executor有10个task。接下来问题来了,driver要频繁地跟executor上运行的1000个task进行通信。通信消息特别多,通信的频率特别高。运行完一个stage,接着运行下一个stage,又是频繁的通信。
在整个spark运行的生命周期内,都会频繁的去进行通信和调度。所有这一切通信和调度都是从你的本地机器上发出去的,和接收到的。这是最要人命的地方。你的本地机器,很可能在30分钟内(spark作业运行的周期内),进行频繁大量的网络通信。那么此时,你的本地机器的网络通信负载是非常非常高的。会导致你的本地机器的网卡流量会激增!!!
多个spark程序并行,cluster 模式下,不同的driver运行在不同的机器上,避免流量激增!