Spark介绍
Spark的部署模式
Spark程序的开发
排序实现
sortByKey
sortBy
def sortBy[K](
f: (T) => K,//指定按照谁进行排序
ascending: Boolean = true,
numPartitions: Int = this.partitions.length)
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-IeuZVDXG-1610514088306)(Day39_分布式计算平台Spark:Core(一).assets/image-20201217095817230.png)]
top:小数据量的排序选取中是比较合适的
def top(num: Int)(implicit ord: Ordering[T]): Array[T]
package bigdata.itcast.cn.spark.scala.topkey
import org.apache.spark.rdd.RDD
import org.apache.spark.{
SparkConf, SparkContext}
/**
* @ClassName SparkCoreWCTopKey
* @Description TODO Spark Core实现词频统计,并且排序
* @Date 2020/12/17 9:32
* @Create By Frank
*/
object SparkCoreWCTopKey {
def main(args: Array[String]): Unit = {
/**
* step1:初始化一个SparkContext
*/
//构建配置对象
val conf = new SparkConf()
.setAppName(this.getClass.getSimpleName.stripSuffix("$"))
.setMaster("local[2]")
//构建SparkContext的实例,如果存在,直接获取,如果不存在,就构建
val sc = SparkContext.getOrCreate(conf)
//调整日志级别
sc.setLogLevel("WARN")
/**
* step2:实现数据的处理过程:读取、转换、保存
*/
//todo:1-读取
val inputRdd: RDD[String] = sc.textFile("datas/wordcount/wordcount.data")
println(s"first line = ${inputRdd.first()}")
println(s"count = ${inputRdd.count()}")
//todo:2-转换
val rsRdd = inputRdd
//对非法数据的过滤:def filter(f: T => Boolean)
.filter(line => null != line && line.trim.length > 0)
//提取所有单词,放到一个集合中
.flatMap(line => line.trim.split("\\s+"))
//转换为二元组
.map(word => (word,1))
//按照单词分组聚合
.reduceByKey((tmp,item) => tmp+item)
//方式一:sortByKey:def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length),只能对二元组类型进行排序
//只能对Key排序,所以交换位置
// .map(tuple => tuple.swap)
// .sortByKey(ascending = false)
//方式二:sortBy
// .sortBy(tuple => -tuple._2)
// .take(3)
//方式三:top:直接对数据进行排序,自动降序排序,取前N个值
.top(3)(Ordering.by(tuple => tuple._2))
.foreach(println)
//todo:3-保存
/**
* step3:释放资源
*/
Thread.sleep(1000000L)
sc.stop()
}
}
反馈问题
先将程序打成jar包
package bigdata.itcast.cn.spark.scala.wordcount
import org.apache.spark.rdd.RDD
import org.apache.spark.{
SparkConf, SparkContext}
/**
* @ClassName SparkCoreWordCount
* @Description TODO 自己开发代码来实现Wordcount
* @Date 2020/12/16 17:22
* @Create By Frank
*/
object SparkCoreWordCount {
def main(args: Array[String]): Unit = {
/**
* step1:先构建SparkContext:初始化资源对象
*/
//构建一个SparkConf:用于管理当前程序的所有配置
val conf = new SparkConf()
//给当前程序设置一个名字
.setAppName(this.getClass.getSimpleName.stripSuffix("$"))
//设置当前程序运行的模式
// .setMaster("local[2]")
//构建一个SparkContext对象
val sc = new SparkContext(conf)
//调整日志级别
sc.setLogLevel("WARN")
/**
* step2:处理数据
*/
//todo:1-读取数据
//使用args(0)作为输入路径
val inputRdd: RDD[String] = sc.textFile(args(0))
//todo:2-处理数据
val rsRdd = inputRdd
.filter(line => null != line && line.trim.length >0)
.flatMap(line => line.trim.split("\\s+"))
.map(word => word -> 1)
.reduceByKey((tmp,item) => tmp+item)
//todo:3-保存结果
rsRdd.foreach(println)
rsRdd.saveAsTextFile(args(1)+"-"+System.currentTimeMillis())
/**
* step3:释放资源
*/
// Thread.sleep(1000000L)
sc.stop()
}
}
上传到Linux上,并将jar包放到HDFS:方便在任何一台机器直接运行、
hdfs dfs -mkdir /spark/apps
hdfs dfs -put spark-chapter01_2.11-1.0.0.jar.jar /spark/apps/
运行:spark-submit
用法
[root@node1 spark]# bin/spark-submit -h
Usage: spark-submit [options] [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]
Options:
--master MASTER_URL spark://host:port, mesos://host:port, yarn,
k8s://https://host:port, or local (Default: local[*]).
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or
on one of the worker machines inside the cluster ("cluster")
(Default: client).
--class CLASS_NAME Your application's main class (for Java / Scala apps).
--name NAME A name of your application.
--jars JARS Comma-separated list of jars to include on the driver
and executor classpaths.
--packages Comma-separated list of maven coordinates of jars to include
on the driver and executor classpaths. Will search the local
maven repo, then maven central and any additional remote
repositories given by --repositories. The format for the
coordinates should be groupId:artifactId:version.
--exclude-packages Comma-separated list of groupId:artifactId, to exclude while
resolving the dependencies provided in --packages to avoid
dependency conflicts.
--repositories Comma-separated list of additional remote repositories to
search for the maven coordinates given with --packages.
--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place
on the PYTHONPATH for Python apps.
--files FILES Comma-separated list of files to be placed in the working
directory of each executor. File paths of these files
in executors can be accessed via SparkFiles.get(fileName).
--conf PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If not
specified, this will look for conf/spark-defaults.conf.
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
--driver-java-options Extra Java options to pass to the driver.
--driver-library-path Extra library path entries to pass to the driver.
--driver-class-path Extra class path entries to pass to the driver. Note that
jars added with --jars are automatically included in the
classpath.
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).
--proxy-user NAME User to impersonate when submitting the application.
This argument does not work with --principal / --keytab.
--help, -h Show this help message and exit.
--verbose, -v Print additional debug output.
--version, Print the version of current Spark.
Cluster deploy mode only:
--driver-cores NUM Number of cores used by the driver, only in cluster mode
(Default: 1).
Spark standalone or Mesos with cluster deploy mode only:
--supervise If given, restarts the driver on failure.
--kill SUBMISSION_ID If given, kills the driver specified.
--status SUBMISSION_ID If given, requests the status of the driver specified.
Spark standalone and Mesos only:
--total-executor-cores NUM Total cores for all executors.
Spark standalone and YARN only:
--executor-cores NUM Number of cores per executor. (Default: 1 in YARN mode,
or all available cores on the worker in standalone mode)
YARN-only:
--queue QUEUE_NAME The YARN queue to submit to (Default: "default").
--num-executors NUM Number of executors to launch (Default: 2).
If dynamic allocation is enabled, the initial number of
executors will be at least NUM.
--archives ARCHIVES Comma separated list of archives to be extracted into the
working directory of each executor.
--principal PRINCIPAL Principal to be used to login to KDC, while running on
secure HDFS.
--keytab KEYTAB The full path to the file that contains the keytab for the
principal specified above. This keytab will be copied to
the node running the Application Master via the Secure
Distributed Cache, for renewing the login tickets and the
delegation tokens periodically.
常用的选项
提交的选项
--master:用于指定提交的模式,local、Standalone、yarn、messos、k8s
local[2]
spark://host:7077
yarn
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-xCPsTqOE-1610514088312)(Day39_分布式计算平台Spark:Core(一).assets/image-20201217110811823.png)]
--deploy-mode:指定deploy模式,client、cluster两种模式
--class:指定运行jar包中的哪个类
--jars:用于指定添加需要的一些额外的包
--conf:临时的修改Spark的属性配置
Driver进程资源的配置
--driver-memory MEM:driver进程的内存资源
Executor进程的资源配置
通用属性
--executor-memory:每个Executor能够使用的内存(Default: 1G).
Spark standalone and Mesos
--total-executor-cores:所有Executor总共使用的CPU核数
Spark standalone and YARN
--executor-cores:指定每个Executor能用几核CPU
YARN-only
--num-executors:指定Executor的个数
问题:Standalone集群中,如何指定Executor的个数
SPARK_HOME=/export/server/spark
$SPARK_HOME/bin/spark-submit \
--master local[2] \
--class bigdata.itcast.cn.spark.scala.wordcount.SparkCoreWordCount \
hdfs://node1:8020/spark/apps/spark-chapter01_2.11-1.0.0.jar.jar \
/datas/wordcount.data \
/datas/output
直接提交
SPARK_HOME=/export/server/spark
$SPARK_HOME/bin/spark-submit \
--master spark://node1:7077 \
--class bigdata.itcast.cn.spark.scala.wordcount.SparkCoreWordCount \
hdfs://node1:8020/spark/apps/spark-chapter01_2.11-1.0.0.jar.jar \
/datas/wordcount.data \
/datas/output
调整Executor和Driver的资源
SPARK_HOME=/export/server/spark
$SPARK_HOME/bin/spark-submit \
--master spark://node1:7077 \
--driver-memory 512M \
--executor-memory 512M \
--executor-cores 1 \
--total-executor-cores 2 \
--class bigdata.itcast.cn.spark.scala.wordcount.SparkCoreWordCount \
hdfs://node1:8020/spark/apps/spark-chapter01_2.11-1.0.0.jar.jar \
/datas/wordcount.data \
/datas/output
修改yarn-site.xml
<property>
<name>yarn.log-aggregation-enablename>
<value>truevalue>
property>
<property>
<name>yarn.log-aggregation.retain-secondsname>
<value>604800value>
property>
<property>
<name>yarn.log.server.urlname>
<value>http://node1:19888/jobhistory/logsvalue>
property>
<property>
<name>yarn.nodemanager.pmem-check-enabledname>
<value>falsevalue>
property>
<property>
<name>yarn.nodemanager.vmem-check-enabledname>
<value>falsevalue>
property>
分发
cd /export/server/hadoop/etc/hadoop
scp -r yarn-site.xml root@node2:$PWD
scp -r yarn-site.xml root@node3:$PWD
关闭Spark集群
cd /export/server/spark
sbin/stop-master.sh
sbin/stop-slaves.sh
sbin/stop-history-server.sh
修改spark-env.sh
#添加yarn的地址
YARN_CONF_DIR=/export/server/hadoop/etc/hadoop
配置HDFS上Spark jar包的存储位置:解决YARN运行Spark没有Spark的依赖包
hdfs dfs -mkdir -p /spark/apps/jars/
hdfs dfs -put /export/server/spark/jars/* /spark/apps/jars/
修改spark-defaults.conf
#为了在8088中能直接访问Spark程序的监控,所以这里这里做了转接,如果在yarn中点击history,就转接18080
spark.yarn.historyServer.address node1:18080
#指定yarn运行时的spark的jar包的地址
spark.yarn.jars hdfs://node1:8020/spark/apps/jars/*
分发
cd /export/server/spark/conf/
scp spark-env.sh spark-defaults.conf node2:$PWD
scp spark-env.sh spark-defaults.conf node3:$PWD
启动YARN
start-yarn.sh
启动Jobhistoryserver
mr-jobhistory-daemon.sh start historyserver
启动Spark的HistoryServer
/export/server/spark/sbin/start-history-server.sh
提交程序
SPARK_HOME=/export/server/spark
$SPARK_HOME/bin/spark-submit \
--master yarn \
--driver-memory 512M \
--executor-memory 512M \
--executor-cores 1 \
--num-executors 3 \
--class bigdata.itcast.cn.spark.scala.wordcount.SparkCoreWordCount \
hdfs://node1:8020/spark/apps/spark-chapter01_2.11-1.0.0.jar.jar \
/datas/wordcount.data \
/datas/output
问题:每次提交一个新的Spark程序
driver功能:申请资源、解析调度、分配、监控
解决:deploy模式:决定driver进程启动在哪台机器
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or
on one of the worker machines inside the cluster ("cluster")
(Default: client).
Driver运行的位置不同
client模式
SPARK_HOME=/export/server/spark
$SPARK_HOME/bin/spark-submit \
--master spark://node1:7077 \
--driver-memory 512M \
--executor-memory 512M \
--executor-cores 1 \
--total-executor-cores 2 \
--class bigdata.itcast.cn.spark.scala.wordcount.SparkCoreWordCount \
hdfs://node1:8020/spark/apps/spark-chapter01_2.11-1.0.0.jar.jar \
/datas/wordcount.data \
/datas/output
SPARK_HOME=/export/server/spark
$SPARK_HOME/bin/spark-submit \
--master spark://node1:7077 \
--deploy-mode client \
--driver-memory 512M \
--executor-memory 512M \
--executor-cores 1 \
--total-executor-cores 2 \
--class bigdata.itcast.cn.spark.scala.wordcount.SparkCoreWordCount \
hdfs://node1:8020/spark/apps/spark-chapter01_2.11-1.0.0.jar.jar \
/datas/wordcount.data \
/datas/output
cluster
SPARK_HOME=/export/server/spark
$SPARK_HOME/bin/spark-submit \
--master spark://node1:7077 \
--deploy-mode cluster \
--driver-memory 512M \
--executor-memory 512M \
--executor-cores 1 \
--total-executor-cores 2 \
--class bigdata.itcast.cn.spark.scala.wordcount.SparkCoreWordCount \
hdfs://node1:8020/spark/apps/spark-chapter01_2.11-1.0.0.jar.jar \
/datas/wordcount.data \
/datas/output
MapReduce程序在YARN上运行的过程
Spark on YARN client:driver进程启动在客户端
Driver和APPMaster都存在,是两个不同的进程
step1:Spark的driver向RM提交程序,申请启动APPMaster
step2:APPMaster会根据提交的请求向RM申请资源启动Executor
step3:所有的Executor反向注册到Driver进程中
step4:Driver解析和分配Task到Executor中运行
Spark on YARN cluster:driver进程启动在NodeManger中
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-qx57uAlc-1610514088317)(Day39_分布式计算平台Spark:Core(一).assets/image-20201217145900234.png)]
Driver和APPMaster合并了在同一个进程中
定义:弹性分布式数据集(Resilient Distributed Datasets (RDDs))
弹性:RDD的所有数据、转换过程中的数据都存储在内存中实现
分布式:一个RDD中可以有多个分区:类似于MapReduce中split
数据集:类似于一个集合,将各种类型的数据存储在RDD中
Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection
in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
* Internally, each RDD is characterized by five main properties:
*
* - A list of partitions
* - A function for computing each split
* - A list of dependencies on other RDDs
* - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
* - Optionally, a list of preferred locations to compute each split on (e.g. block locations for
* an HDFS file)
方式一:parallelizing an existing collection
val seqData: Seq[Int] = Seq(1,2,3,4,5,6,7,8,9,10)
val seqRDD: RDD[Int] = sc.parallelize(seqData)
//打印分区的个数
println(s"分区的个数为:${seqRDD.getNumPartitions}")
//todo:2-转换数据
//todo:3-输出结果
seqRDD.foreach(num => {
//打印分区编号和当前的数值
println(s"分区=${TaskContext.getPartitionId()} num=${num}")
})
方式二:读取外部文件系统:HDFS/MySQL/Hbase
常规的文件读取:sc.textFile
小文件的读取方式:sc.wholeTextFiles
val inputRDD = sc
//读取所有文件,将每个文件构建成RDD集合的一个元素
.wholeTextFiles("datas/ratings100/", minPartitions = 2)
//将每个文件每一行取出来
.flatMap(tuple => tuple._2.split("\\n"))
println(s"Partitions Number = ${inputRDD.getNumPartitions}")
println(s"Count = ${inputRDD.count()}")
查看分区个数
默认的分区个数
设置分区个数:规则:尽量让RDD的分区个数 = 集群中所有Executor的CPU核数的2 ~ 3
可以适当调整RDD的分区个数
如何设置
1)、启动的时候指定的CPU核数确定了一个参数值:
spark.default.parallelism=指定的CPU核数(集群模式最小2)
2)、对于Scala集合调用parallelize(集合,分区数)方法
如果没有指定分区数,就使用spark.default.parallelism
如果指定了就使用指定的分区数(不要指定大于spark.default.parallelism)
3)、对于textFile(文件, 分区数)
defaultMinPartitions
如果没有指定分区数sc.defaultMinPartitions=min(defaultParallelism,2)
如果指定了就使用指定的分区数sc.defaultMinPartitions=指定的分区数rdd的分区数
rdd的分区数
对于本地文件
rdd的分区数 = max(本地file的分片数, sc.defaultMinPartitions)
对于HDFS文件
rdd的分区数 = max(hdfs文件的block数目, sc.defaultMinPartitions)
所以如果分配的核数为多个,且从文件中读取数据创建RDD,即使hdfs文件只有1个切片,最后的Spark的RDD的partition数也有可能是2
特征
常用
filter
def filter(f: T => Boolean): RDD[T]
map
def map[U: ClassTag](f: T => U): RDD[U]
flatMap
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U]
reduceByKey
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
sortByKey
def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length): RDD[(K, V)]
特征
常用
foreach
def foreach(f: T => Unit): Unit
saveAsTextFile
def saveAsTextFile(path: String): Unit
count:统计元素的个数
first:返回RDD第一个元素的值
take(N):取RDD中的N个值,放入一个数组进行返回
top(N):排序取前N个值,放入一个数组进行返回
collect:将RDD的所有元素,放入一个数组返回
<repositories>
<repository>
<id>aliyunid>
<url>http://maven.aliyun.com/nexus/content/groups/public/url>
repository>
<repository>
<id>clouderaid>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/url>
repository>
<repository>
<id>jbossid>
<url>http://repository.jboss.com/nexus/content/groups/publicurl>
repository>
repositories>
<properties>
<scala.version>2.11.12scala.version>
<scala.binary.version>2.11scala.binary.version>
<spark.version>2.4.5spark.version>
<hadoop.version>2.6.0-cdh5.16.2hadoop.version>
properties>
<dependencies>
<dependency>
<groupId>org.scala-langgroupId>
<artifactId>scala-libraryartifactId>
<version>${scala.version}version>
dependency>
<dependency>
<groupId>org.apache.sparkgroupId>
<artifactId>spark-core_${scala.binary.version}artifactId>
<version>${spark.version}version>
dependency>
<dependency>
<groupId>org.apache.hadoopgroupId>
<artifactId>hadoop-clientartifactId>
<version>${hadoop.version}version>
dependency>
dependencies>
<build>
<outputDirectory>target/classesoutputDirectory>
<testOutputDirectory>target/test-classestestOutputDirectory>
<resources>
<resource>
<directory>${project.basedir}/src/main/resourcesdirectory>
resource>
resources>
<plugins>
<plugin>
<groupId>org.apache.maven.pluginsgroupId>
<artifactId>maven-compiler-pluginartifactId>
<version>3.0version>
<configuration>
<source>1.8source>
<target>1.8target>
<encoding>UTF-8encoding>
configuration>
plugin>
<plugin>
<groupId>net.alchim31.mavengroupId>
<artifactId>scala-maven-pluginartifactId>
<version>3.2.0version>
<executions>
<execution>
<goals>
<goal>compilegoal>
<goal>testCompilegoal>
goals>
execution>
executions>
plugin>
plugins>
build>