Spark内存计算

Apache Spark

概述

Spark是一个快如闪电的统一分析引擎(计算框架)用于大规模数据集的处理。Spark在做数据的批处理计算,计算性能大约是Hadoop MapReduce的10~100倍,因为Spark使用比较先进的基于DAG任务调度,可以将一个任务拆分成若干个阶段,然后将这些阶段分批次交给集群计算节点处理。

Spark内存计算_第1张图片

MapReduce VS Spark

MapReduce作为第一代大数据处理框架,在设计初期只是为了满足基于海量数据级的海量数据计算的迫切需求。自2006年剥离自Nutch(Java搜索引擎)工程,主要解决的是早期人们对大数据的初级认知所面临的问题。
在这里插入图片描述
整个MapReduce的计算实现的是基于磁盘的IO计算,随着大数据技术的不断普及,人们开始重新定义大数据的处理方式,不仅仅满足于能在合理的时间范围内完成对大数据的计算,还对计算的实效性提出了更苛刻的要求,因为人们开始探索使用Map Reduce计算框架完成一些复杂的高阶算法,往往这些算法通常不能通过1次性的Map Reduce迭代计算完成。由于Map Reduce计算模型总是把结果存储到磁盘中,每次迭代都需要将数据磁盘加载到内存,这就为后续的迭代带来了更多延长。

2009年Spark在加州伯克利AMP实验室诞生,2010首次开源后该项目就受到很多开发人员的喜爱,2013年6月份开始在Apache孵化,2014年2月份正式成为Apache的顶级项目。Spark发展如此之快是因为Spark在计算层方面明显优于Hadoop的Map Reduce这磁盘迭代计算,因为Spark可以使用内存对数据做计算,而且计算的中间结果也可以缓存在内存中,这就为后续的迭代计算节省了时间,大幅度的提升了针对于海量数据的计算效率。
在这里插入图片描述
Spark也给出了在使用MapReduce和Spark做线性回归计算(算法实现需要n次迭代)上,Spark的速率几乎是MapReduce计算10~100倍这种计算速度。
在这里插入图片描述
不仅如此Spark在设计理念中也提出了One stack ruled them all战略,并且提供了基于Spark批处理至上的计算服务分支例如:实现基于Spark的交互查询、近实时流处理、机器学习、Graphx 图形关系存储等。
在这里插入图片描述
从图中不难看出Apache Spark处于计算层,Spark项目在战略上启到了承上启下的作用,并没有废弃原有以hadoop为主体的大数据解决方案。因为Spark向下可以计算来自于HDFS、HBase、Cassandra和亚马逊S3文件服务器的数据,也就意味着使用Spark作为计算层,用户原有的存储层架构无需改动。

计算流程

因为Spark计算是在MapReduce计算之后诞生,吸取了MapReduce设计经验,极大地规避了MapReduce计算过程中的诟病,先来回顾一下MapReduce计算的流程。
在这里插入图片描述
总结一下几点缺点:

  • MapReduce虽然基于矢量编程思想,但是计算状态过于简单,只是简单的将任务分为Map Stage和Reduce Stage,没有考虑到迭代计算场景。
  • 在Map任务计算的中间结果存储到本地磁盘,IO调用过多,数据读写效率差。
  • MapReduce是先提交任务,然后在计算过程中申请资源。并且计算方式过于笨重。每个并行度都是由一个JVM进程来实现计算。

通过简单的罗列不难发现MapReduce计算的诟病和问题,因此Spark在计算层面上借鉴了MapReduce计算设计的经验,提出了DGASchedual和TaskSchedual概念,打破了在MapReduce任务中一个job只用Map Stage和Reduce Stage的两个阶段,并不适合一些迭代计算次数比较多的场景。因此Spark 提出了一个比较先进的设计理念,任务阶段拆分,Spark在任务计算初期首先通过DGASchedule计算任务的Stage,将每个阶段的Stage封装成一个TaskSet(线程集),然后由TaskSchedual将TaskSet提交集群进行计算。可以尝试将Spark计算的流程使用一下的流程图描述如下:

Spark内存计算_第2张图片

相比较于MapReduce计算,Spark计算有以下优点:

1)智能DAG任务拆分,将一个复杂计算拆分成若干个Stage,满足迭代计算场景

2)Spark提供了计算的缓存和容错策略,将计算结果存储在内存或者磁盘,加速每个stage的运行,提升运行效率

3)Spark在计算初期,就已经申请好计算资源。任务并行度是通过在Executor进程中启动线程实现,相比较于MapReduce 启动进程计算更加轻快。

提示目前Spark提供了Cluster Manager的实现由Yarn、Standalone、Messso、kubernates等实现。其中企业常用的有Yarn和Standalone方式的管理。

环境搭建

下载地址:https://www.apache.org/dyn/closer.lua/spark/spark-2.4.5/spark-2.4.5-bin-without-hadoop.tgz

Spark On Yarn

Hadoop环境

  • 设置CentOS进程数和文件数(可选)
[root@CentOS ~]# vi /etc/security/limits.conf

* soft nofile 204800
* hard nofile 204800
* soft nproc 204800
* hard nproc 204800

优化linux性能,修改这个最大值,重启CentOS生效

  • 配置主机名(重启生效)
[root@CentOS ~]# vi /etc/hostname
CentOS
[root@CentOS ~]# reboot

CentOS6修改:/etc/sysconfig/network

  • 设置IP映射
[root@CentOS ~]# vi /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.73.135 CentOS
  • 防火墙服务
# 临时关闭服务
[root@CentOS ~]# systemctl stop firewalld
[root@CentOS ~]# firewall-cmd --state
not running
# 关闭开机自动启动
[root@CentOS ~]# systemctl disable firewalld

CentOS6:service iptables stop(关停服务) / chkconfig iptables off(关闭开机自启动)

  • 安装JDK1.8+
[root@CentOS ~]# rpm -ivh jdk-8u171-linux-x64.rpm 
[root@CentOS ~]# ls -l /usr/java/
total 4
lrwxrwxrwx. 1 root root   16 Mar 26 00:56 default -> /usr/java/latest
drwxr-xr-x. 9 root root 4096 Mar 26 00:56 jdk1.8.0_171-amd64
lrwxrwxrwx. 1 root root   28 Mar 26 00:56 latest -> /usr/java/jdk1.8.0_171-amd64
[root@CentOS ~]# vi .bashrc 
JAVA_HOME=/usr/java/latest
PATH=$PATH:$JAVA_HOME/bin
CLASSPATH=.
export JAVA_HOME
export PATH
export CLASSPATH
[root@CentOS ~]# source ~/.bashrc
  • SSH配置免密
[root@CentOS ~]# ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa): 
Created directory '/root/.ssh'.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
4b:29:93:1c:7f:06:93:67:fc:c5:ed:27:9b:83:26:c0 root@CentOS
The key's randomart image is:
+--[ RSA 2048]----+
|                 |
|         o   . . |
|      . + +   o .|
|     . = * . . . |
|      = E o . . o|
|       + =   . +.|
|        . . o +  |
|           o   . |
|                 |
+-----------------+
[root@CentOS ~]# ssh-copy-id CentOS
The authenticity of host 'centos (192.168.40.128)' can't be established.
RSA key fingerprint is 3f:86:41:46:f2:05:33:31:5d:b6:11:45:9c:64:12:8e.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'centos,192.168.40.128' (RSA) to the list of known hosts.
root@centos's password: 
Now try logging into the machine, with "ssh 'CentOS'", and check in:

  .ssh/authorized_keys

to make sure we haven't added extra keys that you weren't expecting.
[root@CentOS ~]# ssh root@CentOS
Last login: Tue Mar 26 01:03:52 2019 from 192.168.40.1
[root@CentOS ~]# exit
logout
Connection to CentOS closed.
  • 配置HDFS|YARN

hadoop-2.9.2.tar.gz解压到系统的/usr目录下然后配置[core|hdfs|yarn|mapred]-site.xml配置文件。

[root@CentOS ~]# vi /usr/hadoop-2.9.2/etc/hadoop/core-site.xml


<property>
    <name>fs.defaultFSname>
    <value>hdfs://CentOS:9000value>
property>

<property>
    <name>hadoop.tmp.dirname>
    <value>/usr/hadoop-2.9.2/hadoop-${user.name}value>
property>

[root@CentOS ~]# vi /usr/hadoop-2.9.2/etc/hadoop/hdfs-site.xml


<property>
    <name>dfs.replicationname>
    <value>1value>
property>

<property>
    <name>dfs.namenode.secondary.http-addressname>
    <value>CentOS:50090value>
property>

<property>
        <name>dfs.datanode.max.xcieversname>
        <value>4096value>
property>

<property>
        <name>dfs.datanode.handler.countname>
        <value>6value>
property>

[root@CentOS ~]# vi /usr/hadoop-2.9.2/etc/hadoop/yarn-site.xml


<property>
    <name>yarn.nodemanager.aux-servicesname>
    <value>mapreduce_shufflevalue>
property>

<property>
    <name>yarn.resourcemanager.hostnamename>
    <value>CentOSvalue>
property>

<property>
        <name>yarn.nodemanager.pmem-check-enabledname>
        <value>falsevalue>
property>

<property>
        <name>yarn.nodemanager.vmem-check-enabledname>
        <value>falsevalue>
property>

[root@CentOS ~]# vi /usr/hadoop-2.9.2/etc/hadoop/mapred-site.xml


<property>
    <name>mapreduce.framework.namename>
    <value>yarnvalue>
property>
  • 配置hadoop环境变量
[root@CentOS ~]# vi .bashrc
JAVA_HOME=/usr/java/latest
HADOOP_HOME=/usr/hadoop-2.9.2
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
CLASSPATH=.
export JAVA_HOME
export CLASSPATH
export PATH
export HADOOP_HOME
[root@CentOS ~]# source .bashrc
  • 启动Hadoop服务
[root@CentOS ~]# hdfs namenode -format # 创建初始化所需的fsimage文件
[root@CentOS ~]# start-dfs.sh
[root@CentOS ~]# start-yarn.sh
[root@CentOS ~]# jps
122690 NodeManager
122374 SecondaryNameNode
122201 DataNode
122539 ResourceManager
122058 NameNode
123036 Jps

访问:http://CentOS:8088以及 http://centos:50070/

Spark环境

下载spark-2.4.5-bin-without-hadoop.tgz解压到/usr目录,并且将Spark目录修改名字为spark-2.4.5然后修改spark-env.shspark-default.conf文件.

  • 解压安装spark
[root@CentOS ~]# tar -zxf spark-2.4.5-bin-without-hadoop.tgz -C /usr/
[root@CentOS ~]# mv /usr/spark-2.4.5-bin-without-hadoop/ /usr/spark-2.4.5
[root@CentOS ~]# tree -L 1 /usr/spark-2.4.5/
/usr/spark-2.4.5/
├── bin  # Spark系统执行脚本
├── conf # Spark配置目录
├── data
├── examples # Spark提供的官方案例
├── jars
├── kubernetes
├── LICENSE
├── licenses
├── NOTICE
├── python
├── R
├── README.md
├── RELEASE
├── sbin # Spark用户执行脚本
└── yarn

  • 配置Spark服务
[root@CentOS ~]# cd /usr/spark-2.4.5/
[root@CentOS spark-2.4.5]# mv conf/spark-env.sh.template conf/spark-env.sh
[root@CentOS spark-2.4.5]# vi conf/spark-env.sh 
# Options read in YARN client/cluster mode
# - SPARK_CONF_DIR, Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - YARN_CONF_DIR, to point Spark towards YARN configuration files when you use YARN
# - SPARK_EXECUTOR_CORES, Number of cores for the executors (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Executor (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Driver (e.g. 1000M, 2G) (Default: 1G)

HADOOP_CONF_DIR=/usr/hadoop-2.9.2/etc/hadoop
YARN_CONF_DIR=/usr/hadoop-2.9.2/etc/hadoop
SPARK_EXECUTOR_CORES=2
SPARK_EXECUTOR_MEMORY=1G
SPARK_DRIVER_MEMORY=1G
LD_LIBRARY_PATH=/usr/hadoop-2.9.2/lib/native

export HADOOP_CONF_DIR
export YARN_CONF_DIR
export SPARK_EXECUTOR_CORES
export SPARK_DRIVER_MEMORY
export SPARK_EXECUTOR_MEMORY
export LD_LIBRARY_PATH

export SPARK_DIST_CLASSPATH=$(hadoop classpath):$SPARK_DIST_CLASSPATH
export SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs:///spark-logs"
[root@CentOS spark-2.4.5]# mv conf/spark-defaults.conf.template conf/spark-defaults.conf
[root@CentOS spark-2.4.5]# vi conf/spark-defaults.conf 
spark.eventLog.enabled=true
spark.eventLog.dir=hdfs:///spark-logs

需要现在在HDFS上创建spark-logs目录,用于作为Sparkhistory服务器存储历史计算数据的地方。

[root@CentOS ~]# hdfs dfs -mkdir /spark-logs
  • 启动Spark history server
[root@CentOS spark-2.4.5]# ./sbin/start-history-server.sh
[root@CentOS spark-2.4.5]# jps
124528 HistoryServer
122690 NodeManager
122374 SecondaryNameNode
122201 DataNode
122539 ResourceManager
122058 NameNode
124574 Jps

  • 访问http://主机ip:18080访问Spark History Server
    在这里插入图片描述

测试环境

[root@CentOS spark-2.4.5]# ./bin/spark-submit \
							--master yarn \
							--deploy-mode client \
							--class org.apache.spark.examples.SparkPi \
							--num-executors 2 \
							--executor-cores 3 \
							/usr/spark-2.4.5/examples/jars/spark-examples_2.11-2.4.5.jar

得到结果

19/04/21 03:30:39 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 6609 ms on CentOS (executor 1) (1/2)
19/04/21 03:30:39 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 6403 ms on CentOS (executor 1) (2/2)
19/04/21 03:30:39 INFO cluster.YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool
19/04/21 03:30:39 INFO scheduler.DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 29.116 s
19/04/21 03:30:40 INFO scheduler.DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 30.317103 s
`Pi is roughly 3.141915709578548`
19/04/21 03:30:40 INFO server.AbstractConnector: Stopped Spark@41035930{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
19/04/21 03:30:40 INFO ui.SparkUI: Stopped Spark web UI at http://CentOS:4040
19/04/21 03:30:40 INFO cluster.YarnClientSchedulerBackend: Interrupting monitor thread
19/04/21 03:30:40 INFO cluster.YarnClientSchedulerBackend: Shutting down all executors
参数 说明
–master 链接的资源服务器的名字yarn
–deploy-mode 部署模式,可选值有clientcluster,决定Driver程序是否在远程执行
–class 运行的主类名字
–num-executors 计算过程所需要的进程数
–executor-cores 每个Exector最多使用的CPU的核数

Spark shell

[root@CentOS spark-2.4.5]# ./bin/spark-shell --master yarn --deploy-mode client  --executor-cores 4 --num-executors 3
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/04/07 18:42:20 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Spark context Web UI available at http://CentOS:4040
Spark context available as 'sc' (master = yarn, app id = application_1586255024224_0002).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.5
      /_/
         
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_231)
Type in expressions to have them evaluated.
Type :help for more information.

scala> var lines=sc.textFile("hdfs:///words/src")
lines: org.apache.spark.rdd.RDD[String] = hdfs:///words/src MapPartitionsRDD[1] at textFile at <console>:24

scala> lines.flatMap(line=>line.split("\\s+")).groupBy(word=>word).map(t=>(t._1,t._2.size)).sortBy(t=>t._2,false,3).collect

res4: Array[(String, Int)] = Array((good,2), (day,2), (this,1), (is,1), (come,1), (baby,1), (up,1), (a,1), (on,1), (demo,1), (study,1))

scala> lines.flatMap(line=>line.split("\\s+")).map(word=>(word,1)).reduceByKey(_+_).sortBy(t=>t._2,false,3).saveAsTextFile("hdfs:///words/results")

Spark Standalone

Hadoop环境

  • 设置CentOS进程数和文件数(可选)
[root@CentOS ~]# vi /etc/security/limits.conf

* soft nofile 204800
* hard nofile 204800
* soft nproc 204800
* hard nproc 204800

优化linux性能,修改这个最大值,重启CentOS生效

  • 配置主机名(重启生效)
[root@CentOS ~]# vi /etc/hostname
CentOS
[root@CentOS ~]# rebbot
  • 设置IP映射
[root@CentOS ~]# vi /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.52.135 CentOS
  • 防火墙服务
# 临时关闭服务
[root@CentOS ~]# systemctl stop firewalld
[root@CentOS ~]# firewall-cmd --state
not running
# 关闭开机自动启动
[root@CentOS ~]# systemctl disable firewalld
  • 安装JDK1.8+
[root@CentOS ~]# rpm -ivh jdk-8u171-linux-x64.rpm 
[root@CentOS ~]# ls -l /usr/java/
total 4
lrwxrwxrwx. 1 root root   16 Mar 26 00:56 default -> /usr/java/latest
drwxr-xr-x. 9 root root 4096 Mar 26 00:56 jdk1.8.0_171-amd64
lrwxrwxrwx. 1 root root   28 Mar 26 00:56 latest -> /usr/java/jdk1.8.0_171-amd64
[root@CentOS ~]# vi .bashrc 
JAVA_HOME=/usr/java/latest
PATH=$PATH:$JAVA_HOME/bin
CLASSPATH=.
export JAVA_HOME
export PATH
export CLASSPATH
[root@CentOS ~]# source ~/.bashrc
  • SSH配置免密
[root@CentOS ~]# ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa): 
Created directory '/root/.ssh'.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
4b:29:93:1c:7f:06:93:67:fc:c5:ed:27:9b:83:26:c0 root@CentOS
The key's randomart image is:
+--[ RSA 2048]----+
|                 |
|         o   . . |
|      . + +   o .|
|     . = * . . . |
|      = E o . . o|
|       + =   . +.|
|        . . o +  |
|           o   . |
|                 |
+-----------------+
[root@CentOS ~]# ssh-copy-id CentOS
The authenticity of host 'centos (192.168.40.128)' can't be established.
RSA key fingerprint is 3f:86:41:46:f2:05:33:31:5d:b6:11:45:9c:64:12:8e.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'centos,192.168.40.128' (RSA) to the list of known hosts.
root@centos's password: 
Now try logging into the machine, with "ssh 'CentOS'", and check in:

  .ssh/authorized_keys

to make sure we haven't added extra keys that you weren't expecting.
[root@CentOS ~]# ssh root@CentOS
Last login: Tue Mar 26 01:03:52 2019 from 192.168.40.1
[root@CentOS ~]# exit
logout
Connection to CentOS closed.
  • 配置HDFS

hadoop-2.9.2.tar.gz解压到系统的/usr目录下然后配置[core|hdfs]-site.xml配置文件。

[root@CentOS ~]# vi /usr/hadoop-2.9.2/etc/hadoop/core-site.xml


<property>
    <name>fs.defaultFSname>
    <value>hdfs://CentOS:9000value>
property>

<property>
    <name>hadoop.tmp.dirname>
    <value>/usr/hadoop-2.9.2/hadoop-${user.name}value>
property>

[root@CentOS ~]# vi /usr/hadoop-2.9.2/etc/hadoop/hdfs-site.xml



<property>
    <name>dfs.replicationname>
    <value>1value>
property>

<property>
    <name>dfs.namenode.secondary.http-addressname>
    <value>CentOS:50090value>
property>

<property>
        <name>dfs.datanode.max.xcieversname>
        <value>4096value>
property>

<property>
        <name>dfs.datanode.handler.countname>
        <value>6value>
property>

[root@CentOS ~]# vi /usr/hadoop-2.9.2/etc/hadoop/slaves

CentOS
  • 配置hadoop环境变量
[root@CentOS ~]# vi .bashrc
JAVA_HOME=/usr/java/latest
HADOOP_HOME=/usr/hadoop-2.9.2
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
CLASSPATH=.
export JAVA_HOME
export CLASSPATH
export PATH
export HADOOP_HOME
[root@CentOS ~]# source .bashrc
  • 启动Hadoop服务
[root@CentOS ~]# hdfs namenode -format # 创建初始化所需的fsimage文件
[root@CentOS ~]# start-dfs.sh
[root@CentOS ~]# jps
122374 SecondaryNameNode
122201 DataNode
122058 NameNode
123036 Jps

访问: http://centos:50070/

Spark环境

下载spark-2.4.5-bin-without-hadoop.tgz解压到/usr目录,并且将Spark目录修改名字为spark-2.4.5然后修改spark-env.shspark-default.conf文件.

  • 解压安装spark
[root@CentOS ~]# tar -zxf spark-2.4.5-bin-without-hadoop.tgz -C /usr/
[root@CentOS ~]# mv /usr/spark-2.4.5-bin-without-hadoop/ /usr/spark-2.4.5
[root@CentOS ~]# tree -L 1 /usr/spark-2.4.5/
/usr/spark-2.4.5/
├── bin  # Spark系统执行脚本
├── conf # Spar配置目录
├── data
├── examples # Spark提供的官方案例
├── jars
├── kubernetes
├── LICENSE
├── licenses
├── NOTICE
├── python
├── R
├── README.md
├── RELEASE
├── sbin # Spark用户执行脚本
└── yarn
  • 配置Spark服务
[root@CentOS ~]# cd /usr/spark-2.4.5/
[root@CentOS spark-2.4.5]# mv conf/spark-env.sh.template conf/spark-env.sh
[root@CentOS spark-2.4.5]# vi conf/spark-env.sh 
# Options for the daemons used in the standalone deploy mode
# - SPARK_MASTER_HOST, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g).
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
# - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. "-Dx=y")
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# - SPARK_DAEMON_CLASSPATH, to set the classpath for all daemons
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers

SPARK_MASTER_HOST=CentOS
SPARK_MASTER_PORT=7077
SPARK_WORKER_CORES=4
SPARK_WORKER_INSTANCES=2
SPARK_WORKER_MEMORY=2g

export SPARK_MASTER_HOST
export SPARK_MASTER_PORT
export SPARK_WORKER_CORES
export SPARK_WORKER_MEMORY
export SPARK_WORKER_INSTANCES

export LD_LIBRARY_PATH=/usr/hadoop-2.9.2/lib/native
export SPARK_DIST_CLASSPATH=$(hadoop classpath):$SPARK_DIST_CLASSPATH
export SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs:///spark-logs"
[root@CentOS spark-2.4.5]# mv conf/spark-defaults.conf.template conf/spark-defaults.conf
[root@CentOS spark-2.4.5]# vi conf/spark-defaults.conf 
spark.eventLog.enabled=true
spark.eventLog.dir=hdfs:///spark-logs

需要现在在HDFS上创建spark-logs目录,用于作为Sparkhistory服务器存储历史计算数据的地方。

[root@CentOS ~]# hdfs dfs -mkdir /spark-logs
  • 启动Spark history server
[root@CentOS spark-2.4.5]# ./sbin/start-history-server.sh
[root@CentOS spark-2.4.5]# jps
124528 HistoryServer
122690 NodeManager
122374 SecondaryNameNode
122201 DataNode
122539 ResourceManager
122058 NameNode
124574 Jps

  • 访问http://主机ip:18080访问Spark History Server
    在这里插入图片描述
  • 修改Spark计算节点
[root@CentOS spark-2.4.5]# mv conf/slaves.template conf/slaves
[root@CentOS spark-2.4.5]# vi conf/slaves
[root@CentOS spark-2.4.5]#

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# A Spark Worker will be started on each of the machines listed below.
CentOS
  • 启动Spark自己计算服务
[root@CentOS spark-2.4.5]# ./sbin/start-all.sh 
starting org.apache.spark.deploy.master.Master, logging to /usr/spark-2.4.5/logs/spark-root-org.apache.spark.deploy.master.Master-1-CentOS.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /usr/spark-2.4.5/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-CentOS.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /usr/spark-2.4.5/logs/spark-root-org.apache.spark.deploy.worker.Worker-2-CentOS.out
[root@CentOS spark-2.4.5]# jps
7908 Worker
7525 HistoryServer
8165 Jps
122374 SecondaryNameNode
7751 Master
122201 DataNode
122058 NameNode
7854 Worker

用户可以访问http://CentOS:8080

Spark内存计算_第3张图片

测试环境

[root@CentOS spark-2.4.5]# ./bin/spark-submit \
							--master spark://CentOS:7077 \
							--deploy-mode client \
							--class org.apache.spark.examples.SparkPi \
							--total-executor-cores 6 \
							/usr/spark-2.4.5/examples/jars/spark-examples_2.11-2.4.5.jar

得到结果

19/04/21 03:30:39 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 6609 ms on CentOS (executor 1) (1/2)
19/04/21 03:30:39 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 6403 ms on CentOS (executor 1) (2/2)
19/04/21 03:30:39 INFO cluster.YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool
19/04/21 03:30:39 INFO scheduler.DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 29.116 s
19/04/21 03:30:40 INFO scheduler.DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 30.317103 s
`Pi is roughly 3.141915709578548`
19/04/21 03:30:40 INFO server.AbstractConnector: Stopped Spark@41035930{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
19/04/21 03:30:40 INFO ui.SparkUI: Stopped Spark web UI at http://CentOS:4040
19/04/21 03:30:40 INFO cluster.YarnClientSchedulerBackend: Interrupting monitor thread
19/04/21 03:30:40 INFO cluster.YarnClientSchedulerBackend: Shutting down all executors
参数 说明
–master 链接的资源服务器的名字spark://CentOS:7077
–deploy-mode 部署模式,可选值有clientcluster,决定Driver程序是否在远程执行
–class 运行的主类名字
–total-executor-cores 计算过程所需要的计算资源线程数

Spark Shell

[root@CentOS spark-2.4.5]# ./bin/spark-shell --master spark://CentOS:7077 --total-executor-cores 6
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://CentOS:4040
Spark context available as 'sc' (master = spark://CentOS:7077, app id = app-20200207140419-0003).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.5
      /_/
         
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_231)
Type in expressions to have them evaluated.
Type :help for more informat.

scala> sc.textFile("hdfs:///demo/words").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).sortBy(_._2,true).saveAsTextFile("hdfs:///demo/results")

常见疑问

Spark与Apache Hadoop有何关系?

Spark是与Hadoop数据兼容的快速通用处理引擎。它可以通过YARN或Spark的Standaone在Hadoop群集中运行,并且可以处理HDFS,HBase,Cassandra,Hive和任何Hadoop InputFormat中的数据。它旨在执行批处理(类似于MapReduce)和提供新的额工作特性,例如流计算,SparkSQL 交互式查询和 Machine Learning机器学习等 。

我的数据需要容纳在内存中才能使用Spark吗?

不会。Spark的operators会在不适合内存的情况下将数据溢出到磁盘上,从而使其可以在任何大小的数据上正常运行。同样,由RDD(弹性分布式数据集合)的存储级别决定,如果内存不足,则缓存的数据集要么溢出到磁盘上,要么在需要时即时重新计算。

http://spark.apache.org/faq.html

Spark RDD详解

参考:http://spark.apache.org/docs/latest/rdd-programming-guide.html

At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.

总体上看Spark,每个Spark应用程序都包含一个Driver,该Driver程序运行用户的main方法并在集群上执行各种并行操作。 Spark提供的主要抽象概念,是弹性分布式数据集(RDD resilient distributed dataset),它是跨集群节点 分区的元素的集合,可以并行操作。

RDD可以通过从Hadoop文件系统(或任何其他Hadoop支持的文件系统)中的文件或驱动程序中现有的Scala集合开始并进行转换来创建RDD,然后调用RDD算子实现对RDD的转换运算。用户还可以要求Spark将RDD持久存储在内存中,从而使其可以在并行操作中高效地重复使用。最后,RDD会自动从节点故障中恢复。

开发环境

  • 导入Maven依赖

<dependency>
  <groupId>org.apache.sparkgroupId>
  <artifactId>spark-core_2.11artifactId>
  <version>2.4.5version>
dependency>

<dependency>
  <groupId>org.apache.hadoopgroupId>
  <artifactId>hadoop-clientartifactId>
  <version>2.9.2version>
dependency>
  • Scala编译插件

<plugin>
  <groupId>net.alchim31.mavengroupId>
  <artifactId>scala-maven-pluginartifactId>
  <version>4.0.1version>
  <executions>
    <execution>
      <id>scala-compile-firstid>
      <phase>process-resourcesphase>
      <goals>
        <goal>add-sourcegoal>
        <goal>compilegoal>
      goals>
    execution>
  executions>
plugin>
  • 打包fat jar插件

<plugin>
  <groupId>org.apache.maven.pluginsgroupId>
  <artifactId>maven-shade-pluginartifactId>
  <version>2.4.3version>
  <executions>
    <execution>
      <phase>packagephase>
      <goals>
        <goal>shadegoal>
      goals>
      <configuration>
        <filters>
          <filter>
            <artifact>*:*artifact>
            <excludes>
              <exclude>META-INF/*.SFexclude>
              <exclude>META-INF/*.DSAexclude>
              <exclude>META-INF/*.RSAexclude>
            excludes>
          filter>
        filters>
      configuration>
    execution>
  executions>
plugin>
  • JDK编译版本插件(可选)
<plugin>
  <groupId>org.apache.maven.pluginsgroupId>
  <artifactId>maven-compiler-pluginartifactId>
  <version>3.2version>
  <configuration>
    <source>1.8source>
    <target>1.8target>
    <encoding>UTF-8encoding>
  configuration>
  <executions>
    <execution>
      <phase>compilephase>
      <goals>
        <goal>compilegoal>
      goals>
    execution>
  executions>
plugin>
  • Driver编写
object SparkWordCountApplication1 {

  def main(args: Array[String]): Unit = {
     //1. 创建SparkContext
     val conf = new SparkConf()
       .setAppName("SparkWordCountApplication")
       .setMaster("spark://CentOS:7077")
    val sc=new SparkContext(conf)

    //2.创建RDD - 细化
    val lineRDD: RDD[String] = sc.textFile("hdfs:///words/src")

    //3.针对计算,编写RDD `转换算子` - 细化
    val finalRDD: RDD[(String, Int)] = lineRDD.flatMap(line => line.split("\\s+"))
      .map(word => (word, 1))
      .reduceByKey((v1, v2) => v1 + v2)
      .sortBy(t => t._2, false, 3)

    //4.调用finalRDD的`Action算子`出发任务计算 - 细化
    finalRDD.saveAsTextFile("hdfs:///words/results") //目录必须不存在
    
    //5.关闭SparkContext
    sc.stop()
  }
}
  • 使用maven package进行打包,将fatjar上传到CentOS
  • 使用spark-submit提交任务
[root@CentOS spark-2.4.5]# ./bin/spark-submit --master spark://CentOS:7077 --deploy-mode client --class  com.baizhi.deploy.SparkWordCountApplication1 --name SparkWordCountApplication --total-executor-cores 6 /root/spark-rdd-1.0-SNAPSHOT.jar

Spark提供了本地测试的方法

object SparkWordCountApplication2 {

  def main(args: Array[String]): Unit = {
     //1. 创建SparkContext
     val conf = new SparkConf()
       .setAppName("SparkWordCountApplication")
       .setMaster("local[6]")
    val sc=new SparkContext(conf)
    //设置日志级别
    sc.setLogLevel("ERROR")

    //2.创建RDD - 细化
    val lineRDD: RDD[String] = sc.textFile("file:///D:/data/word")

    //3.针对计算,编写RDD `转换算子` - 细化
    val finalRDD: RDD[(String, Int)] = lineRDD.flatMap(line => line.split("\\s+"))
      .map(word => (word, 1))
      .reduceByKey((v1, v2) => v1 + v2)
      .sortBy(t => t._2, false, 3)

    //4.调用finalRDD的`Action算子`出发任务计算 - 细化
    finalRDD.saveAsTextFile("file:///D:/data/results") //目录必须不存在

    //5.关闭SparkContext
    sc.stop()
  }
}

需要resource导入log4j.poperties

log4j.rootLogger = FATAL,stdout

log4j.appender.stdout = org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target = System.out
log4j.appender.stdout.layout = org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern = %p %d{yyyy-MM-dd HH:mm:ss} %c %m%n

RDD创建

Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

Spark围绕弹性分布式数据集(RDD)的概念展开,RDD是一个具有容错特性且可并行操作的元素集合。创建RDD的方法有两种:①可以在Driver并行化现有的Scala集合 ②引用外部存储系统(例如共享文件系统,HDFS,HBase或提供Hadoop InputFormat的任何数据源)中的数据集。

Parallelized Collections

通过在Driver程序中的现有集合(Scala Seq)上调用SparkContext的parallelize或者makeRDD方法来创建并行集合。复制集合的元素以形成可以并行操作的分布式数据集。例如,以下是创建包含数字1到5的并行化集合的方法:

scala> val data = Array(1, 2, 3, 4, 5)
data: Array[Int] = Array(1, 2, 3, 4, 5)

scala> val distData = sc.parallelize(data)
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:26

并行集合的可以指定一个分区参数,用于指定计算的并行度。Spark集群的为每个分区运行一个任务。当用户不指定分区的时候,sc会根据系统分配到的资源自动做分区。例如:

[root@CentOS spark-2.4.5]# ./bin/spark-shell --master spark://CentOS:7077 --total-executor-cores 6
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://CentOS:4040
Spark context available as 'sc' (master = spark://CentOS:7077, app id = app-20200208013551-0006).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.5
      /_/
         
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_231)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 

系统会自动在并行化集合的时候,指定分区数为6。用户也可以手动指定分区数

scala> val distData = sc.parallelize(data,10)
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:26

scala> distData.getNumPartitions
res1: Int = 10

External Datasets

Spark可以从Hadoop支持的任何存储源创建分布式数据集,包括您的本地文件系统,HDFS,HBase,Amazon S3、RDBMS等。

  • 本地文件系统
scala> sc.textFile("file:///root/t_word").collect
res6: Array[String] = Array(this is a demo, hello spark, "good good study ", "day day up ", come on baby)
  • 读HDFS

textFile

会将文本文件转换为RDD[String]集合对象,每一行文本表示RDD集合中的一个元素

scala> sc.textFile("hdfs:///demo/words").collect
res7: Array[String] = Array(this is a demo, hello spark, "good good study ", "day day up ", come on baby)

该参数也可以指定分区数,但是需要分区数 >= 文件系统数据块的个数,所以一般在不知到的情况下,用户可以省略不给。

wholeTextFiles

会将文件转换为RDD[(String,String)]集合对象,RDD中每一个元组元素表示一个文件。其中_1文件名_2文件内容

scala> sc.wholeTextFiles("hdfs:///demo/words",1).collect
res26: Array[(String, String)] =
Array((hdfs://CentOS:9000/demo/words/t_word,"this is a demo
hello spark
good good study
day day up
come on baby
"))
scala> sc.wholeTextFiles("hdfs:///demo/words",1).collect
res26: Array[(String, String)] =
Array((hdfs://CentOS:9000/demo/words/t_word,"this is a demo
hello spark
good good study
day day up
come on baby
"))
scala> sc.wholeTextFiles("hdfs:///demo/words",1).map(t=>t._2).flatMap(context=>context.split("\n")).collect
res25: Array[String] = Array(this is a demo, hello spark, "good good study ", "day day up ", come on baby)

√newAPIHadoopRDD

MySQL

<dependency>
  <groupId>mysqlgroupId>
  <artifactId>mysql-connector-javaartifactId>
  <version>5.1.38version>
dependency>
object SparkNewHadoopAPIMySQL {
  // Driver
  def main(args: Array[String]): Unit = {

    //1.创建SparkContext
    val conf = new SparkConf()
      .setMaster("local[*]")
      .setAppName("SparkWordCountApplication")
    val sc = new SparkContext(conf)


    val hadoopConfig = new Configuration()

   DBConfiguration.configureDB(hadoopConfig, //配置数据库的链接参数
      "com.mysql.jdbc.Driver",
      "jdbc:mysql://localhost:3306/test",
      "root",
      "root"
    )
    //设置查询相关属性
    hadoopConfig.set(DBConfiguration.INPUT_QUERY,"select id,name,password,birthDay from t_user")
    hadoopConfig.set(DBConfiguration.INPUT_COUNT_QUERY,"select count(id) from t_user")
    hadoopConfig.set(DBConfiguration.INPUT_CLASS_PROPERTY,"com.baizhi.createrdd.UserDBWritable")

    //通过Hadoop提供的InputFormat读取外部数据源
    val jdbcRDD:RDD[(LongWritable,UserDBWritable)] = sc.newAPIHadoopRDD(
      hadoopConfig, //hadoop配置信息
      classOf[DBInputFormat[UserDBWritable]], //输入格式类
      classOf[LongWritable], //Mapper读入的Key类型
      classOf[UserDBWritable] //Mapper读入的Value类型
    )

    jdbcRDD.map(t=>(t._2.id,t._2.name,t._2.password,t._2.birthDay))
           .collect() //动作算子 远程数据 拿到 Driver端 ,一般用于小批量数据测试
           .foreach(t=>println(t))

    //jdbcRDD.foreach(t=>println(t))//动作算子,远端执行 ok

    //jdbcRDD.collect().foreach(t=>println(t)) 因为UserDBWritable、LongWritable都没法序列化 error
    //5.关闭SparkContext
    sc.stop()
  }
}
class UserDBWritable extends DBWritable {
  var id:Int=_
  var name:String=_
  var password:String=_
  var birthDay:Date=_
  //主要用于DBOutputFormat,因为使用的是读取,该方法可以忽略
  override def write(preparedStatement: PreparedStatement): Unit = {}

  //在使用DBInputFormat,需要将读取的结果集封装给成员属性
  override def readFields(resultSet: ResultSet): Unit = {
    id=resultSet.getInt("id")
    name=resultSet.getString("name")
    password=resultSet.getString("password")
    birthDay=resultSet.getDate("birthDay")
  }
}
Hbase

<dependency>
  <groupId>org.apache.hadoopgroupId>
  <artifactId>hadoop-authartifactId>
  <version>2.9.2version>
dependency>
<dependency>
  <groupId>org.apache.hbasegroupId>
  <artifactId>hbase-clientartifactId>
  <version>1.2.4version>
dependency>

<dependency>
  <groupId>org.apache.hbasegroupId>
  <artifactId>hbase-serverartifactId>
  <version>1.2.4version>
dependency>
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.hbase.HConstants
import org.apache.hadoop.hbase.client.{Result, Scan}
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.{TableInputFormat, TableMapReduceUtil}
import org.apache.hadoop.hbase.protobuf.ProtobufUtil
import org.apache.hadoop.hbase.util.{Base64, Bytes}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object SparkNewHadoopAPIHbase {
  // Driver
  def main(args: Array[String]): Unit = {

    //1.创建SparkContext
    val conf = new SparkConf()
      .setMaster("local[*]")
      .setAppName("SparkWordCountApplication")
    val sc = new SparkContext(conf)


    val hadoopConf = new Configuration()
    hadoopConf.set(HConstants.ZOOKEEPER_QUORUM,"CentOS")//hbase链接参数
    hadoopConf.set(TableInputFormat.INPUT_TABLE,"baizhi:t_user")

    val scan = new Scan()              //构建查询项
    val pro = ProtobufUtil.toScan(scan)
    hadoopConf.set(TableInputFormat.SCAN,Base64.encodeBytes(pro.toByteArray))

    val hbaseRDD:RDD[(ImmutableBytesWritable,Result)] = sc.newAPIHadoopRDD(
      hadoopConf, //hadoop配置
      classOf[TableInputFormat],//输入格式
      classOf[ImmutableBytesWritable], //Mapper key类型
      classOf[Result]//Mapper Value类型
    )

    hbaseRDD.map(t=>{
      val rowKey = Bytes.toString(t._1.get())
      val result = t._2
      val name = Bytes.toString(result.getValue("cf1".getBytes(), "name".getBytes()))
      (rowKey,name)
    }).foreach(t=> println(t))

    //5.关闭SparkContext
    sc.stop()
  }
}

RDD Operations

RDD支持两种类型的操作:transformations-转换,将一个已经存在的RDD转换为一个新的RDD,另外一种称为actions-动作,动作算子一般在执行结束以后,会将结果返回给Driver。在Spark中所有的transformations都是lazy的,所有转换算子并不会立即执行,它们仅仅是记录对当前RDD的转换逻辑。仅当Actions算子要求将结果返回给Driver程序时transformations才开始真正的进行转换计算。这种设计使Spark可以更高效地运行。

默认情况下,每次在其上执行操作时,都可能会重新计算每个转换后的RDD。但是,您也可以使用persist(或cache)方法将RDD保留在内存中,在这种情况下,Spark会将元素保留在群集中,以便下次查询时可以更快地进行访问。

scala> var rdd1=sc.textFile("hdfs:///words/src").map(line=>line.split(" ").length)
rdd1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[117] at map at <console>:24

scala> rdd1.cache
res54: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[117] at map at <console>:24

scala> rdd1.reduce(_+_)
res55: Int = 15                                                                 

scala> rdd1.reduce(_+_)
res56: Int = 15

Spark还支持将RDD持久存储在磁盘上,或在多个节点之间复制。比如用户可调用persist(StorageLevel.DISK_ONLY_2)将RDD存储在磁盘上,并且存储2份。

Transformations

参考:http://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations

√map(func)

Return a new distributed dataset formed by passing each element of the source through a function func.

将一个RDD[U] 转换为 RRD[T]类型。在转换的时候需要用户提供一个匿名函数func: U => T

scala> var rdd:RDD[String]=sc.makeRDD(List("a","b","c","a"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[120] at makeRDD at <console>:25

scala> val mapRDD:RDD[(String,Int)] = rdd.map(w => (w, 1))
mapRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[121] at map at <console>:26
√filter(func)

Return a new dataset formed by selecting those elements of the source on which func returns true.

将对一个RDD[U]类型元素进行过滤,过滤产生新的RDD[U],但是需要用户提供func:U => Boolean系统仅仅会保留返回true的元素。

scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[122] at makeRDD at <console>:25

scala> val mapRDD:RDD[Int]=rdd.filter(num=> num %2 == 0)
mapRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[123] at filter at <console>:26

scala> mapRDD.collect
res63: Array[Int] = Array(2, 4)
√flatMap(func)

Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).

map类似,也是将一个RDD[U] 转换为 RRD[T]类型。但是需要用户提供一个方法func:U => Seq[T]

scala> var rdd:RDD[String]=sc.makeRDD(List("this is","good good"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[124] at makeRDD at <console>:25

scala> var flatMapRDD:RDD[(String,Int)]=rdd.flatMap(line=> for(i<- line.split("\\s+")) yield (i,1))
flatMapRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[125] at flatMap at <console>:26

scala> var flatMapRDD:RDD[(String,Int)]=rdd.flatMap( line=>  line.split("\\s+").map((_,1)))
flatMapRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[126] at flatMap at <console>:26

scala> flatMapRDD.collect
res64: Array[(String, Int)] = Array((this,1), (is,1), (good,1), (good,1))

√mapPartitions(func)

Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator => Iteratorwhen running on an RDD of type T.

和map类似,但是该方法的输入时一个分区的全量数据,因此需要用户提供一个分区的转换方法:func:Iterator => Iterator

scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[128] at makeRDD at <console>:25

scala> var mapPartitionsRDD=rdd.mapPartitions(values => values.map(n=>(n,n%2==0)))
mapPartitionsRDD: org.apache.spark.rdd.RDD[(Int, Boolean)] = MapPartitionsRDD[129] at mapPartitions at <console>:26

scala> mapPartitionsRDD.collect
res70: Array[(Int, Boolean)] = Array((1,false), (2,true), (3,false), (4,true), (5,false))

√mapPartitionsWithIndex(func)

Similar to mapPartitions, but also provides func with an integer value representing the index of the partition, so func must be of type (Int, Iterator) => Iterator when running on an RDD of type T.

和mapPartitions类似,但是该方法会提供RDD元素所在的分区编号。因此func:(Int, Iterator) => Iterator

scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5,6),2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[139] at makeRDD at <console>:25

scala> var mapPartitionsWithIndexRDD=rdd.mapPartitionsWithIndex((p,values) => values.map(n=>(n,p)))
mapPartitionsWithIndexRDD: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[140] at mapPartitionsWithIndex at <console>:26

scala> mapPartitionsWithIndexRDD.collect
res77: Array[(Int, Int)] = Array((1,0), (2,0), (3,0), (4,1), (5,1), (6,1))

sample(withReplacement, fraction, seed)

Sample a fraction fraction of the data, with or without replacement, using a given random number generator seed.

抽取RDD中的样本数据,可以通过withReplacement:是否允许重复抽样、fraction:控制抽样大致比例、seed:控制的是随机抽样过程中产生随机数。

scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5,6))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[150] at makeRDD at <console>:25

scala> var simpleRDD:RDD[Int]=rdd.sample(false,0.5d,1L)
simpleRDD: org.apache.spark.rdd.RDD[Int] = PartitionwiseSampledRDD[151] at sample at <console>:26

scala> simpleRDD.collect
res91: Array[Int] = Array(1, 5, 6)

种子不一样,会影响最终的抽样结果!

union(otherDataset)

Return a new dataset that contains the union of the elements in the source dataset and the argument.

是将两个同种类型的RDD的元素进行合并。

scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5,6))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[154] at makeRDD at <console>:25

scala> var rdd2:RDD[Int]=sc.makeRDD(List(6,7))
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[155] at makeRDD at <console>:25

scala> rdd.union(rdd2).collect
res95: Array[Int] = Array(1, 2, 3, 4, 5, 6, 6, 7)
intersection(otherDataset)

Return a new RDD that contains the intersection of elements in the source dataset and the argument.

是将两个同种类型的RDD的元素进行计算交集。

scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5,6))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[154] at makeRDD at <console>:25

scala> var rdd2:RDD[Int]=sc.makeRDD(List(6,7))
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[155] at makeRDD at <console>:25

scala> rdd.intersection(rdd2).collect
res100: Array[Int] = Array(6)
distinct([numPartitions]))

Return a new dataset that contains the distinct elements of the source dataset.

去除RDD中重复元素,其中numPartitions是一个可选参数,是否修改RDD的分区数,一般是在当数据集经过去重之后,如果数据量级大规模降低,可以尝试传递numPartitions减少分区数。

scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5,6,5))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[154] at makeRDD at <console>:25

scala> rdd.distinct(3).collect
res106: Array[Int] = Array(6, 3, 4, 1, 5, 2)
√join(otherDataset, [numPartitions])

When called on datasets of type (K, V) and (K, W), returns a dataset of(K, (V, W))pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.

当调用RDD[(K,V)]和RDD[(K,W)]系统可以返回一个新的RDD[(k,(v,w))](默认内连接),目前支持 leftOuterJoin, rightOuterJoin, 和 fullOuterJoin.

scala> var userRDD:RDD[(Int,String)]=sc.makeRDD(List((1,"zhangsan"),(2,"lisi")))
userRDD: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[204] at makeRDD at <console>:25

scala> case class OrderItem(name:String,price:Double,count:Int)
defined class OrderItem

scala> var orderItemRDD:RDD[(Int,OrderItem)]=sc.makeRDD(List((1,OrderItem("apple",4.5,2))))
orderItemRDD: org.apache.spark.rdd.RDD[(Int, OrderItem)] = ParallelCollectionRDD[206] at makeRDD at <console>:27

scala> userRDD.join(orderItemRDD).collect
res107: Array[(Int, (String, OrderItem))] = Array((1,(zhangsan,OrderItem(apple,4.5,2))))

scala> userRDD.leftOuterJoin(orderItemRDD).collect
res108: Array[(Int, (String, Option[OrderItem]))] = Array((1,(zhangsan,Some(OrderItem(apple,4.5,2)))), (2,(lisi,None)))
cogroup(otherDataset, [numPartitions])-了解

When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable, Iterable)) tuples. This operation is also called groupWith.

scala> var userRDD:RDD[(Int,String)]=sc.makeRDD(List((1,"zhangsan"),(2,"lisi")))
userRDD: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[204] at makeRDD at <console>:25

scala> var orderItemRDD:RDD[(Int,OrderItem)]=sc.makeRDD(List((1,OrderItem("apple",4.5,2)),(1,OrderItem("pear",1.5,2))))
orderItemRDD: org.apache.spark.rdd.RDD[(Int, OrderItem)] = ParallelCollectionRDD[215] at makeRDD at <console>:27

scala> userRDD.cogroup(orderItemRDD).collect
res110: Array[(Int, (Iterable[String], Iterable[OrderItem]))] = Array((1,(CompactBuffer(zhangsan),CompactBuffer(OrderItem(apple,4.5,2), OrderItem(pear,1.5,2)))), (2,(CompactBuffer(lisi),CompactBuffer())))

scala> userRDD.groupWith(orderItemRDD).collect
res119: Array[(Int, (Iterable[String], Iterable[OrderItem]))] = Array((1,(CompactBuffer(zhangsan),CompactBuffer(OrderItem(apple,4.5,2), OrderItem(pear,1.5,2)))), (2,(CompactBuffer(lisi),CompactBuffer())))

cartesian(otherDataset)-了解

When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements).

计算集合笛卡尔积

scala> var rdd1:RDD[Int]=sc.makeRDD(List(1,2,4))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[238] at makeRDD at <console>:25

scala> var rdd2:RDD[String]=sc.makeRDD(List("a","b","c"))
rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[239] at makeRDD at <console>:25

scala> rdd1.cartesian(rdd2).collect
res120: Array[(Int, String)] = Array((1,a), (1,b), (1,c), (2,a), (2,b), (2,c), (4,a), (4,b), (4,c))

coalesce(numPartitions)

Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.

当经过大规模的过滤数据以后,可以使coalesce对RDD进行分区的缩小(只能减少分区,不可以增加)。

scala> var rdd1:RDD[Int]=sc.makeRDD(0 to 100)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[252] at makeRDD at <console>:25

scala> rdd1.getNumPartitions
res129: Int = 6

scala> rdd1.filter(n=> n%2 == 0).coalesce(3).getNumPartitions
res127: Int = 3

scala> rdd1.filter(n=> n%2 == 0).coalesce(12).getNumPartitions
res128: Int = 6
repartition(numPartitions)

Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.

coalesce相似,但是该算子能够变大或者缩小RDD的分区数。

scala> var rdd1:RDD[Int]=sc.makeRDD(0 to 100)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[252] at makeRDD at <console>:25

scala> rdd1.getNumPartitions
res129: Int = 6

scala> rdd1.filter(n=> n%2 == 0).repartition(12).getNumPartitions
res130: Int = 12

scala> rdd1.filter(n=> n%2 == 0).repartition(3).getNumPartitions
res131: Int = 3
repartitionAndSortWithinPartitions(partitioner)-了解

Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. This is more efficient than calling repartition and then sorting within each partition because it can push the sorting down into the shuffle machinery.

该算子能够使用用户提供的partitioner实现对RDD中数据分区,然后对分区内的数据按照他们key进行排序。

scala> case class User(name:String,deptNo:Int)
defined class User

var empRDD:RDD[User]= sc.parallelize(List(User("张三",1),User("lisi",2),User("wangwu",1)))

empRDD.map(t => (t.deptNo, t.name)).repartitionAndSortWithinPartitions(new Partitioner {
  override def numPartitions: Int = 4

  override def getPartition(key: Any): Int = {
    key.hashCode() & Integer.MAX_VALUE % numPartitions
  }
}).mapPartitionsWithIndex((p,values)=> {
  //println(p+"\t"+values.mkString("|"))
  values.map(v=>(p,v))
}).collect()

思考

1、如果有两个超大型文件需要join,有何优化策略?
Spark内存计算_第4张图片

√xxxByKey-算子(掌握)

在Spark中专门针对RDD[(K,V)]类型数据集提供了xxxByKey算子实现对RDD[(K,V)]类型针对性实现计算。

  • groupByKey([numPartitions])

When called on a dataset of (K, V) pairs, returns a dataset of(K, Iterable) pairs.

类似于MapReduce计算模型。将RDD[(K, V)]转换为RDD[ (K, Iterable)]

scala> var lines=sc.parallelize(List("this is good good"))
lines: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> lines.flatMap(_.split("\\s+")).map((_,1)).groupByKey.collect
res3: Array[(String, Iterable[Int])] = Array((this,CompactBuffer(1)), (is,CompactBuff)), (good,CompactBuffer(1, 1)))
  • groupBy(f:(k,v)=> T)
scala> var lines=sc.parallelize(List("this is good good"))
lines: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> lines.flatMap(_.split("\\s+")).map((_,1)).groupBy(t=>t._1)
res5: org.apache.spark.rdd.RDD[(String, Iterable[(String, Int)])] = ShuffledRDD[18] at groupBy at <console>:26

scala> lines.flatMap(_.split("\\s+")).map((_,1)).groupBy(t=>t._1).map(t=>(t._1,t._2.size)).collect
res6: Array[(String, Int)] = Array((this,1), (is,1), (good,2))
  • reduceByKey(func, [numPartitions])

When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.

scala> var lines=sc.parallelize(List("this is good good"))
lines: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> lines.flatMap(_.split("\\s+")).map((_,1)).reduceByKey(_+_).collect
res8: Array[(String, Int)] = Array((this,1), (is,1), (good,2))
  • aggregateByKey(zeroValue)(seqOp, combOp, [numPartitions])

When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral “zero” value. Allows an aggregated value type that is different than the input value type, while avoiding unnecessary allocations. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.

scala> var lines=sc.parallelize(List("this is good good"))
lines: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> lines.flatMap(_.split("\\s+")).map((_,1)).aggregateByKey(0)(_+_,_+_).collect
res9: Array[(String, Int)] = Array((this,1), (is,1), (good,2))
  • sortByKey([ascending], [numPartitions])

When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.

scala> var lines=sc.parallelize(List("this is good good"))
lines: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> lines.flatMap(_.split("\\s+")).map((_,1)).aggregateByKey(0)(_+_,_+_).sortByKey(true).collect
res13: Array[(String, Int)] = Array((good,2), (is,1), (this,1))

scala> lines.flatMap(_.split("\\s+")).map((_,1)).aggregateByKey(0)(_+_,_+_).sortByKey(false).collect
res14: Array[(String, Int)] = Array((this,1), (is,1), (good,2))
  • sortBy(T=>U,ascending,[numPartitions])
scala> var lines=sc.parallelize(List("this is good good"))
lines: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> lines.flatMap(_.split("\\s+")).map((_,1)).aggregateByKey(0)(_+_,_+_).sortBy(_._2,false).collect
res18: Array[(String, Int)] = Array((good,2), (this,1), (is,1))

scala> lines.flatMap(_.split("\\s+")).map((_,1)).aggregateByKey(0)(_+_,_+_).sortBy(t=>t._2,true).collect
res19: Array[(String, Int)] = Array((this,1), (is,1), (good,2))

Actions

Spark任何一个计算任务,有且仅有一个动作算子,用于触发job的执行。将RDD中的数据写出到外围系统或者RDD的数据传递给Driver主程序。

reduce(func)

Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.

该算子能够对远程结果进行计算,然后将计算结果返回给Driver。计算文件中的字符数。

scala> sc.textFile("hdfs:///words/src").map(_.split("\\s+").length).reduce(_+_)
res56: Int = 13
collect()

Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.

将远程RDD中数据传输给Driver端。通常用于测试环境或者RDD中数据非常的小的情况才可以使用collect算子,否则Driver可能因为数据太大导致内存溢出。

scala> sc.textFile("hdfs:///words/src").collect
res58: Array[String] = Array(this is a demo, good good study, day day up, come on baby)

一般用于测试,将分布式RDD数据转换为本地的Array数组。

√foreach(func)

Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems.

在数据集的每个元素上运行函数func。通常这样做是出于副作用,例如更新累加器或与外部存储系统交互。

scala> sc.textFile("file:///root/t_word").foreach(line=>println(line))
count()

Return the number of elements in the dataset.

返回RDD中元素的个数

scala> sc.textFile("file:///root/t_word").count()
res7: Long = 5
first()|take(n)

Return the first element of the dataset (similar to take(1)). take(n) Return an array with the first n elements of the dataset.

scala> sc.textFile("file:///root/t_word").first
res9: String = this is a demo

scala> sc.textFile("file:///root/t_word").take(1)
res10: Array[String] = Array(this is a demo)

scala> sc.textFile("file:///root/t_word").take(2)
res11: Array[String] = Array(this is a demo, hello spark)
takeSample(withReplacement, num, [seed])

Return an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed.

随机的从RDD中采样num个元素,并且将采样的元素返回给Driver主程序。因此这和sample转换算子有很大的区别。

scala> sc.textFile("file:///root/t_word").takeSample(false,2)
res20: Array[String] = Array("good good study ", hello spark)

takeOrdered(n, [ordering])

Return the first n elements of the RDD using either their natural order or a custom comparator.

返回RDD中前N个元素,用户可以指定比较规则

scala> case class User(name:String,deptNo:Int,salary:Double)
defined class User

scala> var userRDD=sc.parallelize(List(User("zs",1,1000.0),User("ls",2,1500.0),User("ww",2,1000.0)))
userRDD: org.apache.spark.rdd.RDD[User] = ParallelCollectionRDD[51] at parallelize at <console>:26

scala> userRDD.takeOrdered
   def takeOrdered(num: Int)(implicit ord: Ordering[User]): Array[User]

scala> userRDD.takeOrdered(3)
<console>:26: error: No implicit Ordering defined for User.
       userRDD.takeOrdered(3)

scala>  implicit var userOrder=new Ordering[User]{
     |      override def compare(x: User, y: User): Int = {
     |        if(x.deptNo!=y.deptNo){
     |          x.deptNo.compareTo(y.deptNo)
     |        }else{
     |          x.salary.compareTo(y.salary) * -1
     |        }
     |      }
     |    }
userOrder: Ordering[User] = $anon$1@7066f4bc

scala> userRDD.takeOrdered(3)
res23: Array[User] = Array(User(zs,1,1000.0), User(ls,2,1500.0), User(ww,2,1000.0))
√saveAsTextFile(path)

Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.

Spark会调用RDD中元素的toString方法将元素以文本行的形式写入到文件中。

scala> sc.textFile("file:///root/t_word").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).sortBy(_._1,true,1).map(t=> t._1+"\t"+t._2).saveAsTextFile("hdfs:///demo/results02")
saveAsSequenceFile(path)

Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs that implement Hadoop’s Writable interface. In Scala, it is also available on types that are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc).

该方法只能用于RDD[(k,v)]类型。并且K/v都必须实现Writable接口,由于使用Scala编程,Spark已经实现隐式转换将Int, Double, String, 等类型可以自动的转换为Writable

scala> sc.textFile("file:///root/t_word").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).sortBy(_._1,true,1).saveAsSequenceFile("hdfs:///demo/results03")
scala> sc.sequenceFile[String,Int]("hdfs:///demo/results03").collect
res29: Array[(String, Int)] = Array((a,1), (baby,1), (come,1), (day,2), (demo,1), (good,2), (hello,1), (is,1), (on,1), (spark,1), (study,1), (this,1), (up,1))

共享变量

当RDD中的转换算子需要用到定义Driver中的变量的时候,计算节点在运行该转换算子之前,会通过网络将Driver中定义的变量下载到计算节点。同时如果计算节点在修改了下载的变量,该修改对Driver端定义的变量不可见。

scala> var i:Int=0
i: Int = 0

scala> sc.textFile("file:///root/t_word").foreach(line=> i=i+1)
                                                                                
scala> print(i)
0

√广播变量

问题:

当出现超大数据集和小数据集合进行join的时候,能否使用join算子直接进行jion,如果不行为什么?

//100GB
var orderItems=List("001 apple 2 4.5","002 pear 1 2.0","001 瓜子 1 7.0")
//10MB
var users=List("001 zhangsan","002 lisi","003 王五")

var rdd1:RDD[(String,String)] =sc.makeRDD(orderItems).map(line=>(line.split(" ")(0),line))
var rdd2:RDD[(String,String)] =sc.makeRDD(users).map(line=>(line.split(" ")(0),line))

rdd1.join(rdd2).collect().foreach(println)

系统在做join的操作的时候会产生shuffle,会在各个计算节点当中传输100GB的数据用于完成join操作,因此join网络代价和内存代价都很高。因此可以考虑将小数据定义成Driver中成员变量,在Map操作的时候完成join。

scala> var users=List("001 zhangsan","002 lisi","003 王五").map(line=>line.split(" ")).map(ts=>ts(0)->ts(1)).toMap
users: scala.collection.immutable.Map[String,String] = Map(001 -> zhangsan, 002 -> lisi, 003 -> 王五)

scala> var orderItems=List("001 apple 2 4.5","002 pear 1 2.0","001 瓜子 1 7.0")
orderItems: List[String] = List(001 apple 2 4.5, 002 pear 1 2.0, 001 瓜子 1 7.0)

scala> var rdd1:RDD[(String,String)] =sc.makeRDD(orderItems).map(line=>(line.split(" ")(0),line))
rdd1: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[89] at map at <console>:32

scala> rdd1.map(t=> t._2+"\t"+users.get(t._1).getOrElse("未知")).collect()
res33: Array[String] = Array(001 apple 2 4.5    zhangsan, 002 pear 1 2.0       lisi, 001 瓜子 1 7.0	zhangsan)

但是上面写法会存在一个问题,每当一个map算子遍历元素的时候都会向Driver下载users变量,虽然该值不大,但是在计算节点会频繁的下载。正是因为此种情景会导致没有必要的重复变量的拷贝,Spark提出广播变量。

Spark 在程序运行前期,提前将需要广播的变量通知给所有的计算节点,计算节点会对需要广播的变量在计算之前进行下载操作并且将该变量缓存,该计算节点其他线程在使用到该变量的时候就不需要下载。

//100GB
var orderItems=List("001 apple 2 4.5","002 pear 1 2.0","001 瓜子 1 7.0")
//10MB 声明Map类型变量
var users:Map[String,String]=List("001 zhangsan","002 lisi","003 王五").map(line=>line.split(" ")).map(ts=>ts(0)->ts(1)).toMap

//声明广播变量,调用value属性获取广播值
val ub = sc.broadcast(users)

var rdd1:RDD[(String,String)] =sc.makeRDD(orderItems).map(line=>(line.split(" ")(0),line))

rdd1.map(t=> t._2+"\t"+ub.value.get(t._1).getOrElse("未知")).collect().foreach(println)

计数器

Spark提供的Accumulator,主要用于多个节点对一个变量进行共享性的操作。Accumulator只提供了累加的功能。但是确给我们提供了多个task对一个变量并行操作的功能。但是task只能对Accumulator进行累加操作,不能读取它的值。只有Driver程序可以读取Accumulator的值。

scala> val accum = sc.longAccumulator("mycount")
accum: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 1075, name: Some(mycount), value: 0)

scala> sc.parallelize(Array(1, 2, 3, 4),6).foreach(x => accum.add(x))

scala> accum.value
res36: Long = 10

Spark数据写出

将数据写出HDFS

scala> sc.textFile("file:///root/t_word").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).sortBy(_._1,true,1).saveAsSequenceFile("hdfs:///demo/results03")

因为saveASxxx都是将计算结果写入到HDFS或者是本地文件系统中,因此如果需要 将计算结果写出到第三方数据数据库此时就需要借助于spark给我们提供的一个算子foreach算子写出。

√foreach写出

场景1:频繁的打开和关闭链接,写入效率很低(可以运行成功的)

sc.textFile("file:///root/t_word")
.flatMap(_.split(" "))
.map((_,1))
.reduceByKey(_+_)
.sortBy(_._1,true,3)
.foreach(tuple=>{ //数据库
  //1,创建链接
  //2.开始插入
  //3.关闭链接
})

场景2:错误写法,因为链接池不可能被序列化(运行失败)

//1.定义连接Connection
var conn=... //定义在Driver
sc.textFile("file:///root/t_word")
.flatMap(_.split(" "))
.map((_,1))
.reduceByKey(_+_)
.sortBy(_._1,true,3)
.foreach(tuple=>{ //数据库
  //2.开始插入
})
 //3.关闭链接

场景3:一个分区一个链接池?(还不错,但是不是最优),有可能一个JVM运行多个分区,也就意味着一个JVM创建多个链接造成资源的浪费。单例对象?

sc.textFile("file:///root/t_word")
.flatMap(_.split(" "))
.map((_,1))
.reduceByKey(_+_)
.sortBy(_._1,true,3)
.foreachPartition(values=>{
  //创建链接 
  //写入分区数据
  //关闭链接
})

将创建链接代码使用单例对象创建,如果一个计算节点拿到多个分区。通过JVM单例定义可以知道,在整个JVM中仅仅只会创建一次。

val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("SparkWordCountApplication")
val sc = new SparkContext(conf)

sc.textFile("hdfs://CentOS:9000/demo/words/")
.flatMap(_.split(" "))
.map((_,1))
.reduceByKey(_+_)
.sortBy(_._1,true,3)
.foreachPartition(values=>{
  HbaseSink.writeToHbase("baizhi:t_word",values.toList)
})

sc.stop()
object HbaseSink {
  lazy val conn:Connection=createConnection()

  def createConnection(): Connection = {
    val hadoopConf = new Configuration()
    hadoopConf.set(HConstants.ZOOKEEPER_QUORUM,"CentOS")
    ConnectionFactory.createConnection(hadoopConf)
  }

  /**
   * @param tableName
   * @param values
   */
  def writeToHbase(tableName: String, values: List[(String, Int)]): Unit = {
    val bufferedMutator = conn.getBufferedMutator(TableName.valueOf(tableName))

    val puts: List[Put] = values.map(t => {
      val put = new Put(t._1.getBytes())
      put.addColumn("cf1".getBytes(), "count".getBytes(), (t._2 + " ").getBytes())
      put
    })
    //批量写出
    bufferedMutator.mutate(puts.asJava)
    bufferedMutator.flush()
    bufferedMutator.close()
  }

  sys.addShutdownHook({
     println("虚拟机退出!")
      if(conn!=null){
        conn.close()
      }
  })

}

RDD进阶(面试)

分析WordCount

sc.textFile("hdfs:///words/t_word") //RDD0
   .flatMap(_.split(" "))                //RDD1
   .map((_,1))                           //RDD2
   .reduceByKey(_+_)                     //RDD3  finalRDD
   .collect                              //Array 任务提交

Spark内存计算_第5张图片

RDD都有哪些特性?

* Internally, each RDD is characterized by five main properties:
*
*  - A list of partitions
*  - A function for computing each split
*  - A list of dependencies on other RDDs
*  - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
*  - Optionally, a list of preferred locations to compute each split on (e.g. block locations for
*    an HDFS file)
*
  • RDD只读的具有分区分布式数据集-分区数等于该RDD并行度
  • 每个分区独立运算,尽可能实现分区本地性计算
  • 只读的数据集且RDD与RDD之间存在着相互依赖关系
  • 针对于 key-value RDD,可以指定分区策略【可选】
  • 基于数据所属的位置,选择最优位置实现本地性计算【可选】

RDD容错

在理解DAGSchedule如何做状态划分的前提是需要大家了解一个专业术语lineage通常被人们称为RDD的血统。在了解什么是RDD的血统之前,先来看看程序猿进化过程。

Spark内存计算_第6张图片

上图中描述了一个程序猿起源变化的过程,我们可以近似的理解类似于RDD的转换也是一样的,Spark的计算本质就是对RDD做各种转换,因为RDD是一个不可变只读的集合,因此每次的转换都需要上一次的RDD作为本次转换的输入,因此RDD的lineage描述的是RDD间的相互依赖关系。为了保证RDD中数据的健壮性,RDD数据集通过所谓的血统关系(Lineage)记住了它是如何从其它RDD中转换过来的。Spark将RDD之间的关系归类为宽依赖窄依赖。Spark会根据Lineage存储的RDD的依赖关系对RDD计算做故障容错。目前Saprk的容错策略根据RDD依赖关系重新计算-无需干预RDD做Cache-临时缓存RDD做Checkpoint-持久化手段完成RDD计算的故障容错。

RDD缓存

缓存是一种RDD计算容错的一种手段,程序在RDD数据丢失的时候,可以通过缓存快速计算当前RDD的值,而不需要反推出所有的RDD重新计算,因此Spark在需要对某个RDD多次使用的时候,为了提高程序的执行效率用户可以考虑使用RDD的cache。

scala> var finalRDD=sc.textFile("hdfs:///words/src").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
finalRDD: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[25] at reduceByKey at <console>:24

scala> finalRDD.cache
res7: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[25] at reduceByKey at <console>:24

scala> finalRDD.collect
res8: Array[(String, Int)] = Array((this,1), (is,1), (day,2), (come,1), (hello,1), (baby,1), (up,1), (spark,1), (a,1), (on,1), (demo,1), (good,2), (study,1))

scala> finalRDD.collect
res9: Array[(String, Int)] = Array((this,1), (is,1), (day,2), (come,1), (hello,1), (baby,1), (up,1), (spark,1), (a,1), (on,1), (demo,1), (good,2), (study,1))

用户可以调用upersist方法清空缓存

scala> finalRDD.unpersist()
res11: org.apache.spark.rdd.RDD[(String, Int)] @scala.reflect.internal.annotations.uncheckedBounds = ShuffledRDD[25] at reduceByKey at <console>:24

除了调用cache之外,Spark提供了更细粒度的RDD缓存方案,用户可以根据集群的内存状态选择合适的缓存策略。用户可以使用persist方法指定缓存级别。

RDD#persist(StorageLevel.MEMORY_ONLY)

目前Spark支持的缓存方案如下:

object StorageLevel {
  val NONE = new StorageLevel(false, false, false, false)
  val DISK_ONLY = new StorageLevel(true, false, false, false)# 仅仅存储磁盘
  val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2) # 仅仅存储磁盘 存储两份
  val MEMORY_ONLY = new StorageLevel(false, true, false, true)
  val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)
  val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false) # 先序列化再 存储内存,费CPU节省内存
  val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)
  val MEMORY_AND_DISK = new StorageLevel(true, true, false, true) # 选择这个!
  val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
  val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
  val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
  val OFF_HEAP = new StorageLevel(true, true, true, false, 1)
...

那如何选择呢?

默认情况下,性能最高的当然是MEMORY_ONLY,但前提是你的内存必须足够足够大,可以绰绰有余地存放下整个RDD的所有数据。因为不进行序列化与反序列化操作,就避免了这部分的性能开销;对这个RDD的后续算子操作,都是基于纯内存中的数据的操作,不需要从磁盘文件中读取数据,性能也很高;而且不需要复制一份数据副本,并远程传送到其他节点上。但是这里必须要注意的是,在实际的生产环境中,恐怕能够直接用这种策略的场景还是有限的,如果RDD中数据比较多时(比如几十亿),直接用这种持久化级别,会导致JVM的OOM内存溢出异常。

如果使用MEMORY_ONLY级别时发生了内存溢出,那么建议尝试使用MEMORY_ONLY_SER级别。该级别会将RDD数据序列化后再保存在内存中,此时每个partition仅仅是一个字节数组而已,大大减少了对象数量,并降低了内存占用。这种级别比MEMORY_ONLY多出来的性能开销,主要就是序列化与反序列化的开销。但是后续算子可以基于纯内存进行操作,因此性能总体还是比较高的。此外,可能发生的问题同上,如果RDD中的数据量过多的话,还是可能会导致OOM内存溢出的异常。

不要泄漏到磁盘,除非你在内存中计算需要很大的花费,或者可以过滤大量数据,保存部分相对重要的在内存中。否则存储在磁盘中计算速度会很慢,性能急剧降低。

后缀为_2的级别,必须将所有数据都复制一份副本,并发送到其他节点上,数据复制以及网络传输会导致较大的性能开销,除非是要求作业的高可用性,否则不建议使用。

CheckPoint 机制

除了使用缓存机制可以有效的保证RDD的故障恢复,但是如果缓存失效还是会在导致系统重新计算RDD的结果,所以对于一些RDD的lineage较长的场景,计算比较耗时,用户可以尝试使用checkpoint机制存储RDD的计算结果,该种机制和缓存最大的不同在于,使用checkpoint之后被checkpoint的RDD数据直接持久化在文件系统中,一般推荐将结果写在hdfs中,这种checpoint并不会自动清空。注意checkpoint在计算的过程中先是对RDD做mark,在任务执行结束后,再对mark的RDD实行checkpoint,也就是要重新计算被Mark之后的rdd的依赖和结果。

sc.setCheckpointDir("hdfs://CentOS:9000/checkpoints")

val rdd1 = sc.textFile("hdfs://CentOS:9000/demo/words/")
.map(line => {
  println(line)
})

//对当前RDD做标记
rdd1.checkpoint()

rdd1.collect()

因此在checkpoint一般需要和cache连用,这样就可以保证计算一次。

sc.setCheckpointDir("hdfs://CentOS:9000/checkpoints")

val rdd1 = sc.textFile("hdfs://CentOS:9000/demo/words/")
.map(line => {
  println(line)
})

rdd1.persist(StorageLevel.MEMORY_AND_DISK)//先cache
//对当前RDD做标记
rdd1.checkpoint()
rdd1.collect()
rdd1.unpersist()//删除缓存

任务计算源码剖析

理论指导

sc.textFile("hdfs:///demo/words/t_word") //RDD0
   .flatMap(_.split(" "))                //RDD1
   .map((_,1))                           //RDD2
   .reduceByKey(_+_)                     //RDD3  finalRDD
   .collect                              //Array 任务提交

Spark内存计算_第7张图片

通过分析以上的代码,我们不难发现Spark在执行任务前期,会根据RDD的转换关系形成一个任务执行DAG。将任务划分成若干个stage。Spark底层在划分stage的依据是根据RDD间的依赖关系划分。Spark将RDD与RDD间的转换分类:ShuffleDependency-宽依赖 | NarrowDependency-窄依赖,Spark如果发现RDD与RDD之间存在窄依赖关系,系统会自动将存在窄依赖关系的RDD的计算算子归纳为一个stage,如果遇到宽依赖系统开启一个新的stage.

Spark 宽窄依赖判断

Spark内存计算_第8张图片

宽依赖:父RDD的一个分区对应了子RDD的多个分区,出现分叉就认定为宽依赖。ShuffleDependency

窄依赖:父RDD的1个分区(多个父RDD)仅仅只对应子RDD的一个分区认定为窄依赖。OneToOneDependency|RangeDependency|PruneDependency

Spark在任务提交前期,首先根据finalRDD逆推出所有依赖RDD,以及RDD间依赖关系,如果遇到窄依赖合并在当前的stage中,如果是宽依赖开启新的stage。

Spark内存计算_第9张图片

getMissingParentStages

private def getMissingParentStages(stage: Stage): List[Stage] = {
    val missing = new HashSet[Stage]
    val visited = new HashSet[RDD[_]]
    // We are manually maintaining a stack here to prevent StackOverflowError
    // caused by recursively visiting
    val waitingForVisit = new ArrayStack[RDD[_]]
    def visit(rdd: RDD[_]) {
      if (!visited(rdd)) {
        visited += rdd
        val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil)
        if (rddHasUncachedPartitions) {
          for (dep <- rdd.dependencies) {
            dep match {
              case shufDep: ShuffleDependency[_, _, _] =>
                val mapStage = getOrCreateShuffleMapStage(shufDep, stage.firstJobId)
                if (!mapStage.isAvailable) {
                  missing += mapStage
                }
              case narrowDep: NarrowDependency[_] =>
                waitingForVisit.push(narrowDep.rdd)
            }
          }
        }
      }
    }
    waitingForVisit.push(stage.rdd)
    while (waitingForVisit.nonEmpty) {
      visit(waitingForVisit.pop())
    }
    missing.toList
  }

遇到宽依赖,系统会自动的创建一个ShuffleMapStage

submitMissingTasks

  private def submitMissingTasks(stage: Stage, jobId: Int) {
    
        //计算分区
        val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()
        ...
        //计算最佳位置
      val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
        stage match {
          case s: ShuffleMapStage =>
            partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
          case s: ResultStage =>
            partitionsToCompute.map { id =>
              val p = s.partitions(id)
              (id, getPreferredLocs(stage.rdd, p))
            }.toMap
        }
      } catch {
        case NonFatal(e) =>
          stage.makeNewStageAttempt(partitionsToCompute.size)
          listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
          abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
          runningStages -= stage
          return
      }
    //将分区映射TaskSet
    val tasks: Seq[Task[_]] = try {
      val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
      stage match {
        case stage: ShuffleMapStage =>
          stage.pendingPartitions.clear()
          partitionsToCompute.map { id =>
            val locs = taskIdToLocations(id)
            val part = partitions(id)
            stage.pendingPartitions += id
            new ShuffleMapTask(stage.id, stage.latestInfo.attemptNumber,
              taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
              Option(sc.applicationId), sc.applicationAttemptId, stage.rdd.isBarrier())
          }

        case stage: ResultStage =>
          partitionsToCompute.map { id =>
            val p: Int = stage.partitions(id)
            val part = partitions(p)
            val locs = taskIdToLocations(id)
            new ResultTask(stage.id, stage.latestInfo.attemptNumber,
              taskBinary, part, locs, id, properties, serializedTaskMetrics,
              Option(jobId), Option(sc.applicationId), sc.applicationAttemptId,
              stage.rdd.isBarrier())
          }
      }
    } catch {
      case NonFatal(e) =>
        abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
        runningStages -= stage
        return
    }
    //调用taskScheduler#submitTasks TaskSet
    if (tasks.size > 0) {
      logInfo(s"Submitting ${tasks.size} missing tasks from $stage (${stage.rdd}) (first 15 " +
        s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")
      taskScheduler.submitTasks(new TaskSet(
        tasks.toArray, stage.id, stage.latestInfo.attemptNumber, jobId, properties))
    } 
    ...
  } 

总结关键字:逆推、finalRDD、ResultStage 、ShuffleMapStage、ShuffleMapTask、ResultTask、ShuffleDependency、NarrowDependency、DAGSchedulerTaskSchedulerSchedulerBackendDAGSchedulerEventProcessLoop

Jars依赖问题

1、可以使用–packages或者–jars解决依赖问题

[root@CentOS ~]# spark-submit  --master spark://CentOS:7077 --deploy-mode client --class com.baizhi.outputs.SparkWordCountApplication --name RedisSinkDemo --total-executor-cores 6 --packages redis.clients:jedis:2.9.2  /root/original-spark-rdd-1.0-SNAPSHOT.jar

2、可以使用fat jar插件将需要的依赖打包

[root@CentOS ~]# spark-submit  --master spark://CentOS:7077 --deploy-mode client --class com.baizhi.outputs.SparkWordCountApplication --name RedisSinkDemo --total-executor-cores 6 /root/spark-rdd-1.0-SNAPSHOT.jar

3、注意当集成MySQL的时候,需要额外注意

  • 将MySQL添加到HADOOP_CLASSPATH类路径下
  • 使用spark.executor.extraClassPath和spark.driver.extraClassPath能够解决MySQL依赖问题
[root@CentOS ~]#  spark-submit  --master spark://CentOS:7077 --deploy-mode client --class com.baizhi.inputs.SparkMySQLUserQueryApplication  --name MysqLReadDemo --total-executor-cores 6 --conf spark.driver.extraClassPath=/root/mysql-connector-java-5.1.49.jar --conf  spark.executor.extraClassPath=/root/mysql-connector-java-5.1.49.jar  /root/original-spark-rdd-1.0-SNAPSHOT.jar

如果大家觉得麻烦,还可以在 spark-defaut.conf 配置改参数:

spark.executor.extraClassPath=/root/.ivy2/jars/* 
spark.driver.extraClassPath=/root/.ivy2/jars/*
[root@CentOS ~]#  spark-submit  --master spark://CentOS:7077 --deploy-mode client --class com.baizhi.inputs.SparkMySQLUserQueryApplication  --name MysqLReadDemo --total-executor-cores 6 --packages mysql:mysql-connector-java:5.1.38 /root/original-spark-rdd-1.0-SNAPSHOT.jar

ortStage(stage, s"Task creation failed: KaTeX parse error: Undefined control sequence: \n at position 2: e\̲n̲{Utils.exceptionString(e)}", Some(e))
runningStages -= stage
return
}
//调用taskScheduler#submitTasks TaskSet
if (tasks.size > 0) {
logInfo(s"Submitting ${tasks.size} missing tasks from s t a g e ( stage ( stage({stage.rdd}) (first 15 " +
s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")
taskScheduler.submitTasks(new TaskSet(
tasks.toArray, stage.id, stage.latestInfo.attemptNumber, jobId, properties))
}

}


总结关键字:逆推、finalRDD、ResultStage 、ShuffleMapStage、ShuffleMapTask、ResultTask、ShuffleDependency、NarrowDependency、**DAGScheduler**、**TaskScheduler**、**SchedulerBackend**、**DAGSchedulerEventProcessLoop**

## Jars依赖问题

1、可以使用--packages或者--jars解决依赖问题

```shell
[root@CentOS ~]# spark-submit  --master spark://CentOS:7077 --deploy-mode client --class com.baizhi.outputs.SparkWordCountApplication --name RedisSinkDemo --total-executor-cores 6 --packages redis.clients:jedis:2.9.2  /root/original-spark-rdd-1.0-SNAPSHOT.jar

2、可以使用fat jar插件将需要的依赖打包

[root@CentOS ~]# spark-submit  --master spark://CentOS:7077 --deploy-mode client --class com.baizhi.outputs.SparkWordCountApplication --name RedisSinkDemo --total-executor-cores 6 /root/spark-rdd-1.0-SNAPSHOT.jar

3、注意当集成MySQL的时候,需要额外注意

  • 将MySQL添加到HADOOP_CLASSPATH类路径下
  • 使用spark.executor.extraClassPath和spark.driver.extraClassPath能够解决MySQL依赖问题
[root@CentOS ~]#  spark-submit  --master spark://CentOS:7077 --deploy-mode client --class com.baizhi.inputs.SparkMySQLUserQueryApplication  --name MysqLReadDemo --total-executor-cores 6 --conf spark.driver.extraClassPath=/root/mysql-connector-java-5.1.49.jar --conf  spark.executor.extraClassPath=/root/mysql-connector-java-5.1.49.jar  /root/original-spark-rdd-1.0-SNAPSHOT.jar

如果大家觉得麻烦,还可以在 spark-defaut.conf 配置改参数:

spark.executor.extraClassPath=/root/.ivy2/jars/* 
spark.driver.extraClassPath=/root/.ivy2/jars/*
[root@CentOS ~]#  spark-submit  --master spark://CentOS:7077 --deploy-mode client --class com.baizhi.inputs.SparkMySQLUserQueryApplication  --name MysqLReadDemo --total-executor-cores 6 --packages mysql:mysql-connector-java:5.1.38 /root/original-spark-rdd-1.0-SNAPSHOT.jar

你可能感兴趣的:(Spark,spark)