Spark共享变量

共享变量分类

共享变量分为broadcast variable和Accumulators

共享变量官网解释

Normally, when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variables used in the function

通常情况下, 当一个函数传递到一个运行在一个远程的集群节点
(如yarn)上的Spark operation(eg. map 或者 reduce)的时候，
spark operation会操作使用在函数的所有变量的单独的拷贝

These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program

这些变量被拷贝到每台机器上, 被穿会给driver program的在远程机器上的变量是不能更新的

注:

val mapVar = new HashMap()  //driver program端
val rdd = sc.textFile("...")
rdd.map(x => {
    ...mapVar...    //executors端
})

如果executor端用到driver端的变量/常量，
由于map变量最终转换成的task是在executor中执行的，
而且task是并行执行的，所以每个task都会持有一个mapVar拷贝

example：
  运行1000task, 需要拷贝的变量占10m
  所以一共占1000*10m=10G

Supporting general, read-write shared variables across tasks would be inefficient

支持普遍的，跨task读写共享的变量是不高效的做法

However, Spark does provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators

但是，Spark对于两种广泛使用的场景提供两种限制类型的共享变量: 
  broadcast变量与计数器

广播变量(Broadcast Variable)

3.1 应用场景

在做Spark处理时，或者ETL时过程中，
想要知道本次作业执行过程中一共有多少条记录，
丢了多少条记录(脏数据占比是多少)

3.2 官网描述

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks

广播变量允许程序员保存一个read-only的变量在每台机器上(也就是每个executor，
每个executor能够并行执行，每个worker node对于一个应用都有一个executor), 
而不是每一个task内都有一个副本

They can be used, for example, to give every node a copy of a large input dataset in an efficient manner

这样就可以高效的给每个节点一个大large输入数据集一个拷贝

Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost

Spark同样尝试去使用高效的广播算法来分布式的广播变量，
这样做的目的是减少沟通的花费

3.3 广播变量应用举例

(1) join操作

此时有两个表，table1与table2

table1为

<1, record1>  <2, record2>  <3, record3>

table2为

<1, input1>    <2, input2>    <4, input4>

其join之后的result为

<1, (record1, input1)>
<2, (record2, input2)>

由于进行了join操作，所以进行了shuffle，但是又带来了一个问题

但是带来了一个问题，shuffle是跨executor复制，所以会有网络IO开销

其解决方法为

如果将<2, input2>作为Broadcast Variables, 放在executor memory中之后, 
<1, record1>匹配了Broadcast Variables, 匹配就匹配了，没匹配就继续向下匹配

缺点为

只适用于Broadcast variables不能过大的场景

使用于

大表与小表join， 解决数据倾斜的一种方案

3.4 使用代码实现基本join(带shuffle)

3.4.1 代码

val g5 = sc.parallelize(Array(("12", "henry"), ("11", "kaka"), ("99", "ronaldo")))
           .map(x => (x._1, x))
//因为f11在没有map之前是RDD[(String, String, String)]
//但是根据源码传递的参数类型为RDD[(K, W)]
//所以需要对其进行map操作
//此时类型就变为RDD[String, (String, String)]
val f11 = sc.parallelize(Array(("12", "Lauern", "Arsenal"), ("9", "messi", "Barcelona")))
            .map(x => (x._1,x))
//此时join之后类型为RDD[String, (String, String), (String, String, String)]
g5.join(f11)
  .map({x._1 + "," + x._2._1._2  + "," + x._2._2._3})
  .collect

3.4.2 运行代码后的stage图

stage.png

从上图可知，join的过程中是存在shuffle
因为

(1) 前两个stage shuffle write分别为281B和296B
(2) 最后一个stage shuffle read为577.0 B

3.5 使用broadcast variable实现join

3.5.1 代码

def broadcastJoin(sc:SparkContext)={
    val g5 = sc.parallelize(Array(("12", "henry"), ("11", "kaka"), ("99", "ronaldo"))).collectAsMap()
    //假设f11是小表
    val f11 = sc.parallelize(Array(("12", "Lauern", "Arsenal"), ("9", "messi", "Barcelona"))).map(x => (x._1,x))
    //发起时必须从客户端发起
    val g5Broadcast = sc.broadcast(g5)

    f11.mapPartitions(partition => {
      //获取广播变量的内容
      val g5Stus = g5Broadcast.value

      //分区内数据是否包含广播变量的id
      for((key, value) <- partition if(g5Stus.contains(key))){
        yield(key, g5Stus.get(key).getOrElse(""), value._2)
      }
      partition
    })
}

3.5.1.1 yield

For each iteration of your for loop, yield generates a value which will be remembered. It's like the for loop has a buffer you can't see, and for each iteration of your for loop, another item is added to that buffer. When your for loop finishes running, it will return this collection of all the yielded values. The type of the collection that is returned is the same type that you were iterating over, so a Map yields a Map, a List yields a List, and so on

3.5.2 运行代码后的stage图

broadcastJoin.png

(1) 从此图可以看出broadcastJoin无shuffle，因为无新的stage
(2) 而且在执行过后看ui，
    发现completed stages下的表格中shuffle read和shuffle write为空

累加器(Accumulators)

4.1 累加器定义

Accumulators are variables that are only “added” to through an associative and commutative operation and can therefore be efficiently supported in parallel

4.2 源码注释

Create and register a long accumulator, which starts with 0 and accumulates inputs by add

创建并注册一个long类型的accumulator，index为0开始并且通过add增加

4.3 用途

一般用作累计操作

Spark共享变量

你可能感兴趣的:(Spark共享变量)