Spark-Core(共享变量)

Spark-Core(共享变量)

1.Shared Variables

​ 当在远程集群节点上执行传递给Spark操作(例如map或reduce)的函数时,它将在函数中使用的所有变量的单独副本上工作。这些变量将复制到每台计算机,并且远程计算机上的变量的更新不会传播回驱动程序。支持跨任务的通用,读写共享变量效率低下。但是,Spark确实为两种常见的使用模式提供了两种有限类型的共享变量:广播变量和累加器。

val value = new HashMap()
val rdd = …...
rdd.foreach(x => {
​	value…..
})

2.Accumulators(计数器/累加器)

只支持add

scala> val accum = sc.longAccumulator("My Accumulator")
accum: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 0, name: Some(My Accumulator), value: 0)

scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum.add(x))
...
10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s

scala> accum.value
res2: Long = 10

###3.Broadcast Variables(广播变量)

val value = new HashMap()//100M
val rdd = …...
rdd.foreach(x => {
​	value….. 1000task * 10M =10G
})

假设上边的变量的value中有10M的数据,下面的rdd有1000个task,那么在进行计算的时候就会把10M的数据发送到每个task,数据量就达到了10G,显然这种方式是不好的,那么就可以使用广播变量,Broadcast是每个机器一个副本,这个同一个机器中的task就能使用到同一个变量了,这样就可以少了很多重复的数据。

scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)

scala> broadcastVar.value
res0: Array[Int] = Array(1, 2, 3)
package core

import org.apache.spark.{SparkConf, SparkContext}

object BroadcastApp {

  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf()
    sparkConf.setAppName("BroadcastApp").setMaster("local[2]")
    val sc = new SparkContext(sparkConf)

    commonJoin(sc)
    sc.stop()
  }

  def commonJoin(sc: SparkContext) = {

    val info1 = sc.parallelize(Array(("601", "张三"), ("602", "李四")))
    val info2 = sc.parallelize(Array(("601", "东南","20"), ("603", "科大","21"), ("604", "浙大","22")))
        .map(x => (x._1,x))

    info1.join(info2).map(x =>{
      x._1 + "," + x._2._1 + "," + x._2._2._2
    }).foreach(println)

  }

}

上面是一个常规的join操作

package core

import org.apache.spark.{SparkConf, SparkContext}

object BroadcastApp {

  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf()
    sparkConf.setAppName("BroadcastApp").setMaster("local[2]")
    val sc = new SparkContext(sparkConf)

//    commonJoin(sc)
    broadcastJoin(sc) //使用broadcast+map代替了shuffle操作,但是小表数据进行广播的时候,数据量不能太大
    Thread.sleep(2000000)

    sc.stop()

  }

  def broadcastJoin(sc: SparkContext) = {
    //小 ==》广播
    val info1 = sc.parallelize(Array(("601", "张三"), ("602", "李四"))).collectAsMap()
    //Map(),一般采用Map的方式进行广播比较好
    val info1Broadcast = sc.broadcast(info1)
    //大
    val info2 = sc.parallelize(Array(("601", "东南","20"), ("603", "科大","21"), ("604", "浙大","22")))
      .map(x => (x._1,x))

    info2.mapPartitions(x =>{
      val broadcastMap = info1Broadcast.value
      for((key,value) <- x if (broadcastMap.contains(key)))
        //for循环中的 yield 会把当前的元素记下来,保存在集合中,循环结束后将返回该集合。
        //Scala中for循环是有返回值的。如果被循环的是Map,返回的就是Map,被循环的是List,返回的就是List,以此类推。
        yield (key,broadcastMap.get(key).getOrElse(""),value._2)
    }).foreach(println)

  }

  def commonJoin(sc: SparkContext) = {

    val info1 = sc.parallelize(Array(("601", "张三"), ("602", "李四")))
    val info2 = sc.parallelize(Array(("601", "东南","20"), ("603", "科大","21"), ("604", "浙大","22")))
        .map(x => (x._1,x))

    info1.join(info2).map(x =>{
      x._1 + "," + x._2._1 + "," + x._2._2._2
    }).foreach(println)

  }

}

broadcast出去之后就不会再用join来实现了,大表的数据读取出来一条就和广播出去的小表的记录做匹配,这样就不会每个task一份数据,而是每台机器上一份数据,减少网络传输,减少内存消耗

你可能感兴趣的:(hadoop,Spark,Spark,Spark-Core)