当在远程集群节点上执行传递给Spark操作(例如map或reduce)的函数时,它将在函数中使用的所有变量的单独副本上工作。这些变量将复制到每台计算机,并且远程计算机上的变量的更新不会传播回驱动程序。支持跨任务的通用,读写共享变量效率低下。但是,Spark确实为两种常见的使用模式提供了两种有限类型的共享变量:广播变量和累加器。
val value = new HashMap()
val rdd = …...
rdd.foreach(x => {
value…..
})
只支持add
scala> val accum = sc.longAccumulator("My Accumulator")
accum: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 0, name: Some(My Accumulator), value: 0)
scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum.add(x))
...
10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s
scala> accum.value
res2: Long = 10
###3.Broadcast Variables(广播变量)
val value = new HashMap()//100M
val rdd = …...
rdd.foreach(x => {
value….. 1000task * 10M =10G
})
假设上边的变量的value中有10M的数据,下面的rdd有1000个task,那么在进行计算的时候就会把10M的数据发送到每个task,数据量就达到了10G,显然这种方式是不好的,那么就可以使用广播变量,Broadcast是每个机器一个副本,这个同一个机器中的task就能使用到同一个变量了,这样就可以少了很多重复的数据。
scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)
scala> broadcastVar.value
res0: Array[Int] = Array(1, 2, 3)
package core
import org.apache.spark.{SparkConf, SparkContext}
object BroadcastApp {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf()
sparkConf.setAppName("BroadcastApp").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
commonJoin(sc)
sc.stop()
}
def commonJoin(sc: SparkContext) = {
val info1 = sc.parallelize(Array(("601", "张三"), ("602", "李四")))
val info2 = sc.parallelize(Array(("601", "东南","20"), ("603", "科大","21"), ("604", "浙大","22")))
.map(x => (x._1,x))
info1.join(info2).map(x =>{
x._1 + "," + x._2._1 + "," + x._2._2._2
}).foreach(println)
}
}
上面是一个常规的join操作
package core
import org.apache.spark.{SparkConf, SparkContext}
object BroadcastApp {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf()
sparkConf.setAppName("BroadcastApp").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
// commonJoin(sc)
broadcastJoin(sc) //使用broadcast+map代替了shuffle操作,但是小表数据进行广播的时候,数据量不能太大
Thread.sleep(2000000)
sc.stop()
}
def broadcastJoin(sc: SparkContext) = {
//小 ==》广播
val info1 = sc.parallelize(Array(("601", "张三"), ("602", "李四"))).collectAsMap()
//Map(),一般采用Map的方式进行广播比较好
val info1Broadcast = sc.broadcast(info1)
//大
val info2 = sc.parallelize(Array(("601", "东南","20"), ("603", "科大","21"), ("604", "浙大","22")))
.map(x => (x._1,x))
info2.mapPartitions(x =>{
val broadcastMap = info1Broadcast.value
for((key,value) <- x if (broadcastMap.contains(key)))
//for循环中的 yield 会把当前的元素记下来,保存在集合中,循环结束后将返回该集合。
//Scala中for循环是有返回值的。如果被循环的是Map,返回的就是Map,被循环的是List,返回的就是List,以此类推。
yield (key,broadcastMap.get(key).getOrElse(""),value._2)
}).foreach(println)
}
def commonJoin(sc: SparkContext) = {
val info1 = sc.parallelize(Array(("601", "张三"), ("602", "李四")))
val info2 = sc.parallelize(Array(("601", "东南","20"), ("603", "科大","21"), ("604", "浙大","22")))
.map(x => (x._1,x))
info1.join(info2).map(x =>{
x._1 + "," + x._2._1 + "," + x._2._2._2
}).foreach(println)
}
}
broadcast出去之后就不会再用join来实现了,大表的数据读取出来一条就和广播出去的小表的记录做匹配,这样就不会每个task一份数据,而是每台机器上一份数据,减少网络传输,减少内存消耗