(RDD)Broadcast 广播变量

1)假设某个作业有10000个tasks,每个task上有100M的变量,这个数据是很可怕的

    所以:10000tasks ==>100 executor    广播变量是广播到executor上的,每个executor上的所有task共享


2)使用案例

    map join    把小表的数据广播出去

    BroadcastJoin = MapJoin 


3)说明

    Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

    广播变量允许保留一个只读的变量,缓存在每台机器上,而不是每一个task上。相当于在每个executor都放一份,可以直接使用。Spark尝试去把广播变量分布到各个节点上去,降低通信成本


4)用法

scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)

scala> broadcastVar.value
res0: Array[Int] = Array(1, 2, 3)


 
 
 

你可能感兴趣的:((RDD)Broadcast 广播变量)