Spark Broadcast Variables

这篇文章详细的介绍了spark广播变量,值得一看

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-broadcast.html

在此只摘录其中的Example

Let’s start with an introductory example to check out how to use broadcast variables and build your initial understanding.

You’re going to use a static mapping of interesting projects with their websites, i.e. Map[String, String] that the tasks, i.e. closures (anonymous functions) in transformations, use.

scala> val pws = Map("Apache Spark" -> "http://spark.apache.org/", "Scala" -> "http://www.scala-lang.org/")
pws: scala.collection.immutable.Map[String,String] = Map(Apache Spark -> http://spark.apache.org/, Scala -> http://www.scala-lang.org/)

scala> val websites = sc.parallelize(Seq("Apache Spark", "Scala")).map(pws).collect
...
websites: Array[String] = Array(http://spark.apache.org/, http://www.scala-lang.org/)

It works, but is very ineffective as the pws map is sent over the wire to executors while it could have been there already. If there were more tasks that need the pws map, you could improve their performance by minimizing the number of bytes that are going to be sent over the network for task execution.

Enter broadcast variables.

val pwsB = sc.broadcast(pws)
val websites = sc.parallelize(Seq("Apache Spark", "Scala")).map(pwsB.value).collect
// websites: Array[String] = Array(http://spark.apache.org/, http://www.scala-lang.org/)

Semantically, the two computations - with and without the broadcast value - are exactly the same, but the broadcast-based one wins performance-wise when there are more executors spawned to execute many tasks that use pws map.

总结

通过这篇文章可以知道,如果在driver中定义一个普通的变量,也是可以在不同的task中传递的,只不过是通过拷贝一个副本的方式传递。为了提高性能通过定义广播变量,在每个机器上只生成一个只读变量,共享给这个机器上所有的task。

你可能感兴趣的:(Spark Broadcast Variables)