Spark-广播变量(Broadcast Variables)

Spark-广播变量(Broadcast Variables)

1. 简单使用

def main(args: Array[String]): Unit = {
    val sparkConf=new SparkConf()
      .setAppName("BroadcastApp").setMaster("local[2]")
    val sc= new SparkContext(sparkConf)
    val data=sc.parallelize(Array(1,2,3,4,5))
    //val broadcastdata=sc.broadcast(data)
    //java.lang.IllegalArgumentException: requirement failed: Can not directly broadcast RDDs; instead, call collect() and broadcast the result.
    //不能直接对RDD进行广播需要collect获取
    //修改为
    val broadcastdata=sc.broadcast(data.collect())
    broadcastdata.value.foreach(println(_))//1 2 3 4 5 
  }

2. 广播流程

  • broadcast源码分析
 /**
   * Broadcast a read-only variable to the cluster, returning a  *********************变量只读**********************
   * [[org.apache.spark.broadcast.Broadcast]] object for reading it in distributed functions.
   * The variable will be sent to each cluster only once. ******************每个cluster只发送一次******************
   *
   * @param value value to broadcast to the Spark nodes
   * @return `Broadcast` object, a read-only variable cached on each machine
   */
  def broadcast[T: ClassTag](value: T): Broadcast[T] = {
    assertNotStopped()
    require(!classOf[RDD[_]].isAssignableFrom(classTag[T].runtimeClass),
      "Can not directly broadcast RDDs; instead, call collect() and broadcast the result.")
      // ***************调用broadcastManager的newBroadcast方法*****************
    val bc = env.broadcastManager.newBroadcast[T](value, isLocal)
    val callSite = getCallSite
    logInfo("Created broadcast " + bc.id + " from " + callSite.shortForm)
    cleaner.foreach(_.registerBroadcastForCleanup(bc))
    bc
  }

def newBroadcast[T: ClassTag](value_ : T, isLocal: Boolean): Broadcast[T] = {
    broadcastFactory.newBroadcast[T](value_, isLocal, nextBroadcastId.getAndIncrement())
  }

override def newBroadcast[T: ClassTag](value_ : T, isLocal: Boolean, id: Long): Broadcast[T] = {
    new TorrentBroadcast[T](value_, id)
  }
  


/**
 * A BitTorrent-like implementation of [[org.apache.spark.broadcast.Broadcast]].
 *
 * The mechanism is as follows:
 *
 * The driver divides the serialized object into small chunks and
 * stores those chunks in the BlockManager of the driver.
 *  将广播对象分成若干个系列化大小的数据块(默认4m),存储在BlockManager
 * On each executor, the executor first attempts to fetch the object from its BlockManager. If
 * it does not exist, it then uses remote fetches to fetch the small chunks from the driver and/or
 * other executors if available. Once it gets the chunks, it puts the chunks in its own
 * BlockManager, ready for other executors to fetch from.
 *  executor第一次请求对象不存在时,会去driver或者其他executors获得,一旦或得对象,便分块存储在自己BlockManager,并给其他executors提供获取对象的服务
 这可以得出数据请求并非集中于Driver端从而避免了单点故障和Driver端网络磁盘IO过高。
 
 * This prevents the driver from being the bottleneck in sending out multiple copies of the
 * broadcast data (one per executor).
 *
 * When initialized, TorrentBroadcast objects read SparkEnv.get.conf.
 *
 * @param obj object to broadcast
 * @param id A unique identifier for the broadcast variable.
 */
private[spark] class TorrentBroadcast[T: ClassTag](obj: T, id: Long)
  extends Broadcast[T](id) with Logging with Serializable {

  /**
   * Value of the broadcast object on executors. This is reconstructed by [[readBroadcastBlock]],
   * which builds this value by reading blocks from the driver and/or other executors.
   *
   * On the driver, if the value is required, it is read lazily from the block manager.
   */
   lazy读取
   会先从本地读取,对结果进行模式匹配,匹配到就本地读取,否则远程加载
  @transient private lazy val _value: T = readBroadcastBlock()
   是否压缩
  /** The compression codec to use, or None if compression is disabled */
  @transient private var compressionCodec: Option[CompressionCodec] = _
  /** Size of each block. Default value is 4MB.  This value is only read by the broadcaster. */
  每个block大小默认是4
  @transient private var blockSize: Int = _

  private def setConf(conf: SparkConf) {
    compressionCodec = if (conf.getBoolean("spark.broadcast.compress", true)) {
      Some(CompressionCodec.createCodec(conf))
    } else {
      None
    }
    // Note: use getSizeAsKb (not bytes) to maintain compatibility if no units are provided
    blockSize = conf.getSizeAsKb("spark.broadcast.blockSize", "4m").toInt * 1024
    checksumEnabled = conf.getBoolean("spark.broadcast.checksum", true)
  }
  setConf(SparkEnv.get.conf)



    /**
   * Divide the object into multiple blocks and put those blocks in the block manager.
   *    将该对象存储于本地BlockManager
   * @param value the object to divide
   * @return number of blocks this broadcast variable is divided into
   */
  private def writeBlocks(value: T): Int = {
    import StorageLevel._
    // Store a copy of the broadcast variable in the driver so that tasks run on the driver
    // do not create a duplicate copy of the broadcast variable's value.
    val blockManager = SparkEnv.get.blockManager

    if (!blockManager.putSingle(broadcastId, value, MEMORY_AND_DISK, tellMaster = false)) {
      throw new SparkException(s"Failed to store $broadcastId in BlockManager")
    }


/** Fetch torrent blocks from the driver and/or other executors. */
  private def readBlocks(): Array[BlockData] = {
    // Fetch chunks of data. Note that all these chunks are stored in the BlockManager and reported
    // to the driver, so other executors can pull these chunks from this executor as well.
	val blocks = new Array[BlockData](numBlocks)
    val bm = SparkEnv.get.blockManager
    ......
      // We found the block from remote executors/driver's BlockManager, so put the block
        // in this executor's BlockManager.
        //序列化存储到本地
      if (!bm.putBytes(pieceId, b, StorageLevel.MEMORY_AND_DISK_SER, tellMaster = true)) {
         throw new SparkException(
            s"Failed to store $pieceId of $broadcastId in local BlockManager")
        }

还原对象存储到本地,其他task在使用时不用重新获取
val obj = TorrentBroadcast.unBlockifyObject[T](
                blocks.map(_.toInputStream()), SparkEnv.get.serializer, compressionCodec)
              // Store the merged copy in BlockManager so other tasks on this executor don't
              // need to re-fetch it.
              val storageLevel = StorageLevel.MEMORY_AND_DISK



// Called by SparkContext or Executor before using Broadcast
//在使用广播之前由SparkContext或Executor调用
  private def initialize() {
    synchronized {
      if (!initialized) {
        broadcastFactory = new TorrentBroadcastFactory
        broadcastFactory.initialize(isDriver, conf, securityManager)
        initialized = true
      }
    }
  }

3.总结

  1. 只读不可更改=>保证数据一致性
  2. 在每台机子缓存一个变量,不会每个task传输变量,减少了网络io
  3. 广播变量在运行task前会反序列化进行缓存
  4. 使用场景
    1)广播的对象比较小
    2)跨stage(多个stage公用)
    3)一个executor运行多个task

你可能感兴趣的:(大数据,spark,spark)