SparkCore — BlockManager写数据,以及缓存淘汰机制

BlockManager写数据

  上一篇博客,分析了BlockManager读取数据,主要分为本地拉取和远程拉取。现在我们分析一下写数据,主要是调用doPut()方法:

doPut()
 private def doPut(
      blockId: BlockId,
      data: BlockValues,
      level: StorageLevel,
      tellMaster: Boolean = true,
      effectiveStorageLevel: Option[StorageLevel] = None)
    : Seq[(BlockId, BlockStatus)] = {

    // 当前Block不为null,并且存储级别是valid
    require(blockId != null, "BlockId is null")
    require(level != null && level.isValid, "StorageLevel is null or invalid")
    effectiveStorageLevel.foreach { level =>
      require(level != null && level.isValid, "Effective StorageLevel is null or invalid")
    }

    // Return value
    // 保存写如数据的信息,BlockId到BlockStatus的映射
    val updatedBlocks = new ArrayBuffer[(BlockId, BlockStatus)]

    /* Remember the block's storage level so that we can correctly drop it to disk if it needs
     * to be dropped right after it got put into memory. Note, however, that other threads will
     * not be able to get() this block until we call markReady on its BlockInfo. */
    // 为要写入的block,创建一个BlockInfo,并将其放入BlockInfo map中,
    // 记录Block的持久化级别,防止存入新的数据的时候,Block内存不够会尝试将内存数据写入磁盘,空出内存
    val putBlockInfo = {
      val tinfo = new BlockInfo(level, tellMaster)
      // Do atomically !
      val oldBlockOpt = blockInfo.putIfAbsent(blockId, tinfo)
      if (oldBlockOpt.isDefined) {
        if (oldBlockOpt.get.waitForReady()) {
          logWarning(s"Block $blockId already exists on this machine; not re-adding it")
          return updatedBlocks
        }
        // TODO: So the block info exists - but previous attempt to load it (?) failed.
        // What do we do now ? Retry on it ?
        oldBlockOpt.get
      } else {
        tinfo
      }
    }

    val startTimeMs = System.currentTimeMillis

    /* If we're storing values and we need to replicate the data, we'll want access to the values,
     * but because our put will read the whole iterator, there will be no values left. For the
     * case where the put serializes data, we'll remember the bytes, above; but for the case where
     * it doesn't, such as deserialized storage, let's rely on the put returning an Iterator. */
    var valuesAfterPut: Iterator[Any] = null

    // Ditto for the bytes after the put
    var bytesAfterPut: ByteBuffer = null

    // Size of the block in bytes
    var size = 0L

    // The level we actually use to put the block
    val putLevel = effectiveStorageLevel.getOrElse(level)

    // If we're storing bytes, then initiate the replication before storing them locally.
    // This is faster as data is already serialized and ready to send.
    val replicationFuture = data match {
      case b: ByteBufferValues if putLevel.replication > 1 =>
        // Duplicate doesn't copy the bytes, but just creates a wrapper
        val bufferView = b.buffer.duplicate()
        Future {
          // This is a blocking action and should run in futureExecutionContext which is a cached
          // thread pool
          replicate(blockId, bufferView, putLevel)
        }(futureExecutionContext)
      case _ => null
    }

    // 尝试对BlockInfo加锁,进行多线程并发访问同步
    putBlockInfo.synchronized {
      logTrace("Put for block %s took %s to get into synchronized block"
        .format(blockId, Utils.getUsedTimeMs(startTimeMs)))

      var marked = false
      try {
        // 首先根据持久化级别,选择一种BlockStore,如MemoryStore、DiskStore等
        val (returnValues, blockStore: BlockStore) = {
          if (putLevel.useMemory) {
            // Put it in memory first, even if it also has useDisk set to true;
            // We will drop it to disk later if the memory store can't hold it.
            (true, memoryStore)
          } else if (putLevel.useOffHeap) {
            // Use external block store
            (false, externalBlockStore)
          } else if (putLevel.useDisk) {
            // Don't get back the bytes from put unless we replicate them
            (putLevel.replication > 1, diskStore)
          } else {
            assert(putLevel == StorageLevel.NONE)
            throw new BlockException(
              blockId, s"Attempted to put block $blockId without specifying storage level!")
          }
        }

        // Actually put the values
        // 根据选择的store类型,并且根据数据的类型,将数据放入store中
        val result = data match {
          case IteratorValues(iterator) =>
            blockStore.putIterator(blockId, iterator, putLevel, returnValues)
          case ArrayValues(array) =>
            blockStore.putArray(blockId, array, putLevel, returnValues)
          case ByteBufferValues(bytes) =>
            bytes.rewind()
            blockStore.putBytes(blockId, bytes, putLevel)
        }
        size = result.size
        result.data match {
          case Left (newIterator) if putLevel.useMemory => valuesAfterPut = newIterator
          case Right (newBytes) => bytesAfterPut = newBytes
          case _ =>
        }

        // Keep track of which blocks are dropped from memory
        if (putLevel.useMemory) {
          result.droppedBlocks.foreach { updatedBlocks += _ }
        }
        // 获取一个block对应的BlockStatus
        val putBlockStatus = getCurrentBlockStatus(blockId, putBlockInfo)
        if (putBlockStatus.storageLevel != StorageLevel.NONE) {
          // Now that the block is in either the memory, externalBlockStore, or disk store,
          // let other threads read it, and tell the master about it.
          marked = true
          putBlockInfo.markReady(size)
          if (tellMaster) {
            // 调用reportBlockStatus,将新写入的数据发送到BlockManagerMasterEndpoint
            // 以便于进行Block元数据的同步和维护,说白了就是对BlockManagerMaster上的block状态信息进行更新
            reportBlockStatus(blockId, putBlockInfo, putBlockStatus)
          }
          updatedBlocks += ((blockId, putBlockStatus))
        }
      } finally {
        // If we failed in putting the block to memory/disk, notify other possible readers
        // that it has failed, and then remove it from the block info map.
        if (!marked) {
          // Note that the remove must happen before markFailure otherwise another thread
          // could've inserted a new BlockInfo before we remove it.
          blockInfo.remove(blockId)
          putBlockInfo.markFailure()
          logWarning(s"Putting block $blockId failed")
        }
      }
    }
    logDebug("Put block %s locally took %s".format(blockId, Utils.getUsedTimeMs(startTimeMs)))

    // Either we're storing bytes and we asynchronously started replication, or we're storing
    // values and need to serialize and replicate them now:
    // 如果我们的持久化级别,是定义了_2 (比如 MEMORY_AND_DISK_2)这种后缀,
    // 说明需要对block进行replicate,然后传输到其他节点上去。
    if (putLevel.replication > 1) {
      data match {
        case ByteBufferValues(bytes) =>
          if (replicationFuture != null) {
            Await.ready(replicationFuture, Duration.Inf)
          }
        case _ =>
          val remoteStartTime = System.currentTimeMillis
          // Serialize the block if not already done
          if (bytesAfterPut == null) {
            if (valuesAfterPut == null) {
              throw new SparkException(
                "Underlying put returned neither an Iterator nor bytes! This shouldn't happen.")
            }
            bytesAfterPut = dataSerialize(blockId, valuesAfterPut)
          }
          // 调用replicate方法进行复制操作
          replicate(blockId, bytesAfterPut, putLevel)
          logDebug("Put block %s remotely took %s"
            .format(blockId, Utils.getUsedTimeMs(remoteStartTime)))
      }
    }

    BlockManager.dispose(bytesAfterPut)

    if (putLevel.replication > 1) {
      logDebug("Putting block %s with replication took %s"
        .format(blockId, Utils.getUsedTimeMs(startTimeMs)))
    } else {
      logDebug("Putting block %s without replication took %s"
        .format(blockId, Utils.getUsedTimeMs(startTimeMs)))
    }

    updatedBlocks
  }

  首先为要写入的Block创建一个BlockInfo,并且记录当前Block的持久化级别,在后面将数据存入本地内存的时候,假如内存不够需要释放一些内存,这时候就会根据持久化级别来判断是删除还是持久化到磁盘上,后面会分析到。然后在写数据的时候,加锁,进行多线程并发写加锁,防止在写的过程中数据被篡改。接着根据持久化级别选择一种BlockStore,比如MemoryStore,DiskStore或者externalBlockStore,接着根据选择的store类型,以及数据的类型,将数据放入store中。然后获取Block对应的BlockStatus,调用reportBlockStatus将新写入的数据信息BlockStatus发送到BlockManagerMasterEndpoint上,同步Block的元数据信息;下一步判断就很重要了,假设要写入的数据的持久化级别是 _2 这种类型,则需要将Block进行replicate,拷贝一份到其他节点上去,这里主要调用的是replicate,里面就是随机选择一个节点,然后将数据发送过去。
  下面接着分析具体的数据写入操作,我们以MemoryStore为例,因为MemoryStore在写入数据的时候,假设内存不够,会先尝试释放一些存储时间久的RDD,如果这个RDD的持久化级别是带有Disk的,那么就将它写入磁盘,假如只是内存级别,那么就会彻底删除。其实在写入数据到内存的时候,最终会调用tryPut()方法,我们分析一下这个方法:

tryPut()
private def tryToPut(
      blockId: BlockId,
      value: () => Any,
      size: Long,
      deserialized: Boolean,
      droppedBlocks: mutable.Buffer[(BlockId, BlockStatus)]): Boolean = {
    // 这里进行多线程同步操作,在写入数据的时候,防止其他线程也在写
    memoryManager.synchronized {
      // 释放unroll内存
      releasePendingUnrollMemoryForThisTask()
      // 这里在放入数据的时候,会判断是否有足够的内存来放入数据
      val enoughMemory = memoryManager.acquireStorageMemory(blockId, size, droppedBlocks)
      if (enoughMemory) {
        // We acquired enough memory for the block, so go ahead and put it
        // 如果获取到足够的内存存储数据
        val entry = new MemoryEntry(value(), size, deserialized)
        // 将数据写入内存中
        entries.synchronized {
          entries.put(blockId, entry)
        }
        val valuesOrBytes = if (deserialized) "values" else "bytes"
        logInfo("Block %s stored as %s in memory (estimated size %s, free %s)".format(
          blockId, valuesOrBytes, Utils.bytesToString(size), Utils.bytesToString(blocksMemoryUsed)))
      } else {
        // 假设内存不够,那么假如持久化级别带有Disk,那么就将数据写入磁盘中,注意是lazy
        lazy val data = if (deserialized) {
          Left(value().asInstanceOf[Array[Any]])
        } else {
          Right(value().asInstanceOf[ByteBuffer].duplicate())
        }
        // 删除之前的存在的数据
        val droppedBlockStatus = blockManager.dropFromMemory(blockId, () => data)
        droppedBlockStatus.foreach { status => droppedBlocks += ((blockId, status)) }
      }
      enoughMemory
    }
  }

  在将数据写入内存的时候,首先添加写锁,防止数据被篡改,接着释放unroll内存,然后去尝试获取足够的内存来存放数据,调用方法acquireStorageMemory,来判断当前内存是否够存放数据,不够的话,那么就会移除一部分数据,再次判断够不够,直到内存足够或者一直不够,就返回,它里面调用了evictBlocksToFreeSpace方法;再申请到足够的内存之后,就将当前数据创建一份MemoryEntry,并写入内存中。
  我们具体看看是怎么申请到足够的内存的:

evictBlocksToFreeSpace方法
private[spark] def evictBlocksToFreeSpace(
      blockId: Option[BlockId],
      space: Long,
      droppedBlocks: mutable.Buffer[(BlockId, BlockStatus)]): Boolean = {
    assert(space > 0)
    memoryManager.synchronized {
      var freedMemory = 0L
      // 获取当前Block的RDD
      val rddToAdd = blockId.flatMap(getRddId)
      // 选择需要删除的RDD
      val selectedBlocks = new ArrayBuffer[BlockId]
      entries.synchronized {
        // 获取内存中保存的MemoryEntry
        val iterator = entries.entrySet().iterator()
        while (freedMemory < space && iterator.hasNext) {
          // 去除待的时间最长的RDD
          val pair = iterator.next()
          val blockId = pair.getKey
          if (rddToAdd.isEmpty || rddToAdd != getRddId(blockId)) {
            // 放入选择要删除的队列中
            selectedBlocks += blockId
            // 释放内存
            freedMemory += pair.getValue.size
          }
        }
      }

      if (freedMemory >= space) {
        logInfo(s"${selectedBlocks.size} blocks selected for dropping")
        for (blockId <- selectedBlocks) {
          // 获取MemoryEntry集合
          val entry = entries.synchronized { entries.get(blockId) }
          // This should never be null as only one task should be dropping
          // blocks and removing entries. However the check is still here for
          // future safety.
          if (entry != null) {
            val data = if (entry.deserialized) {
              Left(entry.value.asInstanceOf[Array[Any]])
            } else {
              Right(entry.value.asInstanceOf[ByteBuffer].duplicate())
            }
            // 删除老的数据,这里面会判断待移除的RDD的持久化类型是否带有Disk
            // 假如只是memory,那么就彻底移除,否则的话持久化到磁盘上
            val droppedBlockStatus = blockManager.dropFromMemory(blockId, data)
            droppedBlockStatus.foreach { status => droppedBlocks += ((blockId, status)) }
          }
        }
        true
      } else {
        blockId.foreach { id =>
          logInfo(s"Will not store $id as it would require dropping another block " +
            "from the same RDD")
        }
        false
      }
    }
  }

  当前内存不够存放的时候,会将部分数据移除,移除策略是这样的,取entries也即一个LinkedHashMap结构的数据,我们知道LinkedHashMap的存储数据是有先后顺序的,放的越久的数据会在头部,而最新的数据都是放在尾部的,这有点类似LRU缓存淘汰算法,这里的数据移除策略就是这样的,在移除的过程中也会判断RDD的持久化级别是否包含Disk,假如包含的话,那么会存储到磁盘上,否则就彻底丢失了,这里就涉及到了Spark的性能调优,RDD在没有持久化的情况下可能会丢失数据。
  总结一下,BlockManager在写数据的时候,也是根据持久化的级别来进行数据存储,我们分析了写入内存(MemoryStore)是如何操作的,其他几种情况类似。在写入内存的时候,会有一个缓存淘汰的机制,即在内存不够的情况下,会将Block中存放时间较久的RDD删除(假如RDD的持久化机制中包含了Disk,那么会写入磁盘),这就会导致RDD的数据丢失。因此对于比较重要或使用次数较多的RDD,在持久化的时候最好加上Disk类型。

你可能感兴趣的:(Spark,Core原理与源码分析)