Spark源码之CacheManager

Spark源码之CacheManager篇

CacheManager介绍
1.CacheManager管理spark的缓存，而缓存可以基于内存的缓存，也可以是基于磁盘的缓存；
2.CacheManager需要通过BlockManager来操作数据；
3.当Task运行的时候会调用RDD的comput方法进行计算，而compute方法会调用iterator方法；

CacheManager源码解析
既然要说CacheManager，那么就从RDD获取数据开始入手,例如MapPartitionsRDD,在RDD的compute方法中通过firstParent[T].iterator(split, context)
获取父RDD的数据;

private[spark] class MapPartitionsRDD[U: ClassTag, T: ClassTag](
    prev: RDD[T],
    f: (TaskContext, Int, Iterator[T]) => Iterator[U],  // (TaskContext, partition index, iterator)
    preservesPartitioning: Boolean = false)
  extends RDD[U](prev) {

  override val partitioner = if (preservesPartitioning) firstParent[T].partitioner else None

  override def getPartitions: Array[Partition] = firstParent[T].partitions

  override def compute(split: Partition, context: TaskContext): Iterator[U] =
    f(context, split.index, firstParent[T].iterator(split, context))
}

我们再来看下RDD的iterator是如何获取数据的,在这个方法中可以看到，首先判断该RDD的数据是否进行了缓存，如果有缓存则使用cacheManager获取数据，
如果没有缓存则调用computeOrReadCheckpoint方法获取缓存;

  /**
   * Internal method to this RDD; will read from cache if applicable, or otherwise compute it.
   * This should ''not'' be called by users directly, but is available for implementors of custom
   * subclasses of RDD.
   */
  final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
    //todo 判断RDD的数据是否进行了缓存
    if (storageLevel != StorageLevel.NONE) {
     //todo 如果缓存了则使用cacheManager获取数据
      SparkEnv.get.cacheManager.getOrCompute(this, split, context, storageLevel)
    } else {
      //todo 否则冲checkpoint中获取数据或者从新计算rdd的数据
      computeOrReadCheckpoint(split, context)
    }
  }

我们先看下如何使用cacheManager获取数据，进入cacheManager的getOrCompute方法,该方法的作用就是从缓存中获取RDD的数据,如果缓存中没有数据则重新计算；
从这个方法中我们可以看到cacheManager确实是通过blockManager来获取数据的，通过blockManager的get方法，从本地或者远程读取数据；
如果有数据返回则直接返回用InterruptibleIterator封装的数据集；
如果blockManager的get方法没有获取到数据那么就要从新计算该RDD的数据，如下源码可见，先给指定的partition加锁，然后使用rdd.computeOrReadCheckpoint(partition, context)从新计算RDD的数据，计算之后再将计算的结果数据存入缓存中，之后便返回RDD的数据;

  /** Gets or computes an RDD partition. Used by RDD.iterator() when an RDD is cached. */
  //todo 从缓存中获取RDD的数据,如果缓存中没有数据则重新计算
  def getOrCompute[T](
      rdd: RDD[T],
      partition: Partition,
      context: TaskContext,
      storageLevel: StorageLevel): Iterator[T] = {

    val key = RDDBlockId(rdd.id, partition.index)
    logDebug(s"Looking for partition $key")
    //todo cacheManager利用blockManager获取数据
    blockManager.get(key) match {
      case Some(blockResult) =>
        // Partition is already materialized, so just return its values
        //todo 如果有数据则直接返回
        val existingMetrics = context.taskMetrics
          .getInputMetricsForReadMethod(blockResult.readMethod)
        existingMetrics.incBytesRead(blockResult.bytes)

        val iter = blockResult.data.asInstanceOf[Iterator[T]]
        new InterruptibleIterator[T](context, iter) {
          override def next(): T = {
            existingMetrics.incRecordsRead(1)
            delegate.next()
          }
        }
      case None =>
        //todo 如果没有数据则从新计算，并将计算后的数据存入缓存中
        // Acquire a lock for loading this partition
        // If another thread already holds the lock, wait for it to finish return its results
        //todo 给指定的分区加锁
        val storedValues = acquireLockForPartition[T](key)
        if (storedValues.isDefined) {
          return new InterruptibleIterator[T](context, storedValues.get)
        }

        // Otherwise, we have to load the partition ourselves
        try {
          logInfo(s"Partition $key not found, computing it")
          //todo  从新计算rdd的数据
          val computedValues = rdd.computeOrReadCheckpoint(partition, context)

          // If the task is running locally, do not persist the result
          //todo 如果任务运行在本地，则直接返回结果数据不进行持久化
          if (context.isRunningLocally) {
            return computedValues
          }

          // Otherwise, cache the values and keep track of any updates in block statuses
          //todo 将结果数据放入到缓存中
          val updatedBlocks = new ArrayBuffer[(BlockId, BlockStatus)]
          val cachedValues = putInBlockManager(key, computedValues, storageLevel, updatedBlocks)
          val metrics = context.taskMetrics
          val lastUpdatedBlocks = metrics.updatedBlocks.getOrElse(Seq[(BlockId, BlockStatus)]())
          metrics.updatedBlocks = Some(lastUpdatedBlocks ++ updatedBlocks.toSeq)
          new InterruptibleIterator(context, cachedValues)

        } finally {
          loading.synchronized {
            loading.remove(key)
            loading.notifyAll()
          }
        }
    }

刚才在CacheManager的getOrCompute方法中调用了rdd.computeOrReadCheckpoint(partition, context)来获取RDD的数据，我们现在来看下是怎么回事；
如下源代码所示，在computeOrReadCheckpoint方法中，会先判断spark应用程序是否启用了checkPoint机制，如果启用了则直接去获取数据,如果没有启用checkPoint机制则需要重新计算,其实这里的重新计算就是调用父类的iterator方法获取数据;

  /**
   * Compute an RDD partition or read it from a checkpoint if the RDD is checkpointing.
   */
  private[spark] def computeOrReadCheckpoint(split: Partition, context: TaskContext): Iterator[T] =
  {
    //todo 判断数据是否checkPoint
    if (isCheckpointedAndMaterialized) {
      firstParent[T].iterator(split, context)
    } else {
      //todo 如果没有checkPoint则从新进行RDD的comput
      compute(split, context)
    }
  }
  
  override def compute(partition: Partition, context: TaskContext): Iterator[T] = {
    partition.asInstanceOf[CoalescedRDDPartition].parents.iterator.flatMap { parentPartition =>
      firstParent[T].iterator(parentPartition, context)
    }
  }

再说下CacheManager是如何向缓存中写入数据的:
如下源码所示,在putInBlockManager方法中，先获取RDD的store级别，如果不使用memory存储，则直接使用blockManager.putIterator进行数据存储,这个方法最终调用的是blockManager.doPut()存储数据，在Spark源码之BlockManager这篇中已经讲述过了，如果使用memory存储数据，则先用blockManager.memoryStore.unrollSafely()获取一块连续内存空间,如果有足够的空间，则直接存储,如果没有足够的空间则判断是否可以存储在磁盘上，如果可以存储在磁盘上则修改它的store级别为diskOnlyLevel，然后再调用一次putInBlockManager方法将数据存入Disk中;否则直接返回数据不进行存储;

  private def putInBlockManager[T](
      key: BlockId,
      values: Iterator[T],
      level: StorageLevel,
      updatedBlocks: ArrayBuffer[(BlockId, BlockStatus)],
      effectiveStorageLevel: Option[StorageLevel] = None): Iterator[T] = {

    //todo 获取store级别
    val putLevel = effectiveStorageLevel.getOrElse(level)
    if (!putLevel.useMemory) {
      /*
       * This RDD is not to be cached in memory, so we can just pass the computed values as an
       * iterator directly to the BlockManager rather than first fully unrolling it in memory.
       */
      //todo 不使用memory的方式存储
      updatedBlocks ++=
        blockManager.putIterator(key, values, level, tellMaster = true, effectiveStorageLevel)
      blockManager.get(key) match {
        case Some(v) => v.data.asInstanceOf[Iterator[T]]
        case None =>
          logInfo(s"Failure to store $key")
          throw new BlockException(key, s"Block manager failed to return cached value for $key!")
      }
    } else {
      /*
       * This RDD is to be cached in memory. In this case we cannot pass the computed values
       * to the BlockManager as an iterator and expect to read it back later. This is because
       * we may end up dropping a partition from memory store before getting it back.
       *
       * In addition, we must be careful to not unroll the entire partition in memory at once.
       * Otherwise, we may cause an OOM exception if the JVM does not have enough space for this
       * single partition. Instead, we unroll the values cautiously, potentially aborting and
       * dropping the partition to disk if applicable.
       */
      //todo 采用memory的形式存储
      blockManager.memoryStore.unrollSafely(key, values, updatedBlocks) match {
        case Left(arr) =>
          // We have successfully unrolled the entire partition, so cache it in memory
          updatedBlocks ++=
            blockManager.putArray(key, arr, level, tellMaster = true, effectiveStorageLevel)
          arr.iterator.asInstanceOf[Iterator[T]]
        case Right(it) =>
          // There is not enough space to cache this partition in memory
          val returnValues = it.asInstanceOf[Iterator[T]]
          if (putLevel.useDisk) {
            logWarning(s"Persisting partition $key to disk instead.")
            val diskOnlyLevel = StorageLevel(useDisk = true, useMemory = false,
              useOffHeap = false, deserialized = false, putLevel.replication)
            putInBlockManager[T](key, returnValues, level, updatedBlocks, Some(diskOnlyLevel))
          } else {
            returnValues
          }
      }
    }
  }

Spark源码之CacheManager

你可能感兴趣的:(Spark源码之CacheManager)