spark源码阅读笔记RDD(六) RDD的依赖关系

RDD的依赖关系有那些?

RDD的依赖关系有两种:窄依赖(narrow dependency)和宽依赖(wide dependency).可以用下图进行说明:

spark源码阅读笔记RDD(六) RDD的依赖关系_第1张图片

窄依赖:一个父RDD的partition最多被一个子RDD的一个partition使用

宽依赖:多个子RDD的partition会依赖同一个父RDD的partition


窄依赖和宽依赖在源码中是怎么回事?

所有依赖都是继承package org.apache.spark下的Dependency[T]抽象类,源码如下:

//Dependency的基本类
//Serializable是继承java.io.Serializable的接口,也就是说Dependency能够序列化
abstract class Dependency[T] extends Serializable {
  def rdd: RDD[T]
}
rdd就是依赖的parentRDD。窄依赖又有两种具体的表现方式:一种是一对一的依赖,另一种是多对一的依赖

两种依赖都是继承于NarrowDependency这个抽象类。

/**
 * :: DeveloperApi ::
 * Base class for dependencies where each partition of the child RDD depends on a small number
 * of partitions of the parent RDD. Narrow dependencies allow for pipelined execution.
 */
@DeveloperApi
abstract class NarrowDependency[T](_rdd: RDD[T]) extends Dependency[T] {
  /**
   * Get the parent partitions for a child partition.
   * @param partitionId a partition of the child RDD
   * @return the partitions of the parent RDD that the child partition depends upon
   */
  def getParents(partitionId: Int): Seq[Int]
  //重写rdd方法
  override def rdd: RDD[T] = _rdd
}
一对一的 窄依赖

是用OneToOneDependency这个类实现的。源码如下:

/**
 * :: DeveloperApi ::
 * Represents a one-to-one dependency between partitions of the parent and child RDDs.
 */
@DeveloperApi
class OneToOneDependency[T](rdd: RDD[T]) extends NarrowDependency[T](rdd) {
  override def getParents(partitionId: Int): List[Int] = List(partitionId)
}
通过重写getParents的方法,输入的参数是parentRDD的partitionId,大家知道这个id是独一无二的。


多对一的窄依赖

是用RangeDependency这个类实现的。源码如下:

/**
 * :: DeveloperApi ::
 * Represents a one-to-one dependency between ranges of partitions in the parent and child RDDs.
 * @param rdd the parent RDD
 * @param inStart the start of the range in the parent RDD
 * @param outStart the start of the range in the child RDD
 * @param length the length of the range
 */
@DeveloperApi
class RangeDependency[T](rdd: RDD[T], inStart: Int, outStart: Int, length: Int)
  extends NarrowDependency[T](rdd) {

  override def getParents(partitionId: Int): List[Int] = {
    if (partitionId >= outStart && partitionId < outStart + length) {
      List(partitionId - outStart + inStart)
    } else {
      Nil
    }
  }
}
对于多对一,先通过下面的图进行理解:

spark源码阅读笔记RDD(六) RDD的依赖关系_第2张图片


对于RangeDependency,它目前只被UnionRDD使用,如下是它导入的依赖

import org.apache.spark.{Dependency, Partition, RangeDependency, SparkContext, TaskContext}
同时看一下UnionRDD这个类,源码如下:

class UnionRDD[T: ClassTag](
    sc: SparkContext,
    var rdds: Seq[RDD[T]])
  extends RDD[T](sc, Nil) {  // Nil since we implement getDependencies

  override def getPartitions: Array[Partition] = {
    val array = new Array[Partition](rdds.map(_.partitions.length).sum)
    var pos = 0
    for ((rdd, rddIndex) <- rdds.zipWithIndex; split <- rdd.partitions) {
      array(pos) = new UnionPartition(pos, rdd, rddIndex, split.index)
      pos += 1
    }
    array
  }

现在我们再来联系上图和代码来分析分析到底怎么回事,UnionRDD类中输入的:var rdds: Seq[RDD[T]]),

它是一个Seq[T],也就是说它把多个RDD全部放到一个序列中。现在我们在来看看我们的RangeDependency

输入:(rdd: RDD[T], inStart: Int, outStart: Int, length: Int)

其中:

rdd:父RDD

inStart:父RDDpartition的起始位置

outStart:UnionRDD的起始位置

length:父RDD partition的数量,也就是UnionRDD类中的Array的length


----------------------------------------宽依赖--------------------------------------------------

宽依赖的实现只有一种,那就是ShuffleDependency,子依赖于parentRDD的所有partition,需要shuffle的过程,源码如下:

/**
 * :: DeveloperApi ::
 * Represents a dependency on the output of a shuffle stage. Note that in the case of shuffle,
 * the RDD is transient since we don't need it on the executor side.
 * @param _rdd the parent RDD
 * @param partitioner partitioner used to partition the shuffle output
 * @param serializer [[org.apache.spark.serializer.Serializer Serializer]] to use. If set to None,
 *                   the default serializer, as specified by `spark.serializer` config option, will
 *                   be used.
 * @param keyOrdering key ordering for RDD's shuffles
 * @param aggregator map/reduce-side aggregator for RDD's shuffle   
 * @param mapSideCombine whether to perform partial aggregation (also known as map-side combine)
 */
@DeveloperApi
class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag](
    @transient private val _rdd: RDD[_ <: Product2[K, V]],
    val partitioner: Partitioner,
    val serializer: Option[Serializer] = None,
    val keyOrdering: Option[Ordering[K]] = None,
    val aggregator: Option[Aggregator[K, V, C]] = None,
    val mapSideCombine: Boolean = false)
  extends Dependency[Product2[K, V]] {
	把rdd强制转换为RDD[Product2[K, V]]
  override def rdd: RDD[Product2[K, V]] = _rdd.asInstanceOf[RDD[Product2[K, V]]]
	//基于key
  private[spark] val keyClassName: String = reflect.classTag[K].runtimeClass.getName
  //基于key
  private[spark] val valueClassName: String = reflect.classTag[V].runtimeClass.getName
  // Note: It's possible that the combiner class tag is null, if the combineByKey
  // methods in PairRDDFunctions are used instead of combineByKeyWithClassTag.
	//shuffleId:是产生下一个RDD的id的
  private[spark] val combinerClassName: Option[String] =
    Option(reflect.classTag[C]).map(_.runtimeClass.getName)

  val shuffleId: Int = _rdd.context.newShuffleId()

//向shuffle Manager进行注册
  val shuffleHandle: ShuffleHandle = _rdd.context.env.shuffleManager.registerShuffle(
    shuffleId, _rdd.partitions.length, this)

  _rdd.sparkContext.cleaner.foreach(_.registerShuffleForCleanup(this))
}
因为Shuffle是一个复杂的过程,所以打算放在后续章节来详细阅读总结。利用下图进行梳理:

spark源码阅读笔记RDD(六) RDD的依赖关系_第3张图片


现在我们来看看RDD之间是怎么来找依赖关系的,首先看两个内部使用的函数

  /** Returns the first parent RDD */
  protected[spark] def firstParent[U: ClassTag]: RDD[U] = {
    dependencies.head.rdd.asInstanceOf[RDD[U]]
  }
  /** Returns the jth parent RDD: e.g. rdd.parent[T](0) is equivalent to rdd.firstParent[T] */
  protected[spark] def parent[U: ClassTag](j: Int) = {
    dependencies(j).rdd.asInstanceOf[RDD[U]]
  }

我们再来看看dependencies这个函数, 

   * Get the list of dependencies of this RDD, taking into account whether the
   * RDD is checkpointed or not
   */
  final def dependencies: Seq[Dependency[_]] = {
  //如果RDD,checkpoint过了,此RDD只依赖这个checkpointRDD
    checkpointRDD.map(r => List(new OneToOneDependency(r))).getOrElse {
	//没有checkpoint时
      if (dependencies_ == null) {
	  //且没有计算过依赖,调用子类的getDependencies的方法(因为这个方法只能调用一次)
        dependencies_ = getDependencies
      }
	  //若已经计算过依赖,那么直接返回(缓存RDD的依赖已经存在)
      dependencies_
    }
  }


dependencies是一个序列,这个序列记录的依赖关系










你可能感兴趣的:(spark源码阅读笔记)