RDD的依赖关系有那些?
RDD的依赖关系有两种:窄依赖(narrow dependency)和宽依赖(wide dependency).可以用下图进行说明:
窄依赖:一个父RDD的partition最多被一个子RDD的一个partition使用
宽依赖:多个子RDD的partition会依赖同一个父RDD的partition
窄依赖和宽依赖在源码中是怎么回事?
所有依赖都是继承package org.apache.spark下的Dependency[T]抽象类,源码如下:
//Dependency的基本类
//Serializable是继承java.io.Serializable的接口,也就是说Dependency能够序列化
abstract class Dependency[T] extends Serializable {
def rdd: RDD[T]
}
rdd就是依赖的parentRDD。窄依赖又有两种具体的表现方式:一种是一对一的依赖,另一种是多对一的依赖
两种依赖都是继承于NarrowDependency这个抽象类。
/**
* :: DeveloperApi ::
* Base class for dependencies where each partition of the child RDD depends on a small number
* of partitions of the parent RDD. Narrow dependencies allow for pipelined execution.
*/
@DeveloperApi
abstract class NarrowDependency[T](_rdd: RDD[T]) extends Dependency[T] {
/**
* Get the parent partitions for a child partition.
* @param partitionId a partition of the child RDD
* @return the partitions of the parent RDD that the child partition depends upon
*/
def getParents(partitionId: Int): Seq[Int]
//重写rdd方法
override def rdd: RDD[T] = _rdd
}
一对一的
窄依赖
是用OneToOneDependency这个类实现的。源码如下:
/**
* :: DeveloperApi ::
* Represents a one-to-one dependency between partitions of the parent and child RDDs.
*/
@DeveloperApi
class OneToOneDependency[T](rdd: RDD[T]) extends NarrowDependency[T](rdd) {
override def getParents(partitionId: Int): List[Int] = List(partitionId)
}
通过重写getParents的方法,输入的参数是parentRDD的partitionId,大家知道这个id是独一无二的。
多对一的窄依赖
是用RangeDependency这个类实现的。源码如下:
/**
* :: DeveloperApi ::
* Represents a one-to-one dependency between ranges of partitions in the parent and child RDDs.
* @param rdd the parent RDD
* @param inStart the start of the range in the parent RDD
* @param outStart the start of the range in the child RDD
* @param length the length of the range
*/
@DeveloperApi
class RangeDependency[T](rdd: RDD[T], inStart: Int, outStart: Int, length: Int)
extends NarrowDependency[T](rdd) {
override def getParents(partitionId: Int): List[Int] = {
if (partitionId >= outStart && partitionId < outStart + length) {
List(partitionId - outStart + inStart)
} else {
Nil
}
}
}
对于多对一,先通过下面的图进行理解:
对于RangeDependency,它目前只被UnionRDD使用,如下是它导入的依赖
import org.apache.spark.{Dependency, Partition, RangeDependency, SparkContext, TaskContext}同时看一下UnionRDD这个类,源码如下:
class UnionRDD[T: ClassTag](
sc: SparkContext,
var rdds: Seq[RDD[T]])
extends RDD[T](sc, Nil) { // Nil since we implement getDependencies
override def getPartitions: Array[Partition] = {
val array = new Array[Partition](rdds.map(_.partitions.length).sum)
var pos = 0
for ((rdd, rddIndex) <- rdds.zipWithIndex; split <- rdd.partitions) {
array(pos) = new UnionPartition(pos, rdd, rddIndex, split.index)
pos += 1
}
array
}
现在我们再来联系上图和代码来分析分析到底怎么回事,UnionRDD类中输入的:var rdds: Seq[RDD[T]]),
它是一个Seq[T],也就是说它把多个RDD全部放到一个序列中。现在我们在来看看我们的RangeDependency
输入:(rdd: RDD[T], inStart: Int, outStart: Int, length: Int)
其中:
rdd:父RDD
inStart:父RDDpartition的起始位置
outStart:UnionRDD的起始位置
length:父RDD partition的数量,也就是UnionRDD类中的Array的length
----------------------------------------宽依赖--------------------------------------------------
宽依赖的实现只有一种,那就是ShuffleDependency,子依赖于parentRDD的所有partition,需要shuffle的过程,源码如下:
/**
* :: DeveloperApi ::
* Represents a dependency on the output of a shuffle stage. Note that in the case of shuffle,
* the RDD is transient since we don't need it on the executor side.
* @param _rdd the parent RDD
* @param partitioner partitioner used to partition the shuffle output
* @param serializer [[org.apache.spark.serializer.Serializer Serializer]] to use. If set to None,
* the default serializer, as specified by `spark.serializer` config option, will
* be used.
* @param keyOrdering key ordering for RDD's shuffles
* @param aggregator map/reduce-side aggregator for RDD's shuffle
* @param mapSideCombine whether to perform partial aggregation (also known as map-side combine)
*/
@DeveloperApi
class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag](
@transient private val _rdd: RDD[_ <: Product2[K, V]],
val partitioner: Partitioner,
val serializer: Option[Serializer] = None,
val keyOrdering: Option[Ordering[K]] = None,
val aggregator: Option[Aggregator[K, V, C]] = None,
val mapSideCombine: Boolean = false)
extends Dependency[Product2[K, V]] {
把rdd强制转换为RDD[Product2[K, V]]
override def rdd: RDD[Product2[K, V]] = _rdd.asInstanceOf[RDD[Product2[K, V]]]
//基于key
private[spark] val keyClassName: String = reflect.classTag[K].runtimeClass.getName
//基于key
private[spark] val valueClassName: String = reflect.classTag[V].runtimeClass.getName
// Note: It's possible that the combiner class tag is null, if the combineByKey
// methods in PairRDDFunctions are used instead of combineByKeyWithClassTag.
//shuffleId:是产生下一个RDD的id的
private[spark] val combinerClassName: Option[String] =
Option(reflect.classTag[C]).map(_.runtimeClass.getName)
val shuffleId: Int = _rdd.context.newShuffleId()
//向shuffle Manager进行注册
val shuffleHandle: ShuffleHandle = _rdd.context.env.shuffleManager.registerShuffle(
shuffleId, _rdd.partitions.length, this)
_rdd.sparkContext.cleaner.foreach(_.registerShuffleForCleanup(this))
}
因为Shuffle是一个复杂的过程,所以打算放在后续章节来详细阅读总结。利用下图进行梳理:
现在我们来看看RDD之间是怎么来找依赖关系的,首先看两个内部使用的函数
/** Returns the first parent RDD */
protected[spark] def firstParent[U: ClassTag]: RDD[U] = {
dependencies.head.rdd.asInstanceOf[RDD[U]]
}
/** Returns the jth parent RDD: e.g. rdd.parent[T](0) is equivalent to rdd.firstParent[T] */
protected[spark] def parent[U: ClassTag](j: Int) = {
dependencies(j).rdd.asInstanceOf[RDD[U]]
}
我们再来看看dependencies这个函数,
* Get the list of dependencies of this RDD, taking into account whether the
* RDD is checkpointed or not
*/
final def dependencies: Seq[Dependency[_]] = {
//如果RDD,checkpoint过了,此RDD只依赖这个checkpointRDD
checkpointRDD.map(r => List(new OneToOneDependency(r))).getOrElse {
//没有checkpoint时
if (dependencies_ == null) {
//且没有计算过依赖,调用子类的getDependencies的方法(因为这个方法只能调用一次)
dependencies_ = getDependencies
}
//若已经计算过依赖,那么直接返回(缓存RDD的依赖已经存在)
dependencies_
}
}
dependencies是一个序列,这个序列记录的依赖关系