RDD的Dependency就是描述了这样的关系,分为2类,Narrow Dependency, Shuffle Dependency。
/** * Base class for dependencies. */ abstract class Dependency[T](val rdd: RDD[T]) extends Serializable
Stage2中的C到D,就是一种narrow dependency。
/** * Base class for dependencies where each partition of the parent RDD is used by at most one * partition of the child RDD. Narrow dependencies allow for pipelined execution. */ abstract class NarrowDependency[T](rdd: RDD[T]) extends Dependency(rdd) { /** * Get the parent partitions for a child partition. * @param partitionId a partition of the child RDD * @return the partitions of the parent RDD that the child partition depends upon */ def getParents(partitionId: Int): Seq[Int] }
窄依赖允许管道式的执行,很容易理解像map, filter 这些都是窄依赖。
/** * Represents a one-to-one dependency between partitions of the parent and child RDDs. */ class OneToOneDependency[T](rdd: RDD[T]) extends NarrowDependency[T](rdd) { override def getParents(partitionId: Int) = List(partitionId)
/** * Represents a one-to-one dependency between ranges of partitions in the parent and child RDDs. * @param rdd the parent RDD * @param inStart the start of the range in the parent RDD * @param outStart the start of the range in the child RDD * @param length the length of the range */ class RangeDependency[T](rdd: RDD[T], inStart: Int, outStart: Int, length: Int) extends NarrowDependency[T](rdd) { override def getParents(partitionId: Int) = { if (partitionId >= outStart && partitionId < outStart + length) { List(partitionId - outStart + inStart) } else { Nil } } }
洗牌依赖?可以这样翻译么?我们还是叫它Wide Dependency吧。
这种叫做宽泛依赖,代表了在shuffle stage的时候的依赖关系。
partitioner,要进行shuffle就需要一个partitioner,是hash partitioner还是自定义的都可以。
/** * Represents a dependency on the output of a shuffle stage. * @param rdd the parent RDD * @param partitioner partitioner used to partition the shuffle output * @param serializerClass class name of the serializer to use */ class ShuffleDependency[K, V]( @transient rdd: RDD[_ <: Product2[K, V]], val partitioner: Partitioner, val serializerClass: String = null) extends Dependency(rdd.asInstanceOf[RDD[Product2[K, V]]]) { val shuffleId: Int = rdd.context.newShuffleId() }