内容:
1、RDD依赖关系本质内幕;
2、依赖关系下数据流动视图;
3、经典的RDD依赖关系解析;
4、RDD依赖关系源码内幕;
==========RDD依赖关系本质内幕 ============
窄依赖和宽依赖两种情况
窄依赖是指每个父RDD的partition最多被子RDD的一个partition所使用
宽依赖是指多个子RDD的partition会依赖同一个父的partition,就是一个父RDD的partition会被多个子RDD所使用
宽依赖一般会产生shuffle,例如groupByKey、reduceByKey、sortByKey等操作,都会产生宽依赖
窄依赖例如map、filter,都会产生窄依赖
总结:如果一父RDD的一个partition被一个子RDD的partition所使用,就是窄依赖,否则就是宽依赖。如果子RDD的partition对父RDD的partition依赖的数量不会随着RDD的数据规模的改变而改变的话,就是窄依赖,否则就是宽依赖。
特别说明:对join操作,有两种情况,如果join每个partition仅仅和已知的partition进行特定的join操作,因为是确定的partition数量的依赖关系,则此时的join操作就是窄依赖,并不会产生shuffle;依赖的其它情况的join操作就是宽依赖,会产生shuffle。得出一个推论,窄依赖不仅包含一对一的窄依赖,还包含一对固定个数的窄依赖(也就是对父RDD的依赖的partition的数量不会随着RDD数据规模的改变而改变)。
RDD基于不同的关系构成了stage。
为什么上面是3个stage,而不是一个呢?
如果在一个stage,则会产生大量的中间数据,中间数据要被存储起来,则内存无法释放。
而每个partition是互不干扰的。
比如上如,为最后的三个partition每个partition分配一个task,G的第一个数据,来自B的第一个分片,并且来自A的三个分片;G还来自F的四个分片,F前面所有的。则这个task就超大,遇到shuffle级别的依赖关系,就要计算依赖的RDD的所有partition,并且都发生在一个task中计算。
一个stage唯一的有点就是回溯,pipeline的思想,但是不行。
上面的两种解释的核心问题都在在遇到shuffle依赖(宽依赖)的时候,无法进行pipeline
浪费内存、重复计算、任务太大、不方便管理
=》改进版:
从后往前推,遇到shuffle依赖就断开,弄成一个stage。遇到窄依赖还是在这个stage中。就是上面的那张图。
每个stage里面的task的数量是由该stage中最后一个RDD的partition的数量所决定的。
最后的stage里面的任务类型是ResultTask,前面其它所有的stage里面的任务类型都是ShuffledMapTask。
代表当前stage的算子,一定是该stage的最后一个计算步骤。
窄依赖都可以做pipeline
补充:Hadoop中的Map Reduce操作的Mapper和Reducer操作在Spark中基本等量算子是:map和reduceByKey
==========依赖关系下数据流动视图 ============
以前的例子的展示:
表面上看是数据在流动,实质上是算子在流动(函数式编程) 。
算子在流动包含两层意思:
1、数据不动代码动;
2、在一个Stage内部,算子为何会流动(pipeline)?首先是算子合并,也就是所谓的函数式编程执行的时候最终进行的函数展开,从而把一个Stage内部的多个算子合并成为一个大算子(其内部包含了当前Stage所有算子对数据的计算逻辑);其次是由于transformation的lazy特性,因为是lazy的,所以最后才会帮你合并,最终算是executor在算,在具体算子交给集群的executor计算之前,首先会通过Spark Framework DAGSchedule进行算子的优化(基于数据本地性的pipeline)。
==========经典算子============
CoGroup
/**
* :: Experimental ::
* Generic function to combine the elements for each key using a custom set of aggregation
* functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C
* Note that V and C can be different -- for example, one might group an RDD of type
* (Int, Int) into an RDD of type (Int, Seq[Int]). Users provide three functions:
*
* - `createCombiner`, which turns a V into a C (e.g., creates a one-element list)
* - `mergeValue`, to merge a V into a C (e.g., adds it to the end of a list)
* - `mergeCombiners`, to combine two C's into a single one.
*
* In addition, users can control the partitioning of the output RDD, and whether to perform
* map-side aggregation (if a mapper can produce multiple items with the same key).
*/
@Experimental
def combineByKeyWithClassTag[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
if (keyClass.isArray) {
if (mapSideCombine) {
throw new SparkException("Cannot use map-side combining with array keys.")
}
if (partitioner.isInstanceOf[HashPartitioner]) {
throw new SparkException("Default partitioner cannot partition array keys.")
}
}
val aggregator = new Aggregator[K, V, C](
self.context.clean(createCombiner),
self.context.clean(mergeValue),
self.context.clean(mergeCombiners))
if (self.partitioner == Some(partitioner)) {
self.mapPartitions(iter => {
val context = TaskContext.get()
new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
}, preservesPartitioning = true)
} else {
new ShuffledRDD[K, V, C](self, partitioner)
.setSerializer(serializer)
.setAggregator(aggregator)
.setMapSideCombine(mapSideCombine)
}
}
combineByKeyWithClassTag,是shuffle级别的,会产生shuffle依赖,是k,v形式的。
reduceByKey必然是边抓边算
mergeCombiners
/**
* :: DeveloperApi ::
* Base class for dependencies.
*/
@DeveloperApi
abstract class Dependency[T] extends Serializable {
def rdd: RDD[T]
}
/**
* :: DeveloperApi ::
* Base class for dependencies where each partition of the child RDD depends on a small number
* of partitions of the parent RDD. Narrow dependencies allow for pipelined execution.
*/
@DeveloperApi
abstract class NarrowDependency[T](_rdd: RDD[T]) extends Dependency[T] {
/**
* Get the parent partitions for a child partition.
* @param partitionId a partition of the child RDD
* @return the partitions of the parent RDD that the child partition depends upon
*/
def getParents(partitionId: Int): Seq[Int]
override def rdd: RDD[T] = _rdd
}
窄依赖的两种方式
/**
* :: DeveloperApi ::
* Represents a one-to-one dependency between partitions of the parent and child RDDs.
*/
@DeveloperApi
class OneToOneDependency[T](rdd: RDD[T]) extends NarrowDependency[T](rdd) {
override def getParents(partitionId: Int): List[Int] = List(partitionId)
}
/**
* :: DeveloperApi ::
* Represents a one-to-one dependency between ranges of partitions in the parent and child RDDs.
* @param rdd the parent RDD
* @param inStart the start of the range in the parent RDD
* @param outStart the start of the range in the child RDD
* @param length the length of the range
*/
@DeveloperApi
class RangeDependency[T](rdd: RDD[T], inStart: Int, outStart: Int, length: Int)
extends NarrowDependency[T](rdd) {
override def getParents(partitionId: Int): List[Int] = {
if (partitionId >= outStart && partitionId < outStart + length) {
List(partitionId - outStart + inStart)
} else {
Nil
}
}
}
子partitioner依赖于父的所有的partitioner:
/**
* :: DeveloperApi ::
* Represents a dependency on the output of a shuffle stage. Note that in the case of shuffle,
* the RDD is transient since we don't need it on the executor side.
*
* @param _rdd the parent RDD
* @param partitioner partitioner used to partition the shuffle output
* @param serializer [[org.apache.spark.serializer.Serializer Serializer]] to use. If set to None,
* the default serializer, as specified by `spark.serializer` config option, will
* be used.
* @param keyOrdering key ordering for RDD's shuffles
* @param aggregator map/reduce-side aggregator for RDD's shuffle
* @param mapSideCombine whether to perform partial aggregation (also known as map-side combine)
*/
@DeveloperApi
class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag](
@transient private val _rdd: RDD[_ <: Product2[K, V]],
val partitioner: Partitioner,
val serializer: Option[Serializer] = None,
val keyOrdering: Option[Ordering[K]] = None,
val aggregator: Option[Aggregator[K, V, C]] = None,
val mapSideCombine: Boolean = false)
extends Dependency[Product2[K, V]] {
override def rdd: RDD[Product2[K, V]] = _rdd.asInstanceOf[RDD[Product2[K, V]]]
private[spark] val keyClassName: String = reflect.classTag[K].runtimeClass.getName
private[spark] val valueClassName: String = reflect.classTag[V].runtimeClass.getName
// Note: It's possible that the combiner class tag is null, if the combineByKey
// methods in PairRDDFunctions are used instead of combineByKeyWithClassTag.
private[spark] val combinerClassName: Option[String] =
Option(reflect.classTag[C]).map(_.runtimeClass.getName)
val shuffleId: Int = _rdd.context.newShuffleId()
val shuffleHandle: ShuffleHandle = _rdd.context.env.shuffleManager.registerShuffle(
shuffleId, _rdd.partitions.size, this)
_rdd.sparkContext.cleaner.foreach(_.registerShuffleForCleanup(this))
}
作业:
RDD依赖关系彻底解密的博客。
本文出自 “一枝花傲寒” 博客,谢绝转载!