源码目录
1 程序入口
var conf: SparkConf = new SparkConf().setAppName("SparkJob_Demo").setMaster("local[*]")
val sparkContext: SparkContext = new SparkContext(conf)
sparkContext.parallelize(List("aaa", "bbb", "ccc", "ddd"), 2)
.repartition(4)
.collect()
2 进入源码
2.1 跟进parallelize
- 进入
org.apache.spark.SparkContext.scala
def parallelize[T: ClassTag](
seq: Seq[T],
numSlices: Int = defaultParallelism): RDD[T] = withScope {
assertNotStopped()
new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]())
}
创建一个 ParallelCollectionRDD
- 进入
org.apache.spark.rdd.ParallelCollectionRDD.scala
private[spark] class ParallelCollectionRDD[T: ClassTag](
sc: SparkContext,
@transient private val data: Seq[T],
numSlices: Int,
locationPrefs: Map[Int, Seq[String]])
extends RDD[T](sc, Nil) {
}
2.2 跟进repartition
- 进入
org.apache.spark.rdd.RDD.scala
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}
repartition 调用 coalesce 方法,并将 shuffle 参数设置为 true。
2.2.1 跟进coalesce
- 进入
org.apache.spark.rdd.RDD.scala
def coalesce(numPartitions: Int, shuffle: Boolean = false,
partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
(implicit ord: Ordering[T] = null)
: RDD[T] = withScope {
require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")
if (shuffle) {
/** Distributes elements evenly across output partitions, starting from a random partition. */
val distributePartition = (index: Int, items: Iterator[T]) => {
var position = (new Random(index)).nextInt(numPartitions)
items.map { t =>
// Note that the hash code of the key will just be the key itself. The HashPartitioner
// will mod it with the number of total partitions.
position = position + 1
(position, t)
}
} : Iterator[(Int, T)]
// include a shuffle step so that our upstream tasks are still distributed
new CoalescedRDD(
new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
new HashPartitioner(numPartitions)),
numPartitions,
partitionCoalescer).values
} else {
new CoalescedRDD(this, numPartitions, partitionCoalescer)
}
}
这里shuffle为true,因此将构建CoalescedRDD。
- 进入
org.apache.spark.rdd.CoalescedRDD.scala
private[spark] class CoalescedRDD[T: ClassTag](
@transient var prev: RDD[T],
maxPartitions: Int,
partitionCoalescer: Option[PartitionCoalescer] = None)
extends RDD[T](prev.context, Nil) { // Nil since we implement getDependencies
}
CoalescedRDD 的构造参数有:
- prev(RDD[T]):父RDD,即上一个 RDD,即ShuffledRDD
- maxPartitions(Int):聚合后RDD的分区数
- partitionCoalescer(Option[PartitionCoalescer]):用于coalesce的分区聚合器
- 进入
org.apache.spark.rdd.ShuffledRDD.scala
class ShuffledRDD[K: ClassTag, V: ClassTag, C: ClassTag](
@transient var prev: RDD[_ <: Product2[K, V]],
part: Partitioner)
extends RDD[(K, C)](prev.context, Nil) {
}
prev:父RDD,即上一个RDD
part:用于给RDD分区的Partitioner
- 进入
org.apache.spark.rdd.RDD.scala
private[spark] def mapPartitionsWithIndexInternal[U: ClassTag](
f: (Int, Iterator[T]) => Iterator[U],
preservesPartitioning: Boolean = false,
isOrderSensitive: Boolean = false): RDD[U] = withScope {
new MapPartitionsRDD(
this,
(context: TaskContext, index: Int, iter: Iterator[T]) => f(index, iter),
preservesPartitioning = preservesPartitioning,
isOrderSensitive = isOrderSensitive)
}
构建一个MapPartitionsRDD并返回,因此上面ShuffledRDD的父RDD就是此MapPartitionsRDD。
- 再进入
org.apache.spark.rdd.MapPartitionsRDD.scala
private[spark] class MapPartitionsRDD[U: ClassTag, T: ClassTag](
var prev: RDD[T],
f: (TaskContext, Int, Iterator[T]) => Iterator[U], // (TaskContext, partition index, iterator)
preservesPartitioning: Boolean = false,
isFromBarrier: Boolean = false,
isOrderSensitive: Boolean = false)
extends RDD[U](prev) {
}
这里也有一个参数prev,表示MapPartitionsRDD的父RDD,上面传入的prev参数为this,即调用mapPartitionsWithIndexInternal方法的类,即ParallelCollectionRDD,因此MapPartitionsRDD的父RDD是ParallelCollectionRDD。
这样,就构建出了一个处理链:
ParallelCollectionRDD -> MapPartitionsRDD -> ShuffledRDD -> CoalescedRDD
2.3 跟进collect
- 进入
org.apache.spark.rdd.RDD.scala
def collect(): Array[T] = withScope {
val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
Array.concat(results: _*)
}
调用 SparkContext.runJob 开始运行Job。
3 总结
- Transformation 算子通过
.
进行链接; - 每个 Transformation 算子都会生成一个相应的 TransformationRDD(MapPartitionsRDD等);
- TransformationRDD 内部通过参数 prev,去链接自己的父 RDD,最终构成一个处理链。