源码目录

1 程序入口

    var conf: SparkConf = new SparkConf().setAppName("SparkJob_Demo").setMaster("local[*]")
    val sparkContext: SparkContext = new SparkContext(conf)

    sparkContext.parallelize(List("aaa", "bbb", "ccc", "ddd"), 2)
      .repartition(4)
      .collect()

2 进入源码

2.1 跟进parallelize

进入org.apache.spark.SparkContext.scala

  def parallelize[T: ClassTag](
      seq: Seq[T],
      numSlices: Int = defaultParallelism): RDD[T] = withScope {
    assertNotStopped()
    new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]())
  }

创建一个 ParallelCollectionRDD

进入org.apache.spark.rdd.ParallelCollectionRDD.scala

private[spark] class ParallelCollectionRDD[T: ClassTag](
    sc: SparkContext,
    @transient private val data: Seq[T],
    numSlices: Int,
    locationPrefs: Map[Int, Seq[String]])
    extends RDD[T](sc, Nil) {

}

2.2 跟进repartition

进入org.apache.spark.rdd.RDD.scala

  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
  }

repartition 调用 coalesce 方法，并将 shuffle 参数设置为 true。

2.2.1 跟进coalesce

进入org.apache.spark.rdd.RDD.scala

  def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
              (implicit ord: Ordering[T] = null)
      : RDD[T] = withScope {
    require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")
    if (shuffle) {
      /** Distributes elements evenly across output partitions, starting from a random partition. */
      val distributePartition = (index: Int, items: Iterator[T]) => {
        var position = (new Random(index)).nextInt(numPartitions)
        items.map { t =>
          // Note that the hash code of the key will just be the key itself. The HashPartitioner
          // will mod it with the number of total partitions.
          position = position + 1
          (position, t)
        }
      } : Iterator[(Int, T)]

      // include a shuffle step so that our upstream tasks are still distributed
      new CoalescedRDD(
        new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
        new HashPartitioner(numPartitions)),
        numPartitions,
        partitionCoalescer).values
    } else {
      new CoalescedRDD(this, numPartitions, partitionCoalescer)
    }
  }

这里shuffle为true，因此将构建CoalescedRDD。

进入org.apache.spark.rdd.CoalescedRDD.scala

private[spark] class CoalescedRDD[T: ClassTag](
    @transient var prev: RDD[T],
    maxPartitions: Int,
    partitionCoalescer: Option[PartitionCoalescer] = None)
  extends RDD[T](prev.context, Nil) {  // Nil since we implement getDependencies

}

CoalescedRDD 的构造参数有：

prev（RDD[T]）：父RDD，即上一个 RDD，即ShuffledRDD
maxPartitions（Int）：聚合后RDD的分区数
partitionCoalescer（Option[PartitionCoalescer]）：用于coalesce的分区聚合器

进入org.apache.spark.rdd.ShuffledRDD.scala

class ShuffledRDD[K: ClassTag, V: ClassTag, C: ClassTag](
    @transient var prev: RDD[_ <: Product2[K, V]],
    part: Partitioner)
  extends RDD[(K, C)](prev.context, Nil) {

}

prev：父RDD，即上一个RDD
part：用于给RDD分区的Partitioner

进入org.apache.spark.rdd.RDD.scala

  private[spark] def mapPartitionsWithIndexInternal[U: ClassTag](
      f: (Int, Iterator[T]) => Iterator[U],
      preservesPartitioning: Boolean = false,
      isOrderSensitive: Boolean = false): RDD[U] = withScope {
    new MapPartitionsRDD(
      this,
      (context: TaskContext, index: Int, iter: Iterator[T]) => f(index, iter),
      preservesPartitioning = preservesPartitioning,
      isOrderSensitive = isOrderSensitive)
  }

构建一个MapPartitionsRDD并返回，因此上面ShuffledRDD的父RDD就是此MapPartitionsRDD。

再进入org.apache.spark.rdd.MapPartitionsRDD.scala

private[spark] class MapPartitionsRDD[U: ClassTag, T: ClassTag](
    var prev: RDD[T],
    f: (TaskContext, Int, Iterator[T]) => Iterator[U],  // (TaskContext, partition index, iterator)
    preservesPartitioning: Boolean = false,
    isFromBarrier: Boolean = false,
    isOrderSensitive: Boolean = false)
  extends RDD[U](prev) {

}

这里也有一个参数prev，表示MapPartitionsRDD的父RDD，上面传入的prev参数为this，即调用mapPartitionsWithIndexInternal方法的类，即ParallelCollectionRDD，因此MapPartitionsRDD的父RDD是ParallelCollectionRDD。

这样，就构建出了一个处理链：
ParallelCollectionRDD -> MapPartitionsRDD -> ShuffledRDD -> CoalescedRDD

2.3 跟进collect

进入org.apache.spark.rdd.RDD.scala

  def collect(): Array[T] = withScope {
    val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
    Array.concat(results: _*)
  }

调用 SparkContext.runJob 开始运行Job。

3 总结

Transformation 算子通过 . 进行链接；
每个 Transformation 算子都会生成一个相应的 TransformationRDD（MapPartitionsRDD等）；
TransformationRDD 内部通过参数 prev，去链接自己的父 RDD，最终构成一个处理链。

Spark源码：构建处理链