Spark源码:构建处理链

源码目录


1 程序入口

    var conf: SparkConf = new SparkConf().setAppName("SparkJob_Demo").setMaster("local[*]")
    val sparkContext: SparkContext = new SparkContext(conf)

    sparkContext.parallelize(List("aaa", "bbb", "ccc", "ddd"), 2)
      .repartition(4)
      .collect()

2 进入源码

2.1 跟进parallelize

  • 进入org.apache.spark.SparkContext.scala
  def parallelize[T: ClassTag](
      seq: Seq[T],
      numSlices: Int = defaultParallelism): RDD[T] = withScope {
    assertNotStopped()
    new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]())
  }

创建一个 ParallelCollectionRDD

  • 进入org.apache.spark.rdd.ParallelCollectionRDD.scala
private[spark] class ParallelCollectionRDD[T: ClassTag](
    sc: SparkContext,
    @transient private val data: Seq[T],
    numSlices: Int,
    locationPrefs: Map[Int, Seq[String]])
    extends RDD[T](sc, Nil) {

}

2.2 跟进repartition

  • 进入org.apache.spark.rdd.RDD.scala
  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
  }

repartition 调用 coalesce 方法,并将 shuffle 参数设置为 true。

2.2.1 跟进coalesce

  • 进入org.apache.spark.rdd.RDD.scala
  def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
              (implicit ord: Ordering[T] = null)
      : RDD[T] = withScope {
    require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")
    if (shuffle) {
      /** Distributes elements evenly across output partitions, starting from a random partition. */
      val distributePartition = (index: Int, items: Iterator[T]) => {
        var position = (new Random(index)).nextInt(numPartitions)
        items.map { t =>
          // Note that the hash code of the key will just be the key itself. The HashPartitioner
          // will mod it with the number of total partitions.
          position = position + 1
          (position, t)
        }
      } : Iterator[(Int, T)]

      // include a shuffle step so that our upstream tasks are still distributed
      new CoalescedRDD(
        new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
        new HashPartitioner(numPartitions)),
        numPartitions,
        partitionCoalescer).values
    } else {
      new CoalescedRDD(this, numPartitions, partitionCoalescer)
    }
  }

这里shuffle为true,因此将构建CoalescedRDD。

  • 进入org.apache.spark.rdd.CoalescedRDD.scala
private[spark] class CoalescedRDD[T: ClassTag](
    @transient var prev: RDD[T],
    maxPartitions: Int,
    partitionCoalescer: Option[PartitionCoalescer] = None)
  extends RDD[T](prev.context, Nil) {  // Nil since we implement getDependencies

}

CoalescedRDD 的构造参数有:

  1. prev(RDD[T]):父RDD,即上一个 RDD,即ShuffledRDD
  2. maxPartitions(Int):聚合后RDD的分区数
  3. partitionCoalescer(Option[PartitionCoalescer]):用于coalesce的分区聚合器
  • 进入org.apache.spark.rdd.ShuffledRDD.scala
class ShuffledRDD[K: ClassTag, V: ClassTag, C: ClassTag](
    @transient var prev: RDD[_ <: Product2[K, V]],
    part: Partitioner)
  extends RDD[(K, C)](prev.context, Nil) {

}

prev:父RDD,即上一个RDD
part:用于给RDD分区的Partitioner

  • 进入org.apache.spark.rdd.RDD.scala
  private[spark] def mapPartitionsWithIndexInternal[U: ClassTag](
      f: (Int, Iterator[T]) => Iterator[U],
      preservesPartitioning: Boolean = false,
      isOrderSensitive: Boolean = false): RDD[U] = withScope {
    new MapPartitionsRDD(
      this,
      (context: TaskContext, index: Int, iter: Iterator[T]) => f(index, iter),
      preservesPartitioning = preservesPartitioning,
      isOrderSensitive = isOrderSensitive)
  }

构建一个MapPartitionsRDD并返回,因此上面ShuffledRDD的父RDD就是此MapPartitionsRDD。

  • 再进入org.apache.spark.rdd.MapPartitionsRDD.scala
private[spark] class MapPartitionsRDD[U: ClassTag, T: ClassTag](
    var prev: RDD[T],
    f: (TaskContext, Int, Iterator[T]) => Iterator[U],  // (TaskContext, partition index, iterator)
    preservesPartitioning: Boolean = false,
    isFromBarrier: Boolean = false,
    isOrderSensitive: Boolean = false)
  extends RDD[U](prev) {

}

这里也有一个参数prev,表示MapPartitionsRDD的父RDD,上面传入的prev参数为this,即调用mapPartitionsWithIndexInternal方法的类,即ParallelCollectionRDD,因此MapPartitionsRDD的父RDD是ParallelCollectionRDD。

这样,就构建出了一个处理链:
ParallelCollectionRDD -> MapPartitionsRDD -> ShuffledRDD -> CoalescedRDD

2.3 跟进collect

  • 进入org.apache.spark.rdd.RDD.scala
  def collect(): Array[T] = withScope {
    val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
    Array.concat(results: _*)
  }

调用 SparkContext.runJob 开始运行Job。

3 总结

  1. Transformation 算子通过 . 进行链接;
  2. 每个 Transformation 算子都会生成一个相应的 TransformationRDD(MapPartitionsRDD等);
  3. TransformationRDD 内部通过参数 prev,去链接自己的父 RDD,最终构成一个处理链。

你可能感兴趣的:(Spark源码:构建处理链)