spark2.2.0源码阅读---spark core包 --- partial/rdd

1、本文目标以及其它说明:

    本文主要是介绍partial、rdd包下面的类

2、partial包下面的数据结构说明

private[spark] trait ApproximateEvaluator[U, R] {
  def merge(outputId: Int, taskResult: U): Unit
  def currentResult(): R
}

这接口两个方法主要是用来逐渐地合并不同task跑后的结果。每一个task任务结束都调用一次merge方法。

private[spark] class ApproximateActionListener[T, U, R](
    rdd: RDD[T],
    func: (TaskContext, Iterator[T]) => U,
    evaluator: ApproximateEvaluator[U, R],
    timeout: Long)
  extends JobListener {

在规定的时间内返回结果,这个结果可能是所有分区的结果,也有可能只是一部分规定时间执行完分区的结果(所以包名叫 partial)

class BoundedDouble(val mean: Double, val confidence: Double, val low: Double, val 
high: Double) {
这一个类封装了mean/confidence/low/high,调用equals方法的时候,这几个都要相等才会返回true
private[spark] class CountEvaluator(totalOutputs: Int, confidence: Double)
  extends ApproximateEvaluator[Long, BoundedDouble] {

就相当于返回元素个数。

private[spark] class GroupedCountEvaluator[T : ClassTag](totalOutputs: Int, confidence: 
Double)  extends ApproximateEvaluator[OpenHashMap[T, Long], Map[T, BoundedDouble]] {
 通过不同的key进行累加求和。
private[spark] class MeanEvaluator(totalOutputs: Int, confidence: Double)
  extends ApproximateEvaluator[StatCounter, BoundedDouble] {

返回结果的均值

private[spark] class SumEvaluator(totalOutputs: Int, confidence: Double)
  extends ApproximateEvaluator[StatCounter, BoundedDouble] {

返回元素求和的总和

3、rdd包下面的数据结构说明

由于本包下面rdd具体实现太多,而且模式统一,故,这里只研究一两个即可。

private[spark] class BlockRDDPartition(val blockId: BlockId, idx: Int) extends Partition {
  val index = idx
}

BlockRDD的分区类型。blockId,表示的是父块,当前分区引用的分区块。idx,表示的是当前分区在当前blockrdd里面的索引

编号。

private[spark]
class BlockRDD[T: ClassTag](sc: SparkContext, @transient val blockIds: Array[BlockId])
  extends RDD[T](sc, Nil) {
 @transient val blockIds: Array[BlockId] 表示的是数据块源,可以理解为父RDD的数据。每一个rdd
具体实现都会复写三个方法:getPartitions  /  compute / getPreferredLocation
override def getPartitions: Array[Partition] = {
  assertValid()
  (0 until blockIds.length).map { i =>
    new BlockRDDPartition(blockIds(i), i).asInstanceOf[Partition]
  }.toArray
}   为当前rdd生成分区,所有的分区。

override def compute(split: Partition, context: TaskContext): Iterator[T] = {
  assertValid()
  val blockManager = SparkEnv.get.blockManager
  val blockId = split.asInstanceOf[BlockRDDPartition].blockId
  blockManager.get[T](blockId) match {
    case Some(block) => block.data.asInstanceOf[Iterator[T]]
    case None =>
      throw new Exception(s"Could not compute split, block $blockId of RDD $id not found")
  }
} split是父分区,context是代表父分区任务执行的上下文
 
  
override def getPreferredLocations(split: Partition): Seq[String] = {
  assertValid()
  _locations(split.asInstanceOf[BlockRDDPartition].blockId)
} 获取的是引用地址,父分区数据存放的物理地址。

 
  
private[spark] case class NarrowCoGroupSplitDep(
    @transient rdd: RDD[_],
    @transient splitIndex: Int,
    var split: Partition
  ) extends Serializable {

  @throws(classOf[IOException])
  private def writeObject(oos: ObjectOutputStream): Unit = Utils.tryOrIOException {
    // Update the reference to parent split at the time of task serialization
    split = rdd.partitions(splitIndex)
    oos.defaultWriteObject()
  }
}  就包含一个分区,这个分区是父分区。
private[spark] class CoGroupPartition(
    override val index: Int, val narrowDeps: Array[Option[NarrowCoGroupSplitDep]])
  extends Partition with Serializable {
  override def hashCode(): Int = index
  override def equals(other: Any): Boolean = super.equals(other)
} narrowDeps是代表的全部分区。代表的是CoGroupedRDD的分区
 
  
class CoGroupedRDD[K: ClassTag](
    @transient var rdds: Seq[RDD[_ <: Product2[K, _]]],
    part: Partitioner)
  extends RDD[(K, Array[Iterable[_]])](rdds.head.context, Nil) {
 
  
 
  本RDD就是将 
  cogroup  
  算子产生的RDD。 
  
override def getDependencies: Seq[Dependency[_]] = {
  rdds.map { rdd: RDD[_] =>
    if (rdd.partitioner == Some(part)) {
      logDebug("Adding one-to-one dependency with " + rdd)
      new OneToOneDependency(rdd)
    } else {
      logDebug("Adding shuffle dependency with " + rdd)
      new ShuffleDependency[K, Any, CoGroupCombiner](
        rdd.asInstanceOf[RDD[_ <: Product2[K, _]]], part, serializer)
    }
  }
}
获取的是本RDD所依赖的血统关系
override def getPartitions: Array[Partition] = {
  val array = new Array[Partition](part.numPartitions)
  for (i <- 0 until array.length) {
    // Each CoGroupPartition will have a dependency per contributing RDD
    array(i) = new CoGroupPartition(i, rdds.zipWithIndex.map { case (rdd, j) =>
      // Assume each RDD contributed a single dependency, and get it
      dependencies(j) match {
        case s: ShuffleDependency[_, _, _] =>
          None
        case _ =>
          Some(new NarrowCoGroupSplitDep(rdd, i, rdd.partitions(i)))
      }
    }.toArray)
  }
  array
}
为当前的CoGroupRDD生成自己的partitions
override def compute(s: Partition, context: TaskContext): 
Iterator[(K, Array[Iterable[_]])] = {
计算s分区的数据。
 
  

class DoubleRDDFunctions(self: RDD[Double]) extends Logging with Serializable {
本类里面的一些功能有sum/mean/方差、标准差等等,是通过隐式转换被RDD所使用
 
  
class OrderedRDDFunctions[K : Ordering : ClassTag,
                          V: ClassTag,
                          P <: Product2[K, V] : ClassTag] @DeveloperApi() (
    self: RDD[P])
  extends Logging with Serializable {
RDD的排序类,也是通过隐式转换增强的RDD的功能,适合key value对这种形式的数据
 
  
class PairRDDFunctions[K, V](self: RDD[(K, V)])
    (implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null)
  extends Logging with Serializable {
专门用来处理key value 这种数据情况的,里面提供了算子,通过隐式转换被rdd所使用
 
  
private[spark] class ReliableRDDCheckpointData[T: ClassTag](@transient private val 
rdd: RDD[T])  extends RDDCheckpointData[T](rdd) with Logging {
将数据写入checkpoint的地方,外部存储系统,比如hdfs
 
  

 
  
 
  

你可能感兴趣的:(spark源码)