本文主要是介绍partial、rdd包下面的类
private[spark] trait ApproximateEvaluator[U, R] { def merge(outputId: Int, taskResult: U): Unit def currentResult(): R }
这接口两个方法主要是用来逐渐地合并不同task跑后的结果。每一个task任务结束都调用一次merge方法。
private[spark] class ApproximateActionListener[T, U, R]( rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, evaluator: ApproximateEvaluator[U, R], timeout: Long) extends JobListener {
在规定的时间内返回结果,这个结果可能是所有分区的结果,也有可能只是一部分规定时间执行完分区的结果(所以包名叫 partial)
class BoundedDouble(val mean: Double, val confidence: Double, val low: Double, val high: Double) {
这一个类封装了mean/confidence/low/high,调用equals方法的时候,这几个都要相等才会返回true
private[spark] class CountEvaluator(totalOutputs: Int, confidence: Double) extends ApproximateEvaluator[Long, BoundedDouble] {
就相当于返回元素个数。
private[spark] class GroupedCountEvaluator[T : ClassTag](totalOutputs: Int, confidence:
Double) extends ApproximateEvaluator[OpenHashMap[T, Long], Map[T, BoundedDouble]] {通过不同的key进行累加求和。
private[spark] class MeanEvaluator(totalOutputs: Int, confidence: Double) extends ApproximateEvaluator[StatCounter, BoundedDouble] {
返回结果的均值
private[spark] class SumEvaluator(totalOutputs: Int, confidence: Double) extends ApproximateEvaluator[StatCounter, BoundedDouble] {
返回元素求和的总和
由于本包下面rdd具体实现太多,而且模式统一,故,这里只研究一两个即可。
private[spark] class BlockRDDPartition(val blockId: BlockId, idx: Int) extends Partition { val index = idx }
BlockRDD的分区类型。blockId,表示的是父块,当前分区引用的分区块。idx,表示的是当前分区在当前blockrdd里面的索引
编号。
private[spark] class BlockRDD[T: ClassTag](sc: SparkContext, @transient val blockIds: Array[BlockId]) extends RDD[T](sc, Nil) {
@transient val blockIds: Array[BlockId] 表示的是数据块源,可以理解为父RDD的数据。每一个rdd
具体实现都会复写三个方法:getPartitions / compute / getPreferredLocation
override def getPartitions: Array[Partition] = { assertValid() (0 until blockIds.length).map { i => new BlockRDDPartition(blockIds(i), i).asInstanceOf[Partition] }.toArray } 为当前rdd生成分区,所有的分区。
override def compute(split: Partition, context: TaskContext): Iterator[T] = { assertValid() val blockManager = SparkEnv.get.blockManager val blockId = split.asInstanceOf[BlockRDDPartition].blockId blockManager.get[T](blockId) match { case Some(block) => block.data.asInstanceOf[Iterator[T]] case None => throw new Exception(s"Could not compute split, block $blockId of RDD $id not found") } } split是父分区,context是代表父分区任务执行的上下文
override def getPreferredLocations(split: Partition): Seq[String] = { assertValid() _locations(split.asInstanceOf[BlockRDDPartition].blockId) } 获取的是引用地址,父分区数据存放的物理地址。
private[spark] case class NarrowCoGroupSplitDep( @transient rdd: RDD[_], @transient splitIndex: Int, var split: Partition ) extends Serializable { @throws(classOf[IOException]) private def writeObject(oos: ObjectOutputStream): Unit = Utils.tryOrIOException { // Update the reference to parent split at the time of task serialization split = rdd.partitions(splitIndex) oos.defaultWriteObject() } } 就包含一个分区,这个分区是父分区。
private[spark] class CoGroupPartition( override val index: Int, val narrowDeps: Array[Option[NarrowCoGroupSplitDep]]) extends Partition with Serializable { override def hashCode(): Int = index override def equals(other: Any): Boolean = super.equals(other) } narrowDeps是代表的全部分区。代表的是CoGroupedRDD的分区
class CoGroupedRDD[K: ClassTag]( @transient var rdds: Seq[RDD[_ <: Product2[K, _]]], part: Partitioner) extends RDD[(K, Array[Iterable[_]])](rdds.head.context, Nil) {本RDD就是将 cogroup 算子产生的RDD。
override def getDependencies: Seq[Dependency[_]] = { rdds.map { rdd: RDD[_] => if (rdd.partitioner == Some(part)) { logDebug("Adding one-to-one dependency with " + rdd) new OneToOneDependency(rdd) } else { logDebug("Adding shuffle dependency with " + rdd) new ShuffleDependency[K, Any, CoGroupCombiner]( rdd.asInstanceOf[RDD[_ <: Product2[K, _]]], part, serializer) } } }获取的是本RDD所依赖的血统关系
override def getPartitions: Array[Partition] = { val array = new Array[Partition](part.numPartitions) for (i <- 0 until array.length) { // Each CoGroupPartition will have a dependency per contributing RDD array(i) = new CoGroupPartition(i, rdds.zipWithIndex.map { case (rdd, j) => // Assume each RDD contributed a single dependency, and get it dependencies(j) match { case s: ShuffleDependency[_, _, _] => None case _ => Some(new NarrowCoGroupSplitDep(rdd, i, rdd.partitions(i))) } }.toArray) } array }为当前的CoGroupRDD生成自己的partitions
override def compute(s: Partition, context: TaskContext):
Iterator[(K, Array[Iterable[_]])] = {计算s分区的数据。
class DoubleRDDFunctions(self: RDD[Double]) extends Logging with Serializable {本类里面的一些功能有sum/mean/方差、标准差等等,是通过隐式转换被RDD所使用
class OrderedRDDFunctions[K : Ordering : ClassTag, V: ClassTag, P <: Product2[K, V] : ClassTag] @DeveloperApi() ( self: RDD[P]) extends Logging with Serializable {RDD的排序类,也是通过隐式转换增强的RDD的功能,适合key value对这种形式的数据
class PairRDDFunctions[K, V](self: RDD[(K, V)]) (implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null) extends Logging with Serializable {专门用来处理key value 这种数据情况的,里面提供了算子,通过隐式转换被rdd所使用
private[spark] class ReliableRDDCheckpointData[T: ClassTag](@transient private val
rdd: RDD[T]) extends RDDCheckpointData[T](rdd) with Logging {将数据写入checkpoint的地方,外部存储系统,比如hdfs