Spark RDD算子源码解读

本文结合spark1.5.0的源码,简单介绍Spark RDD算子的功能、原理及调用接口。此处介绍只包含RDD默认基础算子,并不包含RDD[(K, V)]的扩展算子(即PairRDDFunctions中的方法),对于扩展算子,请见后续文章。

  • ++
    返回两个RDD元素合并后组成的RDD,保留重复元素
/** * Return the union of this RDD and another one. Any identical elements will appear multiple * times (use `.distinct()` to eliminate them). */
  def ++(other: RDD[T]): RDD[T] = withScope {
    this.union(other)
  }
  • aggregate
    将RDD中元素聚集,须提供0初值(因为累积元素,所有要提供累积的初值)。先在分区内依照seqOp函数聚集元素(把T类型元素聚集为U类型的分区“结果”),再在分区间按照combOp函数聚集分区计算结果,最后返回这个标量
  /** * Aggregate the elements of each partition, and then the results for all the partitions, using * given combine functions and a neutral "zero value". This function can return a different result * type, U, than the type of this RDD, T. Thus, we need one operation for merging a T into an U * and one operation for merging two U's, as in scala.TraversableOnce. Both of these functions are * allowed to modify and return their first argument instead of creating a new U to avoid memory * allocation. */
  def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U
  • cache
    将RDD缓冲在内存中
  /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
  def cache(): this.type = persist()
  • cartesian
    返回一个RDD,其元素为两个RDD元素的笛卡尔乘积
    如:
    RDD1 RDD2 => RDD3
    a b (a,b)
    c d (a,d)
    (c,b)
    (c,d)
  /** * Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of * elements (a, b) where a is in `this` and b is in `other`. */
  def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)] = withScope {
    new CartesianRDD(sc, this, other)
  }
  • checkpoint
    为RDD设置检查点,即将RDD内容入分布式文件系统(如HDFS)中,移除RDD的依赖关系(lineage)。是一种典型的以空间换取时间做法,采用存储数据来避免失效时通过计算依赖链的时间代价。跟persist意义不一样。
  /** * Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint * directory set with `SparkContext#setCheckpointDir` and all references to its parent * RDDs will be removed. This function must be called before any job has been * executed on this RDD. It is strongly recommended that this RDD is persisted in * memory, otherwise saving it on a file will require recomputation. */
  def checkpoint(): Unit = RDDCheckpointData.synchronized {
    // NOTE: we use a global lock here due to complexities downstream with ensuring
    // children RDD partitions point to the correct parent partitions. In the future
    // we should revisit this consideration.
    if (context.checkpointDir.isEmpty) {
      throw new SparkException("Checkpoint directory has not been set in the SparkContext")
    } else if (checkpointData.isEmpty) {
      checkpointData = Some(new ReliableRDDCheckpointData(this))
    }
  }
  • coalesce
    重新给RDD的元素分区。
    当适当缩小分区数时,如1000->100,spark会把之前的10个分区当作一个分区,并行度变为100,不会引起数据shuffle。
    当严重缩小分区数时,如1000->1,运算时的并行度会变成1。为了避免并行效率低下问题,可将shuffle设为true。shuffle之前的运算和之后的运算分为不同stage,它们的并行度分别为1000,1。
    当把分区数增大时,必会存在shuffle,shuffle须设为true。
  /** * Return a new RDD that is reduced into `numPartitions` partitions. * * This results in a narrow dependency, e.g. if you go from 1000 partitions * to 100 partitions, there will not be a shuffle, instead each of the 100 * new partitions will claim 10 of the current partitions. * * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, * this may result in your computation taking place on fewer nodes than * you like (e.g. one node in the case of numPartitions = 1). To avoid this, * you can pass shuffle = true. This will add a shuffle step, but means the * current upstream partitions will be executed in parallel (per whatever * the current partitioning is). * * Note: With shuffle = true, you can actually coalesce to a larger number * of partitions. This is useful if you have a small number of partitions, * say 100, potentially with a few partitions being abnormally large. Calling * coalesce(1000, shuffle = true) will result in 1000 partitions with the * data distributed using a hash partitioner. */
  def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null)
      : RDD[T]
  • collect
    返回包含RDD中所有元素的数组,当RDD元素超多时,driver节点可能会out of memory。
  /** * Return an array that contains all of the elements in this RDD. */
  def collect(): Array[T] = withScope {
    val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
    Array.concat(results: _*)
  }

  /** * Return an RDD that contains all matching values by applying `f`. */
  def collect[U: ClassTag](f: PartialFunction[T, U]): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    filter(cleanF.isDefinedAt).map(cleanF)
  }
  • count
    返回RDD中元素个数
  /** * Return the number of elements in the RDD. */
  def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
  • countByValue
    返回RDD中的不同元素的个数
  /** * Return the count of each unique value in this RDD as a local map of (value, count) pairs. * * Note that this method should only be used if the resulting map is expected to be small, as * the whole thing is loaded into the driver's memory. * To handle very large results, consider using rdd.map(x => (x, 1L)).reduceByKey(_ + _), which * returns an RDD[T, Long] instead of a map. */
  def countByValue()(implicit ord: Ordering[T] = null): Map[T, Long] = withScope {
    map(value => (value, null)).countByKey()
  }
  • dependencies
    返回RDD的依赖列表,从lineage的检查点算起。
  /** * Get the list of dependencies of this RDD, taking into account whether the * RDD is checkpointed or not. */
  final def dependencies: Seq[Dependency[_]] = {
    checkpointRDD.map(r => List(new OneToOneDependency(r))).getOrElse {
      if (dependencies_ == null) {
        dependencies_ = getDependencies
      }
      dependencies_
    }
  }
  • distinct
    返回RDD中元素去重后的RDD
  /** * Return a new RDD containing the distinct elements in this RDD. */
  def distinct(): RDD[T] = withScope {
    distinct(partitions.length)
  }

  def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
  }
  • filter
    过滤RDD,返回一个包含所有满足f函数判定的元素的RDD
  /** * Return a new RDD containing only the elements that satisfy a predicate. */
  def filter(f: T => Boolean): RDD[T] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[T, T](
      this,
      (context, pid, iter) => iter.filter(cleanF),
      preservesPartitioning = true)
  }
  • filterWith
    过滤RDD,返回一个包含所有满足f函数判定的元素的RDD。为每个分区的判断函数f指定一个参数,该参数依据分区号产生。可以实现各分区区别对待进行过滤。
    详见mapPartitionsWithIndex。
  /** * Filters this RDD with p, where p takes an additional parameter of type A. This * additional parameter is produced by constructA, which is called in each * partition with the index of that partition. */
  @deprecated("use mapPartitionsWithIndex and filter", "1.0.0")
  def filterWith[A](constructA: Int => A)(p: (T, A) => Boolean): RDD[T] = withScope {
    val cleanP = sc.clean(p)
    val cleanA = sc.clean(constructA)
    mapPartitionsWithIndex((index, iter) => {
      val a = cleanA(index)
      iter.filter(t => cleanP(t, a))
    }, preservesPartitioning = true)
  }
  • first
    返回RDD中第一个元素。
    集合为空时异常。
  /** * Return the first element in this RDD. */
  def first(): T = withScope {
    take(1) match {
      case Array(t) => t
      case _ => throw new UnsupportedOperationException("empty collection")
    }
  }
  • firstParent
    返回RDD依赖中的第一个父RDD
  /** Returns the first parent RDD */
  protected[spark] def firstParent[U: ClassTag]: RDD[U] = {
    dependencies.head.rdd.asInstanceOf[RDD[U]]
  }
  • flatMap
    对RDD中的每个元素,产生一个集合,返回全部集合里的所有元素组成的RDD。
  • flatMapWith
    对RDD中的每个元素,产生一个集合,返回全部集合里的所有元素组成的RDD。
    为每个分区的产生集合函数f指定一个参数,该参数依据分区号产生。可以实现各分区元素区别对待进行产生集合。
    详见mapPartitionsWithIndex。
  /** * Return a new RDD by first applying a function to all elements of this * RDD, and then flattening the results. */
  def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
  }

  /** * FlatMaps f over this RDD, where f takes an additional parameter of type A. This * additional parameter is produced by constructA, which is called in each * partition with the index of that partition. */
  @deprecated("use mapPartitionsWithIndex and flatMap", "1.0.0")
  def flatMapWith[A, U: ClassTag]
      (constructA: Int => A, preservesPartitioning: Boolean = false)
      (f: (T, A) => Seq[U]): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    val cleanA = sc.clean(constructA)
    mapPartitionsWithIndex((index, iter) => {
      val a = cleanA(index)
      iter.flatMap(t => cleanF(t, a))
    }, preservesPartitioning)
  }
  • fold
    跟aggregate功能类似,将RDD中元素聚集。
    fold使用相同的函数op先进行分区内元素聚集,再进行分区间结果的聚集(所有涉及类型均为T类型),最后,返回聚集结果。
  /** * Aggregate the elements of each partition, and then the results for all the partitions, using a * given associative and commutative function and a neutral "zero value". The function * op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object * allocation; however, it should not modify t2. * * This behaves somewhat differently from fold operations implemented for non-distributed * collections in functional languages like Scala. This fold operation may be applied to * partitions individually, and then fold those results into the final result, rather than * apply the fold to each element sequentially in some defined ordering. For functions * that are not commutative, the result may differ from that of a fold applied to a * non-distributed collection. */
  def fold(zeroValue: T)(op: (T, T) => T): T = withScope {
    // Clone the zero value since we will also be serializing it as part of tasks
    var jobResult = Utils.clone(zeroValue, sc.env.closureSerializer.newInstance())
    val cleanOp = sc.clean(op)
    val foldPartition = (iter: Iterator[T]) => iter.fold(zeroValue)(cleanOp)
    val mergeResult = (index: Int, taskResult: T) => jobResult = op(jobResult, taskResult)
    sc.runJob(this, foldPartition, mergeResult)
    jobResult
  }
  • foreach
    对RDD的每一个函数应用f函数,返回空值(Unit)。
    目的是为了取得副作用,如对上下文中变量的更改,或打印数据等等。
    注意:worker节点打印数据会打印到自己终端上,master节点看不到。
  • foreachPartition
    对RDD的每个分区元素迭代器应用f函数,返回空值(Unit)。
    目的是为了取得副作用,如对上下文中变量的更改,或打印数据等等。
    注意:worker节点打印数据会打印到自己终端上,master节点看不到。
  • foreachWith
    对RDD的每个元素应用f函数,返回空值(Unit)。f函数另外接受一个参数,参数由分区号应用constructA函数产生,可以达到对RDD不同分区的元素进行不同操作目的。
    详见mapPartitionsWithIndex。
  /** * Applies a function f to all elements of this RDD. */
  def foreach(f: T => Unit): Unit = withScope {
    val cleanF = sc.clean(f)
    sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))
  }

  /** * Applies a function f to each partition of this RDD. */
  def foreachPartition(f: Iterator[T] => Unit): Unit = withScope {
    val cleanF = sc.clean(f)
    sc.runJob(this, (iter: Iterator[T]) => cleanF(iter))
  }

  /** * Applies f to each element of this RDD, where f takes an additional parameter of type A. * This additional parameter is produced by constructA, which is called in each * partition with the index of that partition. */
  @deprecated("use mapPartitionsWithIndex and foreach", "1.0.0")
  def foreachWith[A](constructA: Int => A)(f: (T, A) => Unit): Unit = withScope {
    val cleanF = sc.clean(f)
    val cleanA = sc.clean(constructA)
    mapPartitionsWithIndex { (index, iter) =>
      val a = cleanA(index)
      iter.map(t => {cleanF(t, a); t})
    }
  }
  • getCheckpointFile
    返回RDD的检查点文件存储目录的路径
  /** * Gets the name of the directory to which this RDD was checkpointed. * This is not defined if the RDD is checkpointed locally. */
  def getCheckpointFile: Option[String] = {
    checkpointData match {
      case Some(reliable: ReliableRDDCheckpointData[T]) => reliable.getCheckpointDir
      case _ => None
    }
  }
  • getDependencies
    返回RDD的依赖序列(依赖链)
  /** * Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only * be called once, so it is safe to implement a time-consuming computation in it. */
  protected def getDependencies: Seq[Dependency[_]] = deps
  • getPartitions
    返回RDD分区对象的数组
  /** * Implemented by subclasses to return the set of partitions in this RDD. This method will only * be called once, so it is safe to implement a time-consuming computation in it. */
  protected def getPartitions: Array[Partition]
  • getPreferredLocations
    返回split分区在集群中的位置数组(多份存储容错机制)。
/** * Optionally overridden by subclasses to specify placement preferences. */
  protected def getPreferredLocations(split: Partition): Seq[String] = Nil
  • getStorageLevel
    返回RDD当前的存储级别
  /** Get the RDD's current storage level, or StorageLevel.NONE if none is set. */
  def getStorageLevel: StorageLevel = storageLevel
  • glom
    将RDD的每个分区里面的元素组建成一个数组,返回RDD的每分区里只有一个元素——就是那个数组。
  /** * Return an RDD created by coalescing all elements within each partition into an array. */
  def glom(): RDD[Array[T]] = withScope {
    new MapPartitionsRDD[Array[T], T](this, (context, pid, iter) => Iterator(iter.toArray))
  }
  • groupBy
    将key-value型RDD的元素按key分组,key相同的value形成一个集合。返回key->value集合 元素类型的RDD。
    如:
    1->a
    1->b
    2->c
    2->b
    3 ->2
    转换为:
    1->(a,b)
    2->(c,b)
    3->(2)
    由于要记录所有的value,整个过程的代价很大的。当我们要做sum、average等累积计算操作时,只须记录value累积值即可,没必要记录所有value值。此时,更推荐采用aggregateByKey、reduceByKey等算子。
  /** * Return an RDD of grouped items. Each group consists of a key and a sequence of elements * mapping to that key. The ordering of elements within each group is not guaranteed, and * may even differ each time the resulting RDD is evaluated. * * Note: This operation may be very expensive. If you are grouping in order to perform an * aggregation (such as a sum or average) over each key, using [[PairRDDFunctions.aggregateByKey]] * or [[PairRDDFunctions.reduceByKey]] will provide much better performance. */
  def groupBy[K](f: T => K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {
    groupBy[K](f, defaultPartitioner(this))
  }

  def groupBy[K](
      f: T => K,
      numPartitions: Int)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {
    groupBy(f, new HashPartitioner(numPartitions))
  }

  def groupBy[K](f: T => K, p: Partitioner)(implicit kt: ClassTag[K], ord: Ordering[K] = null)
      : RDD[(K, Iterable[T])] = withScope {
    val cleanF = sc.clean(f)
    this.map(t => (cleanF(t), t)).groupByKey(p)
  }
  • intersection
    返回两个RDD的交集元素组成的RDD,结果去重。
  /** * Return the intersection of this RDD and another one. The output will not contain any duplicate * elements, even if the input RDDs did. * * Note that this method performs a shuffle internally. */
  def intersection(other: RDD[T]): RDD[T] = withScope {
    this.map(v => (v, null)).cogroup(other.map(v => (v, null)))
        .filter { case (_, (leftGroup, rightGroup)) => leftGroup.nonEmpty && rightGroup.nonEmpty }
        .keys
  }

  def intersection(
      other: RDD[T],
      partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    this.map(v => (v, null)).cogroup(other.map(v => (v, null)), partitioner)
        .filter { case (_, (leftGroup, rightGroup)) => leftGroup.nonEmpty && rightGroup.nonEmpty }
        .keys
  }

  def intersection(other: RDD[T], numPartitions: Int): RDD[T] = withScope {
    intersection(other, new HashPartitioner(numPartitions))
  }
  • isCheckpointed
    返回RDD是否已经checkpointed
  /** * Return whether this RDD is marked for checkpointing, either reliably or locally. */
  def isCheckpointed: Boolean = checkpointData.exists(_.isCheckpointed)
  • isEmpty
    判断RDD是否为空
  /** * @note due to complications in the internal implementation, this method will raise an * exception if called on an RDD of `Nothing` or `Null`. This may be come up in practice * because, for example, the type of `parallelize(Seq())` is `RDD[Nothing]`. * (`parallelize(Seq())` should be avoided anyway in favor of `parallelize(Seq[T]())`.) * @return true if and only if the RDD contains no elements at all. Note that an RDD * may be empty even when it has at least 1 partition. */
  def isEmpty(): Boolean = withScope {
    partitions.length == 0 || take(1).length == 0
  }
  • keyBy
    使用f函数为RDD每个元素产生key,返回key-value型的RDD
  /** * Creates tuples of the elements in this RDD by applying `f`. */
  def keyBy[K](f: T => K): RDD[(K, T)] = withScope {
    val cleanedF = sc.clean(f)
    map(x => (cleanedF(x), x))
  }
  • map
    返回元素被f函数转换后的RDD
  /** * Return a new RDD by applying a function to all elements of this RDD. */
  def map[U: ClassTag](f: T => U): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
  }
  • mapPartitions
    运用f函数将RDD中每个分区的元素迭代器转换为新的元素迭代器,返回由这些新迭代器元素构成分区的RDD。
    preservesPartitioning表示返回RDD是否留有分区器。仅当RDD为K-V型RDD,且key没有被修饰的情况下,可设为true。非K-V型RDD一般不存在分区器;K-V RDD key被修改后,元素将不再满足分区器的分区要求。这些情况下,须设为false,表示返回的RDD没有被分区器分过区。
  /** * Return a new RDD by applying a function to each partition of this RDD. * * `preservesPartitioning` indicates whether the input function preserves the partitioner, which * should be `false` unless this is a pair RDD and the input function doesn't modify the keys. */
  def mapPartitions[U: ClassTag](
      f: Iterator[T] => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U] = withScope {
    val cleanedF = sc.clean(f)
    new MapPartitionsRDD(
      this,
      (context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(iter),
      preservesPartitioning)
  }
  • mapPartitionsWithIndex
    spark中一个很关键的算子。
    运用f函数将RDD中每个分区的元素迭代器转换为新的元素迭代器,返回由这些新迭代器元素构成分区的RDD。
    f函数还接受分区id作为参数,表示f可以依据分区id对不同分区进行区别操作。
    preservesPartitioning表示返回RDD是否留有分区器。仅当RDD为K-V型RDD,且key没有被修饰的情况下,可设为true。非K-V型RDD一般不存在分区器;K-V RDD key被修改后,元素将不再满足分区器的分区要求。这些情况下,须设为false,表示返回的RDD没有被分区器分过区。
  /** * Return a new RDD by applying a function to each partition of this RDD, while tracking the index * of the original partition. * * `preservesPartitioning` indicates whether the input function preserves the partitioner, which * should be `false` unless this is a pair RDD and the input function doesn't modify the keys. */
  def mapPartitionsWithIndex[U: ClassTag](
      f: (Int, Iterator[T]) => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U] = withScope {
    val cleanedF = sc.clean(f)
    new MapPartitionsRDD(
      this,
      (context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(index, iter),
      preservesPartitioning)
  }
  • max
    返回RDD中最大元素
  /** * Returns the max of this RDD as defined by the implicit Ordering[T]. * @return the maximum element of the RDD * */
  def max()(implicit ord: Ordering[T]): T = withScope {
    this.reduce(ord.max)
  }
  • min
    返回RDD中最小元素
  /** * Returns the min of this RDD as defined by the implicit Ordering[T]. * @return the minimum element of the RDD * */
  def min()(implicit ord: Ordering[T]): T = withScope {
    this.reduce(ord.min)
  }
  • parent
    返回RDD的某个父RDD
  /** Returns the jth parent RDD: e.g. rdd.parent[T](0) is equivalent to rdd.firstParent[T] */
  protected[spark] def parent[U: ClassTag](j: Int) = {
    dependencies(j).rdd.asInstanceOf[RDD[U]]
  }
  • partitions
    返回RDD分区对象的数组
  /** * Get the array of partitions of this RDD, taking into account whether the * RDD is checkpointed or not. */
  final def partitions: Array[Partition] = {
    checkpointRDD.map(_.partitions).getOrElse {
      if (partitions_ == null) {
        partitions_ = getPartitions
      }
      partitions_
    }
  }
  • persist
    将RDD按指定存储级别缓存
  /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
  def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)

  /** * Set this RDD's storage level to persist its values across operations after the first time * it is computed. This can only be used to assign a new storage level if the RDD does not * have a storage level set yet. Local checkpointing is an exception. */
  def persist(newLevel: StorageLevel): this.type = {
    if (isLocallyCheckpointed) {
      // This means the user previously called localCheckpoint(), which should have already
      // marked this RDD for persisting. Here we should override the old storage level with
      // one that is explicitly requested by the user (after adapting it to use disk).
      persist(LocalRDDCheckpointData.transformStorageLevel(newLevel), allowOverride = true)
    } else {
      persist(newLevel, allowOverride = false)
    }
  }
  • preferredLocations
    返回split分区在集群中的位置数组(多份存储容错机制)。
  /** * Get the preferred locations of a partition, taking into account whether the * RDD is checkpointed. */
  final def preferredLocations(split: Partition): Seq[String] = {
    checkpointRDD.map(_.getPreferredLocations(split)).getOrElse {
      getPreferredLocations(split)
    }
  }
  • reduce
    返回RDD元素按f函数的“累加和”。
  /** * Reduces the elements of this RDD using the specified commutative and * associative binary operator. */
  def reduce(f: (T, T) => T): T = withScope {
    val cleanF = sc.clean(f)
    val reducePartition: Iterator[T] => Option[T] = iter => {
      if (iter.hasNext) {
        Some(iter.reduceLeft(cleanF))
      } else {
        None
      }
    }
    var jobResult: Option[T] = None
    val mergeResult = (index: Int, taskResult: Option[T]) => {
      if (taskResult.isDefined) {
        jobResult = jobResult match {
          case Some(value) => Some(f(value, taskResult.get))
          case None => taskResult
        }
      }
    }
    sc.runJob(this, reducePartition, mergeResult)
    // Get the final result out of our Option, or throw an exception if the RDD was empty
    jobResult.getOrElse(throw new UnsupportedOperationException("empty collection"))
  }
  • repartition
    按指定分区数重新分区RDD,存在shuffle。
    当指定的分区数比当前分区数目少时,考虑使用coalesce,这样能够避免shuffle。
  /** * Return a new RDD that has exactly numPartitions partitions. * * Can increase or decrease the level of parallelism in this RDD. Internally, this uses * a shuffle to redistribute data. * * If you are decreasing the number of partitions in this RDD, consider using `coalesce`, * which can avoid performing a shuffle. */
  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
  }
  • sample
    返回由取样元素组成的RDD。
    分为放回和不放回形式。放回时,fraction>=0,表示元素平均出现取样次数;不放回时,fraction属于[0,1],表示元素被取样的概率。
  /** * Return a sampled subset of this RDD. * * @param withReplacement can elements be sampled multiple times (replaced when sampled out) * @param fraction expected size of the sample as a fraction of this RDD's size * without replacement: probability that each element is chosen; fraction must be [0, 1] * with replacement: expected number of times each element is chosen; fraction must be >= 0 * @param seed seed for the random number generator */
  def sample(
      withReplacement: Boolean,
      fraction: Double,
      seed: Long = Utils.random.nextLong): RDD[T] = withScope {
    require(fraction >= 0.0, "Negative fraction value: " + fraction)
    if (withReplacement) {
      new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed)
    } else {
      new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed)
    }
  }
  • saveAsObjectFile
    将RDD以序列化后的二进制形式保存至path
  /** * Save this RDD as a SequenceFile of serialized objects. */
  def saveAsObjectFile(path: String): Unit = withScope {
    this.mapPartitions(iter => iter.grouped(10).map(_.toArray))
      .map(x => (NullWritable.get(), new BytesWritable(Utils.serialize(x))))
      .saveAsSequenceFile(path)
  }
  • saveAsTextFile
    将RDD以文本形式存储至path
  /** * Save this RDD as a text file, using string representations of elements. */
  def saveAsTextFile(path: String): Unit = withScope {
    // https://issues.apache.org/jira/browse/SPARK-2075
    //
    // NullWritable is a `Comparable` in Hadoop 1.+, so the compiler cannot find an implicit
    // Ordering for it and will use the default `null`. However, it's a `Comparable[NullWritable]`
    // in Hadoop 2.+, so the compiler will call the implicit `Ordering.ordered` method to create an
    // Ordering for `NullWritable`. That's why the compiler will generate different anonymous
    // classes for `saveAsTextFile` in Hadoop 1.+ and Hadoop 2.+.
    //
    // Therefore, here we provide an explicit Ordering `null` to make sure the compiler generate
    // same bytecodes for `saveAsTextFile`.
    val nullWritableClassTag = implicitly[ClassTag[NullWritable]]
    val textClassTag = implicitly[ClassTag[Text]]
    val r = this.mapPartitions { iter =>
      val text = new Text()
      iter.map { x =>
        text.set(x.toString)
        (NullWritable.get(), text)
      }
    }
    RDD.rddToPairRDDFunctions(r)(nullWritableClassTag, textClassTag, null)
      .saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path)
  }

 def saveAsTextFile(path: String, codec: Class[_ <: CompressionCodec]): Unit
  • sortBy
    返回元素按f函数转换值进行排序后的RDD。
    默认按转换值的升序排序,分区数目保持不变。
  /** * Return this RDD sorted by the given key function. */
  def sortBy[K](
      f: (T) => K,
      ascending: Boolean = true,
      numPartitions: Int = this.partitions.length)
      (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {
    this.keyBy[K](f)
        .sortByKey(ascending, numPartitions)
        .values
  }
  • subtract
    返回在当前RDD中存,在other中不存在的元素组成的RDD。
  /** * Return an RDD with the elements from `this` that are not in `other`. * * Uses `this` partitioner/partition size, because even if `other` is huge, the resulting * RDD will be &lt;= us. */
  def subtract(other: RDD[T]): RDD[T] = withScope {
    subtract(other, partitioner.getOrElse(new HashPartitioner(partitions.length)))
  }

  /** * Return an RDD with the elements from `this` that are not in `other`. */
  def subtract(other: RDD[T], numPartitions: Int): RDD[T] = withScope {
    subtract(other, new HashPartitioner(numPartitions))
  }

  /** * Return an RDD with the elements from `this` that are not in `other`. */
  def subtract(
      other: RDD[T],
      p: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]
  • take
    返回RDD的前num个元素组成的数组,并未进行大小排序。
  /** * Take the first num elements of the RDD. It works by first scanning one partition, and use the * results from that partition to estimate the number of additional partitions needed to satisfy * the limit. * * @note due to complications in the internal implementation, this method will raise * an exception if called on an RDD of `Nothing` or `Null`. */
  def take(num: Int): Array[T]
  • takeOrdered
    返回RDD排序(升序)后,前num个元素组成的数组。
  /** * Returns the first k (smallest) elements from this RDD as defined by the specified * implicit Ordering[T] and maintains the ordering. This does the opposite of [[top]]. * For example: * {{{ * sc.parallelize(Seq(10, 4, 2, 12, 3)).takeOrdered(1) * // returns Array(2) * * sc.parallelize(Seq(2, 3, 4, 5, 6)).takeOrdered(2) * // returns Array(2, 3) * }}} * * @param num k, the number of elements to return * @param ord the implicit ordering for T * @return an array of top elements */
  def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
    if (num == 0) {
      Array.empty
    } else {
      val mapRDDs = mapPartitions { items =>
        // Priority keeps the largest elements, so let's reverse the ordering.
        val queue = new BoundedPriorityQueue[T](num)(ord.reverse)
        queue ++= util.collection.Utils.takeOrdered(items, num)(ord)
        Iterator.single(queue)
      }
      if (mapRDDs.partitions.length == 0) {
        Array.empty
      } else {
        mapRDDs.reduce { (queue1, queue2) =>
          queue1 ++= queue2
          queue1
        }.toArray.sorted(ord)
      }
    }
  }
  • takeSample
    返回RDD中取样num元素组成的数组。
    分为放回和不放回形式,放回时,fraction>=0,表示元素平均出现取样次数;不放回时,fraction属于[0,1],表示元素被取样的概率。
  /** * Return a fixed-size sampled subset of this RDD in an array * * @param withReplacement whether sampling is done with replacement * @param num size of the returned sample * @param seed seed for the random number generator * @return sample of specified size in an array */
  // TODO: rewrite this without return statements so we can wrap it in a scope
  def takeSample(
      withReplacement: Boolean,
      num: Int,
      seed: Long = Utils.random.nextLong): Array[T]
  • toLocalIterator
    返回RDD中所有元素的迭代器
  /** * Return an iterator that contains all of the elements in this RDD. * * The iterator will consume as much memory as the largest partition in this RDD. * * Note: this results in multiple Spark jobs, and if the input RDD is the result * of a wide transformation (e.g. join with different partitioners), to avoid * recomputing the input RDD should be cached first. */
  def toLocalIterator: Iterator[T] = withScope {
    def collectPartition(p: Int): Array[T] = {
      sc.runJob(this, (iter: Iterator[T]) => iter.toArray, Seq(p)).head
    }
    (0 until partitions.length).iterator.flatMap(i => collectPartition(i))
  }
  • top
    返回RDD中的top num组成的数组,即从大到小排序后的前num个元素。
  /** * Returns the top k (largest) elements from this RDD as defined by the specified * implicit Ordering[T]. This does the opposite of [[takeOrdered]]. For example: * {{{ * sc.parallelize(Seq(10, 4, 2, 12, 3)).top(1) * // returns Array(12) * * sc.parallelize(Seq(2, 3, 4, 5, 6)).top(2) * // returns Array(6, 5) * }}} * * @param num k, the number of top elements to return * @param ord the implicit ordering for T * @return an array of top elements */
  def top(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
    takeOrdered(num)(ord.reverse)
  }
  • treeAggregate
  /** * Aggregates the elements of this RDD in a multi-level tree pattern. * * @param depth suggested depth of the tree (default: 2) * @see [[org.apache.spark.rdd.RDD#aggregate]] */
  def treeAggregate[U: ClassTag](zeroValue: U)(
      seqOp: (U, T) => U,
      combOp: (U, U) => U,
      depth: Int = 2): U
  • treeReduce
 /** * Reduces the elements of this RDD in a multi-level tree pattern. * * @param depth suggested depth of the tree (default: 2) * @see [[org.apache.spark.rdd.RDD#reduce]] */
  def treeReduce(f: (T, T) => T, depth: Int = 2): T
  • union
    返回两个RDD元素的并集组成的RDD,不去重
  /** * Return the union of this RDD and another one. Any identical elements will appear multiple * times (use `.distinct()` to eliminate them). */
  def union(other: RDD[T]): RDD[T] = withScope {
    if (partitioner.isDefined && other.partitioner == partitioner) {
      new PartitionerAwareUnionRDD(sc, Array(this, other))
    } else {
      new UnionRDD(sc, Array(this, other))
    }
  }
  • unpersist
    释放RDD的缓存
  /** * Mark the RDD as non-persistent, and remove all blocks for it from memory and disk. * * @param blocking Whether to block until all blocks are deleted. * @return This RDD. */
  def unpersist(blocking: Boolean = true): this.type = {
    logInfo("Removing RDD " + id + " from persistence list")
    sc.unpersistRDD(id, blocking)
    storageLevel = StorageLevel.NONE
    this
  }
  • zip
    将RDD的元素与other中元素一一对应起来构建元素对,返回元素对组成的RDD。
    两个RDD的分区数目,对应分区里的元素数目都必须相等。
    zip很形象,两个RDD中元素像拉拉链一样地组合。
  /** * Zips this RDD with another one, returning key-value pairs with the first element in each RDD, * second element in each RDD, etc. Assumes that the two RDDs have the *same number of * partitions* and the *same number of elements in each partition* (e.g. one was made through * a map on the other). */
  def zip[U: ClassTag](other: RDD[U]): RDD[(T, U)] = withScope {
    zipPartitions(other, preservesPartitioning = false) { (thisIter, otherIter) =>
      new Iterator[(T, U)] {
        def hasNext: Boolean = (thisIter.hasNext, otherIter.hasNext) match {
          case (true, true) => true
          case (false, false) => false
          case _ => throw new SparkException("Can only zip RDDs with " +
            "same number of elements in each partition")
        }
        def next(): (T, U) = (thisIter.next(), otherIter.next())
      }
    }
  }
  • zipPartitions
    将两个或多个RDD中分区元素的迭代器一一对应起来,用来产生一个元素迭代器,返回以这个分区迭代器元素作为分区的RDD。
    要求两个或多个RDD具有相同分区数。
  /** * Zip this RDD's partitions with one (or more) RDD(s) and return a new RDD by * applying a function to the zipped partitions. Assumes that all the RDDs have the * *same number of partitions*, but does *not* require them to have the same number * of elements in each partition. */
  def zipPartitions[B: ClassTag, V: ClassTag]
      (rdd2: RDD[B], preservesPartitioning: Boolean)
      (f: (Iterator[T], Iterator[B]) => Iterator[V]): RDD[V] = withScope {
    new ZippedPartitionsRDD2(sc, sc.clean(f), this, rdd2, preservesPartitioning)
  }

  def zipPartitions[B: ClassTag, V: ClassTag]
      (rdd2: RDD[B])
      (f: (Iterator[T], Iterator[B]) => Iterator[V]): RDD[V] = withScope {
    zipPartitions(rdd2, preservesPartitioning = false)(f)
  }

  def zipPartitions[B: ClassTag, C: ClassTag, V: ClassTag]
      (rdd2: RDD[B], rdd3: RDD[C], preservesPartitioning: Boolean)
      (f: (Iterator[T], Iterator[B], Iterator[C]) => Iterator[V]): RDD[V] = withScope {
    new ZippedPartitionsRDD3(sc, sc.clean(f), this, rdd2, rdd3, preservesPartitioning)
  }

  def zipPartitions[B: ClassTag, C: ClassTag, V: ClassTag]
      (rdd2: RDD[B], rdd3: RDD[C])
      (f: (Iterator[T], Iterator[B], Iterator[C]) => Iterator[V]): RDD[V] = withScope {
    zipPartitions(rdd2, rdd3, preservesPartitioning = false)(f)
  }
  • zipWithIndex
    将RDD的元素跟其序号(以0开始的自然数序列)一一对应起来,组成元素对,返回这些元素对组成的RDD。
    由于需要为每个分区的第一个元素确定起始序号,所以在执行时要先统计每个分区的元素数量,按分区号累加得到起始编号。这会触发一次作业。
  /** * Zips this RDD with its element indices. The ordering is first based on the partition index * and then the ordering of items within each partition. So the first item in the first * partition gets index 0, and the last item in the last partition receives the largest index. * * This is similar to Scala's zipWithIndex but it uses Long instead of Int as the index type. * This method needs to trigger a spark job when this RDD contains more than one partitions. * * Note that some RDDs, such as those returned by groupBy(), do not guarantee order of * elements in a partition. The index assigned to each element is therefore not guaranteed, * and may even change if the RDD is reevaluated. If a fixed ordering is required to guarantee * the same index assignments, you should sort the RDD with sortByKey() or save it to a file. */
  def zipWithIndex(): RDD[(T, Long)] = withScope {
    new ZippedWithIndexRDD(this)
  }
  • zipWithUniqueId
    将RDD的元素跟其序号一一对应起来,组成元素对,返回这些元素对组成的RDD。
    第k(分区id起始于0)个分区的元素序号编排如下:
    k,k+n,k+2n,k+3n,k+4n,k+5n,…
    由于并不需要为每个分区计算起始编号,该算子不会触发一次作业。
  /** * Zips this RDD with generated unique Long ids. Items in the kth partition will get ids k, n+k, * 2*n+k, ..., where n is the number of partitions. So there may exist gaps, but this method * won't trigger a spark job, which is different from [[org.apache.spark.rdd.RDD#zipWithIndex]]. * * Note that some RDDs, such as those returned by groupBy(), do not guarantee order of * elements in a partition. The unique ID assigned to each element is therefore not guaranteed, * and may even change if the RDD is reevaluated. If a fixed ordering is required to guarantee * the same index assignments, you should sort the RDD with sortByKey() or save it to a file. */
  def zipWithUniqueId(): RDD[(T, Long)] = withScope {
    val n = this.partitions.length.toLong
    this.mapPartitionsWithIndex { case (k, iter) =>
      iter.zipWithIndex.map { case (item, i) =>
        (item, i * n + k)
      }
    }
  }

作者简介

唐黎哲,国防科学技术大学并行与分布式计算国家重点实验室(PDL)研究生,14年入学便开始接触spark,准备在余下的读研时间在spark相关开源社区贡献自己的代码,毕业后准备继续从事该方面研究。
邮箱:[email protected]

你可能感兴趣的:(scala,spark,RDD,RDD算子)