reduceByKey 和 groupByKey

groupByKey

  • 进入org.apache.spark.rdd.PairRDDFunctions.scala
  /**
   * Group the values for each key in the RDD into a single sequence. Hash-partitions the
   * resulting RDD with the existing partitioner/parallelism level. The ordering of elements
   * within each group is not guaranteed, and may even differ each time the resulting RDD is
   * evaluated.
   *
   * @note This operation may be very expensive. If you are grouping in order to perform an
   * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
   * or `PairRDDFunctions.reduceByKey` will provide much better performance.
   */
  def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
    groupByKey(defaultPartitioner(self))
  }

将 RDD 里 key 相同的元素分组到一个序列里,然后基于 partitioner / 并行度级别对各分组序列按 Hash 分区,每个分组序列里元素的顺序是不确定的。

注意:该操作代价比较大,如果是想根据 key 分组聚合的话(例如求和之类),可以使用 PairRDDFunctions.aggregateByKeyPairRDDFunctions.reduceByKey 来提高性能。

  def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
    // groupByKey shouldn't use map side combine because map side combine does not
    // reduce the amount of data shuffled and requires all map side data be inserted
    // into a hash table, leading to more objects in the old gen.
    val createCombiner = (v: V) => CompactBuffer(v)
    val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
    val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
    val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
      createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
    bufs.asInstanceOf[RDD[(K, Iterable[V])]]
  }

传入参数 mapSideCombine = false,表示不进行 mapper 端的聚合。

reduceByKey

  • 进入org.apache.spark.rdd.PairRDDFunctions.scala
  /**
   * Merge the values for each key using an associative and commutative reduce function. This will
   * also perform the merging locally on each mapper before sending results to a reducer, similarly
   * to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/
   * parallelism level.
   */
  def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {
    reduceByKey(defaultPartitioner(self), func)
  }

使用相应的 reduce 函数对每个 key 分组里的数据进行聚合,该操作会先在 mapper 端本地进行一次聚合操作,然后将聚合后的结果发送给 reducer 端。

  def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
    combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
  }

没有传入 mapSideCombine 参数,默认为 true,表示进行 mapper 端的聚合。

你可能感兴趣的:(reduceByKey 和 groupByKey)