在spark中,我们知道一切的操作都是基于RDD的。在使用中,RDD有一种非常特殊也是非常实用的format——pair RDD,即RDD的每一行是(key, value)的格式。这种格式很像Python的字典类型,便于针对key进行一些处理。
针对pair RDD这样的特殊形式,spark中定义了许多方便的操作,今天主要介绍一下reduceByKey和groupByKey,因为在接下来讲解《在spark中如何实现SQL中的group_concat功能?》时会用到这两个operations。
首先,看一看spark官网[1]是怎么解释的:
reduceByKey(func, numPartitions=None)
Merge the values for each key using an associative reduce function. This will also perform the merginglocally on each mapper before sending results to a reducer, similarly to a “combiner” in MapReduce. Output will be hash-partitioned with numPartitions partitions, or the default parallelism level if numPartitions is not specified.
也就是,reduceByKey用于对每个key对应的多个value进行merge操作,最重要的是它能够在本地先进行merge操作,并且merge操作可以通过函数自定义。
groupByKey(numPartitions=None)
Group the values for each key in the RDD into a single sequence. Hash-partitions the resulting RDD with numPartitions partitions. Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will provide much better performance.
也就是,groupByKey也是对每个key进行操作,但只生成一个sequence。需要特别注意“Note”中的话,它告诉我们:如果需要对sequence进行aggregation操作(注意,groupByKey本身不能自定义操作函数),那么,选择reduceByKey/aggregateByKey更好。这是因为groupByKey不能自定义函数,我们需要先用groupByKey生成RDD,然后才能对此RDD通过map进行自定义函数操作。
为了更好的理解上面这段话,下面我们使用两种不同的方式去计算单词的个数[2]:
[java] view plain copy
- val words = Array("one", "two", "two", "three", "three", "three")
-
- val wordPairsRDD = sc.parallelize(words).map(word => (word, 1))
-
- val wordCountsWithReduce = wordPairsRDD.reduceByKey(_ + _)
-
- val wordCountsWithGroup = wordPairsRDD.groupByKey().map(t => (t._1, t._2.sum))
上面得到的wordCountsWithReduce和wordCountsWithGroup是完全一样的,但是,它们的内部运算过程是不同的。
(1)当采用reduceByKeyt时,Spark可以在每个分区移动数据之前将待输出数据与一个共用的key结合。借助下图可以理解在reduceByKey里究竟发生了什么。 注意在数据对被搬移前同一机器上同样的key是怎样被组合的(reduceByKey中的lamdba函数)。然后lamdba函数在每个区上被再次调用来将所有值reduce成一个最终结果。整个过程如下:
(2)当采用groupByKey时,由于它不接收函数,spark只能先将所有的键值对(key-value pair)都移动,这样的后果是集群节点之间的开销很大,导致传输延时。整个过程如下:
因此,在对大数据进行复杂计算时,reduceByKey优于groupByKey。
另外,如果仅仅是group处理,那么以下函数应该优先于 groupByKey :
(1)、combineByKey 组合数据,但是组合之后的数据类型与输入时值的类型不一样。
(2)、foldByKey合并每一个 key 的所有值,在级联函数和“零值”中使用。
最后,对reduceByKey中的func做一些介绍:
如果是用Python写的spark,那么有一个库非常实用:operator[3],其中可以用的函数包括:大小比较函数,逻辑操作函数,数学运算函数,序列操作函数等等。这些函数可以直接通过“from operator import *”进行调用,直接把函数名作为参数传递给reduceByKey即可。如下:
[python] view plain copy
- from operator import add
- rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
- sorted(rdd.reduceByKey(add).collect())
-
- [('a', 2), ('b', 1)]
下面是附加源码更加详细的解释
转自:https://blog.csdn.net/ZMC921/article/details/75098903
一、首先他们都是要经过shuffle的,groupByKey在方法shuffle之间不会合并原样进行shuffle,。reduceByKey进行shuffle之前会先做合并,这样就减少了shuffle的io传送,所以效率高一点。
案例:
[plain] view plain copy
- object GroupyKeyAndReduceByKeyDemo {
- def main(args: Array[String]): Unit = {
- Logger.getLogger("org").setLevel(Level.WARN)
- val config = new SparkConf().setAppName("GroupyKeyAndReduceByKeyDemo").setMaster("local")
- val sc = new SparkContext(config)
- val arr = Array("val config", "val arr")
- val socketDS = sc.parallelize(arr).flatMap(_.split(" ")).map((_, 1))
- //groupByKey 和reduceByKey 的区别:
- //他们都是要经过shuffle的,groupByKey在方法shuffle之间不会合并原样进行shuffle,
- //reduceByKey进行shuffle之前会先做合并,这样就减少了shuffle的io传送,所以效率高一点
- socketDS.groupByKey().map(tuple => (tuple._1, tuple._2.sum)).foreach(x => {
- println(x._1 + " " + x._2)
- })
- println("----------------------")
- socketDS.reduceByKey(_ + _).foreach(x => {
- println(x._1 + " " + x._2)
- })
- sc.stop()
- }
- }
二 、首先groupByKey有三种
查看源码groupByKey()实现了 groupByKey(defaultPartitioner(self))
[java] view plain copy
- /**
- * Group the values for each key in the RDD into a single sequence. Hash-partitions the
- * resulting RDD with the existing partitioner/parallelism level. The ordering of elements
- * within each group is not guaranteed, and may even differ each time the resulting RDD is
- * evaluated.
- *
- * @note This operation may be very expensive. If you are grouping in order to perform an
- * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
- * or `PairRDDFunctions.reduceByKey` will provide much better performance.
- */
- def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
- groupByKey(defaultPartitioner(self))
- }
查看源码 groupByKey(numPartitions: Int) 实现了 groupByKey(new HashPartitioner(numPartitions))
[java] view plain copy
- /**
- * Group the values for each key in the RDD into a single sequence. Hash-partitions the
- * resulting RDD with into `numPartitions` partitions. The ordering of elements within
- * each group is not guaranteed, and may even differ each time the resulting RDD is evaluated.
- *
- * @note This operation may be very expensive. If you are grouping in order to perform an
- * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
- * or `PairRDDFunctions.reduceByKey` will provide much better performance.
- *
- * @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
- * key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.
- */
- def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])] = self.withScope {
- groupByKey(new HashPartitioner(numPartitions))
- }
其实上面两个都是实现了groupByKey(partitioner: Partitioner)
[java] view plain copy
- /**
- * Group the values for each key in the RDD into a single sequence. Allows controlling the
- * partitioning of the resulting key-value pair RDD by passing a Partitioner.
- * The ordering of elements within each group is not guaranteed, and may even differ
- * each time the resulting RDD is evaluated.
- *
- * @note This operation may be very expensive. If you are grouping in order to perform an
- * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
- * or `PairRDDFunctions.reduceByKey` will provide much better performance.
- *
- * @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
- * key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.
- */
- def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
- // groupByKey shouldn't use map side combine because map side combine does not
- // reduce the amount of data shuffled and requires all map side data be inserted
- // into a hash table, leading to more objects in the old gen.
- val createCombiner = (v: V) => CompactBuffer(v)
- val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
- val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
- val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
- createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
- bufs.asInstanceOf[RDD[(K, Iterable[V])]]
- }
而groupByKey(partitioner: Partitioner)有实现了combineByKeyWithClassTag,所以可以说groupByKey其实底层都是combineByKeyWithClassTag的实现,只是实现的方式不同。
三、再查看reduceByKey也有三种方式
[java] view plain copy
- /**
- * Merge the values for each key using an associative and commutative reduce function. This will
- * also perform the merging locally on each mapper before sending results to a reducer, similarly
- * to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/
- * parallelism level.
- */
- def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {
- reduceByKey(defaultPartitioner(self), func)
- }
[java] view plain copy
- /**
- * Merge the values for each key using an associative and commutative reduce function. This will
- * also perform the merging locally on each mapper before sending results to a reducer, similarly
- * to a "combiner" in MapReduce. Output will be hash-partitioned with numPartitions partitions.
- */
- def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)] = self.withScope {
- reduceByKey(new HashPartitioner(numPartitions), func)
- }
[java] view plain copy
- /**
- * Merge the values for each key using an associative and commutative reduce function. This will
- * also perform the merging locally on each mapper before sending results to a reducer, similarly
- * to a "combiner" in MapReduce.
- */
- def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
- combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
- }
通过查看这三种reduceByKey不难发现,前两种是最后一种的实现。而最后一种是又实现了combineByKeyWithClassTag。
### groupByKey是这样实现的
combineByKeyWithClassTag[CompactBuffer[V]](createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
### reduceByKey是这样实现的
combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
对比上面发现,groupByKey设置了mapSideCombine = false,在map端不进行合并,那就是在shuffle前不合并。而reduceByKey没有设置
难道reduceByKey默认合并吗????
四、接下来,我们仔细看一下combineByKeyWithClassTag
[java] view plain copy
- /**
- * :: Experimental ::
- * Generic function to combine the elements for each key using a custom set of aggregation
- * functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C
- *
- * Users provide three functions:
- *
- * - `createCombiner`, which turns a V into a C (e.g., creates a one-element list)
- * - `mergeValue`, to merge a V into a C (e.g., adds it to the end of a list)
- * - `mergeCombiners`, to combine two C's into a single one.
- *
- * In addition, users can control the partitioning of the output RDD, and whether to perform
- * map-side aggregation (if a mapper can produce multiple items with the same key).
- *
- * @note V and C can be different -- for example, one might group an RDD of type
- * (Int, Int) into an RDD of type (Int, Seq[Int]).
- */
- @Experimental
- def combineByKeyWithClassTag[C](
- createCombiner: V => C,
- mergeValue: (C, V) => C,
- mergeCombiners: (C, C) => C,
- partitioner: Partitioner,
- mapSideCombine: Boolean = true,
- serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
- require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
- if (keyClass.isArray) {
- if (mapSideCombine) {
- throw new SparkException("Cannot use map-side combining with array keys.")
- }
- if (partitioner.isInstanceOf[HashPartitioner]) {
- throw new SparkException("HashPartitioner cannot partition array keys.")
- }
- }
- val aggregator = new Aggregator[K, V, C](
- self.context.clean(createCombiner),
- self.context.clean(mergeValue),
- self.context.clean(mergeCombiners))
- if (self.partitioner == Some(partitioner)) {
- self.mapPartitions(iter => {
- val context = TaskContext.get()
- new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
- }, preservesPartitioning = true)
- } else {
- new ShuffledRDD[K, V, C](self, partitioner)
- .setSerializer(serializer)
- .setAggregator(aggregator)
- .setMapSideCombine(mapSideCombine)
- }
- }
通过查看combineByKeyWithClassTag的,发现reduceByKey默认在map端进行合并,那就是在shuffle前进行合并,如果合并了一些数据,那在shuffle时进行溢写则减少了磁盘IO,所以reduceByKey会快一些。