groupByKey和reduceByKey是spark中十分常用的两个功能函数。
正常情况下两个函数都能得出正确的且相同的结果, 但reduceByKey函数更适合使用在大数据集上,而大多数人建议尽量少用groupByKey,这是为什么呢?(这是较早时候大家的建议)因为Spark在执行时,reduceByKey先在同一个分区内组合数据,然后在移动。groupByKey则是先移动后组合,所以移动的工作量相对较大。
实际最新版的源码(spark2.4):
groupByKey, reduceByKey这两个函数的核心都是调用combineByKey,差别在于:
groupByKey与combineByKey实现的逻辑是一样的,差别是groupByKey内置好了三个函数,但是combineByKey可以自己设定函数。
reduceByKey则是没有这些函数,它也是调用的combineByKey,但是传给combineByKey的函数就有些简单了,所以实现的功能是有局限的。因此,如果你有自己的要求,则需要自己根据,combineByKey,的要求,写三个组合函数。
所以在用的时候建议看看源码,蛮有意思的哈!
def reduceByKey(self, func, numPartitions=None, partitionFunc=portable_hash):
"""
Merge the values for each key using an associative and commutative reduce function.
This will also perform the merging locally on each mapper before
sending results to a reducer, similarly to a "combiner" in MapReduce.
Output will be partitioned with C{numPartitions} partitions, or
the default parallelism level if C{numPartitions} is not specified.
Default partitioner is hash-partition.
>>> from operator import add
>>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
>>> sorted(rdd.reduceByKey(add).collect())
[('a', 2), ('b', 1)]
"""
return self.combineByKey(lambda x: x, func, func, numPartitions, partitionFunc)
def groupByKey(self, numPartitions=None, partitionFunc=portable_hash):
"""
Group the values for each key in the RDD into a single sequence.
Hash-partitions the resulting RDD with numPartitions partitions.
.. note:: If you are grouping in order to perform an aggregation (such as a
sum or average) over each key, using reduceByKey or aggregateByKey will
provide much better performance.
>>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
>>> sorted(rdd.groupByKey().mapValues(len).collect())
[('a', 2), ('b', 1)]
>>> sorted(rdd.groupByKey().mapValues(list).collect())
[('a', [1, 1]), ('b', [1])]
"""
def createCombiner(x):
return [x]
def mergeValue(xs, x):
xs.append(x)
return xs
def mergeCombiners(a, b):
a.extend(b)
return a
memory = self._memory_limit()
serializer = self._jrdd_deserializer
agg = Aggregator(createCombiner, mergeValue, mergeCombiners)
def combine(iterator):
merger = ExternalMerger(agg, memory * 0.9, serializer)
merger.mergeValues(iterator)
return merger.items()
locally_combined = self.mapPartitions(combine, preservesPartitioning=True)
shuffled = locally_combined.partitionBy(numPartitions, partitionFunc)
def groupByKey(it):
merger = ExternalGroupBy(agg, memory, serializer)
merger.mergeCombiners(it)
return merger.items()
return shuffled.mapPartitions(groupByKey, True).mapValues(ResultIterable)
def combineByKey(self, createCombiner, mergeValue, mergeCombiners,
numPartitions=None, partitionFunc=portable_hash):
"""
Generic function to combine the elements for each key using a custom
set of aggregation functions.
通用函数,使用一组自定义聚合函数实现每个key对应元素的组合;
Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined
type" C
将一个RDD[(K, V)]类型的数据,对相同的key的元素组合后,构成RDD[(K, C)],这里的C是组合后的类型。
Users provide three functions:
使用者需要提供的三个功能函数:
- C{createCombiner}, which turns a V into a C (e.g., creates
a one-element list)(定义:value向C类型的转变函数)
- C{mergeValue}, to merge a V into a C (e.g., adds it to the end of
a list)(定义:单个Value的类型,加入到C类型的函数。)
- C{mergeCombiners}, to combine two C's into a single one (e.g., merges
the lists)(定义两个C类型数据的组合函数)
To avoid memory allocation, both mergeValue and mergeCombiners are allowed to
modify and return their first argument instead of creating a new C.
为了避免内存的再次分配,函数mergeValue和 mergeCombiners都允许对
输入的第一个参数进行修改,将第二个参数追加到第一个参数中,避免重复创建。
In addition, users can control the partitioning of the output RDD.
除此之外,使用者也可以修改输出RDD的划分
举例:
.. note:: V and C can be different -- for example, one might group an RDD of type
(Int, Int) into an RDD of type (Int, List[Int]).
>>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 2)])
>>> def to_list(a):
... return [a]
...
>>> def append(a, b):
... a.append(b)
... return a
...
>>> def extend(a, b):
... a.extend(b)
... return a
...
>>> sorted(x.combineByKey(to_list, append, extend).collect())
[('a', [1, 2]), ('b', [1])]
"""
if numPartitions is None:
numPartitions = self._defaultReducePartitions()
serializer = self.ctx.serializer
memory = self._memory_limit()
agg = Aggregator(createCombiner, mergeValue, mergeCombiners)
def combineLocally(iterator):
merger = ExternalMerger(agg, memory * 0.9, serializer)
merger.mergeValues(iterator)
return merger.items()
locally_combined = self.mapPartitions(combineLocally, preservesPartitioning=True)
shuffled = locally_combined.partitionBy(numPartitions, partitionFunc)
def _mergeCombiners(iterator):
merger = ExternalMerger(agg, memory, serializer)
merger.mergeCombiners(iterator)
return merger.items()
return shuffled.mapPartitions(_mergeCombiners, preservesPartitioning=True)
具体分析待续