Spark中得groupBy,reduceByKey和 combineByKey区别于各自用法 (5)

Spark中得groupByKey,reduceByKey和 combineByKey区别于各自用法

这篇博文记录Spark中经常使用到的shuffle操作groupByKey,reduceByKey和 combineByKey,其中前面两个分别还有不带Key,可以在RDD的trans过程中自定义key的用法,在前面的计算TF-IDF文章中有使用到。
下面就一一的来介绍这三个API,使用词频统计的demo来示例。

1. groupByKey

先贴一下官方说明:

groupByKey([numTasks])	When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable) pairs. 
Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance. 
Note: By default, the level of parallelism in the output depends on the number of partitions of the parent RDD. You can pass an optional numTasks argument to set a different number of tasks.

可以看出, 这个函数是在key-values的pairs上进行transform的,返回的rdd也是一个paris类型,此时的values是一个迭代对象,函数根据具有相同key的value进行分组,返回相同key下values的迭代。
举个例子:
我们有三个单词的词频:

(我,1),(爱,1),(中国,2), (中国,3)

经过groupByKey后,就可以得到:

(我,[1]),(爱,[1]),(中国,[2, 3])

正如官方文档中所说的,如果我们想要进行分组后进行聚合操作,使用 reduceByKey 或者 aggregateByKey会更高效,为啥呢? 因为groupbyKey在group的过程中在每一个partition内是不会进行相同key合并的,也就是说即使上面例子中得两个中国pari都在一个partition中,他也不会在map阶段进行合并,而是直接通过网络传输到下一个rdd中。但是我们知道,我们需要根据相同的key进行合并,如果有相同的key在一个partition中,直接先合并,然后在传入到下一个rdd中那么需要传输的数据就会小很多。特别是如果当我们的数据很大的时候,这种网络开销会更大,这将是shuffle的一个性能瓶颈。
我们可以通过程序,加以输出来看看这个过程:

# -*- coding: utf-8 -*-

"""

 @Time    : 2019/2/21 19:35
 @Author  : MaCan ([email protected])
 @File    : groupByKey_test.py
"""

from pyspark import SparkContext, SparkConf, Row
from pyspark.sql import SparkSession, SQLContext

import time
import jieba

def print_out(line):
    print(line[0], list(line[1]))

if __name__ == '__main__':
    start_t = time.perf_counter()
    conf = SparkConf().setAppName('text_trans').setMaster("local[*]")
    sc = SparkContext()
    sc.setLogLevel(logLevel='ERROR')
    spark = SparkSession.builder.config(conf=conf).getOrCreate()
    try:
        data = spark.read.text('test').rdd\
                                     .flatMap(lambda x: jieba.cut(x[0].split(' ')[1]))
        print(data.collect())
        print('map to pair:')
        data = data.map(lambda x: (x, 1))
        print(data.collect())
        print('group by key')
        data = data\
            .groupByKey()
        print(data.collect())
        data.map(print_out).collect()
    except Exception as e:
        print(e)
    finally:
        spark.stop()
        print(time.perf_counter() - start_t)

输出:

C:\Users\C\Anaconda3\python.exe C:/workspace/python/py_tools/spark_work/groupByKey_test.py
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2019-02-21 19:45:09 WARN  SparkConf:66 - In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN).
[Stage 0:>                                                          (0 + 1) / 1]Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\C\AppData\Local\Temp\jieba.cache
Loading model cost 0.529 seconds.
Prefix dict has been built succesfully.
['我', '来到', '北京', '清华大学', '他', '来到', '了', '网易', '杭研', '大厦', '我', '来到', '北京', '清华大学', '他', '来到', '了', '网易', '杭研', '大厦', '我', '来到', '北京', '清华大学', ',', '我', '来到', '北京', '清华大学']
map to pair:
[Stage 1:>                                                          (0 + 1) / 1]Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\C\AppData\Local\Temp\jieba.cache
Loading model cost 0.528 seconds.
Prefix dict has been built succesfully.
[('我', 1), ('来到', 1), ('北京', 1), ('清华大学', 1), ('他', 1), ('来到', 1), ('了', 1), ('网易', 1), ('杭研', 1), ('大厦', 1), ('我', 1), ('来到', 1), ('北京', 1), ('清华大学', 1), ('他', 1), ('来到', 1), ('了', 1), ('网易', 1), ('杭研', 1), ('大厦', 1), ('我', 1), ('来到', 1), ('北京', 1), ('清华大学', 1), (',', 1), ('我', 1), ('来到', 1), ('北京', 1), ('清华大学', 1)]
group by key
[Stage 2:>                                                          (0 + 1) / 1]Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\C\AppData\Local\Temp\jieba.cache
Loading model cost 0.537 seconds.
Prefix dict has been built succesfully.
[('我', ), ('来到', ), ('北京', ), ('清华大学', ), ('他', ), ('了', ), ('网易', ), ('杭研', ), ('大厦', ), (',', )]
[Stage 5:>                                                          (0 + 1) / 1]我 [1, 1, 1, 1]
来到 [1, 1, 1, 1, 1, 1]
北京 [1, 1, 1, 1]
清华大学 [1, 1, 1, 1]
他 [1, 1]
了 [1, 1]
网易 [1, 1]
杭研 [1, 1]
大厦 [1, 1]
, [1]
13.095626099999999

Process finished with exit code 0

可以看到聚合后的结果是一个pari,其中value是一个数组,如果想计算词频,还需要进行一组sum操作

2. reduceByKey

先上官方文档:

When called on a dataset of (K, V) pairs, 
returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, 
which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.

那么相对于groupByKey而言,reduceByKey优化了在相同partition中相同key先进行reduce,然后再在不同partition中进行reduce。reduce和groupby的输入参数也不一样,参数可以决定我们返回值的类型,通过自定义的参数Func,可以灵活的配置我们以怎样的方式组合value。例如在单词词频统计中,我们可以将相同的单词的词频相加。
代码:

# -*- coding: utf-8 -*-

"""

 @Time    : 2019/2/21 19:35
 @Author  : MaCan ([email protected])
 @File    : groupByKey_test.py
"""

from pyspark import SparkContext, SparkConf, Row
from pyspark.sql import SparkSession, SQLContext
# from pyspark.sql.types import *
# from pyspark.sql.functions import *

import os
import re
import time
import jieba

if __name__ == '__main__':
    start_t = time.perf_counter()
    conf = SparkConf().setAppName('text_trans').setMaster("local[*]")
    sc = SparkContext()
    sc.setLogLevel(logLevel='ERROR')
    spark = SparkSession.builder.config(conf=conf).getOrCreate()
    try:
        data = spark.read.text('test').rdd\
                                     .flatMap(lambda x: jieba.cut(x[0].split(' ')[1]))
        print(data.collect())
        print('map to pair:')
        data = data.map(lambda x: (x, 1))
        print(data.collect())
        print('recuce by key')
        data = data\
            .reduceByKey(lambda x, y: x+y)
        print(data.collect())

    except Exception as e:
        print(e)
    finally:
        spark.stop()
        print(time.perf_counter() - start_t)

查看输出可以看到:

C:\Users\C\Anaconda3\python.exe C:/workspace/python/py_tools/spark_work/groupByKey_test.py
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2019-02-21 19:55:25 WARN  SparkConf:66 - In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN).
[Stage 0:>                                                          (0 + 1) / 1]Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\C\AppData\Local\Temp\jieba.cache
Loading model cost 0.563 seconds.
Prefix dict has been built succesfully.
['我', '来到', '北京', '清华大学', '他', '来到', '了', '网易', '杭研', '大厦', '我', '来到', '北京', '清华大学', '他', '来到', '了', '网易', '杭研', '大厦', '我', '来到', '北京', '清华大学', ',', '我', '来到', '北京', '清华大学']
map to pair:
[Stage 1:>                                                          (0 + 1) / 1]Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\C\AppData\Local\Temp\jieba.cache
Loading model cost 0.544 seconds.
Prefix dict has been built succesfully.
[('我', 1), ('来到', 1), ('北京', 1), ('清华大学', 1), ('他', 1), ('来到', 1), ('了', 1), ('网易', 1), ('杭研', 1), ('大厦', 1), ('我', 1), ('来到', 1), ('北京', 1), ('清华大学', 1), ('他', 1), ('来到', 1), ('了', 1), ('网易', 1), ('杭研', 1), ('大厦', 1), ('我', 1), ('来到', 1), ('北京', 1), ('清华大学', 1), (',', 1), ('我', 1), ('来到', 1), ('北京', 1), ('清华大学', 1)]
recuce by key
[Stage 2:>                                                          (0 + 1) / 1]Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\C\AppData\Local\Temp\jieba.cache
Loading model cost 0.534 seconds.
Prefix dict has been built succesfully.
[('我', 4), ('来到', 6), ('北京', 4), ('清华大学', 4), ('他', 2), ('了', 2), ('网易', 2), ('杭研', 2), ('大厦', 2), (',', 1)]
[Stage 5:>                                                          (0 + 1) / 1]我 4
来到 6
北京 4
清华大学 4
他 2
了 2
网易 2
杭研 2
大厦 2
, 1
13.069456200000001

Process finished with exit code 0

输出结果中得前面和上面的一样,这里只贴最后一个stage的的输出:

[('我', 4), ('来到', 6), ('北京', 4), ('清华大学', 4), ('他', 2), ('了', 2), ('网易', 2), ('杭研', 2), ('大厦', 2), (',', 1)]

我们看到reduceByKey的结果就是我们想要的单词词频,而不需要再通过一个map去求和了。

3. aggregateByKey

在groupByKey的文档中性能优化推荐中除了2的reduceByKey,还提及了一个aggregateByKey,那么这个函数是干啥的呢?

    def aggregateByKey(self, zeroValue, seqFunc, combFunc, numPartitions=None,
                       partitionFunc=portable_hash):

从函数的定义中可以看到其和groupByKey多了三个函数: zeroValue, seqFunc, combFunc
它相对reduceByKey更加的灵活,比groupByKey更加的高效。因为reduceByKey中只有combFunc,只能指定value合并的方法,如果我们想获得和groupByKey一样的values值,即相同的key的value的迭代呢? 我开始想到这么写:

# -*- coding: utf-8 -*-

"""

 @Time    : 2019/2/21 19:35
 @Author  : MaCan ([email protected])
 @File    : groupByKey_test.py
"""

from pyspark import SparkContext, SparkConf, Row
from pyspark.sql import SparkSession, SQLContext
# from pyspark.sql.types import *
# from pyspark.sql.functions import *

import os
import re
import time
import jieba

def comb_func(x, y):
    t = []
    if type(x) == list:
        t.extend(x)
    else:
        t.append(x)
    if type(y) == list:
        t.extend(y)
    else:
        t.append(y)
    return t


if __name__ == '__main__':
    start_t = time.perf_counter()
    conf = SparkConf().setAppName('text_trans').setMaster("local[*]")
    sc = SparkContext()
    sc.setLogLevel(logLevel='ERROR')
    spark = SparkSession.builder.config(conf=conf).getOrCreate()
    try:
        data = spark.read.text('test').rdd \
            .flatMap(lambda x: jieba.cut(x[0].split(' ')[1]))
        print(data.collect())
        print('map to pair:')
        data = data.map(lambda x: (x, 1))
        print(data.collect())
        print('recuce by key')

        data = data \
            .reduceByKey(comb_func)
        print(data.collect())

    except Exception as e:
        print(e)
    finally:
        spark.stop()
        print(time.perf_counter() - start_t)

我们在reduce的时候之定义合并函数,每次都新建一个[],然后在进行reduce的时候,将数据加进来就好啦~~
但是我们回想一下,这样不就需要每次reduce都要新建一个[]? 虽然方法可行,但是这些没必要的东西必然会造成我们的程序出现性能问题,那咋办呢?我们可以使用aggregateByKey,它的用法如下:
zeroValue,是一个初始值或者累加值
seqFunc, 是将累加值和相同key的values进行操作,第一个参数是累加值,第二个是rdd中来的values
combFunc,和reduceByKey一样,定义如何合并相同key的values
下来上代码来看看是如何进行词频统计,同时获得和groupByKey相同的结果:

# -*- coding: utf-8 -*-

"""

 @Time    : 2019/2/21 19:35
 @Author  : MaCan ([email protected])
 @File    : groupByKey_test.py
"""

from pyspark import SparkContext, SparkConf, Row
from pyspark.sql import SparkSession, SQLContext
# from pyspark.sql.types import *
# from pyspark.sql.functions import *

import os
import re
import time
import jieba

def seq_func(x ,v):
    if type(v) is not list:
        x.append(v)
    else:
        x.extend(v)
    return x



if __name__ == '__main__':
    start_t = time.perf_counter()
    conf = SparkConf().setAppName('text_trans').setMaster("local[*]")
    sc = SparkContext()
    sc.setLogLevel(logLevel='ERROR')
    spark = SparkSession.builder.config(conf=conf).getOrCreate()
    try:
        data = spark.read.text('test').rdd \
            .flatMap(lambda x: jieba.cut(x[0].split(' ')[1]))
        print(data.collect())
        print('map to pair:')
        data = data.map(lambda x: (x, 1))
        print(data.collect())
        print('aggregate by key')

        # print(data.collect())
        data = data.aggregateByKey([], seq_func, seq_func)
        print(data.collect())

    except Exception as e:
        print(e)
    finally:
        spark.stop()
        print(time.perf_counter() - start_t)

他减少了我们每次创建数组的过程。减少了没必要的数据创建开销,同时可以返回和传入的pari不同类型的value

4. combineByKey

和aggregateByKey一样,可以返回和输入不同类型的value(貌似上面的例子,reduceByKey也可以),下面来介绍一下它的逻辑:
它会遍历rdd中得每一个元素,因此,每个元素的key要么没遇到,要么和之前出现过的某个元素的key相同,
如果这是一个新元素,combineByKey会使用一个叫做createCombiner()的函数来创建那个键对应的累加器初始值。这个过程会在每个分区的第一个出现的元素的时候发生。
如果这是一个在处理当前分区之前已经遇到过的键了,他会使用mergeValues的方法将该键的累加器对应的当前值与这个新值进行合并。
因为对于每一个分区都是单独处理的, 因此对于同一个键在全局中可能会有不同的累加器,如果不同分区有两个或者更多的键对应的累加器,那么就使用mergeCombiners函数将各个分区的结果进行合并
(上面话摘自:《Spakr快速大数据分析》)
好啦,知道原理了,我们搞个代码来实践一下我们的wordcount吧
小提醒: combineByKey也是会在一个partition内吧相同key的value进行合并的哦。这点除了和groupByKey不一样,和其他两个都是一样的,还有一个floyByKey不想写了。。。

# -*- coding: utf-8 -*-

"""

 @Time    : 2019/2/21 19:35
 @Author  : MaCan ([email protected])
 @File    : groupByKey_test.py
"""

from pyspark import SparkContext, SparkConf, Row
from pyspark.sql import SparkSession, SQLContext
# from pyspark.sql.types import *
# from pyspark.sql.functions import *

import os
import re
import time
import jieba

def create_combiner(v):
    t = [v]
    print(t)
    return t


def merge_value(v, x):
    print(x,v)
    v.append(x)
    return v


def merge_combiner(x, y):
    x.extend(y)
    return x


if __name__ == '__main__':
    start_t = time.perf_counter()
    conf = SparkConf().setAppName('text_trans').setMaster("local[*]")
    sc = SparkContext()
    sc.setLogLevel(logLevel='ERROR')
    spark = SparkSession.builder.config(conf=conf).getOrCreate()
    try:
        data = spark.read.text('test').rdd \
            .flatMap(lambda x: jieba.cut(x[0].split(' ')[1]))
        print(data.collect())
        print('map to pair:')
        data = data.map(lambda x: (x, 1))
        print(data.collect())
        print('recuce by key')
        data = data.combineByKey(lambda v: v,lambda x,v: x+v, lambda x,y: x+y)
        print(data.collect())
        # data = data.combineByKey(create_combiner, merge_value, merge_combiner)
        # print(data.collect())
    except Exception as e:
        print(e)
    finally:
        spark.stop()
        print(time.perf_counter() - start_t)

听说你就想得到groupByKey的结果,那就吧上面注释的两行取消注释吧,记得把这两行上面的两行注释掉哦:

# data = data.combineByKey(create_combiner, merge_value, merge_combiner)
# print(data.collect())

5.总结:

忘了groupByKey吧,其他三个任你选!!!

你可能感兴趣的:(spark,python,PySpark)