spark的键值对的聚合操作

利用键值对求解和均值

这里的操作均是利用键值对操作,x = ('a', (4, 2))的 x[0]==4而不是x[0]=='a'。
import sys

from pyspark import SparkContext

if __name__ == "__main__":
    master = "local"
    if len(sys.argv) == 2:
        master = sys.argv[1]
    try:
        sc.stop() 
    except:
        pass
    sc = SparkContext(master, 'test')
    RDD1= sc.parallelize((('a', 2),
                          ('b', 3), 
                          ('c', 3), 
                          ('d', 1),
                          ('a', 2),
                          ('b', 3), 
                          ('c', 3), 
                          ('d', 1)))
    print(RDD1.collect())
    RDD1 = RDD1.mapValues(lambda x: (x, 1)).reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))
    #统计键的个数
    print(RDD1.collect())
    #统计键的均值
    RDD1 = RDD1.mapValues(lambda x: (x[0] / x[1]))
    print(RDD1.collect())

输出:

[('a', 2), ('b', 3), ('c', 3), ('d', 1), ('a', 2), ('b', 3), ('c', 3), ('d', 1)]
[('a', (4, 2)), ('b', (6, 2)), ('c', (6, 2)), ('d', (2, 2))]
[('a', 2.0), ('b', 3.0), ('c', 3.0), ('d', 1.0)]

这里可以看出mapValues对每个键值对操作,reduceByKey是对键值对之间的一种操作,如果是这样:

RDD1 = RDD1.mapValues(lambda x: (x[0] / x[1]))

结果会是:

[('a', (4, 2)), ('b', (6, 2)), ('c', (6, 2)), ('d', (2, 2))]

上图:


你可能感兴趣的:(spark的键值对的聚合操作)