利用键值对求解和均值
这里的操作均是利用键值对操作,x = ('a', (4, 2))的 x[0]==4而不是x[0]=='a'。
import sys
from pyspark import SparkContext
if __name__ == "__main__":
master = "local"
if len(sys.argv) == 2:
master = sys.argv[1]
try:
sc.stop()
except:
pass
sc = SparkContext(master, 'test')
RDD1= sc.parallelize((('a', 2),
('b', 3),
('c', 3),
('d', 1),
('a', 2),
('b', 3),
('c', 3),
('d', 1)))
print(RDD1.collect())
RDD1 = RDD1.mapValues(lambda x: (x, 1)).reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))
#统计键的个数
print(RDD1.collect())
#统计键的均值
RDD1 = RDD1.mapValues(lambda x: (x[0] / x[1]))
print(RDD1.collect())
输出:
[('a', 2), ('b', 3), ('c', 3), ('d', 1), ('a', 2), ('b', 3), ('c', 3), ('d', 1)]
[('a', (4, 2)), ('b', (6, 2)), ('c', (6, 2)), ('d', (2, 2))]
[('a', 2.0), ('b', 3.0), ('c', 3.0), ('d', 1.0)]
这里可以看出mapValues对每个键值对操作,reduceByKey是对键值对之间的一种操作,如果是这样:
RDD1 = RDD1.mapValues(lambda x: (x[0] / x[1]))
结果会是:
[('a', (4, 2)), ('b', (6, 2)), ('c', (6, 2)), ('d', (2, 2))]
上图: