reduceByKey
可以将kv型嵌套元组分组,并且根据指定函数进行合并计算:
具体例子如下:
from pyspark import SparkConf, SparkContext
import os
# 让PySpark知道Python的解释器位置
os.environ['PYSPARK_PYTHON'] = "C:/Python310/python.exe"
conf = SparkConf().setMaster("local[*]").setAppName("test_spark")
sc = SparkContext(conf=conf)
# 准备一个RDD,数据集
rdd = sc.parallelize([('男', 99), ('女', 99), ('男', 88), ('女', 66)])
# 求男生和女生两个组的成绩和
rdd2= rdd.reduceByKey(lambda a, b: a + b)
print(rdd2.collect())
sc.stop()
输出结果:
[(‘男’, 187), (‘女’, 165)]
说明:reduceByKey 就是按元组的第一个元素(男还是女)来分组,然后对组内所有元组的第二个元素来执行相对应函数
单词文件hello.txt 如下:
itheima itheima itcast itheima
spark python spark python itheima
itheima itcast itcast itheima python
python python spark pyspark pyspark
itheima python pyspark itcast spark
from pyspark import SparkConf, SparkContext
import os
# 1.让PySpark知道Python的解释器位置
os.environ['PYSPARK_PYTHON'] = "C:/Python310/python.exe"
conf = SparkConf().setMaster("local[*]").setAppName("test_spark")
sc = SparkContext(conf=conf)
# 2.读取数据文件
rdd = sc.textFile("./hello.txt")
# 3.取出全部单词,要解嵌套,这里传给x的每个实参都是字符串list
word_rdd = rdd.flatMap(lambda x: x.split(" "))
# 4.将所有单词都转换成二元元组,单词为key, value为1
word_tuple_list = word_rdd.map(lambda word: (word, 1))
# 5.分组并求和
result_add = word_tuple_list.reduceByKey(lambda a, b: a + b)
print(result_add.collect())
输出结果
[(‘itcast’, 4), (‘python’, 6), (‘itheima’, 7), (‘spark’, 4), (‘pyspark’, 3)]