【Spark入门项目】统计男女生身高的平均值、最大、最小值

项目要求

分别统计男女生身高的平均值、最大、最小值,数据格式为(ID,sex,height),如下:

1 M 174
2 F 165
3 M 180
4 M 176
5 F 160
6 M 188
7 F 170

流程

  1. 初始化spark配置
  2. 通过textFile方法读取身高数据文件
  3. RDD的每一个元素为txt文件中的一行,通过map方法返回K-V对,key为该行数据的性别,value为该行的身高
  4. 通过filter函数过滤分别得到男生、女生的身高heightsFemaleheightsMale
  5. 为了求平均值,使用count()获取男生和女生的人数
  6. 通过reduceByKey 分别将男生和女生的身高聚合在一起,使用不同的聚合函数获取平均值、最大值、最小值
from pyspark import SparkContext, SparkConf

# set sparkcontext
conf = SparkConf().setMaster("local[*]").setAppName("My App")
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")

def split(x):
    heights = x.split()
    # sex height
    return (heights[1], int(heights[2]))

'''Count `Female` and `Male` average height. '''
rdd = sc.textFile('data/height')
heights = rdd.map(split)

heightsFemale = heights.filter(lambda x: x[0] == 'F').cache()
numsFemale = heightsFemale.count()
avgFemale = heightsFemale.reduceByKey(
    lambda x, y: x+y).collect()[0][1]/numsFemale
maxFemale = heightsFemale.reduceByKey(lambda x, y: max(x, y)).collect()[0]
minFemale = heightsFemale.reduceByKey(lambda x, y: min(x, y)).collect()[0]

heightsMale = heights.filter(lambda x: x[0] == 'M').cache()
numsMale = heightsMale.count()
avgMale = heightsMale.reduceByKey(
    lambda x, y: x+y).collect()[0][1]/numsMale
maxMale = heightsMale.reduceByKey(lambda x, y: max(x, y)).collect()[0]
minMale = heightsMale.reduceByKey(lambda x, y: min(x, y)).collect()[0]

print('Female:')
print('total people', numsFemale)
print('avg height', avgFemale)
print('max height', maxFemale)
print('min height', minFemale)

print()

print('Male:')
print('total people', numsMale)
print('avg height', avgMale)
print('max height', maxMale)
print('min height', minMale)

# stop spark
sc.stop()

输出

Female:
total people 3
avg height 165.0
max height ('F', 170)
min height ('F', 160)

Male:
total people 4
avg height 179.5
max height ('M', 188)
min height ('M', 174)

你可能感兴趣的:(Spark)