求平均值
- 包含商品数据的文件products.csv
- 统计同一种类的商品的平均价格
- 读文件
val rdd = sc textFile { "file:///root/products.csv" }
val rdd1 = map { _ split "," }
数据格式:
Array(Array(1, 2, Quest Q64 10 FT. x 10 FT. Slant Leg Instant U, "", 59.98, http://images.acmesports.sports/Quest+Q64+10+FT.+x+10+FT.+Slant+Leg+Instant+Up+Canopy),
Array(2, 2, Under Armour Men's Highlight MC Football Clea, "", 129.99, http://images.acmesports.sports/Under+Armour+Men%27s+Highlight+MC+Football+Cleat),
Array(3, 2, Under Armour Men's Renegade D Mid Football Cl, "", 89.99, http://images.acmesports.sports/Under+Armour+Men%27s+Renegade+D+Mid+Football+Cleat))
注:其中第二列商品种类和第五列价格是需要的数据,而且Array中的元素都为String类型
- 把第二列和第五列的数据提取出来, 并把第五列的数据转化为Double类型,同时还要过滤掉价格为空的行,否则计算时会报错
val rdd2 = rdd1 filter { _(4).length > 0 } map { line=> (line(1), (line(4) toDouble, 1)) }
数据格式:
Array[(String, (Double, Int))] = Array((2,(59.98,1)), (2,(129.99,1)), (2,(89.99,1)))
val rdd3 = rdd2 reduceByKey { (x, y) = > (x._1 + y._1, x._ 2 + y._2) }
数据格式:
Array[(String, (Double, Int))] = Array((4,(6248.65,24)), (19,(2678.739999999999,24)))
val rdd4 = rdd3 map { x => (x._1, x._2._1 / x._2._2) }
数据格式:
Array[(String, Double)] = Array((4,260.36041666666665), (19,111.61416666666662))
val rdd5 = rdd4 map { _ swap } sortByKey { false } map { _ swap }
结果:
Array[(String, Double)] = Array((10,405.1120833333332), (32,289.57291666666646))
val rdd = sc textFile { "file:///root/products.csv" } map { _ split "," } filter { _(4).length > 0}
map { x => (x(1), (x(4) toDouble, 1)) } reduceByKey { (x, y) => (x._1 + y._1, x._2 + y._2)}
map { x => (x._1, x._2._1 / x._2._2) } map { _ swap } sortByKey { false } map { _swap }