Spark RDD练习

单词统计

读取文件

val rdd = sc.textFile("file:///root/customer.csv")  // 读取本地文件
val rdd = sc.textFile("hdfs:///temp/root/customer.csv")  // 读取HDFS上的文件

对RDD进行转化操作，转化操作会返回一个新的RDD

val rdd1 = rdd flatMap { _ split "," }    // _ 是占位符，代表一行数据，并对这行数据使用逗号分割
或者
val rdd1 = rdd.flatMap (x => x.split(","))

此时的数据格式：
Array[String] = Array(Johnson, Rachel, Novato)

继续转化
val rdd2 = rdd1 map { (_, 1) }
或者
val rdd2 = rdd1.map(x => (x, 1))

此时的数据格式：
Array[(String, Int)] = Array((Johnson,1), (Rachel,1), (Novato,1))

用reduceByKey函数合并相同的单词

val rdd3 = rdd2 reduceByKey { _ + _ }
或者
val rdd3 = rdd2.reduceByKey((x, y) => x + y)

结果：
Array[(String, Int)] = Array((Vela,1), (Blaise,2), (Ginny,1))

结果排序

val rdd4 = rdd3 map { _ swap } sortByKey { false } map { _ swap }
或
val rdd4 = rdd3.map(x => x.swap).sortByKey(false).map(x => x.swap)

结果：
Array[(String, Int)] = Array((USA,5862), (Canada,1373), (Mexico,956))

合成一句代码

val rdd = sc textFile { "file:///root/customer.csv" } flatMap { _ split "," } map { (_, 1) } 
reduceByKey { _ + _ } map { _ swap } sortByKey { false } map { _ swap }
或
val rdd = sc.textFile("file:///root/customer.csv").flatMap( line => line.split(",")).
map(word => (word, 1)).reduceByKey((x, y) => x + y).
map(e => e.swap).sortByKey(false).map(e => e.swap)

求平均值

包含商品数据的文件products.csv
统计同一种类的商品的平均价格
读文件

 val rdd = sc textFile { "file:///root/products.csv" }

转化RDD

val rdd1 = map { _ split "," } 
 
数据格式：
Array(Array(1, 2, Quest Q64 10 FT. x 10 FT. Slant Leg Instant U, "", 59.98, http://images.acmesports.sports/Quest+Q64+10+FT.+x+10+FT.+Slant+Leg+Instant+Up+Canopy), 
      Array(2, 2, Under Armour Men's Highlight MC Football Clea, "", 129.99, http://images.acmesports.sports/Under+Armour+Men%27s+Highlight+MC+Football+Cleat), 
      Array(3, 2, Under Armour Men's Renegade D Mid Football Cl, "", 89.99, http://images.acmesports.sports/Under+Armour+Men%27s+Renegade+D+Mid+Football+Cleat))

注：其中第二列商品种类和第五列价格是需要的数据，而且Array中的元素都为String类型

把第二列和第五列的数据提取出来, 并把第五列的数据转化为Double类型，同时还要过滤掉价格为空的行，否则计算时会报错

val rdd2 = rdd1 filter { _(4).length > 0 } map { line=> (line(1), (line(4) toDouble, 1)) }

数据格式：
Array[(String, (Double, Int))] = Array((2,(59.98,1)), (2,(129.99,1)), (2,(89.99,1)))

使用reduceByKey函数统计

val rdd3 = rdd2 reduceByKey { (x, y) = > (x._1 + y._1, x._ 2 + y._2) } 

数据格式：
Array[(String, (Double, Int))] = Array((4,(6248.65,24)), (19,(2678.739999999999,24)))

求平均值

val rdd4 = rdd3 map { x => (x._1, x._2._1 / x._2._2) }

数据格式:
Array[(String, Double)] = Array((4,260.36041666666665), (19,111.61416666666662))

结果排序

val rdd5 = rdd4 map { _ swap } sortByKey { false } map { _ swap }

结果：
Array[(String, Double)] = Array((10,405.1120833333332), (32,289.57291666666646))

合成一句代码

val rdd = sc textFile { "file:///root/products.csv" } map { _ split "," } filter { _(4).length > 0}
map { x => (x(1), (x(4) toDouble, 1)) } reduceByKey { (x, y) => (x._1 + y._1, x._2 + y._2)}
map { x => (x._1, x._2._1 / x._2._2) } map { _ swap } sortByKey { false } map { _swap }

Spark RDD练习

单词统计

求平均值

你可能感兴趣的:(Spark RDD练习)