Spark RDD练习


  • 单词统计
    • 读取文件
    val rdd = sc.textFile("file:///root/customer.csv")  // 读取本地文件
    val rdd = sc.textFile("hdfs:///temp/root/customer.csv")  // 读取HDFS上的文件
    
    • 对RDD进行转化操作,转化操作会返回一个新的RDD
    val rdd1 = rdd flatMap { _ split "," }    // _ 是占位符,代表一行数据,并对这行数据使用逗号分割
    或者
    val rdd1 = rdd.flatMap (x => x.split(","))
    
    此时的数据格式:
    Array[String] = Array(Johnson, Rachel, Novato)
    
    继续转化
    val rdd2 = rdd1 map { (_, 1) }
    或者
    val rdd2 = rdd1.map(x => (x, 1))
    
    此时的数据格式:
    Array[(String, Int)] = Array((Johnson,1), (Rachel,1), (Novato,1))
    
    • 用reduceByKey函数合并相同的单词
    val rdd3 = rdd2 reduceByKey { _ + _ }
    或者
    val rdd3 = rdd2.reduceByKey((x, y) => x + y)
    
    结果:
    Array[(String, Int)] = Array((Vela,1), (Blaise,2), (Ginny,1))
    
    • 结果排序
    val rdd4 = rdd3 map { _ swap } sortByKey { false } map { _ swap }
    或
    val rdd4 = rdd3.map(x => x.swap).sortByKey(false).map(x => x.swap)
    
    结果:
    Array[(String, Int)] = Array((USA,5862), (Canada,1373), (Mexico,956))
    
    • 合成一句代码
    val rdd = sc textFile { "file:///root/customer.csv" } flatMap { _ split "," } map { (_, 1) } 
    reduceByKey { _ + _ } map { _ swap } sortByKey { false } map { _ swap }
    或
    val rdd = sc.textFile("file:///root/customer.csv").flatMap( line => line.split(",")).
    map(word => (word, 1)).reduceByKey((x, y) => x + y).
    map(e => e.swap).sortByKey(false).map(e => e.swap)
    

  • 求平均值
    • 包含商品数据的文件products.csv
    • 统计同一种类的商品的平均价格
    • 读文件
     val rdd = sc textFile { "file:///root/products.csv" }
    
    • 转化RDD
    val rdd1 = map { _ split "," } 
     
    数据格式:
    Array(Array(1, 2, Quest Q64 10 FT. x 10 FT. Slant Leg Instant U, "", 59.98, http://images.acmesports.sports/Quest+Q64+10+FT.+x+10+FT.+Slant+Leg+Instant+Up+Canopy), 
          Array(2, 2, Under Armour Men's Highlight MC Football Clea, "", 129.99, http://images.acmesports.sports/Under+Armour+Men%27s+Highlight+MC+Football+Cleat), 
          Array(3, 2, Under Armour Men's Renegade D Mid Football Cl, "", 89.99, http://images.acmesports.sports/Under+Armour+Men%27s+Renegade+D+Mid+Football+Cleat))
    

    注:其中第二列商品种类和第五列价格是需要的数据,而且Array中的元素都为String类型

    • 把第二列和第五列的数据提取出来, 并把第五列的数据转化为Double类型,同时还要过滤掉价格为空的行,否则计算时会报错
    val rdd2 = rdd1 filter { _(4).length > 0 } map { line=> (line(1), (line(4) toDouble, 1)) }
    
    数据格式:
    Array[(String, (Double, Int))] = Array((2,(59.98,1)), (2,(129.99,1)), (2,(89.99,1)))
    
    • 使用reduceByKey函数统计
    val rdd3 = rdd2 reduceByKey { (x, y) = > (x._1 + y._1, x._ 2 + y._2) } 
    
    数据格式:
    Array[(String, (Double, Int))] = Array((4,(6248.65,24)), (19,(2678.739999999999,24)))
    
    • 求平均值
    val rdd4 = rdd3 map { x => (x._1, x._2._1 / x._2._2) }
    
    数据格式:
    Array[(String, Double)] = Array((4,260.36041666666665), (19,111.61416666666662))
    
    • 结果排序
    val rdd5 = rdd4 map { _ swap } sortByKey { false } map { _ swap }
    
    结果:
    Array[(String, Double)] = Array((10,405.1120833333332), (32,289.57291666666646))
    
    • 合成一句代码
    val rdd = sc textFile { "file:///root/products.csv" } map { _ split "," } filter { _(4).length > 0}
    map { x => (x(1), (x(4) toDouble, 1)) } reduceByKey { (x, y) => (x._1 + y._1, x._2 + y._2)}
    map { x => (x._1, x._2._1 / x._2._2) } map { _ swap } sortByKey { false } map { _swap }
    

你可能感兴趣的:(Spark RDD练习)