04 Spark:RDD转换算子之Key-Value类型

RDD转换算子之Key-Value类型

文章目录

  • RDD转换算子之Key-Value类型
    • 1. partitionBy(partitioner)
    • 2. reduceByKey(func, [numTasks])
    • 3. groupByKey()
    • 4. aggregateByKey(zeroValue)(seqOp, comOp, [numTasks])
    • 5. foldByKey(zeroValue)(func)
    • 6. combineByKey[C]
    • 7. sortByKey
    • 8. mapValues
    • 9. join(otherDataSet, [numTasks])
    • 10. cogroup(otherDataSet, [numTasks])

1. partitionBy(partitioner)

  1. 作用:RDD 进行分区操作,如果原有的 partitionRDD 和现有的 partitionRDD 是一致的话就不进行分区,否则会生成 ShuffleRDD ,即会产生 shuffle 过程。

  2. 示例:

    scala> val rdd = sc.makeRDD(Array((1,"a"), (2,"b"), (3,"c"), (4,"d")))
    rdd: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[121] at makeRDD at :24
    
    scala> rdd.partitions.length
    res59: Int = 4
    
    scala> val newRdd = rdd.partitionBy(new org.apache.spark.HashPartitioner(3))
    newRdd: org.apache.spark.rdd.RDD[(Int, String)] = ShuffledRDD[122] at partitionBy at :26
    
    scala> newRdd.partitions.length
    res60: Int = 3
    

2. reduceByKey(func, [numTasks])

  1. 作用: 在一个 (K,V)RDD 上调用,返回一个 (K,V)RDD ,使用指定的 reduce 函数,将相同 keyvalue 聚合在一起, reduce 任务的数量可以通过第二个可选参数来设置。

  2. 示例:

    scala> val rdd = sc.makeRDD(Array(("male",1), ("female",3), ("female",2), ("male",5)))
    rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[123] at makeRDD at :24
    
    scala> val newRdd = rdd.reduceByKey(_+_)
    newRdd: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[124] at reduceByKey at :26
    
    scala> newRdd.collect
    res61: Array[(String, Int)] = Array((female,5), (male,6))
    

3. groupByKey()

  1. 作用: 按照 key 进行分组。

  2. 示例:

    scala> val rdd1 = sc.makeRDD(Array("hello","world","hello","spark","hello","scala"))
    rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[126] at makeRDD at :24
    
    scala> val rdd2 = rdd1.map((_,1))
    rdd2: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[127] at map at :26
    
    scala> val rdd3 = rdd2.groupByKey
    rdd3: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[128] at groupByKey at :28
    
    scala> rdd3.collect
    res62: Array[(String, Iterable[Int])] = Array((spark,CompactBuffer(1)), (scala,CompactBuffer(1)), (hello,CompactBuffer(1, 1, 1)), (world,CompactBuffer(1)))
    
    scala> val newRdd = rdd3.map(t => (t._1, t._2.sum))
    newRdd: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[129] at map at :30
    
    scala> newRdd.collect
    res63: Array[(String, Int)] = Array((spark,1), (scala,1), (hello,3), (world,1))
    

4. aggregateByKey(zeroValue)(seqOp, comOp, [numTasks])

  1. 作用: 使⽤给定的 combine 函数和⼀个初始化的 zero value , 对每个 keyvalue 进⾏聚合。

    1.1 zeroValue :给每一个分区中的每一个 key 一个初始值。
    1.2 seqOp :函数用于在每一个分区中用初始值逐步迭代 value。用于在一个分区进行合并。
    1.3 comOp:函数用于合并每个分区中的结果。用于在两个分区间进行合并。

  2. 示例:

    // 创建⼀个 pairRDD,取出每个分区相同 key 对应值的最⼤值,然后相加
    
    val rdd = sc.makeRDD(Array(("a",3),("a",2),("c",4),("b",3),("c",6),("c",8)),2)
    rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[130] at makeRDD at :24
    
    scala> val newRdd = rdd.aggregateByKey(Int.MinValue)(math.max(_,_), _+_)
    newRdd: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[132] at aggregateByKey at :26
    
    scala> newRdd.collect
    res65: Array[(String, Int)] = Array((b,3), (a,3), (c,12))
    

5. foldByKey(zeroValue)(func)

  1. 作用: aggregateByKey 的简化操作,seqOpcomOp 相同。

  2. 示例:

    scala> val rdd = sc.makeRDD(Array(("a",3),("a",2),("c",4),("b",3),("c",6),("c",8)))
    rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[133] at makeRDD at :24
    
    scala> val newRdd = rdd.foldByKey(0)(_+_)
    newRdd: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[134] at foldByKey at :26
    
    scala> newRdd.collect
    res66: Array[(String, Int)] = Array((a,5), (b,3), (c,18)) 
    

6. combineByKey[C]

  1. 作用: 针对每个 K,将 V 进行合并成 C ,得到 RDD[(K,C)]

  2. 示例:

    // 创建⼀个 pairRDD,根据 key 计算每种 key 的 value 的均值。(先计算每个key出现的次 数以及可以对 // 应值的总和,再相除得到结果)
    
    scala> val rdd = sc.makeRDD(Array(("a",88),("b",95),("a",91),("b",93),("a",95),("b",98)),2)
    rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[150] at makeRDD at :24
    
    scala> val newRdd = rdd.combineByKey((_,1),(acc:(Int,Int),v) => (acc._1+v, acc._2+1), (acc1:(Int,Int), acc2:(Int,Int)) => (acc1._1+acc2._1, acc1._2+acc2._2))
    newRdd: org.apache.spark.rdd.RDD[(String, (Int, Int))] = ShuffledRDD[151] at combineByKey at :26
    
    scala> newRdd.collect
    res71: Array[(String, (Int, Int))] = Array((b,(286,3)), (a,(274,3)))
    
    scala> val resRdd = newRdd.map(item=>(item._1, item._2._1.toInt / item._2._2))
    resRdd: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[152] at map at :28
    
    scala> resRdd.collect
    res72: Array[(String, Int)] = Array((b,95), (a,91))
    

7. sortByKey

  1. 作用: 在一个 (K,V)RDD 上调用,K 必须实现 Ordered 接口,返回一个按照 Key 进行排序的 (K,V)RDD

  2. 示例:

    scala> val rdd = sc.makeRDD(Array((1,"a"),(5,"e"),(4,"c"),(2,"b"),(10,"s")))
    rdd: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[135] at makeRDD at :24
    
    scala> val newRdd = rdd.sortByKey()
    newRdd: org.apache.spark.rdd.RDD[(Int, String)] = ShuffledRDD[138] at sortByKey at :26
    
    scala> newRdd.collect
    res67: Array[(Int, String)] = Array((1,a), (2,b), (4,c), (5,e), (10,s))
    

8. mapValues

  1. 作用: 针对于 (K,V) 形式的类型只对 V 进行操作。

  2. 示例:

    scala> val rdd = sc.makeRDD(Array((1,"a"),(5,"e"),(4,"c"),(2,"b"),(10,"s")))
    rdd: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[139] at makeRDD at :24
    
    scala> val newRdd = rdd.mapValues("<" + _ + ">")
    newRdd: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[140] at mapValues at :26
    
    scala> newRdd.collect
    res68: Array[(Int, String)] = Array((1,), (5,), (4,), (2,), (10,))
    

9. join(otherDataSet, [numTasks])

  1. 作用: 内连接,在类型为 (K,V)(K,W)RDD 上调用,返回一个相同 Key 对应的所有元素对在一起的 (K,(V,W))RDD

  2. 示例:

    scala> var rdd1 = sc.parallelize(Array((1, "a"), (1, "b"), (2, "c")))
    rdd1: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[141] at parallelize at :24
    
    scala> var rdd2 = sc.parallelize(Array((1, "aa"), (3, "bb"), (2, "cc")))
    rdd2: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[142] at parallelize at :24
    
    scala> var newRdd = rdd1.join(rdd2)
    newRdd: org.apache.spark.rdd.RDD[(Int, (String, String))] = MapPartitionsRDD[145] at join at :28
    
    scala> newRdd.collect
    res69: Array[(Int, (String, String))] = Array((1,(a,aa)), (1,(b,aa)), (2,(c,cc)))
    

10. cogroup(otherDataSet, [numTasks])

  1. 作用: 在类型为 (K,V)(K,W)RDD 上调用,返回一个 (K,(Iterable,Iterable)) 类型的 RDD

  2. 示例:

    scala> val rdd1 = sc.parallelize(Array((1, 10),(2, 20),(1, 100),(3, 30)),1)
    rdd1: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[146] at parallelize at :24
    
    scala>  val rdd2 = sc.parallelize(Array((1, "a"),(2, "b"),(1, "aa"),(3, "c")),1)
    rdd2: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[147] at parallelize at :24
    
    scala> val newRdd = rdd1.cogroup(rdd2)
    newRdd: org.apache.spark.rdd.RDD[(Int, (Iterable[Int], Iterable[String]))] = MapPartitionsRDD[149] at cogroup at :28
    
    scala> newRdd.collect
    res70: Array[(Int, (Iterable[Int], Iterable[String]))] = Array((1,(CompactBuffer(10, 100),CompactBuffer(a, aa))), (3,(CompactBuffer(30),CompactBuffer(c))), (2,(CompactBuffer(20),CompactBuffer(b))))
    

你可能感兴趣的:(Spark)