Spark的combineByKey详解

spark的combineByKey算子还是相对比较难理解的,所以在记录下分析理解的过程,以便回顾。
Spark的combineByKey详解_第1张图片

一、spark的combineByKey源码

  /**
   * Simplified version of combineByKeyWithClassTag that hash-partitions the resulting RDD using the
   * existing partitioner/parallelism level. This method is here for backward compatibility. It
   * does not provide combiner classtag information to the shuffle.
   *
   * @see `combineByKeyWithClassTag`
   */
  def combineByKey[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C): RDD[(K, C)] = self.withScope {
    combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners)(null)
  }

首先介绍一下上面三个参数:

  • Users provide three functions:
  • createCombiner, which turns a V into a C (e.g., creates a one-element list)
    这个函数把当前的值作为参数,此时我们可以对其做些附加操作(类型转换)并把它返回 (这一步类似于初始化操作)
  • mergeValue, to merge a V into a C (e.g., adds it to the end of a list)
    该函数把元素V合并到之前的元素C(createCombiner)上 (这个操作在每个分区内进行)
  • mergeCombiners, to combine two C’s into a single one.
    该函数把2个元素C合并 (这个操作在不同分区间进行)

二、具体实例:

 def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
    conf.setMaster("local[*]")
    conf.setAppName("CombineByKeyJob")
    val sc = new SparkContext(conf)
    var rdd = sc.parallelize(Array(("A",2),("A",1),("A",3),("B",1),("B",2),("C",1)),3)
    //打印分区数
    val partitions = rdd.getNumPartitions
    println("partitions:"+partitions)
    val collect: Array[(String, String)] = rdd.combineByKey(
      (v: Int) => v + "_",
      (c: String, v: Int) => c + "@" + v,//同一分区内
      (c1: String, c2: String) => c1 + "$" + c2
    ).collect
    for (elem <- collect) {
      println(elem)
//      println(elem._1,"--",elem._2)
    }
  }

结果展示
(B,1_$2_)
(C,1_)
(A,2_@1$3_)
打印RDD每个分区的内容

 val tuples: Array[(String, List[(String, Int)])] = rdd.mapPartitionsWithIndex((Index, iter) => {
      var part_map = scala.collection.mutable.Map[String, List[(String, Int)]]()

      while (iter.hasNext) {
        var part_name = "part_" + Index
        var elem = iter.next();
        if (part_map.contains(part_name)) {
          var elems = part_map(part_name)
          elems ::= elem
          part_map(part_name) = elems
        } else {
          part_map(part_name) = List[(String, Int)] {
            elem
          }
        }
      }
      part_map.iterator
    }).collect()
    for (elem <- tuples) {
      println(elem._1,"-->",elem._2)
    }

RDD分区结果:
(part_0,–>,List((A,1), (A,2)))
(part_1,–>,List((B,1), (A,3)))
(part_2,–>,List((C,1), (B,2)))

结果展示
(B,1_$2_)
(C,1_)
(A,2_@1$3_)

结果分析
1、key为A的元素存在两个分区中 会执行三个函数 A的结果 为(A,2_@1$3_)。
2、key为B的元素存在两个分区中但是每个分区只有一个元素则会执行createCombiner和mergeCombiners,但不会走mergeValue函数,B的结果为(B,1_$2_)
3、C只存在一个分区中则只会执行createCombiner函数,C的结果为(C,1_)

你可能感兴趣的:(spark)