spark的combineByKey算子还是相对比较难理解的,所以在记录下分析理解的过程,以便回顾。
/**
* Simplified version of combineByKeyWithClassTag that hash-partitions the resulting RDD using the
* existing partitioner/parallelism level. This method is here for backward compatibility. It
* does not provide combiner classtag information to the shuffle.
*
* @see `combineByKeyWithClassTag`
*/
def combineByKey[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C): RDD[(K, C)] = self.withScope {
combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners)(null)
}
首先介绍一下上面三个参数:
createCombiner
, which turns a V into a C (e.g., creates a one-element list)mergeValue
, to merge a V into a C (e.g., adds it to the end of a list)mergeCombiners
, to combine two C’s into a single one. def main(args: Array[String]): Unit = {
val conf = new SparkConf()
conf.setMaster("local[*]")
conf.setAppName("CombineByKeyJob")
val sc = new SparkContext(conf)
var rdd = sc.parallelize(Array(("A",2),("A",1),("A",3),("B",1),("B",2),("C",1)),3)
//打印分区数
val partitions = rdd.getNumPartitions
println("partitions:"+partitions)
val collect: Array[(String, String)] = rdd.combineByKey(
(v: Int) => v + "_",
(c: String, v: Int) => c + "@" + v,//同一分区内
(c1: String, c2: String) => c1 + "$" + c2
).collect
for (elem <- collect) {
println(elem)
// println(elem._1,"--",elem._2)
}
}
结果展示
(B,1_$2_)
(C,1_)
(A,2_@1$3_)
打印RDD每个分区的内容
val tuples: Array[(String, List[(String, Int)])] = rdd.mapPartitionsWithIndex((Index, iter) => {
var part_map = scala.collection.mutable.Map[String, List[(String, Int)]]()
while (iter.hasNext) {
var part_name = "part_" + Index
var elem = iter.next();
if (part_map.contains(part_name)) {
var elems = part_map(part_name)
elems ::= elem
part_map(part_name) = elems
} else {
part_map(part_name) = List[(String, Int)] {
elem
}
}
}
part_map.iterator
}).collect()
for (elem <- tuples) {
println(elem._1,"-->",elem._2)
}
RDD分区结果:
(part_0,–>,List((A,1), (A,2)))
(part_1,–>,List((B,1), (A,3)))
(part_2,–>,List((C,1), (B,2)))
结果展示
(B,1_$2_)
(C,1_)
(A,2_@1$3_)
结果分析
1、key为A的元素存在两个分区中 会执行三个函数 A的结果 为(A,2_@1$3_)。
2、key为B的元素存在两个分区中但是每个分区只有一个元素则会执行createCombiner和mergeCombiners,但不会走mergeValue函数,B的结果为(B,1_$2_)
3、C只存在一个分区中则只会执行createCombiner函数,C的结果为(C,1_)