mapValues、flatMapValues、sortByKey、combineByKey、foldByKey、groupByKey、reduceByKey、aggregateByKey、cogroup、join、leftOuterJoin、rightOuterJoin
当然键值对RDD可以使用所有RDD转换算子,介绍详见:https://blog.csdn.net/qq_23146763/article/details/100988127
对pairRDD中的每个值调用map而不改变键
val sparkConf = new SparkConf().setAppName("transformations examples").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val list = List(("zhangsan", 22), ("lisi", 20), ("wangwu", 23))
val rdd = sc.parallelize(list)
val mapValuesRDD = rdd.mapValues(_ + 2)
mapValuesRDD.foreach(println)
sc.stop()
对pairRDD中的每个值调用flatMap而不改变键
val sparkConf = new SparkConf().setAppName("transformations examples").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val list = List(("zhangsan", "GD SZ"), ("lisi", "HN YY"), ("wangwu", "JS NJ"))
val rdd = sc.parallelize(list)
val mapValuesRDD = rdd.flatMapValues(v => v.split(" "))
mapValuesRDD.foreach(println)
sc.stop()
1.如果key实现了排序,返回以Key排序的(K,V)键值对组成的RDD,accending为true时表示升序,为false时表示降序,numPartitions设置分区数,提高作业并行度
2.并行度设为1,才能实现较好的排序
val sparkConf = new SparkConf().setAppName("transformations examples").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val arr = List((1, 20), (1, 10), (2, 20), (2, 10), (3, 20), (3, 10))
val rdd = sc.parallelize(arr)
val sortByKeyRDD = rdd.sortByKey(true, 1)
sortByKeyRDD.foreach(println)
sc.stop()
使用不同的返回类型合并具有相同键的值
comineByKey(createCombiner,mergeValue,mergeCombiners,partitioner,mapSideCombine)
createCombiner:在第一次遇到Key时创建组合器函数,将RDD数据集中的V类型值转换C类型值(V => C),C是集合类型
注意:这个过程会在每个分区中第一次出现各个键时发生
mergeValue:合并值函数(每个分区独立处理),再次遇到相同的Key时,将createCombiner道理的C类型值与这次传入的V类型值合并成一个C类型值(C,V)=>C
mergeCombiners:合并组合器函数(将每个键的分区结果合并),将C类型值两两合并成一个C类型值
partitioner:使用已有的或自定义的分区函数,默认是HashPartitioner
mapSideCombine:是否在map端进行Combine操作,默认为true
val sparkConf = new SparkConf().setAppName("transformations examples").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
//简单聚合
val input = sc.parallelize(List(("coffee", 1), ("coffee", 2), ("panda", 4), ("panda", 5)))
val result = input.combineByKey(
(v) => (v),
(acc: (Int), v) => (acc + v),
(acc1: (Int), acc2: (Int)) => (acc1 + acc2)
)
result.foreach(println)
//求平均值
val result2 = input.combineByKey(
(v) => (v, 1),
(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),
(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)
).map { case (key, value) => (key, value._1 / value._2.toFloat) }
result2.collectAsMap().map(println)
sc.stop()
foldByKey,groupByKey,reduceByKey函数最终都是通过调用combineByKey函数实现的
作用:每个vaule都加上zeroValue,然后按key聚合
zeroVale:对V进行初始化,实际上是通过CombineByKey的createCombiner实现的 V => (zeroValue,V),再通过func函数映射成新的值,即func(zeroValue,V),如例4可看作对每个V先进行 V=> 2 + V
func: Value将通过func函数按Key值进行合并(实际上是通过CombineByKey的mergeValue,mergeCombiners函数实现的,只不过在这里,这两个函数是相同的)
val sparkConf = new SparkConf().setAppName("transformations examples").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val people = List(("Mobin", 2), ("Mobin", 1), ("Lucy", 2), ("Amy", 1), ("Lucy", 3))
val rdd = sc.parallelize(people)
val foldByKeyRDD = rdd.foldByKey(2)((_ + _))
foldByKeyRDD.foreach(println)
sc.stop()
按Key进行分组,返回[K,Iterable[V]],numPartitions设置分区数,提高作业并行度
val sparkConf = new SparkConf().setAppName("transformations examples").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val seq = Seq(("a", 1), ("a", 2), ("a", 3), ("b", 1), ("b", 2))
val rdd = sc.parallelize(seq)
val groupRDD = rdd.groupByKey(3)
groupRDD.foreach(println)
sc.stop()
按Key进行分组,使用给定的func函数聚合value值, numPartitions设置分区数,提高作业并行度
val sparkConf = new SparkConf().setAppName("transformations examples").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val seq = Seq(("a", 1), ("a", 2), ("a", 3), ("b", 1), ("b", 2))
val rdd = sc.parallelize(seq)
val reduceRdd = rdd.reduceByKey((v1, v2) => v1 + v2, 3)
reduceRdd.foreach(println)
sc.stop()
1.概念
对PairRDD中相同的Key值进行聚合操作,在聚合过程中同样使用了一个中立的初始值。
2.和aggregate的不同点
1.aggregate的聚合结果和分区数量有关;aggregateByKey的聚合结果和分区数量无关
2.aggregate将每个分区里面的元素进行聚合;aggregateByKey对PairRDD中相同的Key值进行聚合操作,返回PairRDD
3.例子解释
1.按key分组,seq中将每组value和初始值比较,得到较小值作为新键值对,comb中将每组数据value按key聚合,最终每组生成一个PairRDD
val sparkConf = new SparkConf().setAppName("transformations examples").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val rdd = sc.parallelize(Seq((1, 2), (1, 3), (1, 4), (2, 5)))
def seq(a: Int, b: Int): Int = {
println("seq: " + a + "\t " + b)
math.min(a, b)
}
def comb(a: Int, b: Int): Int = {
println("comb: " + a + "\t " + b)
a + b
}
rdd.aggregateByKey(3, 2)(seq, comb).foreach(println)
sc.stop()
对两个RDD(如:(K,V)和(K,W))相同Key的元素先分别做聚合,最后返回(K,Iterator,Iterator)形式的RDD,numPartitions设置分区数,提高作业并行度
val sparkConf = new SparkConf().setAppName("transformations examples").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val arr1 = List(("A", 1), ("B", 2), ("A", 2), ("B", 3))
val arr2 = List(("A", "A1"), ("B", "B1"), ("A", "A2"), ("B", "B2"))
val rdd1 = sc.parallelize(arr1, 3)
val rdd2 = sc.parallelize(arr2, 3)
val groupByKeyRDD = rdd1.cogroup(rdd2)
groupByKeyRDD.foreach(println)
sc.stop()
内连接。只有在两个pairRDD中都存在的键才会输出。当一个输出对应的某个键有多个值时,生成的pairRDD会包括来自两个输入RDD的每一组相对应的value
val sparkConf = new SparkConf().setAppName("transformations examples").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val arr1 = List(("A", 1), ("B", 2), ("A", 2), ("B", 3))
val arr2 = List(("A", "A1"), ("B", "B1"), ("A", "A2"), ("B", "B2"))
val rdd1 = sc.parallelize(arr1, 3)
val rdd2 = sc.parallelize(arr2, 3)
val groupByKeyRDD = rdd1.join(rdd2)
groupByKeyRDD.foreach(println)
sc.stop()
左外连接。源pairRDD每个键都有对应的记录,第二个RDD中无记录值用None表示
val sparkConf = new SparkConf().setAppName("transformations examples").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val arr1 = List(("A", 1), ("B", 2), ("A", 2), ("B", 3))
val arr2 = List(("A", "A1"), ("C", "C1"), ("A", "A2"), ("C", "C2"))
val rdd1 = sc.parallelize(arr1, 3)
val rdd2 = sc.parallelize(arr2, 3)
val groupByKeyRDD = rdd1.leftOuterJoin(rdd2)
groupByKeyRDD.foreach(println)
sc.stop()
右外连接。第二个pairRDD每个键都有对应的记录,源RDD中无记录值用None表示
val sparkConf = new SparkConf().setAppName("transformations examples").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val arr1 = List(("A", 1), ("B", 2), ("A", 2), ("B", 3))
val arr2 = List(("A", "A1"), ("C", "C1"), ("A", "A2"), ("C", "C2"))
val rdd1 = sc.parallelize(arr1, 3)
val rdd2 = sc.parallelize(arr2, 3)
val groupByKeyRDD = rdd1.rightOuterJoin(rdd2)
groupByKeyRDD.foreach(println)
sc.stop()