spark常用算子有两种:
val rdd = sc.parallelize(List(1, 2, 3, 4, 5, 6), 3)
// mapPartitions算子
rdd.mapPartitions {
val value = { //map外连接资源 }
iterator => iterator.map(_ * value)
}
// mapPartitionsWithIndex算子
val partitionIndex = (index: Int, iter: Iterator[Int]) => {
iter.toList.map(item => "index:" + index + ": value: " + item).iterator
}
rdd.mapPartitionsWithIndex(partitionIndex, true).foreach(println(_))
/**
index:0: value: 1
index:0: value: 2
index:1: value: 3
index:1: value: 4
index:2: value: 5
index:2: value: 6
*/
3.filterByRange(lower: K, upper: K): RDD[P]:以RDD中元素key的范围做过滤,包含lower和upper上下边界
val rdd = sc.parallelize(List((2, 21), (9, 2), (5, 3), (6, 3), (3, 21), (10, 21)), 2)
rdd.filterByRange(3, 9)
4.flatMapValues[U](f: V => TraversableOnce[U]): RDD[(K, U)]:对元组的value进行业务逻辑操作还回集合,并分别与key进行组合
val rdd = sc.parallelize(List((2, "a b c"), (5, "q w e"), (2, "x y z"), (6, "t y")), 2)
rdd.flatMapValues(_.split(" ")).collect()
/**
Array((2,a), (2,b), (2,c), (5,q), (5,w), (5,e), (2,x), (2,y), (2,z), (6,t), (6,y))
*/
5.combineByKey[C](createCombiner: V => C,mergeValue: (C, V) => C,mergeCombiners: (C, C) => C,...): RDD[(K, C)]:该算子是比较底层的算子,groupByKey和reduceByKey是基于此实现;在shuffle前各个partition内先以key做local聚集,会还回每个分区内key对应的C中间值,在shuffle后再合并各个key对应的C。有三个关键的函数,首先分区内独立处理数据,在分区内遍历所有元素把相同key聚合到一起,若key首次出现createCombiner函数把元素V转为C类型还回,若分区内key已经存在mergeValue函数会把对应V与对应的C做合并;分区内处理完成后,若key对应2个以上分区mergeCombiners函数把key对应的各个分区结果C放在一起合并。
reduceByKey (func: (V, V) => V, numPartitions: Int): RDD[(K, V)] 相同的key对value做聚合,先分区内再整体做聚合,还回与value相同的数据类型;
foldByKey(zeroValue: V,...)(func: (V, V) => V): RDD[(K, V)]:该算子通过调用combineByKey算子实现,先在各个partition内以key做聚集,分区内首次出现key对应的value通过调用createCombiner对V进行V=>V+zeroValue操作,然后再按key通过func函数对分区内V与余下V进行合并调用mergeValue函数,各个分区的结果也按key聚合通过func函数完成合并调用mergeCombiners函数。
/**
* 分区内key首次出现第一个元素转为C
*
* @param value
* @return
*/
def createCombiner(value: Int): List[Int] = {
println("create value:" + value)
List(value)
}
/**
* 分区内key再次出现把V与对应C做合并
*
* @param list
* @param value
* @return
*/
def mergeValue(list: List[Int], value: Int): List[Int] = {
println("merge value:" + value)
list :+ (value)
}
/** *
* 若key对应2+个分区,则合并key对应的各个分区聚集结果C
*
* @param a
* @param b
* @return
*/
def mergeCombiners(a: List[Int], b: List[Int]): List[Int] = {
println("a:" + a.toBuffer + "\tb:" + b.toBuffer)
a ++ b
}
// rdd data
val rdd = sc.parallelize(List(("a", 21), ("b1", 2), ("a", 22), ("b2", 1), ("a", 23), ("c1", 1), ("a", 24), ("c2", 1)), 2)
// combineByKey
val rdd2 = rdd.combineByKey(createCombiner, mergeValue, mergeCombiners)
println("combineByKey result:" + rdd2.collect().toBuffer)
// reduceByKey
val rdd3 = rdd.reduceByKey((pre: Int, after: Int) => (pre + after))
println("reduceByKey result:"+rdd3.collect().toBuffer)
// reduceByKey
val rdd4= rdd.foldByKey(100)(_+_)
println("foldByKey result:"+rdd4.collect().toBuffer)
/**
create value:21
create value:1
merge value:22
create value:1
create value:23
create value:1
merge value:24
create value:1
a:ArrayBuffer(121, 22) b:ArrayBuffer(123, 24)
combineByKey result:ArrayBuffer((b2,List(101)), (c1,List(101)), (a,List(121, 22, 123, 24)), (c2,List(101)), (b1,List(101)))
reduceByKey result:ArrayBuffer((b2,1), (c1,1), (a,90), (c2,1), (b1,1))
foldByKey result:ArrayBuffer((b2,101), (c1,101), (a,290), (c2,101), (b1,101))
*/
6.aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U,combOp: (U, U) => U): RDD[(K, U)]:算子把相同的key聚集在一起,聚集流程与aggregate算子类似;seqOp函数在分区内先以相同的key做聚集,zeroValue与分区内第一个元素做聚集再依次与剩余元素聚集;在combOp函数中初始值zeroValue不参与聚集,各个分区聚集结果一起聚集还回结果。
/**
* 分区内初始值zeroValue与每个元素依次聚集
*
* @param zeroValue
* @param value
* @return
*/
def seqOp(zeroValue: ArrayBuffer[Int], value: Int): ArrayBuffer[Int] = {
println("zeroValue:" + zeroValue + "\tvalue:" + value)
zeroValue += value
}
/**
* 各个分区聚集后的结果再依次聚集
*
* @param a
* @param b
* @return
*/
def combOp(a: ArrayBuffer[Int], b: ArrayBuffer[Int]): ArrayBuffer[Int] = {
println("a:" + a + "\tb:" + b)
a ++ b
}
val rdd = sc.parallelize(List(("a", 21), ("b1", 2), ("a", 22), ("b2", 1), ("a", 23), ("c1", 1), ("a", 24),("c2", 1)), 2)
val rdd2 = rdd.aggregateByKey(ArrayBuffer[Int](88))(seqOp, combOp)
println(rdd2.collect().toBuffer)
/**
zeroValue:ArrayBuffer(88) value:21
zeroValue:ArrayBuffer(88) value:2
zeroValue:ArrayBuffer(88, 21) value:22
zeroValue:ArrayBuffer(88) value:1
zeroValue:ArrayBuffer(88) value:23
zeroValue:ArrayBuffer(88) value:1
zeroValue:ArrayBuffer(88, 23) value:24
zeroValue:ArrayBuffer(88) value:1
a:ArrayBuffer(88, 21, 22) b:ArrayBuffer(88, 23, 24)
ArrayBuffer((b2,ArrayBuffer(88, 1)), (c1,ArrayBuffer(88, 1)), (a,ArrayBuffer(88, 21, 22, 88, 23, 24)), (c2,ArrayBuffer(88, 1)), (b1,ArrayBuffer(88, 2)))
说明:zeroValue在分区内先按相同key与每个元素依次聚集,各个分区结果再依次聚集zeroValue不参与,故2个分区结果有2个88
*/
7.coalesce(numPartitions: Int, shuffle: Boolean = false): RDD[T]:对RDD重新分区,默认不shuffle;
repartition(numPartitions: Int): RDD[T]:重新分区会shuffle
rdd.partitions.length 分区数量
9.keyBy[K](f: T => K): RDD[(K, T)]:对RDD元素新加入一个key,旧的元素元素做为value组合成新的二元组元素
keys: RDD[K]:获取RDD所有的key组合成新的RDD
values: RDD[V]:获取所有value组合新的RDD
val rdd = sc.parallelize(List("abc", "abcd", "ab", "bcd", "bc", "bcde"), 2)
// keyBy
val rdd2 = rdd.keyBy(_.size)
println(rdd2.collect().toBuffer)
// keys
val keys = rdd2.keys
println(keys.collect().toBuffer)
// values
val values = rdd2.values
println(values.collect().toBuffer)
/**
ArrayBuffer((3,abc), (4,abcd), (2,ab), (3,bcd), (2,bc), (4,bcde))
ArrayBuffer(3, 4, 2, 3, 2, 4)
ArrayBuffer(abc, abcd, ab, bcd, bc, bcde)
*/
未完待续