1. map:
对当前元素做一个映射
val rdd = sc.parallelize(array).map(r => 2*r)
2. filter:
过滤出符合条件的元组
val rdd = sc.parallelize(array).filter(r => r > 2)
3. flatMap:
压缩行, 相当于拆分某一个列字段
val array = Array(Array(1,2), Array(3,4), Array(5,6))
sc.parallelize(array).flatMap(r => r)
4. reduceByKey:
相当于group by 同一个key 的分组数据进行聚合, key 为第一个
val array = Array(Array(1,2), Array(3,4), Array(5,6))
sc.parallelize(array).flatMap(r => r).map(r => (r, 1)).reduceByKey((a, b) => a + b)
5. reduce:
每个元组只有一个元素的时候,使用reduce
val array = Array(Array(1,2), Array(3,4), Array(5,6))
sc.parallelize(array).flatMap(r => r).reduce((a, b) => a + b)
6. mapPartitions:
对于每一个partition 进行处理,外层基于partitions 进行处理, 内层基于每个patition进行处理
val array = Array(Array(1,2), Array(3,4), Array(5,6))
val rdd = sc.parallelize(array).flatMap(r => r).mapPartitions(rows => rows.map(row => 3*row))
7.
mapPartitionsWithIndex:
不同的patitions可能有不同的处理的方式的时候使用
val array = Array(Array(1,2), Array(3,4), Array(5,6))
sc.parallelize(array).flatMap(r => r).mapPartitionsWithIndex((index, rows) => rows.map(row => {if(index == 2) 3*row else 4*row}))
8. sample:
抽样操作,从RDD的数据集中出去多大比例的数据
val array = Array(Array(1,2), Array(3,4), Array(5,6))
rdd = sc.parallelize(array).flatMap(r => r).sample(false, 0.5, 10000)
9.
union:
合并两个RDD
val array = Array(Array(1,2), Array(3,4), Array(5,6))
val rdd = sc.parallelize(array).flatMap(r => r).union(sc.parallelize(Array(1, 2, 3, 10, 20, 30)))
10.
intersection:
选择两个数据集的交集
val array = Array(Array(1,2), Array(3,4), Array(5,6))
sc.parallelize(array).flatMap(r => r).intersection(sc.parallelize(Array(1, 2, 3, 10, 20, 30)))
11.
distinct:
元组去重
val array = Array(Array(1,2), Array(3,4), Array(5,6))
val rdd = sc.parallelize(array).flatMap(r => r).distinct
12. groupByKey:
返回元组序列,其中一个是
Iterable
类型, 可以给予
Iterable 进一步处理
val array = Array(Array(1,2), Array(3,4), Array(5,6))
val rdd1 = sc.parallelize(array).flatMap(r => r).map(r => (r, 1)).groupByKey()
val rdd2 = sc.parallelize(array).flatMap(r => r).map(r => (r, 1)).groupByKey().map(r => (r._1, r._2.toArray.sum)).collect
13.
aggregateByKey:
var data = sc.parallelize(List((1,3),(1,2),(1, 4),(2,3)))
def seq(a:Int, b:Int) : Int ={
println("seq: " + a + "\t " + b)
math.max(a,b)
}
def comb(a:Int, b:Int) : Int ={
println("comb: " + a + "\t " + b)
a + b
}
data.aggregateByKey(1)(seq, comb).collect
上诉代码的执行过程是这样的:
每组的value值与1 进行比较, 取最大替换value;
然后执行reduceByKey操作,相同key的value 相加, seq对value进行映射, comb对value进行聚合
14.
sortByKey:
按照key 重新排列RDD的元组, 可以选择是升序还是降序
val array = Array(Array(1,2), Array(3,4), Array(5,6))
sc.parallelize(array).flatMap(r => r).map(r => (r, 1)).sortByKey().collect()
15. join:
同一个key的value 合并 (K, V)和(K, W) => (K, (V, W))
val array = Array(Array(1,2), Array(3,4), Array(5,6))
val rdd = sc.parallelize(array).flatMap(r => r).map(r => (r, 1)).join(sc.parallelize(Array(1,2,3,4)).map(r => (r, 2*r))).collect
16.
cogroup:
val array = Array(Array(1,2), Array(3,4), Array(5,6))
val rdd = sc.parallelize(array).flatMap(r => r).map(r => (r, 1)).cogroup(sc.parallelize(Array(1,2,3,4)).map(r => (r, 2*r))).collect
返回:
Array[(Int, (Iterable[Int], Iterable[Int]))] = Array((1,(CompactBuffer(1),CompactBuffer(2))), (2,(CompactBuffer(1),CompactBuffer(4))), (3,(CompactBuffer(1, 1),CompactBuffer(6))), (4,(CompactBuffer(1, 1),CompactBuffer(8))), (10,(CompactBuffer(1),CompactBuffer())
17.
cartesian:
返回连个RDD的笛卡尔集
val array = Array(Array(1,2), Array(3,4), Array(5,6))
val rdd = sc.parallelize(array).flatMap(r => r).map(r => (r, 1)).cartesian(sc.parallelize(Array(1,2,3,4)).map(r => (r, 2*r))).collect
18. pipe:
传递一个命令来处理RDD的每一个patition
19. coalesce:
削减分区,通常用在过滤完大数据集后,在进行其他操作
val array = Array(Array(1,2), Array(3,4), Array(5,6))
val rdd = sc.parallelize(array).flatMap(r => r).map(r => (r, 2*r)).coalesce(1)
20. repartition:
重新balance 数据
val array = Array(Array(1,2), Array(3,4), Array(5,6))
val rdd = sc.parallelize(array).flatMap(r => r).map(r => (r, 2*r)).
repartition(1)
21. repartitionAndSortWithinPartitions:
重新patition后,在每一个patition上,按key 排序
import org.apache.spark.HashPartitioner
val array = Array(Array(1,2), Array(3,4), Array(5,6))
val c = new HashPartitioner(5)
val rdd = sc.parallelize(array).flatMap(r => r).map(r => (r, 2*r)).repartitionAndSortWithinPartitions(c)