RDD算子 转换算子

RDD 中的所有转换都是延迟加载的,也就是说,它们并不会直接计算结果。相反的,它们只是记住这些应用到基础数据集(例如一个文件)上的转换动作。只有当发生一个要求返回结果给 Driver 的动作时,这些转换才会真正运行。这种设计让 Spark 更加有效率地运行。

常用的Transformation

map,filter,flatMap,mapPartitions,mapPartitonsWithIndex,sample,takeSample,union,intersection,distinct,partitionBy,reduceByKey,groupByKey,combinByKey,aggregateByKey,foldByKey,sortByKey,sortBy,join,cogroup,cartesian,pipe,coalesce,repartition,repartitionAndSortWithinPartitons,glom,mapValues,subtract

map(func):返回一个新的RDD,该RDD由每一个输入元素经过func函数转换后组成

map将原来RDD的每个数据项通过map中的用户自定义函数f映射转变为一个新的元素。源码中的map算子相当于初始化一个RDD,新RDD叫做MappedRDD(this,sc.clean(f))

RDD算子 转换算子_第1张图片

scala> val rdd = sc.makeRDD(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at makeRDD at :24
scala> rdd.map(_*2).collect
res4: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)

filter(func):返回一个新的RDD,该RDD由经过func函数计算后返回值为true的输入元素组成

RDD算子 转换算子_第2张图片

scala> val rdd = sc.makeRDD(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[7] at makeRDD at :24
scala> rdd.filter(_%2==0).collect
res5: Array[Int] = Array(2, 4, 6, 8, 10)

flatMap(func):类似于 map ,但是每一个输入元素可以被为映射为 0  或多个输出元素(所以 func  应该返回一个序列,而不是单一元素)

flatMap将原来RDD中的每个元素通过函数f转化为新的元素,并将生成的RDD的每个集合中的元素合并为一个元素。内部创建FlatMappedRDD(this,sc.clean(f))

RDD算子 转换算子_第3张图片

scala> val rdd = sc.parallelize(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at :24
scala> val flatMap = rdd.flatMap(1 to _)
flatMap: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[10] at flatMap at :26
scala> flatMap.collect
res6: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

mapPartitions(func):类似于 map,但独立地在 RDD 的每一个分片上运行,因此在类型为 T 的 RDD 上运行时,func 的函数类型必须是Iterator[T] => Iterator[U]。假设有 N 个元素,有 M 个分区,那么 map 的函数的将被调用 N 次,而 mapPartitions 被调用 M 次,一个函数一次处理所有分区

RDD算子 转换算子_第4张图片

#表示划分为4个分区
scala> val rdd = sc.parallelize(1 to 10, 4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[17] at parallelize at :24
scala> rdd.partitions.size
res13: Int = 4
# items表示一个分区中的数据
scala> rdd.mapPartitions(items => items.map(_*2)).collect
res14: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
mapPartitionsWithIndex(func):类似于 mapPartitions,但 func 带有一个整数参数表示分片的索引值,因此在类型为

T 的 RDD 上运行时,func 的函数类型必须是(Int, Interator[T]) => Iterator[U]

RDD算子 转换算子_第5张图片

scala> val rdd = sc.parallelize(1 to 10, 4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[22] at parallelize at :24
scala> rdd.mapPartitionsWithIndex((index,items) => Iterator(index + ":" + items.toList)).collect
res18: Array[String] = Array(0:List(1, 2), 1:List(3, 4, 5), 2:List(6, 7), 3:List(8, 9, 10))

sample(withReplacement,fraction, seed):以指定的随机种子随机抽样出数量为fraction 的数据,withReplacement 表示是抽出的数据是否放回,true 为有放回的抽样,false 为无放回的抽样,seed 用于指定随机数生成器种子。例子从 RDD 中随机且有放回的抽出 50%的数据,随机种子值为 2(即可能以 1 2  的其中一个起始值)

RDD算子 转换算子_第6张图片

scala> val rdd = sc.parallelize(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[44] at parallelize at :24
scala> rdd.sample(true,0.4,2).collect
res36: Array[Int] = Array(1, 2, 2)

takeSample:和 Sample 的区别是takeSample 返回的是最终的结果集合。

union(otherDataset): 对源 RDD 和参数 RDD 求并集后返回一个新的 RDD

RDD算子 转换算子_第7张图片

scala> val rdd1 = sc.parallelize(1 to 5)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[54] at parallelize at :24
scala> val rdd2 = sc.parallelize(5 to 10)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[55] at parallelize at :24
scala> rdd1.union(rdd2).collect
res44: Array[Int] = Array(1, 2, 3, 4, 5, 5, 6, 7, 8, 9, 10)

intersection(otherDataset):是数据交集,返回一个新的数据集,包含两个数据集的交集数据。

RDD算子 转换算子_第8张图片

scala> val rdd1 = sc.parallelize(1 to 5)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[73] at parallelize at :24
scala> val rdd2 = sc.parallelize(3 to 7)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[74] at parallelize at :24
scala> rdd1.intersection(rdd2).collect
res50: Array[Int] = Array(4, 3, 5)

distinct([numTasks]))  :数据去重,返回一个数据集,它是对两个数据集去除重复数据,numTasks参数是设置任务并行数量。

RDD算子 转换算子_第9张图片

scala> val rdd = sc.parallelize(List(1,2,3,4,4,2,1))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[82] at parallelize at :24
scala> rdd.distinct.collect
res51: Array[Int] = Array(4, 2, 1, 3)

partitionBy:对 RDD 进行分区操作,如果原有的partionRDD 和现有的 partionRDD 是一致的话就不进行分区,否则会生成 ShuffleRDD.

RDD算子 转换算子_第10张图片

scala> val rdd = sc.parallelize(Array((1,"aaa"),(2,"bbb"),(3,"ccc"),(4,"ddd")),4)
rdd: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[90] at parallelize at :24
scala> rdd.collect
res55: Array[(Int, String)] = Array((1,aaa), (2,bbb), (3,ccc), (4,ddd))
scala> val rdd1 = rdd.partitionBy(new org.apache.spark.HashPartitioner(2))
rdd1: org.apache.spark.rdd.RDD[(Int, String)] = ShuffledRDD[91] at partitionBy at :26
scala> rdd1.collect
res56: Array[(Int, String)] = Array((2,bbb), (4,ddd), (1,aaa), (3,ccc))

reduceByKey(func,[numTasks]):在一个(K,V)的 RDD 上调用,返回一个(K,V)的 RDD,使用指定的 reduce 函数,将相同 key 的值聚合到一起,reduce 任务的个数可以通过第二个可选的参数来设置

RDD算子 转换算子_第11张图片

scala> val words = Array("one","two","two","three","three","three")
words: Array[String] = Array(one, two, two, three, three, three)
scala> val wordPairsRDD = sc.parallelize(words).map((_,1))
wordPairsRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[100] at map at :26
scala> val group = wordPairsRDD.groupByKey
group: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[101] at groupByKey at :28
scala> group.collect
res61: Array[(String, Iterable[Int])] = Array((two,CompactBuffer(1, 1)), (one,CompactBuffer(1)), (three,CompactBuffer(1, 1, 1)))

groupByKey:groupByKey 也是对每个 key 进行操作,但只生成一个 sequence。

RDD算子 转换算子_第12张图片

scala> val words = Array("one","two","two","three","three","three")
words: Array[String] = Array(one, two, two, three, three, three)
scala> val wordPairsRDD = sc.parallelize(words).map((_,1))
wordPairsRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[100] at map at :26
scala> val group = wordPairsRDD.groupByKey
group: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[101] at groupByKey at :28
scala> group.collect
res61: Array[(String, Iterable[Int])] = Array((two,CompactBuffer(1, 1)), (one,CompactBuffer(1)), (three,CompactBuffer(1, 1, 1)))

combineByKey[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C)

对相同 K,把 V 合并成一个集合.createCombiner: combineByKey() 会遍历分区中的所有元素,因此每个元素的键要么还没有遇到过,要么就 和之前的某个元素的键相同。如果这是一个新的元素,combineByKey() 会使用一个叫作 createCombiner() 的函数来创建那个键对应的累加器的初始值mergeValue: 如果这是一个在处理当前分区之前已经遇到的键, 它会使用 mergeValue() 方法将该键的累加器对应的当前值与这个新的值进行合并mergeCombiners: 由于每个分区都是独立处理的, 因此对于同一个键可以有多个累加器。如果有两个或者更多的分区都有对应同一个键的累加器, 就需要使用用户提供的mergeCombiners() 方法将各个分区的结果进行合并。

RDD算子 转换算子_第13张图片

scala> val scores = Array(("Fred",88),("Fred",95),("Fred",91),("Wilma",93),("Wilma",98))
scores: Array[(String, Int)] = Array((Fred,88), (Fred,95), (Fred,91), (Wilma,93), (Wilma,98))
scala> val input = sc.makeRDD(scores)
input: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[113] at makeRDD at :26
scala> val combine = input.combineByKey(
     | v=>(v,1),
     | (c:(Int,Int),v)=>(c._1+v,c._2+1),
     | (c1:(Int,Int),c2:(Int,Int))=>(c1._1+c2._1,c1._2+c2._2))
combine: org.apache.spark.rdd.RDD[(String, (Int, Int))] = ShuffledRDD[114] at combineByKey at :28
scala> combine.map{case (key,value) => (key,value._1/value._2.toDouble)}.collect
res75: Array[(String, Double)] = Array((Wilma,95.5), (Fred,91.33333333333333))
aggregateByKey(zeroValue:U,[
partitioner: Partitioner]) (seqOp:

(U, V) => U,combOp: (U, U) =>U)

在 kv 对的 RDD 中,,按 key 将 value 进行分组合并,合并时,将每个 value 和初始值作为 seq 函数的参数,进行计算,返回的结果作为一个新的 kv 对,然后再将结果按照 key 进行合并,最后将每个分组的 value 传递给 combine 函数进行计算(先将前两个 value 进行计算,将返回结果和下一个 value 传给 combine 函数,以此类推),将 key 与计算结果作为一个新的 kv 对输出。seqOp 函数用于在每一个分区中用初始值逐步迭代 value,combOp 函数用于合并每个分区中的结果

RDD算子 转换算子_第14张图片

scala> val rdd =
sc.parallelize(List((1,3),(1,2),(1,4),(2,3),(3,6),(3,8)),3)
rdd: org.apache.spark.rdd.RDD[(Int, Int)] =
ParallelCollectionRDD[12] at parallelize at :24
scala> val agg = rdd.aggregateByKey(0)(math.max(_,_),_+_)
agg: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[13] at
aggregateByKey at :26
scala> agg.collect()
res7: Array[(Int, Int)] = Array((3,8), (1,7), (2,3))
scala> agg.partitions.size
res8: Int = 3
scala> val rdd =
sc.parallelize(List((1,3),(1,2),(1,4),(2,3),(3,6),(3,8)),1)
rdd: org.apache.spark.rdd.RDD[(Int, Int)] =
ParallelCollectionRDD[10] at parallelize at :24
scala> val agg = rdd.aggregateByKey(0)(math.max(_,_),_+_).collect()
agg: Array[(Int, Int)] = Array((1,4), (3,8), (2,3))

foldByKey(zeroValue: V)(func:(V, V) => V): RDD[(K, V)]

aggregateByKey  的简化操作,seqop  和combop 相同

scala> val rdd =
sc.parallelize(List((1,3),(1,2),(1,4),(2,3),(3,6),(3,8)),3)
rdd: org.apache.spark.rdd.RDD[(Int, Int)] =
ParallelCollectionRDD[91] at parallelize at :24
scala> val agg = rdd.foldByKey(0)(_+_)
agg: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[92] at
foldByKey at :26
scala> agg.collect()
res61: Array[(Int, Int)] = Array((3,14), (1,9), (2,3))

sortByKey([ascending],[numTasks]):在一个(K,V)的 RDD 上调用,K 必须实现Ordered 接口,返回一个按照 key 进行排序的(K,V)的 RDD

scala> val rdd =
sc.parallelize(Array((3,"aa"),(6,"cc"),(2,"bb"),(1,"dd")))
rdd: org.apache.spark.rdd.RDD[(Int, String)] =
ParallelCollectionRDD[14] at parallelize at :24
scala> rdd.sortByKey(true).collect()
res9: Array[(Int, String)] = Array((1,dd), (2,bb), (3,aa), (6,cc))
scala> rdd.sortByKey(false).collect()
res10: Array[(Int, String)] = Array((6,cc), (3,aa), (2,bb), (1,dd))

sortBy(func,[ascending],[numTasks]):与 sortByKey 类似,但是更灵活,可以用func 先对数据进行处理,按照处理后的数据比较结果排序。

scala> val rdd = sc.parallelize(List(1,2,3,4))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[21] at
parallelize at :24
scala> rdd.sortBy(x => x).collect()
res11: Array[Int] = Array(1, 2, 3, 4)
scala> rdd.sortBy(x => x%3).collect()
res12: Array[Int] = Array(3, 4, 1, 2)

join(otherDataset, [numTasks]):在类型为(K,V)和(K,W)的 RDD 上调用,返回一个相同 key 对应的所有元素对在一起的(K,(V,W))的 RDD

RDD算子 转换算子_第15张图片

scala> val rdd = sc.parallelize(Array((1,"a"),(2,"b"),(3,"c")))
rdd: org.apache.spark.rdd.RDD[(Int, String)] =
ParallelCollectionRDD[32] at parallelize at :24
scala> val rdd1 = sc.parallelize(Array((1,4),(2,5),(3,6)))
rdd1: org.apache.spark.rdd.RDD[(Int, Int)] =
ParallelCollectionRDD[33] at parallelize at :24
scala> rdd.join(rdd1).collect()
res13: Array[(Int, (String, Int))] = Array((1,(a,4)), (2,(b,5)),
(3,(c,6)))

cogroup(otherDataset,[numTasks]):在类型为(K,V)和(K,W)的 RDD 上调用,返回一个(K,(Iterable,Iterable))类型的 RDD

RDD算子 转换算子_第16张图片

scala> val rdd = sc.parallelize(Array((1,"a"),(2,"b"),(3,"c")))
rdd: org.apache.spark.rdd.RDD[(Int, String)] =
ParallelCollectionRDD[37] at parallelize at :24
scala> val rdd1 = sc.parallelize(Array((1,4),(2,5),(3,6)))
rdd1: org.apache.spark.rdd.RDD[(Int, Int)] =
ParallelCollectionRDD[38] at parallelize at :24
scala> rdd.cogroup(rdd1).collect()
res14: Array[(Int, (Iterable[String], Iterable[Int]))] =
Array((1,(CompactBuffer(a),CompactBuffer(4))),
(2,(CompactBuffer(b),CompactBuffer(5))),
(3,(CompactBuffer(c),CompactBuffer(6))))
scala> val rdd2 = sc.parallelize(Array((4,4),(2,5),(3,6)))
rdd2: org.apache.spark.rdd.RDD[(Int, Int)] =
ParallelCollectionRDD[41] at parallelize at :24
scala> rdd.cogroup(rdd2).collect()
res15: Array[(Int, (Iterable[String], Iterable[Int]))] =
Array((4,(CompactBuffer(),CompactBuffer(4))),
(1,(CompactBuffer(a),CompactBuffer())),
(2,(CompactBuffer(b),CompactBuffer(5))),
(3,(CompactBuffer(c),CompactBuffer(6))))
scala> val rdd3 =
sc.parallelize(Array((1,"a"),(1,"d"),(2,"b"),(3,"c")))
rdd3: org.apache.spark.rdd.RDD[(Int, String)] =
ParallelCollectionRDD[44] at parallelize at :24
scala> rdd3.cogroup(rdd2).collect()
[Stage 36:> (0 + 0)
res16: Array[(Int, (Iterable[String], Iterable[Int]))] =
Array((4,(CompactBuffer(),CompactBuffer(4))), (1,(CompactBuffer(d,
a),CompactBuffer())), (2,(CompactBuffer(b),CompactBuffer(5))),
(3,(CompactBuffer(c),CompactBuffer(6))))

cartesian(otherDataset)  :笛卡尔积

scala> val rdd1 = sc.parallelize(1 to 3)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[47] at
parallelize at :24
scala> val rdd2 = sc.parallelize(2 to 5)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[48] at
parallelize at :24
scala> rdd1.cartesian(rdd2).collect()
res17: Array[(Int, Int)] = Array((1,2), (1,3), (1,4), (1,5), (2,2),
(2,3), (2,4), (2,5), (3,2), (3,3), (3,4), (3,5))

pipe(command, [envVars]) : 对于每个分区,都执行一个 perl  或者shell  脚本,的 返回输出的 RDD

注意:shell  脚本需要集群中的所有节点都能访问到

coalesce(numPartitions):缩减分区数,用于 大数据集过滤后,提高小数据集的执行效率。

scala> val rdd = sc.parallelize(1 to 16,4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[54] at
parallelize at :24
scala> rdd.partitions.size
res20: Int = 4
scala> val coalesceRDD = rdd.coalesce(3)
coalesceRDD: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[55] at
coalesce at :26
scala> coalesceRDD.partitions.size
res21: Int = 3

repartition(numPartitions):根据分区数,从新通过网络随机洗牌所有数据。

scala> val rdd = sc.parallelize(1 to 16,4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[56] at
parallelize at :24
scala> rdd.partitions.size
res22: Int = 4
scala> val rerdd = rdd.repartition(2)
rerdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[60] at
repartition at :26
scala> rerdd.partitions.size
res23: Int = 2
scala> val rerdd = rdd.repartition(4)
rerdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[64] at
repartition at :26
scala> rerdd.partitions.size
res24: Int = 4

repartitionAndSortWithinPartitions(partitioner)

repartitionAndSortWithinPartitions 函数是repartition 函数的变种,与 repartition 函数不同的是,repartitionAndSortWithinPartitions 在给定的 partitioner 内部进行排序,性能比repartition 要高。

glom :将每一个分区形成一个数组,形成新的RDD 类型时 RDD[Array[T]]

scala> val rdd = sc.parallelize(1 to 16,4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[65] at
parallelize at :24
scala> rdd.glom().collect()
res25: Array[Array[Int]] = Array(Array(1, 2, 3, 4), Array(5, 6, 7,
8), Array(9, 10, 11, 12), Array(13, 14, 15, 16))

mapValues:针对于(K,V)形式的类型只对 V 进行操作

scala> val rdd3 =
sc.parallelize(Array((1,"a"),(1,"d"),(2,"b"),(3,"c")))
rdd3: org.apache.spark.rdd.RDD[(Int, String)] =
ParallelCollectionRDD[67] at parallelize at :24
scala> rdd3.mapValues(_+"|||").collect()
res26: Array[(Int, String)] = Array((1,a|||), (1,d|||), (2,b|||),(3,c|||))

subtract:计算差的一种函数去除两个 RDD 中相同的元素,不同的 RDD 将保留下来

RDD算子 转换算子_第17张图片

scala> val rdd = sc.parallelize(3 to 8)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[70] at
parallelize at :24
scala> val rdd1 = sc.parallelize(1 to 5)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[71] at
parallelize at :24
scala> rdd.subtract(rdd1).collect()
res27: Array[Int] = Array(8, 6, 7)

你可能感兴趣的:(spark,scala,Linux,大数据)