Spark 常见算子总结

Spark常见的算子

----Spark中map和flatmap的区别:
map函数会对每一条输入进行指定的操作,然后为每一条输入返回一个对象。
而flatMap函数则是两个操作的集合——正是“先映射后扁平化”:
操作1:同map函数一样:对每一条输入进行指定的操作,然后为每一条输入返回一个对象
操作2:最后将所有对象合并为一个对象

scala> rdd5.map(t=>{t.split("\t")}).collect
res9: Array[Array[String]] = Array(Array(75, 2018-09-17, BK181713017, 小一), 
									Array(75, 2018-09-17, BK181913016, 小二), 
									Array(75, 2018-09-17, BK181913062, 小四))

scala> rdd5.flatMap(t=>{t.split("\t")}).collect
    res8: Array[String] = 
		Array(75, 2018-09-17, BK181713017, 小一,75, 2018-09-17, BK181913016, 小二,75, 2018-09-17, BK181913062, 小四, 75, 2018-09-17, BK181913007)



scala> val a = sc.parallelize(1 to 10, 3)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :24

scala> a.glom.collect
res0: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6), Array(7, 8, 9, 10))
----map算子
    ----x:RDD中所有元素
scala> a.map(x=>(x*2)).glom.collect
res1: Array[Array[Int]] = Array(Array(2, 4, 6), Array(8, 10, 12), Array(14, 16, 18, 20))

map、mapPartitions区别主要在于调用力度不同:
map的输入变换函数应用于RDD中所有元素,而mapPartitions应用于所有分区。
如parallelize(1 to 10, 3),map函数执行10次,而mapPartitions函数执行3次。

----mapPartitions算子
    ----x:所有分区
    ----y:每个分区里面的所有元素
scala> a.mapPartitions(x=>(x.map(y=>(y*2)))).glom.collect
res2: Array[Array[Int]] = Array(Array(2, 4, 6), Array(8, 10, 12), Array(14, 16, 18, 20))

----mapPartitionsWithIndex算子
----index:每个分区的索引号
----x:所有分区
----y:每个分区里面的所有元素

scala> a.mapPartitionsWithIndex((index,x)=>x.map(y=>(index,y))).glom.collect
res4: Array[Array[(Int, Int)]] = Array(Array((0,1), (0,2), (0,3)), Array((1,4), (1,5), (1,6)), Array((2,7), (2,8), (2,9), (2,10)))

----union算子:合并两个RDD

scala> val b = sc.makeRDD(1 to 5,2)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at makeRDD at :24

scala> b.glom.collect
res5: Array[Array[Int]] = Array(Array(1, 2), Array(3, 4, 5))

----生成一个新的RDD

scala> a.union(b)
res6: org.apache.spark.rdd.RDD[Int] = UnionRDD[12] at union at :28

scala> res6.glom.collect
res7: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6), Array(7, 8, 9, 10), Array(1, 2), Array(3, 4, 5))

----intersection: 交集算子

scala> a.intersection(b)
res8: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[19] at intersection at :28

scala> res8.collect
res9: Array[Int] = Array(3, 4, 1, 5, 2)

scala> res8.glom.collect
res10: Array[Array[Int]] = Array(Array(3), Array(4, 1), Array(5, 2))

----subtract:计算差的一种函数
----前一个RDD中的元素减去后一个RDD中相同的元素,将前一个RDD不同的元素保留下来

scala> val rdd1 = sc.parallelize(1 to 7)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[104] at parallelize at :24

scala> val rdd2 = sc.makeRDD(3 to 6)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[106] at makeRDD at :24
                                                                                                                              
scala> rdd1.subtract(rdd2).collect
res55: Array[Int] = Array(1, 2, 7)

----sortBy排序算子:默认true(升序),false(降序)
----sortBy源码:

    def sortBy[K](
          f: (T) => K,
          ascending: Boolean = true,    是否排序,默认(true)升序
          numPartitions: Int = this.partitions.length)      重新定义分区

scala> val c = a.intersection(b)
c: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[27] at intersection at :27

scala> c.collect
res12: Array[Int] = Array(3, 4, 1, 5, 2)

----sortBy:不加true默认是true(升序)

scala> c.sortBy(x=>x).collect
res14: Array[Int] = Array(1, 2, 3, 4, 5)

----加上true一样

scala> c.sortBy(x=>x,true).collect
res15: Array[Int] = Array(1, 2, 3, 4, 5)

----sortBy:false(降序)

scala> c.sortBy(x=>x,false).collect
res16: Array[Int] = Array(5, 4, 3, 2, 1)

----distinct:去重算子
----(因为去重后数据量肯定减少,所以可以加上参数,重新定义默认的分区数量)

scala> val c = a.union(b)
c: org.apache.spark.rdd.RDD[Int] = UnionRDD[48] at union at :27

scala> c.collect
res17: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5)

scala> c.distinct.collect
res18: Array[Int] = Array(10, 5, 1, 6, 7, 2, 3, 8, 4, 9)

scala> c.distinct.glom.collect
res19: Array[Array[Int]] = Array(Array(10, 5), Array(1, 6), Array(7, 2), Array(3, 8), Array(4, 9))

scala> c.glom.collect
res20: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6), Array(7, 8, 9, 10), Array(1, 2), Array(3, 4, 5))

----distinct():加上参数就是去重后分到几个分区

scala> c.distinct(1).glom.collect
res21: Array[Array[Int]] = Array(Array(4, 1, 6, 3, 7, 9, 8, 10, 5, 2))

scala> c.distinct(2).glom.collect
res22: Array[Array[Int]] = Array(Array(4, 6, 8, 10, 2), Array(1, 3, 7, 9, 5))

----还可以加上sortBy,对分区里面的元素进行排序

scala> c.distinct(2).sortBy(x=>x,true).glom.collect
res27: Array[Array[Int]] = Array(Array(1, 2, 3, 4, 5), Array(6, 7, 8, 9, 10))

----map元祖2模式的union

scala> val pair1 = a.map(x=>(x,1))
pair1: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[86] at map at :25

scala> pair1.collect
res28: Array[(Int, Int)] = Array((1,1), (2,1), (3,1), (4,1), (5,1), (6,1), (7,1), (8,1), (9,1), (10,1))

scala> val pair2 = b.map(x=>(x,2))
pair2: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[87] at map at :25

scala> pair2.collect
res29: Array[(Int, Int)] = Array((1,2), (2,2), (3,2), (4,2), (5,2))

scala> pair1.union(pair2)
res30: org.apache.spark.rdd.RDD[(Int, Int)] = UnionRDD[88] at union at :28

scala> res30.collect
res31: Array[(Int, Int)] = Array((1,1), (2,1), (3,1), (4,1), (5,1), (6,1), (7,1), (8,1), (9,1), (10,1), (1,2), (2,2), (3,2), (4,2), (5,2))

----map元祖2模式的join

scala> pair1.join(pair2).collect
res32: Array[(Int, (Int, Int))] = Array((4,(1,2)), (1,(1,2)), (5,(1,2)), (2,(1,2)), (3,(1,2)))

scala> pair1.join(pair2).glom.collect
res6: Array[Array[(Int, (Int, Int))]] = Array(Array((4,(1,2))), Array((1,(1,2)), (5,(1,2))), 
                                              Array((2,(1,2))), Array((3,(1,2))))

----join加上参数的分区

scala> pair1.join(pair2,2).glom.collect
res7: Array[Array[(Int, (Int, Int))]] = Array(Array((4,(1,2)), (2,(1,2))), 
                                              Array((1,(1,2)), (3,(1,2)), (5,(1,2))))

----Spark中reduceByKey、groupByKey的区别
1、reduceByKey用于对每个key对应的多个value进行merge(合并)的操作,
最重要的是它能够在本地先进行merge操作,并且merge操作可以通过函数自定义;
2、groupByKey也是对每个key进行操作,它会将所有的键值对进行移动,但是不会进行局部merge(合并),
只生成一个sequence,groupByKey本身不能自定义函数,
需要先用groupByKey生成RDD,然后才能对此RDD通过map进行自定义函数操作
因为不会进行局部merge操作,因此会导致集群节点之间的开销很大,导致传输延时。

----groupByKey:根据key值相等进行分组
----源码:

	def groupBy[K](f: T => K, p: Partitioner)
	
scala> val pair3 = pair1.union(pair2)
pair3: org.apache.spark.rdd.RDD[(Int, Int)] = UnionRDD[91] at union at :27

scala> pair3.groupByKey.collect
res36: Array[(Int, Iterable[Int])] = Array((10,CompactBuffer(1)), (5,CompactBuffer(1, 2)), (1,CompactBuffer(1, 2)), 
        (6,CompactBuffer(1)), (7,CompactBuffer(1)), (2,CompactBuffer(1, 2)), (3,CompactBuffer(1, 2)), (8,CompactBuffer(1)), (4,CompactBuffer(1, 2)), (9,CompactBuffer(1)))

----reduceByKey(): key值相等时,value值相加

scala> pair3.reduceByKey((x,y)=>(x+y)).collect
collect   collectAsMap   collectAsync

scala> pair3.reduceByKey((x,y)=>(x+y)).collect
res37: Array[(Int, Int)] = Array((10,1), (5,3), (1,3), (6,1), (7,1), (2,3), (3,3), (8,1), (4,3), (9,1))

----元组模式的groupBy: x._1指的是取元组的第一个元素

scala> pair3.groupBy(x=>{ if(x._1%2==0) "偶数" else "奇数"}).collect
res40: Array[(String, Iterable[(Int, Int)])] = Array((偶数,CompactBuffer((2,1), (4,1), (6,1), (8,1), (10,1), (2,2), (4,2))), 
    (奇数,CompactBuffer((1,1), (3,1), (5,1), (7,1), (9,1), (1,2), (3,2), (5,2))))

----元组模式的groupBy: x._2指的是取元组的第二个元素

scala> pair3.groupBy(x=>{ if(x._2%2==0) "偶数" else "奇数"}).collect
res41: Array[(String, Iterable[(Int, Int)])] = Array((偶数,CompactBuffer((1,2), (2,2), (3,2), (4,2), (5,2))), 
    (奇数,CompactBuffer((1,1), (2,1), (3,1), (4,1), (5,1), (6,1), (7,1), (8,1), (9,1), (10,1))))

----普通集合的groupBy

scala> a.groupBy(x=>{if(x%2==0) "偶数" else "奇数"}).collect
res39: Array[(String, Iterable[Int])] = Array((偶数,CompactBuffer(2, 4, 6, 8, 10)), (奇数,CompactBuffer(1, 3, 5, 7, 9)))

----aggregate: 柯里化函数
----Scala中的柯里化函数是针对所有的元素来计算
----Spark中的柯里化函数是针对每个分区里面的元素来计算的

scala> a.collect
res42: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> a.glom.collect
res43: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6), Array(7, 8, 9, 10))

scala> a.aggregate(5)((x,y)=>(x+y),(x,y)=>(x+y))
res45: Int = 75

----最简化写法

scala> a.aggregate(5)(_+_,_+_)
res50: Int = 75
                                                               
scala> a.aggregate(5)((x,y)=>{println(x,y,"one");x+y},(x,y)=>{println(x,y,"two");x+y})
    (5,4,one)       5+4=9
    (9,5,one)       9+5=14
    (14,6,one)      14+6=20   ----Array(4, 5, 6)
    (5,7,one)       5+7=12
    (5,1,one)       5+1=6
    (6,2,one)       6+2=8
    (8,3,one)       8+3=11    ----Array(1, 2, 3)
    (12,8,one)      12+8=20
    (20,9,one)      20+9=29
    (29,10,one)     20+10=39  ----Array(7, 8, 9, 10)
    (5,11,two)      5+11=16
    (16,20,two)     16+20=36
    (36,39,two)     36+39=75  
res49: Int = 75     



scala> val aa = sc.parallelize(List(("a",1),("b",4),("c",3),("d",8),("b",3),("a",3),("b",7),("c",4),("b",9)),3)
aa: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[101] at parallelize at :24

scala> aa.collect
res51: Array[(String, Int)] = Array((a,1), (b,4), (c,3), (d,8), (b,3), (a,3), (b,7), (c,4), (b,9))

scala> aa.glom.collect
res52: Array[Array[(String, Int)]] = Array(Array((a,1), (b,4), (c,3)), Array((d,8), (b,3), (a,3)), Array((b,7), (c,4), (b,9)))

scala> aa.aggregateByKey(0)(math.max(_,_),_+_)
res56: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[103] at aggregateByKey at :26

scala> res56.collect
res57: Array[(String, Int)] = Array((c,7), (d,8), (a,4), (b,16))

----综合应用
----普通集合的sortBy、sortByKey

scala> val aaaa = sc.makeRDD(List(3,9,2,6,1,5))
aaaa: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[105] at makeRDD at :24

scala> aaaa.collect
res59: Array[Int] = Array(3, 9, 2, 6, 1, 5)

scala> aaaa.sortBy(x=>x, true).collect
res60: Array[Int] = Array(1, 2, 3, 5, 6, 9)

scala> aaaa.glom.collect
res61: Array[Array[Int]] = Array(Array(3), Array(9, 2), Array(6), Array(1, 5))

scala> aaaa.sortBy(x=>x, true).glom.collect
res62: Array[Array[Int]] = Array(Array(1, 2), Array(3), Array(5, 6), Array(9))

----元祖模式的sortBy、sortByKey

scala> val pair3 = pair1.union(pair2)
pair3: org.apache.spark.rdd.RDD[(Int, Int)] = UnionRDD[91] at union at :27

scala> pair3.collect
res63: Array[(Int, Int)] = Array((1,1), (2,1), (3,1), (4,1), (5,1), (6,1), (7,1), (8,1), (9,1), (10,1), (1,2), (2,2), (3,2), (4,2), (5,2))

scala> pair3.sortBy(x=>x, true).collect
res64: Array[(Int, Int)] = Array((1,1), (1,2), (2,1), (2,2), (3,1), (3,2), (4,1), (4,2), (5,1), (5,2), (6,1), (7,1), (8,1), (9,1), (10,1))

scala> pair3.sortBy(x=>x, false).collect
res65: Array[(Int, Int)] = Array((10,1), (9,1), (8,1), (7,1), (6,1), (5,2), (5,1), (4,2), (4,1), (3,2), (3,1), (2,2), (2,1), (1,2), (1,1))

----sortByKey +tab键查看使用方法:一个Boolean,一个分区数量

scala> pair3.sortByKey
   def sortByKey(ascending: Boolean,numPartitions: Int): org.apache.spark.rdd.RDD[(Int, Int)]

scala> pair3.sortByKey(false).collect
res1: Array[(Int, Int)] = Array((10,1), (9,1), (8,1), (7,1), (6,1), (5,1), (5,2), (4,1), (4,2), (3,1), (3,2), (2,1), (2,2), (1,1), (1,2))

----不加分区num,使用默认的自动分区:4

scala> pair3.sortByKey(false).glom.collect
res4: Array[Array[(Int, Int)]] = Array(Array((10,1), (9,1), (8,1)), 
                                Array((7,1), (6,1), (5,1), (5,2)), 
                                Array((4,1), (4,2), (3,1), (3,2)), 
                                Array((2,1), (2,2), (1,1), (1,2)))

    ----加上分区num:2
scala> pair3.sortByKey(false,2).glom.collect
res2: Array[Array[(Int, Int)]] = Array(Array((10,1), (9,1), (8,1), (7,1), (6,1), (5,1), (5,2)), 
                                Array((4,1), (4,2), (3,1), (3,2), (2,1), (2,2), (1,1), (1,2)))

----pipe(“head -n 1”):查看前几行
----pipe(“tail -n 1”):查看后几行

scala> pair1.glom.collect
res8: Array[Array[(Int, Int)]] = Array(Array((1,1), (2,1)), Array((3,1), (4,1), (5,1)), Array((6,1), (7,1)), Array((8,1), (9,1), (10,1)))

scala> pair1.pipe("head -n 1").collect
res9: Array[String] = Array((1,1), (3,1), (6,1), (8,1))

scala> pair1.pipe("tail -n 1").collect
res10: Array[String] = Array((2,1), (5,1), (7,1), (10,1))

scala> pair1.pipe("tail -n 2").collect
res11: Array[String] = Array((1,1), (2,1), (4,1), (5,1), (6,1), (7,1), (9,1), (10,1))

----coalesce: 合并分区数量(重新分区)
----coalesce源码:

   def coalesce(numPartitions: Int, shuffle: Boolean = false,
        partitionCoalescer: Option[PartitionCoalescer] = Option.empty)

scala> val rdd = sc.parallelize(1 to 17, 4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[92] at parallelize at :24

----不加分区num,使用默认的自动分区:4

scala> rdd.partitions.size
res48: Int = 4       

----现在合并成3个分区

scala> val coalesceRDD = rdd.coalesce(3)
coalesceRDD: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[95] at coalesce at :25

scala> coalesceRDD.partitions.size
res50: Int = 3

----可以看到前两个分区的元素不变,只是把后两个分区的元素合并到一个分区

scala> coalesceRDD.glom.collect
res52: Array[Array[Int]] = Array(Array(1, 2, 3, 4), Array(5, 6, 7, 8), Array(9, 10, 11, 12, 13, 14, 15, 16, 17))

----repartition: 重新分区(重组所有数据)
----repartition源码:

 def repartition(numPartitions: Int)(implicit ord: Ordering[T] = nu11): RDD[T] = withScope {
        coalesce( numPartitions, shuffle = true) |
    }

scala> val rdd = sc.parallelize(1 to 16 , 4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[97] at parallelize at :24

scala> rdd.glom.collect
res53: Array[Array[Int]] = Array(Array(1, 2, 3, 4), Array(5, 6, 7, 8), Array(9, 10, 11, 12), Array(13, 14, 15, 16))

scala> val rerdd = rdd.repartition(3)
rerdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[102] at repartition at :25

scala> rerdd.glom.collect
res54: Array[Array[Int]] = Array(Array(3, 6, 11, 14), Array(1, 4, 7, 9, 12, 15), Array(2, 5, 8, 10, 13, 16))

----coalesce与repartition的区别:
1、coalesce重新分区,可以选择是否进行shuffle过程。
由参数shuffle:Boolean = false/true决定。
2、repartition实际上是调用的coalesce,默认是进行shuffle的。

你可能感兴趣的:(Spark)