----Spark中map和flatmap的区别:
map函数会对每一条输入进行指定的操作,然后为每一条输入返回一个对象。
而flatMap函数则是两个操作的集合——正是“先映射后扁平化”:
操作1:同map函数一样:对每一条输入进行指定的操作,然后为每一条输入返回一个对象
操作2:最后将所有对象合并为一个对象
scala> rdd5.map(t=>{t.split("\t")}).collect
res9: Array[Array[String]] = Array(Array(75, 2018-09-17, BK181713017, 小一),
Array(75, 2018-09-17, BK181913016, 小二),
Array(75, 2018-09-17, BK181913062, 小四))
scala> rdd5.flatMap(t=>{t.split("\t")}).collect
res8: Array[String] =
Array(75, 2018-09-17, BK181713017, 小一,75, 2018-09-17, BK181913016, 小二,75, 2018-09-17, BK181913062, 小四, 75, 2018-09-17, BK181913007)
scala> val a = sc.parallelize(1 to 10, 3)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :24
scala> a.glom.collect
res0: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6), Array(7, 8, 9, 10))
----map算子
----x:RDD中所有元素
scala> a.map(x=>(x*2)).glom.collect
res1: Array[Array[Int]] = Array(Array(2, 4, 6), Array(8, 10, 12), Array(14, 16, 18, 20))
map、mapPartitions区别主要在于调用力度不同:
map的输入变换函数应用于RDD中所有元素,而mapPartitions应用于所有分区。
如parallelize(1 to 10, 3),map函数执行10次,而mapPartitions函数执行3次。
----mapPartitions算子
----x:所有分区
----y:每个分区里面的所有元素
scala> a.mapPartitions(x=>(x.map(y=>(y*2)))).glom.collect
res2: Array[Array[Int]] = Array(Array(2, 4, 6), Array(8, 10, 12), Array(14, 16, 18, 20))
----mapPartitionsWithIndex算子
----index:每个分区的索引号
----x:所有分区
----y:每个分区里面的所有元素
scala> a.mapPartitionsWithIndex((index,x)=>x.map(y=>(index,y))).glom.collect
res4: Array[Array[(Int, Int)]] = Array(Array((0,1), (0,2), (0,3)), Array((1,4), (1,5), (1,6)), Array((2,7), (2,8), (2,9), (2,10)))
----union算子:合并两个RDD
scala> val b = sc.makeRDD(1 to 5,2)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at makeRDD at :24
scala> b.glom.collect
res5: Array[Array[Int]] = Array(Array(1, 2), Array(3, 4, 5))
----生成一个新的RDD
scala> a.union(b)
res6: org.apache.spark.rdd.RDD[Int] = UnionRDD[12] at union at :28
scala> res6.glom.collect
res7: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6), Array(7, 8, 9, 10), Array(1, 2), Array(3, 4, 5))
----intersection: 交集算子
scala> a.intersection(b)
res8: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[19] at intersection at :28
scala> res8.collect
res9: Array[Int] = Array(3, 4, 1, 5, 2)
scala> res8.glom.collect
res10: Array[Array[Int]] = Array(Array(3), Array(4, 1), Array(5, 2))
----subtract:计算差的一种函数
----前一个RDD中的元素减去后一个RDD中相同的元素,将前一个RDD不同的元素保留下来
scala> val rdd1 = sc.parallelize(1 to 7)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[104] at parallelize at :24
scala> val rdd2 = sc.makeRDD(3 to 6)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[106] at makeRDD at :24
scala> rdd1.subtract(rdd2).collect
res55: Array[Int] = Array(1, 2, 7)
----sortBy排序算子:默认true(升序),false(降序)
----sortBy源码:
def sortBy[K](
f: (T) => K,
ascending: Boolean = true, 是否排序,默认(true)升序
numPartitions: Int = this.partitions.length) 重新定义分区
scala> val c = a.intersection(b)
c: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[27] at intersection at :27
scala> c.collect
res12: Array[Int] = Array(3, 4, 1, 5, 2)
----sortBy:不加true默认是true(升序)
scala> c.sortBy(x=>x).collect
res14: Array[Int] = Array(1, 2, 3, 4, 5)
----加上true一样
scala> c.sortBy(x=>x,true).collect
res15: Array[Int] = Array(1, 2, 3, 4, 5)
----sortBy:false(降序)
scala> c.sortBy(x=>x,false).collect
res16: Array[Int] = Array(5, 4, 3, 2, 1)
----distinct:去重算子
----(因为去重后数据量肯定减少,所以可以加上参数,重新定义默认的分区数量)
scala> val c = a.union(b)
c: org.apache.spark.rdd.RDD[Int] = UnionRDD[48] at union at :27
scala> c.collect
res17: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5)
scala> c.distinct.collect
res18: Array[Int] = Array(10, 5, 1, 6, 7, 2, 3, 8, 4, 9)
scala> c.distinct.glom.collect
res19: Array[Array[Int]] = Array(Array(10, 5), Array(1, 6), Array(7, 2), Array(3, 8), Array(4, 9))
scala> c.glom.collect
res20: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6), Array(7, 8, 9, 10), Array(1, 2), Array(3, 4, 5))
----distinct():加上参数就是去重后分到几个分区
scala> c.distinct(1).glom.collect
res21: Array[Array[Int]] = Array(Array(4, 1, 6, 3, 7, 9, 8, 10, 5, 2))
scala> c.distinct(2).glom.collect
res22: Array[Array[Int]] = Array(Array(4, 6, 8, 10, 2), Array(1, 3, 7, 9, 5))
----还可以加上sortBy,对分区里面的元素进行排序
scala> c.distinct(2).sortBy(x=>x,true).glom.collect
res27: Array[Array[Int]] = Array(Array(1, 2, 3, 4, 5), Array(6, 7, 8, 9, 10))
----map元祖2模式的union
scala> val pair1 = a.map(x=>(x,1))
pair1: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[86] at map at :25
scala> pair1.collect
res28: Array[(Int, Int)] = Array((1,1), (2,1), (3,1), (4,1), (5,1), (6,1), (7,1), (8,1), (9,1), (10,1))
scala> val pair2 = b.map(x=>(x,2))
pair2: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[87] at map at :25
scala> pair2.collect
res29: Array[(Int, Int)] = Array((1,2), (2,2), (3,2), (4,2), (5,2))
scala> pair1.union(pair2)
res30: org.apache.spark.rdd.RDD[(Int, Int)] = UnionRDD[88] at union at :28
scala> res30.collect
res31: Array[(Int, Int)] = Array((1,1), (2,1), (3,1), (4,1), (5,1), (6,1), (7,1), (8,1), (9,1), (10,1), (1,2), (2,2), (3,2), (4,2), (5,2))
----map元祖2模式的join
scala> pair1.join(pair2).collect
res32: Array[(Int, (Int, Int))] = Array((4,(1,2)), (1,(1,2)), (5,(1,2)), (2,(1,2)), (3,(1,2)))
scala> pair1.join(pair2).glom.collect
res6: Array[Array[(Int, (Int, Int))]] = Array(Array((4,(1,2))), Array((1,(1,2)), (5,(1,2))),
Array((2,(1,2))), Array((3,(1,2))))
----join加上参数的分区
scala> pair1.join(pair2,2).glom.collect
res7: Array[Array[(Int, (Int, Int))]] = Array(Array((4,(1,2)), (2,(1,2))),
Array((1,(1,2)), (3,(1,2)), (5,(1,2))))
----Spark中reduceByKey、groupByKey的区别
1、reduceByKey用于对每个key对应的多个value进行merge(合并)的操作,
最重要的是它能够在本地先进行merge操作,并且merge操作可以通过函数自定义;
2、groupByKey也是对每个key进行操作,它会将所有的键值对进行移动,但是不会进行局部merge(合并),
只生成一个sequence,groupByKey本身不能自定义函数,
需要先用groupByKey生成RDD,然后才能对此RDD通过map进行自定义函数操作
因为不会进行局部merge操作,因此会导致集群节点之间的开销很大,导致传输延时。
----groupByKey:根据key值相等进行分组
----源码:
def groupBy[K](f: T => K, p: Partitioner)
scala> val pair3 = pair1.union(pair2)
pair3: org.apache.spark.rdd.RDD[(Int, Int)] = UnionRDD[91] at union at :27
scala> pair3.groupByKey.collect
res36: Array[(Int, Iterable[Int])] = Array((10,CompactBuffer(1)), (5,CompactBuffer(1, 2)), (1,CompactBuffer(1, 2)),
(6,CompactBuffer(1)), (7,CompactBuffer(1)), (2,CompactBuffer(1, 2)), (3,CompactBuffer(1, 2)), (8,CompactBuffer(1)), (4,CompactBuffer(1, 2)), (9,CompactBuffer(1)))
----reduceByKey(): key值相等时,value值相加
scala> pair3.reduceByKey((x,y)=>(x+y)).collect
collect collectAsMap collectAsync
scala> pair3.reduceByKey((x,y)=>(x+y)).collect
res37: Array[(Int, Int)] = Array((10,1), (5,3), (1,3), (6,1), (7,1), (2,3), (3,3), (8,1), (4,3), (9,1))
----元组模式的groupBy: x._1指的是取元组的第一个元素
scala> pair3.groupBy(x=>{ if(x._1%2==0) "偶数" else "奇数"}).collect
res40: Array[(String, Iterable[(Int, Int)])] = Array((偶数,CompactBuffer((2,1), (4,1), (6,1), (8,1), (10,1), (2,2), (4,2))),
(奇数,CompactBuffer((1,1), (3,1), (5,1), (7,1), (9,1), (1,2), (3,2), (5,2))))
----元组模式的groupBy: x._2指的是取元组的第二个元素
scala> pair3.groupBy(x=>{ if(x._2%2==0) "偶数" else "奇数"}).collect
res41: Array[(String, Iterable[(Int, Int)])] = Array((偶数,CompactBuffer((1,2), (2,2), (3,2), (4,2), (5,2))),
(奇数,CompactBuffer((1,1), (2,1), (3,1), (4,1), (5,1), (6,1), (7,1), (8,1), (9,1), (10,1))))
----普通集合的groupBy
scala> a.groupBy(x=>{if(x%2==0) "偶数" else "奇数"}).collect
res39: Array[(String, Iterable[Int])] = Array((偶数,CompactBuffer(2, 4, 6, 8, 10)), (奇数,CompactBuffer(1, 3, 5, 7, 9)))
----aggregate: 柯里化函数
----Scala中的柯里化函数是针对所有的元素来计算
----Spark中的柯里化函数是针对每个分区里面的元素来计算的
scala> a.collect
res42: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scala> a.glom.collect
res43: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6), Array(7, 8, 9, 10))
scala> a.aggregate(5)((x,y)=>(x+y),(x,y)=>(x+y))
res45: Int = 75
----最简化写法
scala> a.aggregate(5)(_+_,_+_)
res50: Int = 75
scala> a.aggregate(5)((x,y)=>{println(x,y,"one");x+y},(x,y)=>{println(x,y,"two");x+y})
(5,4,one) 5+4=9
(9,5,one) 9+5=14
(14,6,one) 14+6=20 ----Array(4, 5, 6)
(5,7,one) 5+7=12
(5,1,one) 5+1=6
(6,2,one) 6+2=8
(8,3,one) 8+3=11 ----Array(1, 2, 3)
(12,8,one) 12+8=20
(20,9,one) 20+9=29
(29,10,one) 20+10=39 ----Array(7, 8, 9, 10)
(5,11,two) 5+11=16
(16,20,two) 16+20=36
(36,39,two) 36+39=75
res49: Int = 75
scala> val aa = sc.parallelize(List(("a",1),("b",4),("c",3),("d",8),("b",3),("a",3),("b",7),("c",4),("b",9)),3)
aa: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[101] at parallelize at :24
scala> aa.collect
res51: Array[(String, Int)] = Array((a,1), (b,4), (c,3), (d,8), (b,3), (a,3), (b,7), (c,4), (b,9))
scala> aa.glom.collect
res52: Array[Array[(String, Int)]] = Array(Array((a,1), (b,4), (c,3)), Array((d,8), (b,3), (a,3)), Array((b,7), (c,4), (b,9)))
scala> aa.aggregateByKey(0)(math.max(_,_),_+_)
res56: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[103] at aggregateByKey at :26
scala> res56.collect
res57: Array[(String, Int)] = Array((c,7), (d,8), (a,4), (b,16))
----综合应用
----普通集合的sortBy、sortByKey
scala> val aaaa = sc.makeRDD(List(3,9,2,6,1,5))
aaaa: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[105] at makeRDD at :24
scala> aaaa.collect
res59: Array[Int] = Array(3, 9, 2, 6, 1, 5)
scala> aaaa.sortBy(x=>x, true).collect
res60: Array[Int] = Array(1, 2, 3, 5, 6, 9)
scala> aaaa.glom.collect
res61: Array[Array[Int]] = Array(Array(3), Array(9, 2), Array(6), Array(1, 5))
scala> aaaa.sortBy(x=>x, true).glom.collect
res62: Array[Array[Int]] = Array(Array(1, 2), Array(3), Array(5, 6), Array(9))
----元祖模式的sortBy、sortByKey
scala> val pair3 = pair1.union(pair2)
pair3: org.apache.spark.rdd.RDD[(Int, Int)] = UnionRDD[91] at union at :27
scala> pair3.collect
res63: Array[(Int, Int)] = Array((1,1), (2,1), (3,1), (4,1), (5,1), (6,1), (7,1), (8,1), (9,1), (10,1), (1,2), (2,2), (3,2), (4,2), (5,2))
scala> pair3.sortBy(x=>x, true).collect
res64: Array[(Int, Int)] = Array((1,1), (1,2), (2,1), (2,2), (3,1), (3,2), (4,1), (4,2), (5,1), (5,2), (6,1), (7,1), (8,1), (9,1), (10,1))
scala> pair3.sortBy(x=>x, false).collect
res65: Array[(Int, Int)] = Array((10,1), (9,1), (8,1), (7,1), (6,1), (5,2), (5,1), (4,2), (4,1), (3,2), (3,1), (2,2), (2,1), (1,2), (1,1))
----sortByKey +tab键查看使用方法:一个Boolean,一个分区数量
scala> pair3.sortByKey
def sortByKey(ascending: Boolean,numPartitions: Int): org.apache.spark.rdd.RDD[(Int, Int)]
scala> pair3.sortByKey(false).collect
res1: Array[(Int, Int)] = Array((10,1), (9,1), (8,1), (7,1), (6,1), (5,1), (5,2), (4,1), (4,2), (3,1), (3,2), (2,1), (2,2), (1,1), (1,2))
----不加分区num,使用默认的自动分区:4
scala> pair3.sortByKey(false).glom.collect
res4: Array[Array[(Int, Int)]] = Array(Array((10,1), (9,1), (8,1)),
Array((7,1), (6,1), (5,1), (5,2)),
Array((4,1), (4,2), (3,1), (3,2)),
Array((2,1), (2,2), (1,1), (1,2)))
----加上分区num:2
scala> pair3.sortByKey(false,2).glom.collect
res2: Array[Array[(Int, Int)]] = Array(Array((10,1), (9,1), (8,1), (7,1), (6,1), (5,1), (5,2)),
Array((4,1), (4,2), (3,1), (3,2), (2,1), (2,2), (1,1), (1,2)))
----pipe(“head -n 1”):查看前几行
----pipe(“tail -n 1”):查看后几行
scala> pair1.glom.collect
res8: Array[Array[(Int, Int)]] = Array(Array((1,1), (2,1)), Array((3,1), (4,1), (5,1)), Array((6,1), (7,1)), Array((8,1), (9,1), (10,1)))
scala> pair1.pipe("head -n 1").collect
res9: Array[String] = Array((1,1), (3,1), (6,1), (8,1))
scala> pair1.pipe("tail -n 1").collect
res10: Array[String] = Array((2,1), (5,1), (7,1), (10,1))
scala> pair1.pipe("tail -n 2").collect
res11: Array[String] = Array((1,1), (2,1), (4,1), (5,1), (6,1), (7,1), (9,1), (10,1))
----coalesce: 合并分区数量(重新分区)
----coalesce源码:
def coalesce(numPartitions: Int, shuffle: Boolean = false,
partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
scala> val rdd = sc.parallelize(1 to 17, 4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[92] at parallelize at :24
----不加分区num,使用默认的自动分区:4
scala> rdd.partitions.size
res48: Int = 4
----现在合并成3个分区
scala> val coalesceRDD = rdd.coalesce(3)
coalesceRDD: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[95] at coalesce at :25
scala> coalesceRDD.partitions.size
res50: Int = 3
----可以看到前两个分区的元素不变,只是把后两个分区的元素合并到一个分区
scala> coalesceRDD.glom.collect
res52: Array[Array[Int]] = Array(Array(1, 2, 3, 4), Array(5, 6, 7, 8), Array(9, 10, 11, 12, 13, 14, 15, 16, 17))
----repartition: 重新分区(重组所有数据)
----repartition源码:
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = nu11): RDD[T] = withScope {
coalesce( numPartitions, shuffle = true) |
}
scala> val rdd = sc.parallelize(1 to 16 , 4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[97] at parallelize at :24
scala> rdd.glom.collect
res53: Array[Array[Int]] = Array(Array(1, 2, 3, 4), Array(5, 6, 7, 8), Array(9, 10, 11, 12), Array(13, 14, 15, 16))
scala> val rerdd = rdd.repartition(3)
rerdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[102] at repartition at :25
scala> rerdd.glom.collect
res54: Array[Array[Int]] = Array(Array(3, 6, 11, 14), Array(1, 4, 7, 9, 12, 15), Array(2, 5, 8, 10, 13, 16))
----coalesce与repartition的区别:
1、coalesce重新分区,可以选择是否进行shuffle过程。
由参数shuffle:Boolean = false/true决定。
2、repartition实际上是调用的coalesce,默认是进行shuffle的。