Spark常用算子总结

1.map、flatMap、distinct
map说明:将一个RDD中的每个数据项,通过map中的函数映射变成为一个新的元素,
输入分区与输入分区一对一。
即:有多少个输入分区,就有多少个输出分区。
flatMap说明:同Map算子,最后将所有元素放到同一集合中:
distinct说明:将RDD中重复元素做去重处理
注意:针对Array[String]类型,将String对象视为字符串数组
scala> val rdd =sc.textFile("/input/input1.txt")

scala> val rdd1 = rdd.map(x=>x.split(" "))
scala> rdd1.collect
Array[Array[String]] = Array(Array(hello, world), Array(how, are, you?), Array(ni, hao), Array(hello, tom))

scala> val rdd2 = rdd1.flatMap(x=>x) //压扁操作
rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at flatMap at :28

scala> rdd2.collect
res1: Array[String] = Array(hello, world, how, are, you?, ni, hao, hello, tom)  

scala> rdd2.flatMap(x=>x).collect
res3: Array[Char] = Array(h, e, l, l, o, w, o, r, l, d, h, o, w, a, r, e, y, o, u, ?, n, i, h, a, o, h, e, l, l, o, t, o, m)

scala> val rdd3 = rdd2.distinct
rdd3: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at distinct at :30

scala> rdd3.collect
res4: Array[String] = Array(are, tom, how, you?, hello, hao, world, ni) 

2.coalesce和repartition
  coalesce:修改RDD分区数
  repartition:重分区
  coalesce说明:将RDD的分区数进行修改,并生成新的RDD;
  有两个参数:第一个参数为分区数,第二个参数为shuffle Booleean类型,默认为false
  如果更改分区数比原有RDD的分区数小,shuffle为false;默认可以改小分区
  如果更改分区数比原有RDD的分区数大,shuffle必须为true;
  应用说明:一般处理filter或简化操作时,新生成的RDD中分区内数据骤减,可考虑重分区

   repartition底层源码:
   def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
  }
   底层调用coalesce方法,shuffle为true

   查看默认rdd分区数
   scala> rdd.partitions.size
          res4: Int = 2

  
   默认分区2个,往小了分Yes    修改rdd分区数,并生成新的rdd
    scala> val rdd4 = rdd.coalesce(1)
    rdd4: org.apache.spark.rdd.RDD[String] = CoalescedRDD[8] at coalesce at :26

    scala> rdd4.partitions.size
    res10: Int = 1

   默认分区2个,往大了分NO
    scala> val rdd5 = rdd.coalesce(5)
    rdd5: org.apache.spark.rdd.RDD[String] = CoalescedRDD[9] at coalesce at :26

    scala> rdd5.partitions.size
    res12: Int = 2

   默认分区2个,往大了分 增加属性shuffle设为true
    scala> val rdd5 = rdd.coalesce(5,true)
    rdd5: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[13] at coalesce at :26

    scala> rdd5.partitions.size
    res13: Int = 5

   再分区  repartition ,可增可减分区
    scala> val rdd6 = rdd5.repartition(8)
    rdd6: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[11] at repartition at :34

    scala> rdd6.partitions.size
    res6: Int = 8

    *******修改分区即修改Task任务数********
    1)textFile可以修改分区,如果加载文件后再想修改分区,可以使用以上两种方法 
    2)考虑场景业务需求清洗后,数据会减少,通过glom查看分区里的数据会有空值情况,采用重新分区解决

    3.randomSplit:
    def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]]
    说明:将RDD按照权重(weights)进行随机分配,返回指定个数的RDD集合;
    应用案例:Hadoop全排操作
    scala> val rdd = sc.parallelize(List(1,2,3,4,5,6,7))
    rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :24

    0.7+0.1+0.2 = 1 ,将rdd中的7个元素按权重分配, 权重加起来一定等于1
    scala> val rdd1 = rdd.randomSplit(Array(0.7,0.1,0.2))
    rdd1: Array[org.apache.spark.rdd.RDD[Int]] = Array(MapPartitionsRDD[1] at randomSplit at :26, MapPartitionsRDD[2] at randomSplit at :26, MapPartitionsRDD[3] at randomSplit at :26)

    scala> rdd1(0).collect
    res0: Array[Int] = Array(1, 5)                                                  

    scala> rdd1(1).collect
    res1: Array[Int] = Array()

    scala> rdd1(2).collect
    res2: Array[Int] = Array(2, 3, 4, 6, 7)   
    
    rdd重分区 ,按权重分配

4.glom 说明:返回每个分区中的数据项
val rdd = sc.parallelize(List(1,2,3,4,5,6,7,8,9))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :24

scala> rdd.glom.collect
res0: Array[Array[Int]] = Array(Array(1, 2, 3, 4, 5, 6, 7, 8, 9))


5.union:并集 将两个RDD进行合并,不去重
scala> val rdd = sc.parallelize(Array(9,8,7,6,5,4,3,2))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at :24

val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at :24

scala> rdd.union(rdd1).collect
res3: Array[Int] = Array(9, 8, 7, 6, 5, 4, 3, 2, 1, 2, 3, 4, 5, 6, 7, 8, 9)

scala> rdd1.union(rdd).collect
res4: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 9, 8, 7, 6, 5, 4, 3, 2)
返回一个数组Array

6.subtract:差集
scala> rdd.subtract(rdd1).collect   //rdd-rdd1
res8: Array[Int] = Array()

scala> rdd1.subtract(rdd).collect  rdd1-rdd
res9: Array[Int] = Array(1)

7.intersection:交集,去重
scala> rdd.intersection(rdd1).collect
res10: Array[Int] = Array(4, 6, 8, 7, 9, 3, 5, 2)

scala> rdd1.intersection(rdd).collect
res11: Array[Int] = Array(4, 6, 3, 7, 9, 8, 5, 2)


*********************************************************************
val list = List(1,2,3)
// :: 用于的是向队列的头部追加数据,产生新的列表, x::list,x就会添加到list的头部
 println(4 :: list)  //输出: List(4, 1, 2, 3)   list.::(4)

// :+ 用于在list尾部追加元素; list :+ x;
    println(list :+ 6)  //输出: List(1, 2, 3, 6)

// +: 用于在list的头部添加元素;
 val list2 = "A"+:"B"+:Nil //Nil Nil是一个空的List,定义为List[Nothing]
   println(list2)  //输出: List(A, B)

// ::: 用于连接两个List类型的集合 list ::: list2
    println(list ::: list2) //输出: List(1, 2, 3, A, B)

// ++ 用于连接两个集合,list ++ list2
    println(list ++ list2) //输出: List(1, 2, 3, A, B)


*********************************************************************
8.mapPartitions 
说明:针对每个分区进行操作;
与map方法类似,map是对rdd中的每一个元素进行操作
而mapPartitions(foreachPartition)则是对rdd中的每个分区的迭代器进行操作。

val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at :24

object MapPartitions {

  def main(args: Array[String]): Unit = {
    val conf  = new SparkConf().setMaster("local").setAppName("MapPartions")
    val sc   =new SparkContext(conf)
    sc.setLogLevel("ERROR")
     val s =sc.parallelize(1 to 9,3)
       s.collect()
  }

  def myfunc[T](iter: Iterator[T]) : Iterator[(T,T)]={
     var  list = List[(T,T)]()
     var res1 = iter.next()
    while(iter.hasNext){
       val  res2 = iter.next()
       list.::=(res1,res2)
            res1 =res2
    }
       list.iterator
  }
}

9.mapPartitionsWithIndex
类似于mappartition,但接受两个参数。
第一个参数是分区的索引,第二个参数是遍历该分区内所有项的迭代器。
输出是一个迭代器,包含应用函数编码的任何转换之后的项列表
val x = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 3)
scala> x.glom.collect
res11: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6), Array(7, 8, 9, 10))


def myfunc(index: Int, iter: Iterator[Int]) : Iterator[String] = {
  iter.map(x => index + "," + x)
}
注意:iter: Iterator[Int]:Iterator[T]类型,应和RDD内部数据类型一致
x.mapPartitionsWithIndex(myfunc).collect()
res10: Array[String] = Array(0,1, 0,2, 0,3, 1,4, 1,5, 1,6, 2,7, 2,8, 2,9, 2,10)

10.zip 组合新的RDD
def zip[U: ClassTag](other: RDD[U]): RDD[(T, U)]
说明:1.两个RDD之间数据类型可以不同;
      2.要求每个RDD具有相同的分区数
      3.需RDD的每个分区具有相同的数据个数
       个数不相同报错 r.TaskSetManager: Lost task 0.0 in stage 11.0 (TID 29, 192.168.179.131, executor 1): 
 org.apache.spark.SparkException: Can only zip RDDs with same number of elements in each partition

scala> val rdd = sc.parallelize(1 to 10,3)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[34] at parallelize at :24

val rdd1 = sc.parallelize(List("a","b","c","d","e","f","g","h","i","j"),3)
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[38] at parallelize at :24


scala> rdd.zip(rdd1).collect     //(rdd,rdd1)
res14: Array[(Int, String)] = Array((1,a), (2,b), (3,c), (4,d), (5,e), (6,f), (7,g), (8,h), (9,i), (10,j))

scala> rdd1.zip(rdd).collect    //(rdd1,rdd)
res15: Array[(String, Int)] = Array((a,1), (b,2), (c,3), (d,4), (e,5), (f,6), (g,7), (h,8), (i,9), (j,10))


11.zipParititions 要求:需每个RDD具有相同的分区数;
scala> var rdd4 = rdd1.zipPartitions(rdd2,rdd3){
     |       (rdd1Iter,rdd2Iter,rdd3Iter) => {
     |         var result = List[String]()
     |         while(rdd1Iter.hasNext && rdd2Iter.hasNext && rdd3Iter.hasNext) {
     |           result::=(rdd1Iter.next() + "_" + rdd2Iter.next() + "_" + rdd3Iter.next())
     |         }
     |         result.iterator
     |       }
     |     }
rdd4: org.apache.spark.rdd.RDD[String] = ZippedPartitionsRDD3[33] at zipPartitions at :27
 
scala> rdd4.collect
res23: Array[String] = Array(2_B_b, 1_A_a, 5_E_e, 4_D_d, 3_C_c)

 12.zipWithIndex
def zipWithIndex(): RDD[(T, Long)]
将现有的RDD的每个元素和相对应的Index组合,生成新的RDD[(T,Long)]

scala> val rdd1 = sc.parallelize(List("a","b","c","d","e","f","g","h","i","j"),3)
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[38] at parallelize at :24

scala> rdd1.zipWithIndex.collect
res31: Array[(String, Long)] = Array((a,0), (b,1), (c,2), (d,3), (e,4), (f,5), (g,6), (h,7), (i,8), (j,9))

scala> rdd1.zipWithIndex.glom.collect
res32: Array[Array[(String, Long)]] = Array(Array((a,0), (b,1), (c,2)), Array((d,3), (e,4), (f,5)), Array((g,6), (h,7), (i,8), (j,9)))

13.zipWithUniqueId
def zipWithUniqueId(): RDD[(T, Long)]
    该函数将RDD中元素和一个唯一ID组合成键/值对,该唯一ID生成算法如下:
    每个分区中第一个元素的唯一ID值为:该分区索引号,
    每个分区中第N个元素的唯一ID值为:(前一个元素的唯一ID值) + (该RDD总的分区数)
    该函数将RDD中的元素和这个元素在RDD中的ID(索引号)组合成键/值对。
    
    scala> val rdd = sc.parallelize(List(1,2,3,4,5),2)
    rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[21] at parallelize at :24

    scala> rdd.glom.collect
    res25: Array[Array[Int]] = Array(Array(1, 2), Array(3, 4, 5))                   
        
    scala> val rdd2 = rdd.zipWithUniqueId()
    rdd2: org.apache.spark.rdd.RDD[(Int, Long)] = MapPartitionsRDD[23] at zipWithUniqueId at :26

    scala> rdd2.collect
    res26: Array[(Int, Long)] = Array((1,0), (2,2), (3,1), (4,3), (5,5))
         计算规则:
         step1:第一个分区的第一个元素0,第二个分区的第一个元素1
         step2:第一个分区的第二个元素0+2
         step3:第二个分区的第二个元素1+2=3;第二个分区的第三个元素3+2=5;

         
         res25: Array[Array[Int]] = Array(Array(1, 2), Array(3, 4, 5))     
                                                0  2         1  3  5    
         3个分区:                                        
         scala> val rdd = sc.parallelize(List(1,2,3,4,5),3)    
         
         res37: Array[Array[Int]] = Array(Array(1), Array(2, 3), Array(4, 5)) 
                                                0         1  4         2  5

         res38: Array[(Int, Long)] = Array((1,0), (2,1), (3,4), (4,2), (5,5)) 

*************************************************************************************************************************************************


14.reduceByKey
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
说明:合并具有相同键(key)的值
scala> val rdd = sc.parallelize(List("cat","dog","bear","frog","fish","chichen"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[47] at parallelize at :24

scala> rdd.map(x=>(x.length,x)).reduceByKey(_+_).collect
res35: Array[(Int, String)] = Array((4,bearfrogfish), (3,catdog), (7,chichen))

def keyBy[K](f: T => K): RDD[(K, T)]
说明:将f函数的返回值作为Key,与RDD的每个元素构成piarRDD{RDD[(K, T)]}
scala> val rdd = sc.parallelize(List("cat","dog","bear","frog","fish","chichen"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[47] at parallelize at :24

scala> rdd.keyBy(x=>x+"1").collect
res38: Array[(String, String)] = Array((cat1,cat), (dog1,dog), (bear1,bear), (frog1,frog), (fish1,fish), (chichen1,chichen))

 16.groupByKey()
  def groupByKey(): RDD[(K, Iterable[V])]
  说明:按照相同的key进行分组,返回值为RDD[(K, Iterable[V])]

scala> val rdd = sc.parallelize(List((1,"a"),(1,"f"),(2,"b"),(2,"c"),(3,"d")))
rdd: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[0] at parallelize at :24

scala> rdd.groupBy
groupBy   groupByKey

scala> rdd.groupByKey.collect
res0: Array[(Int, Iterable[String])] = Array((1,CompactBuffer(a, f)), (3,CompactBuffer(d)), (2,CompactBuffer(b, c)))

17.keys
    def keys: RDD[K]
    说明:返回键值对的RRD的key的RDD
scala> rdd.keys.collect
res1: Array[Int] = Array(1, 1, 2, 2, 3)

18.values
 def values: RDD[V]
 说明:返回键值对的RRD的value的RDD

scala> rdd.values.collect
res2: Array[String] = Array(a, f, b, c, d)

19.sortByKey
def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.size): RDD[P]
说明:根据key进行排序,默认为ascending: Boolean = true(“升序”)

scala> val rdd = sc.parallelize(List("one","two","three","four","five"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[7] at parallelize at :24

scala> val rdd1 = sc.parallelize(1 to rdd.count.toInt)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[8] at parallelize at :26

scala> rdd.zip(rdd1).collect
res4: Array[(String, Int)] = Array((one,1), (two,2), (three,3), (four,4), (five,5))

scala> rdd.zip(rdd1).map(_.swap).sortByKey().collect
res6: Array[(Int, String)] = Array((1,one), (2,two), (3,three), (4,four), (5,five))

scala> rdd.zip(rdd1).map(_.swap).sortByKey(false).collect
res7: Array[(Int, String)] = Array((5,five), (4,four), (3,three), (2,two), (1,one))

20.partitionBy
def partitionBy(partitioner: Partitioner): RDD[(K, V)]
说明:通过设置Partitioner对RDD进行重分区

**************************************************************************************************************

聚合操作
1.mapValues[Pair]
def mapValues[U](f: V => U): RDD[(K, U)]
说明:将RDD[(K, V)] --> RDD[(K, U)],对Value做(f: V => U)操作,Key不变

scala> val rdd = sc.parallelize(List("one","two","three","four","five"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[23] at parallelize at

scala> val rdd2 = rdd.map(x=>(x.length,x)).mapValues(_+"_").collect
rdd2: Array[(Int, String)] = Array((3,one_), (3,two_), (5,three_), (4,four_), (4,five_))

2.flatMapValues[Pair]
 flatMapValues[U](f: V => TraversableOnce[U]): RDD[(K, U)]
scala>  val rdd = sc.parallelize(List("one","two","three","four","five"))

scala> rdd.map(x=>(x.length,x)).flatMapValues("_"+_+"_").collect
res18: Array[(Int, Char)] = Array((3,_), (3,o), (3,n), (3,e), (3,_), (3,_), (3,t), (3,w), (3,o), (3,_), (5,_), (5,t), (5,h), (5,r), (5,e), (5,e), (5,_), (4,_), (4,f), (4,o), (4,u), (4,r), (4,_), (4,_), (4,f), (4,i), (4,v), (4,e), (4,_))

3.subtractByKey[Pair]
    def subtractByKey[W: ClassTag](other: RDD[(K, W)]): RDD[(K, V)]
    说明:删掉RDD 中键与other RDD 中的键相同的元素
    val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2)
    val b = a.keyBy(_.length)
    val c = sc.parallelize(List("ant", "falcon", "squid"), 2)
    val d = c.keyBy(_.length)
    b.subtractByKey(d).collect
    res15: Array[(Int, String)] = Array((4,lion))
4.combineByKey[Pair] 4.3.1 聚合操作
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C)]
如下解释下3个重要的函数参数: 
createCombiner: V => C ,这个函数把当前的值作为参数,此时我们可以对其做些附加操作(类型转换)并把它返回 (这一步类似于初始化操作)
mergeValue: (C, V) => C,该函数把元素V合并到之前的元素C(createCombiner)上 (这个操作在每个分区内进行)
mergeCombiners: (C, C) => C,该函数把2个元素C合并 (这个操作在不同分区间进行)
    
说明:createCombiner:当分区中遇到第一次出现的键时,触发此函数
mergeValue:当分区中再次出现的键时,触发此函数
mergeCombiners:处理不同区当中相同Key的Value,执行此函数

scala> val a = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[30] at parallelize at :24

scala> val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[31] at parallelize at :24

scala> val c = b.zip(a)
c: org.apache.spark.rdd.RDD[(Int, String)] = ZippedPartitionsRDD2[32] at zip at :28

scala> c.glom.collect
res19: Array[Array[(Int, String)]] = Array(Array((1,dog), (1,cat), (2,gnu)), Array((2,salmon), (2,rabbit), (1,turkey)), Array((2,wolf), (2,bear), (2,bee)))

scala> c.combineByKey(List(_),(x:List[String],y:String)=>y::x,(x:List[String],y:List[String])=>x:::y).collect
res23: Array[(Int, List[String])] = Array((1,List(cat, dog, turkey)), (2,List(gnu, rabbit, salmon, bee, bear, wolf)))

5.foldByKey[Pair]
def foldByKey(zeroValue: V)(func: (V, V) => V): RDD[(K, V)]
说明:与reduceByKey作用类似,但通过柯里化函数,首先要初始化zeroValue

scala>  val rdd = sc.parallelize(List("one","two","three","four","five"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[36] at parallelize at :24

scala> rdd.map(x=>(x.length,x)).foldByKey("@")(_+_).collect
res27: Array[(Int, String)] = Array((4,@fourfive), (3,@onetwo), (5,@three))

6.reduceByKeyLocally:
行动操作
    该函数将RDD[K,V]中每个K对应的V值根据映射函数来运算,运算结果映射到一个Map[K,V]中,而不是RDD[K,V]
    def reduceByKeyLocally(func: (V, V) => V): Map[K, V]
    scala> val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)
    a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at :24

    scala> val b = a.map(x => (x.length, x))
    b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[1] at map at :26
    
    scala> b.glom.collect
    Array[Array[(Int, String)]] = Array(Array((3,dog), (3,cat)), Array((3,owl), (3,gnu), (3,ant)))    
    
    scala> b.reduceByKeyLocally(_+_)
    res0: scala.collection.Map[Int,String] = Map(3 -> dogcatowlgnuant) 

 7.join
 def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
 scala> val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[43] at parallelize at :24

scala> val b = a.keyBy(_.length)
b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[44] at keyBy at :26

scala> val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
c: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[45] at parallelize at :24

scala> val d = c.keyBy(_.length)
d: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[46] at keyBy at :26

scala> b.join(d).collect
res28: Array[(Int, (String, String))] = Array((6,(salmon,salmon)), (6,(salmon,rabbit)), (6,(salmon,turkey)), (6,(salmon,salmon)), (6,(salmon,rabbit)), (6,(salmon,turkey)), (3,(dog,dog)), (3,(dog,cat)), (3,(dog,gnu)), (3,(dog,bee)), (3,(rat,dog)), (3,(rat,cat)), (3,(rat,gnu)), (3,(rat,bee)))

scala> val b = a.keyBy(_.length).collect
b: Array[(Int, String)] = Array((3,dog), (6,salmon), (6,salmon), (3,rat), (8,elephant))
8.rightOuterJoin
       说明:对两个RDD 进行连接操作,确保第一个RDD 的键必须存在(右外连接)
9.leftOuterJoin
       说明:对两个RDD 进行连接操作,确保第二个RDD 的键必须存在(左外连接)
10.cogroup
       说明:将两个RDD 中拥有相同键的数据分组到一起,全连


行动操作3.3.2
-------------------------------------
    定义:触发Job,调用runJob()方法:
    比如:collect、count
       
    1.first():T 表示返回RDD中的第一个元素,不排序
      count(): Long表示返回RDD中的元素个数
      reduce(f:(T,T)=>T):T根据映射函数f,对RDD中的元素进地二元计算
      collect():Array[T]表示将RDD转换成数组
    
    2.take(num:Int)Array[T]表示获取RDD中从0到num-1下标的元素,不排序
    scala> var rdd1 = sc.makeRDD(Seq(10,4,3,12,3))
    rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at makeRDD at :24

    scala> rdd1.take(1)
    res7: Array[Int] = Array(10)

    3.top(num:Int)Array[T]表示从RDD中,按照默认(降序)或者指定的排序规则,返回前num个元素
    scala> rdd1.top(3)
    res8: Array[Int] = Array(12, 10, 4)                                             

    4.takeOrdered(num:Int):Array[T]和top类似,只不过以和top相反的顺序返回元素
    scala> rdd1.takeOrdered(3)
    res9: Array[Int] = Array(3, 3, 4)

        5.sortBy[K](f:(T))=>K 定义RDD为键值类型
    scala> var rdd1 = sc.makeRDD(Array(("A",2),("A",1),("B",6),("B",3),("B",7)))
    rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[7] at makeRDD at :24
    
    按照键值升序
    scala> rdd1.sortBy(x=>x).collect
    res10: Array[(String, Int)] = Array((A,1), (A,2), (B,3), (B,6), (B,7))          
    
    值降序排列
    scala> rdd1.sortBy(x=>x._2,false).collect
    res13: Array[(String, Int)] = Array((B,7), (B,6), (B,3), (A,2), (A,1))

    
    6.foreach
    说明:将结果返回值给执行器节点,而非驱动器,
    Local模式下:
    $ > spark-shell 
    scala> val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2)
    a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[6] at parallelize at :24

    查看打印结果:
    scala> a.foreach(println)
    dog
    tiger
    lion
    cat
    spider
    eagle
    scala> :q

    $> spark-shell --master spark://master:7077
    scala> val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2)
    a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at :24

    scala> a.foreach(println)
    无结果:
    
    查看master:4040
    Running Applications-->Executor Summary-->Logs --> stdout log page for app-20190727155306-0007/2  cat spider eagle
    
    总结:
    spark-shell 下所有的守护进程在一个JVM中所以可以使用foreach
    
    打印RDD的元素
    另一个常见的习惯用法是尝试使用rdd.foreach(println)或打印出RDD的元素rdd.map(println)。
    在一台机器上,这将生成预期的输出并打印所有RDD的元素。
    但是,在cluster模式下,stdout执行程序调用的输出现在写入执行stdout程序,
    而不是驱动程序上的输出,因此stdout驱动程序不会显示这些!要打印驱动程序上的所有元素,
    可以使用该collect()方法首先将RDD带到驱动程序节点:rdd.collect().foreach(println)。
    但是,这会导致驱动程序内存不足,因为collect()将整个RDD提取到一台机器上; 
    如果您只需要打印RDD的一些元素,更安全的方法是使用take():rdd.take(100).foreach(println)
    
    
    7.aggregate
       def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U
       说明:aggregate用于聚合RDD中的元素,先使用seqOp将RDD中每个分区中的T类型元素聚合成U类型,
         再使用combOp将之前每个分区聚合后的U类型聚合成U类型
         特别注意seqOp和combOp都会使用zeroValue的值,zeroValue的类型为U
        val z = sc.parallelize(List(1,2,3,4,5,6), 2)
        z.aggregate(0)(math.max(_, _), _ + _) 
        
        res40: Int = 9
        说明:
        step1:首先在第一个分区[0,1,2,3]中执行math.max,结果为:3
        step2:在第二个分区[0,4,5,6]中执行math.max,结果为:6
        stepn:在第N个分区中执行math.max,结果为:max
        step:将所有分区结果执行combOp(_+_),0+3+6=9
    
    修改初值为5:    
    z.aggregate(5)(math.max(_, _), _ + _)
    
    结果是多少?
    
    
    说明:
        // This example returns 16 since the initial value is 5
        // reduce of partition 0 will be max(5, 1, 2, 3) = 5
        // reduce of partition 1 will be max(5, 4, 5, 6) = 6
        // final reduce across partitions will be 5 + 5 + 6 = 16
        // note the final reduce include the initial value
    
    
    案例说明:
    scala> val z = sc.parallelize(List("12","23","345","4567"),2)
    z: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[3] at parallelize at :24

    scala> z.glom.collect
    res11: Array[Array[String]] = Array(Array(12, 23), Array(345, 4567))

    scala> z.aggregate("")((x,y) => math.max(x.length, y.length).toString, (x,y) => x + y)
    
    res12: String = 42

    scala> z.aggregate("")((x,y) => math.max(x.length, y.length).toString, (x,y) => x + y)
    res13: String = 24
    
        说明:
         step1:第一个分区中执行math.max(x.length, y.length).toString代码
            res0=:math.max(“”.length, “12”.length).toString = “2”
            res1=:math.max(res0.length, “23”.length).toString = “2”
            第一个分区最终返回值为:2
         step2:第二个分区中执行math.max(x.length, y.length).toString代码
            res2=:math.max(“”.length, “345”.length).toString = “3”
            res3=:math.max(res2.length, “4567”.length).toString = “4”
            第一个分区最终返回值为:4
         step3:最后执行(x,y) => x + y   :24  或  42 
            【原因分区是在不同的executor下运行,查看webUI task执行的秒数】
            
    scala> z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
           res14: String = 11
         
            step1:第一个分区中执行math.min(x.length, y.length).toString代码
            res0=:math.min(“”.length, “12”.length).toString = “0”
            res1=:math.min(res0.length, “23”.length).toString = “1”
            第一个分区最终返回值为:1
         step2:第二个分区中执行math.min(x.length, y.length).toString代码
            res2=:math.min(“”.length, “345”.length).toString = “0”
            res3=:math.min(res2.length, “4567”.length).toString = “1”
            第一个分区最终返回值为:1
         step3:最后执行(x,y) => x + y   :11  或 11
         
    scala> z.aggregate("12")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
    
    res15: String = 1211
            step1:第一个分区中执行math.min(x.length, y.length).toString代码
            res0=:math.min(“12”.length, “12”.length).toString = “2”
            res1=:math.min(res0.length, “23”.length).toString = “1”
            第一个分区最终返回值为:1
         step2:第二个分区中执行math.min(x.length, y.length).toString代码
            res2=:math.min(“12”.length, “345”.length).toString = “2”
            res3=:math.min(res2.length, “4567”.length).toString = “1”
            第一个分区最终返回值为:1
         step3:最后执行(x,y) => x + y   :1211  或 1211
         
    8.fold
        def fold(zeroValue: T)(op: (T, T) => T): T
        说明:fold理解为aggregate的简化
     val a = sc.parallelize(List(1,2,3), 3)
     a.fold(0)(_ + _)
     res59: Int = 6
    

你可能感兴趣的:(Spark常用算子总结)