Spark的Transformations算子(理解+实例)

把每个Transformations算子都敲着练习几遍会理解的更深刻

Transformations算子之后要写action算子才会进行计算。

1. map(func)

描述:返回一个新的RDD,该RDD由每一个输入元素经过func函数转换后组成

  def main(args: Array[String]): Unit = {

    val conf = new SparkConf().setAppName("chunchun").setMaster("local")
    val sc = new SparkContext(conf)
    val arr = Array(1,2,3,4,5,6)
    val numRDD = sc.parallelize(arr)
    val resultRDD = numRDD.map(x => x * x)
    resultRDD.foreach(println)
  }
结果:
1
4
9
16
25
36

2. filter(func)

描述:返回一个新的RDD,该RDD经过func函数计算后返回值为true的输入元素组成

 def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("chunchun").setMaster("local")
    val sc = new SparkContext(conf)
    val arr = Array(1,2,3,4,5,6)
    //parallelize()创建个rdd
    val numRDD = sc.parallelize(arr)
    val resultRDD = numRDD.map(_%2 == 0)
    resultRDD.foreach(println)
    resultRDD.take(100).foreach(println)
    resultRDD.collect()

  }
结果:
false
true
false
true
false
true

3.flatMap(func)

描述:类似map,到每个输入元素可以被映射为0个或者多个输入元素(所以func返回一个序列,而不是一个元素)

 def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("chun").setMaster("local")

    val sc = new SparkContext(conf)
    val words = Array("hello python","hello hadoop","hello spark")
    val wordRDD = sc.parallelize(words)
    wordRDD.flatMap(_.split(" ")).collect.foreach(println)

  }
结果:
hello
python
hello
hadoop
hello
spark

4.mapPartitions(func)

描述:类似map,但独立在RDD的每个分区上运行,因此在类型为T的RDD上运行时,,func函数的类型必须是Iterator => Iterator

 def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("chun").setMaster("local")
    val sc = new SparkContext(conf)
    val array = Array(1,2,1,2,2,3,4,5,6,7,8,9)
    val arrayRDD = sc.parallelize(array)
    arrayRDD.mapPartitions(elements =>{
      val result = new ArrayBuffer[Int]()
      elements.foreach(e =>{
        result +=e
      })
      result.iterator
    }).foreach(println)
  }
结果:
121223456789

5.mapPartitionsWithIndex(func)

描述:类似于mapPartitions,但func带有一个整形参数表示分片的索引值,因此在类型为T的RDD上运行时func函数的类型必须(int,Iterator)=> Iterator

def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("chunchun").setMaster("local")
    val sc = new SparkContext(conf)
    val arrayRDD = sc.parallelize(Array(1,2,3,4,5,6,7,8,9),2)  //2表示分区数
    arrayRDD.mapPartitionsWithIndex((index,elements) =>{
      println("partition index:" + index)
      val result = new ArrayBuffer[Int]()
      elements.foreach(e =>{
        result += e
      })
      result.iterator

    }
  ).foreach(println)

  }
运行结果:
partition index:0
1
2
3
4

partition index:1
5
6
7
8
9

6.sample(WithReplacement,fraction,seed)

描述:根据fraction指定的比例对数据进行采样,可以选择是否使用随机数进行替换,seed用于指定随机数生成器种子

 def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("chunchun").setMaster("local")

    val sc = new SparkContext(conf)

    val arrayRDD = sc.parallelize(1 to 10000)
    val sampleRDD = arrayRDD.sample(true,0.001)    //true表示抽样之后放回
    println(sampleRDD.count())
  }

结果:10
 def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("chunchun").setMaster("local")

    val sc = new SparkContext(conf)

    val arrayRDD = sc.parallelize(1 to 10000)
    val sampleRDD = arrayRDD.sample(false,0.001)  //false表示抽样之后不放回
    println(sampleRDD.count())
    
 结果:9
  }

7.union(otherDataset)

描述:对源RDD和参数RDD求并集后并返回一个新的RDD

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("chunchun").setMaster("local")

    val sc = new SparkContext(conf)

    val rdd1 = sc.parallelize(1 to 10)
    val rdd2 = sc.parallelize(11 to 20)
    val resultRDD = rdd1.union(rdd2)
    resultRDD.foreach(print)
  }
结果:
11121314151617181920

8.intersection(otherDataset)

描述:对源RDD和参数RDD求交集后并返回一个新的RDD

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("chunchun").setMaster("local")

    val sc = new SparkContext(conf)

    val rdd1 = sc.parallelize(Array(1,3,5,7,8))
    val rdd2 = sc.parallelize(Array(3,5,7))
    val resultRDD = rdd1.intersection(rdd2)
    resultRDD.foreach(println)
  }
结果:
3
7
5

9.distinct([numTasks])

描述:对源RDD进行去重,返回一个新的RDD

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("chunchun").setMaster("local")

    val sc = new SparkContext(conf)

    val arr = Array(Tuple3("max","math",90),("max","englist",85),("mike","math",100))

    val scoreRDD = sc.parallelize(arr)
    val studentNumber = scoreRDD.map(_._1).distinct().collect()
    println(studentNumber.mkString(","))
  }
结果:
max,mike

10.groupByKey([numTasks])

描述:在一个(k,v)形式的RDD上调用,返回一个(k,Iterator[V])的RDD

 def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("chunchun").setMaster("local")

    val sc = new SparkContext(conf)
    var x =0
    val arr = Array("chun1 chun2 chun3 chun1 chun1 chun2", "chun1")
    val arrRDD = sc.parallelize(arr)
    val resultRDD = arrRDD.flatMap(_.split(" ")).map((_,1)).groupByKey()
    //resultRDD.foreach(println)
    resultRDD.foreach(element => {
      println(element._1+" "+element._2.size)
    })
  }
chun1 4
chun3 1
chun2 2

11.reduceByKey(func,[numTasks])

描述:在一个(k,v)形式的RDD上调用,返回一个(k,v)的RDD,使用指定的reduce函数,将相同key的值聚集到一起,与groupBy类似,reudce任务的个数可以通过第二个参数来设置

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("chunchun").setMaster("local")

    val sc = new SparkContext(conf)

    val arr =Array("chun1 chun2 chun3 chun1 chun1 chun2","chun1")
    val arrRDD=sc.parallelize(arr)
    val resultRDD = a.flatMap(_.split(" ")).map(x=>((x,1))).reduceByKey(_+_).collect.foreach(println)
  }
结果:
(chun1,4)
(chun3,1)
(chun2,2)

12.aggregateByKey(zeroValue)(seqOp,combOP,[numTasks])

描述:当调用(k,v)对的数据集时,返回(K,U)数据集,其中每个key的值使用给定的聚合函数和中性‘零’进行聚合,与groupyKey类似,reduce任务的数量可以通过可选的第二个参数进行配置

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("chunchun").setMaster("local")

    val sc = new SparkContext(conf)

    val data = List((1,3),(1,4),(2,3),(3,6),(1,2),(3,8))
    val rdd =sc.parallelize(data)
    rdd.aggregateByKey(0)(math.max(_,_),_+_).collect(.foreach(println()))
  }
结果:

(1,4)
(3,8)
(2,3)

13.sortByKey([ascending],[numTasks])

描述:在一个(k,v)形式的RDD上调用,k必须实现Ordered接口,返回一个按照key进行排序的(k,v)的RDD

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("chunchun").setMaster("local")

    val sc = new SparkContext(conf)

    val scores = Array(Tuple2("mike",80),("max",90),("bob",100))
    val scoresRDD = sc.parallelize(scores)
    val sortByKeyRDD = scoresRDD.map(x => (x._2,x._1))
      .sortByKey(false).map(x =>(x._2,x._1)         //把元组k,v换位值进行排序后,再换回来
          )
    sortByKeyRDD.collect.foreach(println)
  }
(bob,100)
(max,90)
(mike,80)

14.join(otherDataset,[numTasks])

描述:当调用(k,v)和(k,w)类型的数据集时,返回一个(k,(v,w))形式的数据集,支持left outer join、right outer join 和full outer join

 def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
      .setAppName("chunchun")
      .setMaster("local")
    val sc = new SparkContext(conf)

    //学生信息
    val students = Array(
      Tuple2(1,"max"),
      Tuple2(2,"mike"),
      Tuple2(3,"bob")
    )
    //分数
    val scores = Array(
      Tuple2(1,90),
      Tuple2(2,120),
      Tuple2(3,80)
    )

    val stuRDD = sc.parallelize(students)
    val scoresRDD = sc.parallelize(scores)

    //两组kv对join,返回的是(k,(v,w))
    val resultRDD = stuRDD.join(scoresRDD).sortByKey()
        resultRDD.foreach(x => {
          println("id:" +x._1 +" name:"+x._2._1 + " score:"+x._2._2)
          println("=========================")
        })
  }
结果:

id:1 name:max score:90
=========================
id:2 name:mike score:120
=========================
id:3 name:bob score:80
=========================

15.cogroup(otherDataset,[numTasks])

描述:当调用(k,v)和(k,w)类型的数据集时,返回(k,(Iterator,Iterator))元组的数据集

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("chunchun1").setMaster("local")

    val sc = new SparkContext(conf)

    //学生信息
    val students = Array(("class1","max"),("class1","mike"),("class2","bob"))

    //分数
    val scores = Array(("class1",90),("class1",120),("class2",80))

    val stuRDD = sc.parallelize(students)
    val scoresRDD = sc.parallelize(scores)

    val resultRDD = stuRDD.cogroup(scoresRDD).sortByKey()
    resultRDD.foreach(x =>{
      println("class:"+x._1)
      x._2._1.foreach(println)
      x._2._2.foreach(println)  //可以去掉只显示名字
      println("===========")
    })
  }
结果:

class:class1
max
mike
90
120
===========
class:class2
bob
80
===========

16.cartesian(otherDataset)

描述:当调用T和U类型的数据集时,返回一个(T,U)类型的数据集

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("chunchun").setMaster("local")
    val sc = new SparkContext(conf)

    val arr1 = sc.parallelize(Array(1,3,5))
    val arr2 = sc.parallelize(Array(2,4,6))
    arr1.cartesian(arr2).collect().foreach(println)
  }
(1,2)
(1,4)
(1,6)
(3,2)
(3,4)
(3,6)
(5,2)
(5,4)
(5,6)

17.pipe(command,[envVars])

描述:通过shell命令(例如perl或bash脚本)对RDD的每个分区进行管道连接。RDD元素写入进程的stdin,输出到其stdout的行作为字符串的RDD返回

18.coalesce(numpartitions)

描述:将RDD中的分区数减少到numpartitions,在过滤大型数据集后,可以更高效地运行操作

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("chunchun").setMaster("local")

    val sc = new SparkContext(conf)

    val rdd1 = sc.parallelize(1 to 20,10)
    println(rdd1.partitions.length) //10

    var rdd2 = rdd1.coalesce(15,true)
    println(rdd2.partitions.length) //15

    var rdd3 = rdd1.repartition(15)
    println(rdd3.partitions.length) //15

    var rdd4 = rdd1.coalesce(15,false) //这种是不可以重新分区的
    println(rdd4.partitions.length) //10

    var rdd5 = rdd1.coalesce(2,false)
    println(rdd5.partitions.length) //2
    rdd5.foreach(print) //第一个区:12345678910 第二个区:11121314151617181920
    
    var rdd6 = rdd1.coalesce(2,true)
    println(rdd6.partitions.length) //2
    rdd6.foreach(print) //第一个区:135791113151719 第二个区:2468101214161820

  }

19.repartiton(numPartitions)

描述:随机重组RDD中的数据,以创建更多或更少的分区,并在分区之间进行平衡,总是会产生shuffle操作


repartition和coalesce

他们两个都是RDD的分区进行重新划分,repartition只是coalesce接口中shuffle为true的简易实现,(假设RDD有N个分区,需要重新划分成M个分区)
1)、N
2)如果N>M并且N和M相差不多,(假如N是1000,M是100)那么就可以将N个分区中的若干个分区合并成一个新的分区,最终合并为M个分区,这时可以将shuff设置为false,在shuffl为false的情况下,如果M>N时,coalesce为无效的,不进行shuffle过程,父RDD和子RDD之间是窄依赖关系。
3)如果N>M并且两者相差悬殊,这时如果将shuffle设置为false,父子RDD是窄依赖关系,他们同处在一个Stage中,就可能造成spark程序的并行度不够,从而影响性能,如果在M为1的时候,为了使coalesce之前的操作有更好的并行度,可以讲shuffle设置为true。
总之:如果shuff为false时,如果传入的参数大于现有的分区数目,RDD的分区数不变,也就是说不经过shuffle,是无法将RDDde分区数变多的。

20.repartitionAndSortWithinPartitions(partitioner)

描述:根据给定的分区重新分区RDD,在每个结果分区中,根据它们的键对记录进行排序。这比调用重新分区更有效,然后在每个分区中进行排序,因为它可以将排序推入到洗牌机器中。

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("chunchun").setMaster("local")
    val sc = new SparkContext(conf)
    val arrayRDD = sc.parallelize(Array(1,2,3,4,5,6,7,8,9),3) //3表示分区数
    arrayRDD.mapPartitionsWithIndex((index,elements) =>{  //index为索引值,elements数据
      println("partition index:" + index)
      val result = new ArrayBuffer[Int]()
      elements.foreach(e =>{
        result += e
      })
      result.iterator

    }
  ).foreach(println)

  }
结果:
partition index:0
1
2
3
partition index:1
4
5
6
partition index:2
7
8
9

你可能感兴趣的:(Spark)