Spark基础RDD练习(一)

spark_rdd练习

1.并行化创建RDD

  • 通过并行化生成rdd
    scala> var rdd1 = sc.parallelize(List(123,32,44,55,66,77,88,999))
    rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :24

  • 查看该RDD的分区数量
    scala> rdd1.partitions.length
    res6: Int = 4

  • 创建时指定分区数量
    scala> var rdd1 = sc.parallelize(List(123,234,345,456,567,678,789,890),3)
    rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at :24

2.sortBy,filter

  • 对rdd1里的每个元素乘以2,然后排序
    创建数组
    scala> val rdd1 = sc.parallelize(List(1,2,9,4,22,6))
    rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at :24
    元素2
    scala> val rdd2 = rdd1.map(x => x
    2).collect
    rdd2: Array[Int] = Array(2, 4, 18, 8, 44, 12)
    倒序
    scala> val rdd3 = rdd1.map(*2).sortBy(x =>x,false).collect
    rdd3: Array[Int] = Array(44, 18, 12, 8, 4, 2)
    正序加分区
    scala> val rdd3 = rdd1.map(
    *2).sortBy(x =>x,true,3).collect
    rdd3: Array[Int] = Array(2, 4, 8, 12, 18, 44)
    知识点:
    Spark内有collect方法,是Action操作里边的一个算子,这个方法可以将RDD类型的数据转化为数组,同时会从远程集群是拉取数据到driver端。
    sortBy函数是在org.apache.spark.rdd.RDD类中实现的,sortBy第一个参数是一个函数,该函数的也有一个带T泛型的参数,返回类型和RDD中元素的类型是一致的;第二个参数是ascending,从字面的意思大家应该可以猜到,是的,这参数决定排序后RDD中的元素是升序还是降序,默认是true,也就是升序;sortBy第三个参数是numPartitions,该参数决定排序后的RDD的分区个数,默认排序后的分区个数和排序之前的个数相等,即为this.partitions.size。

  • 过滤出大于等于50的元素
    创建数组
    scala> val rdd1 = sc.parallelize(List(1,4,6,88,99))
    rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :24
    过滤1
    scala> val rdd2 = rdd1.filter(_>50).collect
    rdd2: Array[Int] = Array(88, 99)
    过滤2
    scala> val rdd2 = rdd1.filter(x => x>50).collect
    rdd2: Array[Int] = Array(88, 99)

  • 过滤出偶数
    scala> val rdd2 = rdd1.filter(x => x%2==0).collect
    rdd2: Array[Int] = Array(4, 6, 88)

3.split,flatMap

  • 按空格分隔
    创建
    scala> val rdd1 = sc.parallelize(Array(“x h y”,“s w”,“c d s”))
    rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[4] at parallelize at :24
    空格分割
    scala> rdd1.map(x => x.split(" ")).collect
    res0: Array[Array[String]] = Array(Array(x, h, y), Array(s, w), Array(c, d, s))

  • 分隔并压平
    scala> val rdd2 = rdd1.flatMap(x => x.split(" “)).collect
    rdd2: Array[String] = Array(x, h, y, s, w, c, d, s)
    创建双层集合
    scala> val rdd1 = sc.parallelize(Array(Array(“x u”,“h o n g”,“y u”),Array(“x i a o”,“m i n g”)))
    rdd1: org.apache.spark.rdd.RDD[Array[String]] = ParallelCollectionRDD[8] at parallelize at :24
    压平1
    scala> val rdd2 = rdd1.map(x => x.flatMap(x => x.split(” “))).collect
    rdd2: Array[Array[String]] = Array(Array(x, u, h, o, n, g, y, u), Array(x, i, a, o, m, i, n, g))
    压平2
    scala> val rdd2 = rdd1.flatMap(x => x.flatMap(x => x.split(” "))).collect
    rdd2: Array[String] = Array(x, u, h, o, n, g, y, u, x, i, a, o, m, i, n, g)

4.union,intersecttion,distinct

  • 求并集
    创建1
    scala> val rdd1 = sc.parallelize(List(1,2,3,4,5,6))
    rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :24
    创建2
    scala> val rdd2 = sc.parallelize(List(3,4,5,6,7,8))
    rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at :24

  • 并集union
    scala> val rdd3 = rdd1.union(rdd2)
    rdd3: org.apache.spark.rdd.RDD[Int] = UnionRDD[2] at union at :27
    查看
    scala> rdd3.collect
    res1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 3, 4, 5, 6, 7, 8)

  • 求交集
    交集intersecttion
    scala> val rdd4 = rdd1.intersection(rdd2).collect
    rdd4: Array[Int] = Array(4, 5, 6, 3)

  • 去重distinct
    scala> val rdd5 = rdd1.union(rdd2).distinct.collect
    rdd5: Array[Int] = Array(8, 1, 2, 3, 4, 5, 6, 7)

5.groupByKey,reduceByKey,sortByKey

  • groupByKey
    创建1
    scala> val rdd1 = sc.parallelize(Array((“xhy1”,10),(“xhy2”,30),(“xhy1”,20),(“xhy2”,90)))
    rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[22] at parallelize at :24
    创建2
    scala> val rdd2 = rdd1.groupByKey().collect
    rdd2: Array[(String, Iterable[Int])] = Array((xhy1,CompactBuffer(10, 20)), (xhy2,CompactBuffer(30, 90)))
    遍历
    rdd2.foreach(score => {println(score._1);score._2.foreach(x => println(x))})
    xhy1
    10
    20
    xhy2
    30
    90
    知识点:
    groupByKey也是对每个key进行操作,但只生成一个sequence,如果需要对sequence进行aggregation操作(注意,groupByKey本身不能自定义操作函数),那么,选择reduceByKey/aggregateByKey更好。这是因为groupByKey不能自定义函数,我们需要先用groupByKey生成RDD,然后才能对此RDD通过map进行自定义函数操作。

  • reduceByKey
    scala> val rdd1 = sc.parallelize(List((“tom”,1),(“jerry”,2),(“kitty”,3)))
    rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at :24
    scala> val rdd2 = sc.parallelize(List((“jerry”,9),(“tom”,8),(“shuke”,7)))
    rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[1] at parallelize at :24
    scala> val rdd3 = rdd1.union(rdd2)
    rdd3: org.apache.spark.rdd.RDD[(String, Int)] = UnionRDD[2] at union at :27
    scala> val rdd4 = rdd3.reduceByKey(+)
    rdd4: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[3] at reduceByKey at :25
    scala> rdd4.collect
    res0: Array[(String, Int)] = Array((tom,9), (shuke,7), (kitty,3), (jerry,11))

  • sortByKey
    scala> val rdd5 = rdd4.map(t=>(t._2,t._1)).sortByKey(true).map(t=>(t._2,t._1)).collect
    rdd5: Array[(String, Int)] = Array((kitty,3), (shuke,7), (tom,9), (jerry,11))
    知识点:
    通过key进行排序。

6.Join,leftOuterJoin,rightOuterJoin

  • join
    scala> val rdd1 = sc.parallelize(List((“tom”,1),(“jerry”,2),(“kitty”,3)))
    rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[18] at parallelize at :24
    scala> val rdd2 = sc.parallelize(List((“jerry”,9),(“tom”,8),(“shuke”,7)))
    rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[19] at parallelize at :24
    scala> val rdd3 = rdd1.join(rdd2).collect
    rdd3: Array[(String, (Int, Int))] = Array((tom,(1,8)), (jerry,(2,9)))
    scala> rdd1.join(rdd2).collect
    res1: Array[(String, (Int, Int))] = Array((tom,(1,8)), (jerry,(2,9)))

  • leftOuterJoin
    scala> val rdd1 = sc.parallelize(List((“tom”,1),(“jerry”,2),(“kitty”,3)))
    rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[18] at parallelize at :24
    scala> val rdd2 = sc.parallelize(List((“jerry”,9),(“tom”,8),(“shuke”,7)))
    rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[19] at parallelize at :24
    scala> val rdd3 = rdd1.leftOuterJoin(rdd2).collect
    rdd3: Array[(String, (Int, Option[Int]))] = Array((tom,(1,Some(8))), (jerry,(2,Some(9))), (kitty,(3,None)))
    leftOuterJoin类似于SQL中的左外关联left outer join

  • rightOuterJoin
    scala> val rdd3 = rdd1.rightOuterJoin(rdd2).collect
    rdd3: Array[(String, (Option[Int], Int))] = Array((tom,(Some(1),8)), (jerry,(Some(2),9)), (shuke,(None,7)))

7.Reduce

  • reduce
    scala> val rdd1 = sc.parallelize(List(1,2,3,4,5,6),2)
    rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[32] at parallelize at :24
    scala> val rdd2 = rdd1.reduce(+)
    rdd2: Int = 21
  • 排序后取最大2个
    scala> rdd1.top(2)
    res2: Array[Int] = Array(6, 5)
  • 取前2个
    scala> rdd1.take(2)
    res3: Array[Int] = Array(1, 2)
  • 取第1个元素
    scala> rdd1.first
    res4: Int = 1

8.cogroup

scala> val rdd1 = sc.parallelize(Array((1,“a”),(2,“b”),(3,“c”)))
rdd1: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[34] at parallelize at :24

scala> val rdd2 = sc.parallelize(Array((1,100),(2,97),(3,100)))
rdd2: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[35] at parallelize at :24

scala> val rdd3 = rdd1.cogroup(rdd2).collect
rdd3: Array[(Int, (Iterable[String], Iterable[Int]))] = Array((1,(CompactBuffer(a),CompactBuffer(100))), (2,(CompactBuffer(b),CompactBuffer(97))), (3,(CompactBuffer©,CompactBuffer(100))))

你可能感兴趣的:(Spark)