Value类型RDD转换算子(二)——filter、sample、distinct、coalesce、repartition、sortBy、pipe

RDD整体上分为Value类型和KeyValue类型,其中Value类型又包含双Value类型,接下来的内容就是Value类型RDD的各种转换算子整理:

7. filter(func)

8. sample(withReplacement,fraction,seed)

9. distinct(num)

10. coalesce(numPartitions)(可选shuffle)

11. repartition(numPartitions)(默认shuffle)

12. sortBy(func,[ascending],[num Tasks])

13. pipe(command,[envVars])


7. filter(func)

过滤,原RDD经过func函数计算,过滤出返回值为true的原RDD元素

eg:创建一个字符串RDD,过滤出一个新的RDD(包含"xiao")

scala> val rdd7 = sc.makeRDD(Array("xiaowang","xiaoli","wangwu","zhaoliu"))
rdd7: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[26] at makeRDD at :24

scala> val filter = rdd7.filter(_.contains("xiao"))
filter: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[27] at filter at :26

scala> filter.collect
res23: Array[String] = Array(xiaowang, xiaoli)

 

8. sample(withReplacement,fraction,seed)

从源RDD中随机采样,随机抽取元素。

以指定的随机种子数seed生成随机数,withReplacement=true是有放回的抽样,=false是无放回的抽样。

ps:随机种子数固定不变,算法固定,产生的结果随机数也是固定的,是伪随机数;可以将seed设定为System.currentTimeMills,当前时间,可产生真正的随机数。

用途:判断是否存在数据倾斜,由于RDD是分区的,例如RDD数据为1,1,1,1,1,1,1,2,3;这样分成三个区明显有数据倾斜,7个1分在partition1,2和3分别单独分一个partition;通过不放回抽样sample就可以判断出1数据量太多,可以将7个1打散重分区,这样可以避免数据倾斜发生。

贴一下sample的源码:

def sample(
      withReplacement: Boolean,
      fraction: Double,
      seed: Long = Utils.random.nextLong): RDD[T] = {
    require(fraction >= 0,
      s"Fraction must be nonnegative, but got ${fraction}")

    withScope {
      require(fraction >= 0.0, "Negative fraction value: " + fraction)
      if (withReplacement) {
        new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed)
      } else {
        new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed)
      }

可以看到当withReplacement=true时,对fraction采用泊松算法;反之false不放回抽样,对fraction采用的是伯努利算法。

(1)有放回抽样

有放回抽样,fraction一般大于3,是我们希望抽样元素的重复次数(可能重复次数不一定为指定fraction)

(2)无放回抽样

在伯努利分布实验中,只有2种情况,值为1和0;对集合中每个元素进行判断,若为1将当前元素抽出来,为0表示将当前数不抽出来;

设定fraction值介于[0,1]例如0.4,那么不放回抽样时,对每一个元素进行种子数的伯努利算法,得出的随机结果值与0.4比较,若大于0.4则抽出来,若小于0.4则不抽出来。

eg:创建一个RDD,看放回抽样和不放回抽样的区别

(1)有放回抽样,fraction是指定元素期望重复次数

scala> val rdd8 = sc.makeRDD(1 to 10)
rdd8: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at makeRDD at :24

scala> rdd8.sample(true,3,2).collect
res0: Array[Int] = Array(1, 2, 2, 2, 3, 3, 3, 4, 4, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 8, 8, 9, 9, 9, 9, 9, 10, 10, 10, 10)

scala> rdd8.sample(true,0.4,2).collect
res1: Array[Int] = Array(1, 4, 5, 6, 6, 7, 9)

(2)不放回抽样,fraction介于0-1,0代表一个都不抽,1代表全抽出来

scala> rdd8.sample(false , 0 , 2).collect
res4: Array[Int] = Array()

scala> rdd8.sample(false , 1 , 2).collect
res5: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> rdd8.sample(false , 0.5 , 2).collect
res6: Array[Int] = Array(2, 3, 5, 8, 10)

 

9. distinct(num)

对源RDD进行去重,num指定去重后新RDD的分区数,不加num去重后分区数和原RDD一致,

eg:不加num去重,并查看去重前后RDD的分区数

scala> val rdd9 = sc.makeRDD(List(1,2,1,3,2,9,6,1))
rdd9: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[28] at makeRDD at :24

scala> rdd9.glom.collect
res15: Array[Array[Int]] = Array(Array(1), Array(2), Array(1), Array(3), Array(2), Array(9), Array(6), Array(1))

scala> rdd9.distinct.collect
res16: Array[Int] = Array(1, 9, 2, 3, 6)

scala> rdd9.distinct.glom.collect
res17: Array[Array[Int]] = Array(Array(), Array(1, 9), Array(2), Array(3), Array(), Array(), Array(6), Array())

发现不加num,去重前后分区数一致,都是8个。

eg:加num去重,并查看去重前后RDD的分区数

scala> rdd9.distinct(3).collect
res18: Array[Int] = Array(6, 3, 9, 1, 2)

scala> rdd9.distinct(3).glom.collect
res19: Array[Array[Int]] = Array(Array(6, 3, 9), Array(1), Array(2))

发现指定num为3,去重后,分区数由原RDD的8个分区,变成了3个分区。

 

10. coalesce(numPartitions)(可选shuffle)

对RDD重分区,修改分区数,防止出现数据倾斜。

eg:指定4个分区的RDD,用coalesce缩减为2个分区

scala> val rdd10 = sc.makeRDD(1 to 16 , 4)
rdd10: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[44] at makeRDD at :24

scala> rdd10.partitions.size
res20: Int = 4

scala> rdd10.coalesce(2).partitions.size
res21: Int = 2

scala> rdd10.coalesce(2).collect
res22: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)

数据没变,分区数由4个变成了2个。

 

11. repartition(numPartitions)(默认shuffle,数据分布更均匀)

与coalesce类似,都是对RDD重分区,数据不变

scala> val rdd11 = sc.makeRDD(1 to 16,4)
rdd11: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[47] at makeRDD at :24

scala> rdd11.repartition(1).partitions.size
res23: Int = 1

scala> rdd11.repartition(1).collect
res24: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)

ps:同样是修改分区,coalese和repartition的区别通过源码来看一下

# coalesce源码:
 def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty) 

# repartition源码:
 def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope 
  {
    coalesce(numPartitions, shuffle = true)
  }

通过源码可以看出coalesce修改分区的时候可以选择是否进行shuffle过程(数据打乱重组),由参数shuffle决定为false还是true;

而repartition实际调用的是coalesce,默认是进行shuffle;

通过两者分区数据来看,repartition各个分区的数据经过shuffle后,分布更加均匀,防止出现数据倾斜的情况:

scala> rdd10.coalesce(3).glom.collect
res25: Array[Array[Int]] = Array(Array(1, 2, 3, 4), Array(5, 6, 7, 8), Array(9, 10, 11, 12, 13, 14, 15, 16))

scala> rdd10.repartition(3).glom.collect
res26: Array[Array[Int]] = Array(Array(3, 7, 10, 13, 16), Array(1, 4, 5, 8, 11, 14), Array(2, 6, 9, 12, 15))

 

12. sortBy(func,[ascending],[num Tasks])

先使用func对RDD数据进行处理,然后排序,默认升序ascending;num Tasks可有可无,设定任务数

scala> rdd12.sortBy(x => x).collect
res27: Array[Int] = Array(2, 4, 5, 7)

scala> rdd12.sortBy(x => x%3).collect
res29: Array[Int] = Array(4, 7, 2, 5)

 

13. pipe(command,[envVars])

管道,针对每个分区,都执行一个shell脚本,最后返回输出的RDD

先写一个pipe.sh脚本,首先打印“AA”,然后对读取的每一行数据前都添加>>>

#! /bin/bash
echo "AA"
while read LINE; do
   echo ">>>"${LINE}
done

代码如下:

scala> val rdd13 = sc.makeRDD(List("1","2","3","4"),3)
rdd13: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[79] at makeRDD at :24

scala> rdd13.pipe("/opt/module/spark/pipe.sh").collect
res31: Array[String] = Array(AA, >>>1, AA, >>>2, AA, >>>3, >>>4)

scala> rdd13.glom.collect
res32: Array[Array[String]] = Array(Array(1), Array(2), Array(3, 4))

结果可以看出,每个个分区都会执行一次脚本。

你可能感兴趣的:(Spark,Spark)