RDD整体上分为Value类型和KeyValue类型,其中Value类型又包含双Value类型,接下来的内容就是Value类型RDD的各种转换算子整理:
7. filter(func)
8. sample(withReplacement,fraction,seed)
9. distinct(num)
10. coalesce(numPartitions)(可选shuffle)
11. repartition(numPartitions)(默认shuffle)
12. sortBy(func,[ascending],[num Tasks])
13. pipe(command,[envVars])
过滤,原RDD经过func函数计算,过滤出返回值为true的原RDD元素
eg:创建一个字符串RDD,过滤出一个新的RDD(包含"xiao")
scala> val rdd7 = sc.makeRDD(Array("xiaowang","xiaoli","wangwu","zhaoliu"))
rdd7: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[26] at makeRDD at :24
scala> val filter = rdd7.filter(_.contains("xiao"))
filter: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[27] at filter at :26
scala> filter.collect
res23: Array[String] = Array(xiaowang, xiaoli)
从源RDD中随机采样,随机抽取元素。
以指定的随机种子数seed生成随机数,withReplacement=true是有放回的抽样,=false是无放回的抽样。
ps:随机种子数固定不变,算法固定,产生的结果随机数也是固定的,是伪随机数;可以将seed设定为System.currentTimeMills,当前时间,可产生真正的随机数。
用途:判断是否存在数据倾斜,由于RDD是分区的,例如RDD数据为1,1,1,1,1,1,1,2,3;这样分成三个区明显有数据倾斜,7个1分在partition1,2和3分别单独分一个partition;通过不放回抽样sample就可以判断出1数据量太多,可以将7个1打散重分区,这样可以避免数据倾斜发生。
贴一下sample的源码:
def sample(
withReplacement: Boolean,
fraction: Double,
seed: Long = Utils.random.nextLong): RDD[T] = {
require(fraction >= 0,
s"Fraction must be nonnegative, but got ${fraction}")
withScope {
require(fraction >= 0.0, "Negative fraction value: " + fraction)
if (withReplacement) {
new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed)
} else {
new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed)
}
可以看到当withReplacement=true时,对fraction采用泊松算法;反之false不放回抽样,对fraction采用的是伯努利算法。
(1)有放回抽样
有放回抽样,fraction一般大于3,是我们希望抽样元素的重复次数(可能重复次数不一定为指定fraction)
(2)无放回抽样
在伯努利分布实验中,只有2种情况,值为1和0;对集合中每个元素进行判断,若为1将当前元素抽出来,为0表示将当前数不抽出来;
设定fraction值介于[0,1]例如0.4,那么不放回抽样时,对每一个元素进行种子数的伯努利算法,得出的随机结果值与0.4比较,若大于0.4则抽出来,若小于0.4则不抽出来。
eg:创建一个RDD,看放回抽样和不放回抽样的区别
(1)有放回抽样,fraction是指定元素期望重复次数
scala> val rdd8 = sc.makeRDD(1 to 10)
rdd8: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at makeRDD at :24
scala> rdd8.sample(true,3,2).collect
res0: Array[Int] = Array(1, 2, 2, 2, 3, 3, 3, 4, 4, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 8, 8, 9, 9, 9, 9, 9, 10, 10, 10, 10)
scala> rdd8.sample(true,0.4,2).collect
res1: Array[Int] = Array(1, 4, 5, 6, 6, 7, 9)
(2)不放回抽样,fraction介于0-1,0代表一个都不抽,1代表全抽出来
scala> rdd8.sample(false , 0 , 2).collect
res4: Array[Int] = Array()
scala> rdd8.sample(false , 1 , 2).collect
res5: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scala> rdd8.sample(false , 0.5 , 2).collect
res6: Array[Int] = Array(2, 3, 5, 8, 10)
对源RDD进行去重,num指定去重后新RDD的分区数,不加num去重后分区数和原RDD一致,
eg:不加num去重,并查看去重前后RDD的分区数
scala> val rdd9 = sc.makeRDD(List(1,2,1,3,2,9,6,1))
rdd9: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[28] at makeRDD at :24
scala> rdd9.glom.collect
res15: Array[Array[Int]] = Array(Array(1), Array(2), Array(1), Array(3), Array(2), Array(9), Array(6), Array(1))
scala> rdd9.distinct.collect
res16: Array[Int] = Array(1, 9, 2, 3, 6)
scala> rdd9.distinct.glom.collect
res17: Array[Array[Int]] = Array(Array(), Array(1, 9), Array(2), Array(3), Array(), Array(), Array(6), Array())
发现不加num,去重前后分区数一致,都是8个。
eg:加num去重,并查看去重前后RDD的分区数
scala> rdd9.distinct(3).collect
res18: Array[Int] = Array(6, 3, 9, 1, 2)
scala> rdd9.distinct(3).glom.collect
res19: Array[Array[Int]] = Array(Array(6, 3, 9), Array(1), Array(2))
发现指定num为3,去重后,分区数由原RDD的8个分区,变成了3个分区。
对RDD重分区,修改分区数,防止出现数据倾斜。
eg:指定4个分区的RDD,用coalesce缩减为2个分区
scala> val rdd10 = sc.makeRDD(1 to 16 , 4)
rdd10: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[44] at makeRDD at :24
scala> rdd10.partitions.size
res20: Int = 4
scala> rdd10.coalesce(2).partitions.size
res21: Int = 2
scala> rdd10.coalesce(2).collect
res22: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
数据没变,分区数由4个变成了2个。
与coalesce类似,都是对RDD重分区,数据不变
scala> val rdd11 = sc.makeRDD(1 to 16,4)
rdd11: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[47] at makeRDD at :24
scala> rdd11.repartition(1).partitions.size
res23: Int = 1
scala> rdd11.repartition(1).collect
res24: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
ps:同样是修改分区,coalese和repartition的区别通过源码来看一下
# coalesce源码:
def coalesce(numPartitions: Int, shuffle: Boolean = false,
partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
# repartition源码:
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope
{
coalesce(numPartitions, shuffle = true)
}
通过源码可以看出coalesce修改分区的时候可以选择是否进行shuffle过程(数据打乱重组),由参数shuffle决定为false还是true;
而repartition实际调用的是coalesce,默认是进行shuffle;
通过两者分区数据来看,repartition各个分区的数据经过shuffle后,分布更加均匀,防止出现数据倾斜的情况:
scala> rdd10.coalesce(3).glom.collect
res25: Array[Array[Int]] = Array(Array(1, 2, 3, 4), Array(5, 6, 7, 8), Array(9, 10, 11, 12, 13, 14, 15, 16))
scala> rdd10.repartition(3).glom.collect
res26: Array[Array[Int]] = Array(Array(3, 7, 10, 13, 16), Array(1, 4, 5, 8, 11, 14), Array(2, 6, 9, 12, 15))
先使用func对RDD数据进行处理,然后排序,默认升序ascending;num Tasks可有可无,设定任务数
scala> rdd12.sortBy(x => x).collect
res27: Array[Int] = Array(2, 4, 5, 7)
scala> rdd12.sortBy(x => x%3).collect
res29: Array[Int] = Array(4, 7, 2, 5)
管道,针对每个分区,都执行一个shell脚本,最后返回输出的RDD
先写一个pipe.sh脚本,首先打印“AA”,然后对读取的每一行数据前都添加>>>
#! /bin/bash
echo "AA"
while read LINE; do
echo ">>>"${LINE}
done
代码如下:
scala> val rdd13 = sc.makeRDD(List("1","2","3","4"),3)
rdd13: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[79] at makeRDD at :24
scala> rdd13.pipe("/opt/module/spark/pipe.sh").collect
res31: Array[String] = Array(AA, >>>1, AA, >>>2, AA, >>>3, >>>4)
scala> rdd13.glom.collect
res32: Array[Array[String]] = Array(Array(1), Array(2), Array(3, 4))
结果可以看出,每个个分区都会执行一次脚本。