上一篇,我们主要分析了一下简单的转换算子,这里我们先分析一下常见的转换算子。
groupBy算子如其名,分组算子。但是我们需要制定分组函数。它和groupByKey不同,groupByKey直接按照key分组。
源码部分:
def groupBy[K](f: T => K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {
groupBy[K](f, defaultPartitioner(this))
}
def groupBy[K](f: T => K, p: Partitioner)(implicit kt: ClassTag[K], ord: Ordering[K] = null)
: RDD[(K, Iterable[T])] = withScope {
val cleanF = sc.clean(f)
this.map(t => (cleanF(t), t)).groupByKey(p)
}
通过源码发现,groupBy的底层任仍然是groupByKey算子,经过map得到灭搁置的一个函数标签,类似于键值对,在按照groupByKey分分组得到RDD[(K, Iterable[T])]这样的结果类型。
用法:
val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("BOKE")
val sc = new SparkContext(conf)
val rdd: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4, 5, 6, 7, 8, 9))
val value: RDD[(Boolean, Iterable[Int])] = rdd.groupBy((x: Int) => x % 2 == 0)
println(value.count())
println(value.max())
value.foreach((iter: (Boolean, Iterable[Int])) =>{
println(iter._2.toList.mkString(","))
})
}
在上面的groupBy后,我们可以做一些分组统计,比如:max、min、count等等。
也可以直接输出进行下一步的操作。
2、groupByKey算子
groupByKey类似于groupBy算子,就是分组,但是groupByKey是根据键值对的键值进行分组。返回值是RDD[String, Iterable[Int]]
源码:
def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
groupByKey(defaultPartitioner(self))
}
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
// groupByKey shouldn't use map side combine because map side combine does not
// reduce the amount of data shuffled and requires all map side data be inserted
// into a hash table, leading to more objects in the old gen.
val createCombiner = (v: V) => CompactBuffer(v)
val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}
用法:
val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("BOKE")
val sc = new SparkContext(conf)
val rdd: RDD[String] = sc.makeRDD(List("spark", "hello", "hadoop","hello"))
rdd.map((_: String,1)).groupByKey().
map((x: (String, Iterable[Int])) =>(x._1,x._2.sum))
.foreach(println)
以上的实例就是wordCount的经典例子。
这里groupByKey就是一个简单的按照key分组的例子,里面不需要传参数,没有分区内分组聚合的功能,随着数据量的增加,shuffle过程中,可能会产生OOM。
3、reduceByKey算子
reduceByKey算子是多键值的分组聚合算子,它和groupByKey算子最大的区别是可以传入聚合函数,支持分组内聚合,会减少shuffle的数量。
源码:
/**
* Merge the values for each key using an associative and commutative reduce function. This will
* also perform the merging locally on each mapper before sending results to a reducer, similarly
* to a "combiner" in MapReduce. Output will be hash-partitioned with numPartitions partitions.
*/
def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)] = self.withScope {
reduceByKey(new HashPartitioner(numPartitions), func)
}
/**
* Merge the values for each key using an associative and commutative reduce function. This will
* also perform the merging locally on each mapper before sending results to a reducer, similarly
* to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/
* parallelism level.
*/
def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {
reduceByKey(defaultPartitioner(self), func)
}
/**
1. Merge the values for each key using an associative and commutative reduce function. This will
2. also perform the merging locally on each mapper before sending results to a reducer, similarly
3. to a "combiner" in MapReduce.
*/
def reduceByKey(partitioner: Partitioner, func: (V, V) => V):
RDD[(K, V)] = self.withScope {
combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
}
reduceByKey的执行原理图如下,在进行shuffle之前,会按照传入的函数,先进行分区内聚合,在进行分区间聚合,注意我们传入的函数只有一个,所以分区间、分区内聚合的函数是一样的。这样做可以减少分区间shuffle的数量,相应的也减少了OOM的风险。
reduceByKey的返回值的类型是RDD[(K, V)],K也就是我们的分区规则,V是我们分区聚合后的数据。
4、aggregateByKe算子
aggregateByKey算子类似于reduceByKey,都是分组聚合函数,但是他们之间的不同是
源码如下:
/**
* Aggregate the values of each key, using given combine functions and a neutral "zero value".
* This function can return a different result type, U, than the type of the values in this RDD,
* V. Thus, we need one operation for merging a V into a U and one operation for merging two U's,
* as in scala.TraversableOnce. The former operation is used for merging values within a
* partition, and the latter is used for merging values between partitions. To avoid memory
* allocation, both of these functions are allowed to modify and return their first argument
* instead of creating a new U.
*/
def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,
combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
aggregateByKey(zeroValue, defaultPartitioner(self))(seqOp, combOp)
}
/**
* Aggregate the values of each key, using given combine functions and a neutral "zero value".
* This function can return a different result type, U, than the type of the values in this RDD,
* V. Thus, we need one operation for merging a V into a U and one operation for merging two U's,
* as in scala.TraversableOnce. The former operation is used for merging values within a
* partition, and the latter is used for merging values between partitions. To avoid memory
* allocation, both of these functions are allowed to modify and return their first argument
* instead of creating a new U.
*/
def aggregateByKey[U: ClassTag](zeroValue: U, numPartitions: Int)(seqOp: (U, V) => U,
combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
aggregateByKey(zeroValue, new HashPartitioner(numPartitions))(seqOp, combOp)
}
/**
* Aggregate the values of each key, using given combine functions and a neutral "zero value".
* This function can return a different result type, U, than the type of the values in this RDD,
* V. Thus, we need one operation for merging a V into a U and one operation for merging two U's,
* as in scala.TraversableOnce. The former operation is used for merging values within a
* partition, and the latter is used for merging values between partitions. To avoid memory
* allocation, both of these functions are allowed to modify and return their first argument
* instead of creating a new U.
*/
def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U,
combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
// Serialize the zero value to a byte array so that we can get a new clone of it on each key
val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)
val zeroArray = new Array[Byte](zeroBuffer.limit)
zeroBuffer.get(zeroArray)
lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
val createZero = () => cachedSerializer.deserialize[U](ByteBuffer.wrap(zeroArray))
// We will clean the combiner closure later in `combineByKey`
val cleanedSeqOp = self.context.clean(seqOp)
combineByKeyWithClassTag[U]((v: V) => cleanedSeqOp(createZero(), v),
cleanedSeqOp, combOp, partitioner)
}
入源码中,在调用aggregateByKey中,第一个参数列表(zeroValue: U, numPartitions: Int),zeroValue就是初始值,numPartitions是分区数。
第二个参数列表是(seqOp: (U, V) => U, combOp: (U, U) => U),
seqOp就是分区内聚合函数,combOp就是分区间聚合函数。
用法如下:
val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("BOKE")
val sc = new SparkContext(conf)
// groupByKey
val rdd: RDD[String] = sc.makeRDD(List("spark", "hadoop","spark", "hadoop", "hello", "hadoop", "hello"))
rdd.map((_: String,1)).groupByKey().
map((x: (String, Iterable[Int])) =>(x._1,x._2.sum))
.foreach(println)
// reduceByKey
rdd.map((_:String,1)).reduceByKey((_: Int)+(_: Int)).foreach(println)
// aggregateByKey
rdd.map((_:String,1)).aggregateByKey(1,2)(_+_*2,_+_).foreach(println)
output:
(hello,6)
(spark,6)
(hadoop,9)
5、foldByKey算子
foldByKey算子取自aggregateByKey和reduceByKey,支持柯西化参数,类似于aggregateByKey,第一个参数列表是初始值,第二个参数列表是分区间及分区内聚合函数(这个分区检核分区内聚合函数一样,所以第二个参数列表只有一个参数)。
源码:
/**
* Merge the values for each key using an associative function and a neutral "zero value" which
* may be added to the result an arbitrary number of times, and must not change the result
* (e.g., Nil for list concatenation, 0 for addition, or 1 for multiplication.).
*/
def foldByKey(zeroValue: V, numPartitions: Int)(func: (V, V) => V): RDD[(K, V)] = self.withScope {
foldByKey(zeroValue, new HashPartitioner(numPartitions))(func)
}
/**
* Merge the values for each key using an associative function and a neutral "zero value" which
* may be added to the result an arbitrary number of times, and must not change the result
* (e.g., Nil for list concatenation, 0 for addition, or 1 for multiplication.).
*/
def foldByKey(zeroValue: V)(func: (V, V) => V): RDD[(K, V)] = self.withScope {
foldByKey(zeroValue, defaultPartitioner(self))(func)
}
/**
* Merge the values for each key using an associative function and a neutral "zero value" which
* may be added to the result an arbitrary number of times, and must not change the result
* (e.g., Nil for list concatenation, 0 for addition, or 1 for multiplication.).
*/
def foldByKey(
zeroValue: V,
partitioner: Partitioner)(func: (V, V) => V): RDD[(K, V)] = self.withScope {
// Serialize the zero value to a byte array so that we can get a new clone of it on each key
val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)
val zeroArray = new Array[Byte](zeroBuffer.limit)
zeroBuffer.get(zeroArray)
// When deserializing, use a lazy val to create just one instance of the serializer per task
lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
val createZero = () => cachedSerializer.deserialize[V](ByteBuffer.wrap(zeroArray))
val cleanedFunc = self.context.clean(func)
combineByKeyWithClassTag[V]((v: V) => cleanedFunc(createZero(), v),
cleanedFunc, cleanedFunc, partitioner)
}
用法:
val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("BOKE")
val sc = new SparkContext(conf)
// groupByKey
val rdd: RDD[String] = sc.makeRDD(List("spark", "hadoop","spark", "hadoop", "hello", "hadoop", "hello"))
rdd.map((_: String,1)).groupByKey().
map((x: (String, Iterable[Int])) =>(x._1,x._2.sum))
.foreach(println)
// reduceByKey
rdd.map((_:String,1)).reduceByKey((_: Int)+(_: Int)).foreach(println)
// aggregateByKey
rdd.map((_:String,1)).aggregateByKey(0,2)(_+_*2,_+_).foreach(println)
// foldByKey
rdd.map((_:String,1)).foldByKey(0,2)(_+_).foreach(println)
output:
(hello,2)
(spark,2)
(hadoop,3)
6、combineByKey算子
combineByKey算子,当计算时,发现数据结构不满足要求时,可以让第一个数据转换结构。分区内和分区间计算规则不相同。在进行聚合时候,需要制定数据结构。
我们姑且先看下爱源码:
/**
* Generic function to combine the elements for each key using a custom set of aggregation
* functions. This method is here for backward compatibility. It does not provide combiner
* classtag information to the shuffle.
* * @see `combineByKeyWithClassTag`
*/
def combineByKey[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializer: Serializer = null): RDD[(K, C)] = self.withScope {
combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners,
partitioner, mapSideCombine, serializer)(null)
}
/**
* Simplified version of combineByKeyWithClassTag that hash-partitions the output RDD.
* This method is here for backward compatibility. It does not provide combiner
* classtag information to the shuffle.
* * @see `combineByKeyWithClassTag`
*/
def combineByKey[C](
createCombiner: V => C,
createCombiner: (C, V) => C,
mergeCombiners: (C, C) => C,
numPartitions: Int): RDD[(K, C)] = self.withScope {
combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners, numPartitions)(null)
}
可以看到源码中,我们可以必须要传的参数有三个,createCombiner、createCombiner、mergeCombiners,选传:numPartitions(分区数量)。
val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("BOKE")
val sc = new SparkContext(conf)
// groupByKey
val rdd: RDD[String] = sc.makeRDD(List("spark", "hadoop","spark", "hadoop", "hello", "hadoop", "hello"))
val rdd1: RDD[(String, Double)] = sc.makeRDD(List(("Tom", 88.0), ("Tom", 90.0), ("jack", 86.0), ("jack", 85.0), ("jack", 82.0), ("Tom", 90.0)))
rdd.map((_: String,1)).groupByKey().
map((x: (String, Iterable[Int])) =>(x._1,x._2.sum))
.foreach(println)
// reduceByKey
rdd.map((_:String,1)).reduceByKey((_: Int)+(_: Int)).foreach(println)
// aggregateByKey
rdd.map((_:String,1)).aggregateByKey(0,2)((_: Int)+(_: Int)*2,(_: Int)+(_: Int)).foreach(println)
// foldByKey
rdd.map((_:String,1)).foldByKey(0,2)((_: Int)+(_: Int)).foreach(println)
// combineByKey
rdd.map((_:String,1)).combineByKey((v: Int) => v, (x: Double, y: Int) => x + y, (x: Double, y: Double) => x + y)
val value: RDD[(String, Double)] = rdd1.combineByKey(
(score: Double) => (1, score),
(c1: (Int, Double), new_values: Double) => (c1._1+1, c1._2 + new_values),
(c1: (Int, Double), c2: (Int, Double)) => (c1._1 + c2._1, c1._2 + c2._2)
).map(x => {
val x1: (String, (Int, Double)) = x
(x1._1, x._2._2 / x._2._1)
})
value.foreach(println)
output:
(jack,84.33333333333333)
(Tom,89.33333333333333)
以上五个分组(聚合)函数,这里我们对四个算子进行一下总结:
算子 | 作用 | 相同点 | 不同点 |
---|---|---|---|
groupByKey | 分组 | 作用于键值对类型RDD | 仅支持分组,没有聚合功能,所有的数据都会进行分区间分组,可能会产生大量的shuffle,造成OOM的风险较大。 |
reduceByKey | 分组聚合 | 作用于键值对类型RDD,支持分区内、分区间聚合 | 主要的参数只有一个(聚合函数),没有柯西化,分区间聚合函数和分区内聚合函数一致,先采用该函数在shuffle之前进行分区内聚合,然后将结果在进行分区聚合,减少了shuffle数据量。第一个值作为初始值。底层是combineByKey。 |
aggregateByKey | 分组聚合 | 作用于键聚合值对类型RDD,支持分区内、分区间聚合 | 支持柯西化参数,有两个参数列表,第一个参数列表是聚合初始值和分区数,第二个参数列表是两个聚合函数,他们分别是分区内聚合、分区间聚合函数。同reduceByKey先分区内聚合,在分区间聚合。主要区别是需要传入聚合初始值,解决特定场景下的聚合问题 |
foldByKey | 分组聚合 | 作用于键值对类型RDD ,支持分区内、分区间聚合 | 同aggregateByKey算子,该算子参数支持柯西化,第一个参数列表是初始值以及分区数量,第二个参数列表是一个聚合函数,分区内聚合、分区间聚合函数一样。该算子夹杂在aggregateByKey和reduceByKey的作用之间。 |
combineByKey | 分组聚合 | 作用于键值对类型RDD,支持分区内、分区间聚合 | combineByKey算子是一个高级算子,不支持参数柯西化,主要有三个参数,createCombiner、createCombiner、mergeCombiners,createCombiner支持对数据进行特殊转换后聚合计算,我的理解是制作数据转换模型,相当于在统计前画好表格。 createCombiner参数可以将后面的元素进行聚合累加之类的,mergeCombiners合并参数2计算后的结果。 |
7、sortBy、sortByKey算子
这里将sortBy、sortByKey算子放在一起总结,主要是他们 ‘太像了’,sortBy的低层就是sortByKey。
sortByKey算子,见名知意,就是按照Key值进行排序,使用起来也是很简单的。还是先看一下源码。
源码:
/**
* Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling
* `collect` or `save` on the resulting RDD will return or output an ordered list of records
* (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in
* order of the keys).
*/
// TODO: this currently doesn't work on P other than Tuple2!
def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
: RDD[(K, V)] = self.withScope
{
val part = new RangePartitioner(numPartitions, self, ascending)
new ShuffledRDD[K, V, V](self, part)
.setKeyOrdering(if (ascending) ordering else ordering.reverse)
}
/**
* Return this RDD sorted by the given key function.
*/
def sortBy[K](
f: (T) => K,
ascending: Boolean = true,
numPartitions: Int = this.partitions.length)
(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {
this.keyBy[K](f)
.sortByKey(ascending, numPartitions)
.values
}
可以看到sortByKey算子参数就是排序规则和分区数,主要作用于键值对型RDD,根据Key排序,sortBy获取排序标注值,然后再sortByKey排序,都是简单的算子。
这里先做一下简单的用法演示:
// TODO 创建执行环境
val conf = new SparkConf().setMaster("local[*]").setAppName("create")
val sc = new SparkContext(conf)
val rdd: RDD[(String, Double)] = sc.makeRDD(List(("Tom", 88.0), ("Tom", 90.0), ("jack", 86.0), ("Toni", 85.0), ("jack", 82.0), ("Tom", 93.0)),2)
rdd.sortByKey(ascending = true).foreach(println)
rdd.sortBy(_._1,ascending = false).foreach(println)
output:
// sortByKey
(Tom,88.0)
(Toni,85.0)
(Tom,90.0)
(jack,86.0)
(Tom,93.0)
(jack,82.0)
// sortBy
(Tom,88.0)
(Tom,90.0)
(Tom,93.0)
(jack,86.0)
(jack,82.0)
(Toni,85.0)
现在总结一下两个算子:
/**
* For each key k in `this` or `other1` or `other2` or `other3`,
* return a resulting RDD that contains a tuple with the list of values
* for that key in `this`, `other1`, `other2` and `other3`.
*/
def cogroup[W1, W2, W3](other1: RDD[(K, W1)],
other2: RDD[(K, W2)],
other3: RDD[(K, W3)],
partitioner: Partitioner)
: RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))] = self.withScope {
if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {
throw new SparkException("HashPartitioner cannot partition array keys.")
}
val cg = new CoGroupedRDD[K](Seq(self, other1, other2, other3), partitioner)
cg.mapValues { case Array(vs, w1s, w2s, w3s) =>
(vs.asInstanceOf[Iterable[V]],
w1s.asInstanceOf[Iterable[W1]],
w2s.asInstanceOf[Iterable[W2]],
w3s.asInstanceOf[Iterable[W3]])
}
}
/**
* Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
* pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
* (k, v2) is in `other`. Uses the given Partitioner to partition the output RDD.
*/
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {
this.cogroup(other, partitioner).flatMapValues( pair =>
for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
)
}
在cogroup的基础上,进行两个元组遍历返回。当一个元组长度为0时,整个键值都不会被合并。如:
for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w) 。所以jion算子就是当两个RDD共同的键值才会被jion,不共同存在的键值会被去掉。相当于mysql的内来接。当然,我们用leftOuterJoin、rightOuterJoin也相似于mysql的左外和右外连接。
/**
* Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of
* elements (a, b) where a is in `this` and b is in `other`.
*/
def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)] = withScope {
new CartesianRDD(sc, this, other)
}
private[spark]
class CartesianRDD[T: ClassTag, U: ClassTag](
sc: SparkContext,
var rdd1 : RDD[T],
var rdd2 : RDD[U])
extends RDD[(T, U)](sc, Nil)
with Serializable {
val numPartitionsInRdd2 = rdd2.partitions.length
override def getPartitions: Array[Partition] = {
// create the cross product split
val array = new Array[Partition](rdd1.partitions.length * rdd2.partitions.length)
for (s1 <- rdd1.partitions; s2 <- rdd2.partitions) {
val idx = s1.index * numPartitionsInRdd2 + s2.index
array(idx) = new CartesianPartition(idx, rdd1, rdd2, s1.index, s2.index)
}
array
}
override def getPreferredLocations(split: Partition): Seq[String] = {
val currSplit = split.asInstanceOf[CartesianPartition]
(rdd1.preferredLocations(currSplit.s1) ++ rdd2.preferredLocations(currSplit.s2)).distinct
}
override def compute(split: Partition, context: TaskContext): Iterator[(T, U)] = {
val currSplit = split.asInstanceOf[CartesianPartition]
for (x <- rdd1.iterator(currSplit.s1, context);
y <- rdd2.iterator(currSplit.s2, context)) yield (x, y)
}
override def getDependencies: Seq[Dependency[_]] = List(
new NarrowDependency(rdd1) {
def getParents(id: Int): Seq[Int] = List(id / numPartitionsInRdd2)
},
new NarrowDependency(rdd2) {
def getParents(id: Int): Seq[Int] = List(id % numPartitionsInRdd2)
}
)
override def clearDependencies(): Unit = {
super.clearDependencies()
rdd1 = null
rdd2 = null
}
}
关键部分是:
override def compute(split: Partition, context: TaskContext): Iterator[(T, U)] = {
val currSplit = split.asInstanceOf[CartesianPartition]
for (x <- rdd1.iterator(currSplit.s1, context);
y <- rdd2.iterator(currSplit.s2, context)) yield (x, y)
}
用法:
val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("BOKE")
val sc = new SparkContext(conf)
val rdd1: RDD[(String, Double)] = sc.makeRDD(List(("Tom", 88.0), ("Tom", 90.0), ("jack", 86.0), ("Toni", 85.0), ("jack", 82.0), ("Tom", 93.0)))
val rdd2: RDD[(String, Double)] = sc.makeRDD(List(("Tom", 85.0), ("jack", 82.0), ("Toni", 93.0)))
val value: RDD[(String, (Iterable[Double], Iterable[Double]))] = rdd1.cogroup(rdd2)
value.foreach(println)
// (Toni,(CompactBuffer(85.0),CompactBuffer(93.0)))
// (Tom,(CompactBuffer(88.0, 90.0, 93.0),CompactBuffer(85.0)))
// (jack,(CompactBuffer(86.0, 82.0),CompactBuffer(82.0)))
val value1: RDD[(String, (Double, Double))] = rdd1.join(rdd2)
value1.foreach(println)
// (jack,(86.0,82.0))
// (Toni,(85.0,93.0))
// (jack,(82.0,82.0))
// (Tom,(88.0,85.0))
// (Tom,(90.0,85.0))
// (Tom,(93.0,85.0))
val value2: RDD[((String, Double), (String, Double))] = rdd1.cartesian(rdd2)
value2.foreach(println)
// ((Tom,88.0),(Tom,85.0))
// ((Tom,88.0),(jack,82.0))
// ((Tom,88.0),(Toni,93.0))
// ((Tom,90.0),(Tom,85.0))
// ((Tom,90.0),(jack,82.0))
// ((Tom,90.0),(Toni,93.0))
// ((jack,86.0),(Tom,85.0))
// ((jack,86.0),(jack,82.0))
// ((jack,86.0),(Toni,93.0))
9、pipe算子
pipe算子是让RDD执行一下外部函数,参数是传入命令或者外部程序路径即可。官网的解释是:Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the process’s stdin and lines output to its stdout are returned as an RDD of strings。
该算子比较生僻,我也没有用过。理解不深入,等之后深入理解了在补充完整源码和用法。
10、coalesce、repartition、repartitionAndSortWithinPartitions算子
coalesce、repartition、repartitionAndSortWithinPartitions三个算子的作用都是针对分区的操作,;coalesce、repartition算子就是单纯的重分区操作,但是repartitionAndSortWithinPartitions还具有分组、分区的功能。
因为这三个算子原理都是比较简单的,那么我们先看一下源码:
coalesce:
/**
* Return a new RDD that is reduced into `numPartitions` partitions.
*
* This results in a narrow dependency, e.g. if you go from 1000 partitions
* to 100 partitions, there will not be a shuffle, instead each of the 100
* new partitions will claim 10 of the current partitions. If a larger number
* of partitions is requested, it will stay at the current number of partitions.
*
* However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
* this may result in your computation taking place on fewer nodes than
* you like (e.g. one node in the case of numPartitions = 1). To avoid this,
* you can pass shuffle = true. This will add a shuffle step, but means the
* current upstream partitions will be executed in parallel (per whatever
* the current partitioning is).
*
* @note With shuffle = true, you can actually coalesce to a larger number
* of partitions. This is useful if you have a small number of partitions,
* say 100, potentially with a few partitions being abnormally large. Calling
* coalesce(1000, shuffle = true) will result in 1000 partitions with the
* data distributed using a hash partitioner. The optional partition coalescer
* passed in must be serializable.
*/
def coalesce(numPartitions: Int, shuffle: Boolean = false,
partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
(implicit ord: Ordering[T] = null)
: RDD[T] = withScope {
require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")
if (shuffle) {
/** Distributes elements evenly across output partitions, starting from a random partition. */
val distributePartition = (index: Int, items: Iterator[T]) => {
var position = new Random(hashing.byteswap32(index)).nextInt(numPartitions)
items.map { t =>
// Note that the hash code of the key will just be the key itself. The HashPartitioner
// will mod it with the number of total partitions.
position = position + 1
(position, t)
}
} : Iterator[(Int, T)]
// include a shuffle step so that our upstream tasks are still distributed
new CoalescedRDD(
new ShuffledRDD[Int, T, T](
mapPartitionsWithIndexInternal(distributePartition, isOrderSensitive = true),
new HashPartitioner(numPartitions)),
numPartitions,
partitionCoalescer).values
} else {
new CoalescedRDD(this, numPartitions, partitionCoalescer)
}
}
repartition:
/**
* Return a new RDD that has exactly numPartitions partitions.
*
* Can increase or decrease the level of parallelism in this RDD. Internally, this uses
* a shuffle to redistribute data.
*
* If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
* which can avoid performing a shuffle.
*/
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}
repartitionAndSortWithinPartitions:
def repartitionAndSortWithinPartitions(partitioner: Partitioner): RDD[(K, V)] = self.withScope {
new ShuffledRDD[K, V, V](self, partitioner).setKeyOrdering(ordering)
}
可以看到repartition的底层就是coalesce参数为true的情况。
那我们介绍一下他们的使用场景:
假设RDD有N个分区,需要重新划分成M个分区:
N < M: 一般情况下N个分区有数据分布不均匀的状况,利用HashPartitioner函数将数据重新分区为M个,这时需要将shuffle设置为true。因为重分区前后相当于宽依赖,会发生shuffle过程,此时可以使用coalesce(shuffle=true),或者直接使用repartition()。
N > M:并且N和M相差不多(假如N是1000,M是100): 那么就可以将N个分区中的若干个分区合并成一个新的分区,最终合并为M个分区,这是前后是窄依赖关系,可以使用coalesce(shuffle=false)。
N > M:并且两者相差悬殊: 这时如果将shuffle设置为false,父子RDD是窄依赖关系,他们同处在一个Stage中,就可能造成spark程序的并行度不够,从而影响性能,如果在M为1的时候,为了使coalesce之前的操作有更好的并行度,可以将shuffle设置为true。
总结
如果传入的参数大于现有的分区数目,而shuffle为false,RDD的分区数不变,也就是说不经过shuffle,是无法将RDDde分区数变多的。(该部分来源于:Spark学习-Coalesce()方法和rePartition()方法)
11、mapValues算子
mapValues是标准的键值对算子,他和map很像,都是非shuffle的一对一依赖算子,唯一的区别是mapValues作用于键值对的values值算子。源码和用法都是非常简单的,这里我们仅展示一下源码:
/**
* Pass each value in the key-value pair RDD through a map function without changing the keys;
* this also retains the original RDD's partitioning.
*/
def mapValues[U](f: V => U): RDD[(K, U)] = self.withScope {
val cleanF = self.context.clean(f)
new MapPartitionsRDD[(K, U), (K, V)](self,
(context, pid, iter) => iter.map { case (k, v) => (k, cleanF(v)) },
preservesPartitioning = true)
}
在(context, pid, iter) => iter.map { case (k, v) => (k, cleanF(v)) },中我们可以看到,迭代器经过map后,k是没有发生变化的,只是v在我们的经过我们的函数作用变成了新的值。
12、flatMapValues算子
flatMapValues是键值对算子,它和flatMap很像,针对值做扁平化操作,通过我们传入的函数,键值不变,生成一系列的同键值的键值对。
看下源码:
/**
* Pass each value in the key-value pair RDD through a flatMap function without changing the
* keys; this also retains the original RDD's partitioning.
*/
def flatMapValues[U](f: V => TraversableOnce[U]): RDD[(K, U)] = self.withScope {
val cleanF = self.context.clean(f)
new MapPartitionsRDD[(K, U), (K, V)](self,
(context, pid, iter) => iter.flatMap { case (k, v) =>
cleanF(v).map(x => (k, x))
},
preservesPartitioning = true)
}
用法:
val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("BOKE")
val sc = new SparkContext(conf)
val rdd2: RDD[(String, Int)] = sc.makeRDD(List(("Tom", 1), ("Tom", 2), ("jack", 3)))
rdd2.flatMapValues(_ to 5).foreach(println)
// (jack,3)
// (Tom,1)
// (Tom,2)
// (Tom,2)
// (jack,4)
// (Tom,3)
// (Tom,3)
// (Tom,4)
// (jack,5)
// (Tom,5)
// (Tom,4)
// (Tom,5)
13、keys、values算子
keys、values算子就是返回键值对的键值或者values值。形成一个新的RDD[Any]
用法和源码特别简单,就省略了。
对于transformations算子,暂时说到这里,后边随着知识的积累,逐步补充完善,后边我们会开始action算子,拜了个拜。