本文主要探讨RDD常用算子之间相互调用的关系
这个算子有两个方法,无参数添加了分区数之后再调用有参数的。
def distinct(): RDD[T] = withScope {
distinct(partitions.length)
}
实现的手段比较简单,先map成二元组再reduceByKey,相同的key合并为同一个最后再map还原成原来的样子。
def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
}
这个算子有三种实现,第一种是如下的没有指定分区数也没用指定分区类的
def groupBy[K](f: T => K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {
groupBy[K](f, defaultPartitioner(this))
}
第二种是指定了分区数,选择HashPartitionner的
def groupBy[K](
f: T => K,
numPartitions: Int)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {
groupBy(f, new HashPartitioner(numPartitions))
}
第三种指定了分区方式,需要用到map和groupByKey方法;实际上前两种都是调用这种方法实现。
def groupBy[K](f: T => K, p: Partitioner)(implicit kt: ClassTag[K], ord: Ordering[K] = null)
: RDD[(K, Iterable[T])] = withScope {
val cleanF = sc.clean(f)
this.map(t => (cleanF(t), t)).groupByKey(p)
}
对于这个算子,源码有一些注释,
@note This operation may be very expensive. If you are grouping in order to perform an
aggregation (such as a sum or average) over each key, usingPairRDDFunctions.aggregateByKey
orPairRDDFunctions.reduceByKey
will provide much better performance.
take只有一种实现,但是有一些其他算子调用了它,所以放在一起来看。
首先看take的源码,其实主要是计算一些因子麻烦了一点。
def take(num: Int): Array[T] = withScope {
val scaleUpFactor = Math.max(conf.getInt("spark.rdd.limit.scaleUpFactor", 4), 2)
if (num == 0) {
new Array[T](0)
} else {
val buf = new ArrayBuffer[T]
val totalParts = this.partitions.length
var partsScanned = 0
while (buf.size < num && partsScanned < totalParts) {
// The number of partitions to try in this iteration. It is ok for this number to be
// greater than totalParts because we actually cap it at totalParts in runJob.
var numPartsToTry = 1L
if (partsScanned > 0) {
// If we didn't find any rows after the previous iteration, quadruple and retry.
// Otherwise, interpolate the number of partitions we need to try, but overestimate
// it by 50%. We also cap the estimation in the end.
if (buf.isEmpty) {
numPartsToTry = partsScanned * scaleUpFactor
} else {
// the left side of max is >=1 whenever partsScanned >= 2
numPartsToTry = Math.max((1.5 * num * partsScanned / buf.size).toInt - partsScanned, 1)
numPartsToTry = Math.min(numPartsToTry, partsScanned * scaleUpFactor)
}
}
val left = num - buf.size
val p = partsScanned.until(math.min(partsScanned + numPartsToTry, totalParts).toInt)
val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p)
res.foreach(buf ++= _.take(num - buf.size))
partsScanned += p.size
}
buf.toArray
}
}
这里有两个需要注意的
@note This method should only be used if the resulting array is expected to be small, as
all the data is loaded into the driver's memory.这里是说调用它的RDD数据应该小一些,
因为会加载到Driver的内存当中、
@note Due to complications in the internal implementation, this method will raise
an exception if called on an RDD of `Nothing` or `Null`.这里是RDD为空的话,会产生异常,
毕竟源码只有对于长度为0的判断
其实就是调用了take()方法,取参数为1
def first(): T = withScope {
take(1) match {
case Array(t) => t
case _ => throw new UnsupportedOperationException("empty collection")
}
}
这个就是按指定规则排序后再take
默认自然排序,从小到大
def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
if (num == 0) {
Array.empty
} else {
val mapRDDs = mapPartitions { items =>
// Priority keeps the largest elements, so let's reverse the ordering.
val queue = new BoundedPriorityQueue[T](num)(ord.reverse)
queue ++= util.collection.Utils.takeOrdered(items, num)(ord)
Iterator.single(queue)
}
if (mapRDDs.partitions.length == 0) {
Array.empty
} else {
mapRDDs.reduce { (queue1, queue2) =>
queue1 ++= queue2
queue1
}.toArray.sorted(ord)
}
}
}
排序规则和自然排序相反,从大到小
def top(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
takeOrdered(num)(ord.reverse)
}
这是个非常经典的算子操纵了
按照指定的规则合并数据
def reduce(f: (T, T) => T): T = withScope {
val cleanF = sc.clean(f)
val reducePartition: Iterator[T] => Option[T] = iter => {
if (iter.hasNext) {
Some(iter.reduceLeft(cleanF))
} else {
None
}
}
var jobResult: Option[T] = None
val mergeResult = (index: Int, taskResult: Option[T]) => {
if (taskResult.isDefined) {
jobResult = jobResult match {
case Some(value) => Some(f(value, taskResult.get))
case None => taskResult
}
}
}
sc.runJob(this, reducePartition, mergeResult)
// Get the final result out of our Option, or throw an exception if the RDD was empty
jobResult.getOrElse(throw new UnsupportedOperationException("empty collection"))
}
很容易用reduce实现下面两种算子
def max()(implicit ord: Ordering[T]): T = withScope {
this.reduce(ord.max)
}
ord.max就是对表达式的封装
/** Return `x` if `x` >= `y`, otherwise `y`. */
def max(x: T, y: T): T = if (gteq(x, y)) x else y
def min()(implicit ord: Ordering[T]): T = withScope {
this.reduce(ord.min)
}
ord.min也是对表达式的封装
/** Return `x` if `x` <= `y`, otherwise `y`. */
def min(x: T, y: T): T = if (lteq(x, y)) x else y
刚刚跳到了reduce,那么map当然不能落下。
def map[U: ClassTag](f: T => U): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
}
最后还有flatMap,这个在wordcount的经典案例中就出现的算子。
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
}
只是简单看了一部分算子,不难发现文中提到的转换算子好像最终都会调用iter的同名方法,行动算子最终都会调用runJob。那么iter是什么?runJob又是什么呢?留到之后的博客再探讨。