Spark源码学习之RDD的常见算子(1)

文章目录

  • 前言
  • distinct
  • groupBy
  • take
    • take()
    • first()
  • takeOrdered
    • takeOrdered()
    • top()
  • reduce
    • reduce()
    • max()
    • min()
  • map
  • flatMap
  • 总结

前言

本文主要探讨RDD常用算子之间相互调用的关系

distinct

这个算子有两个方法,无参数添加了分区数之后再调用有参数的。

  def distinct(): RDD[T] = withScope {
    distinct(partitions.length)
  }

实现的手段比较简单,先map成二元组再reduceByKey,相同的key合并为同一个最后再map还原成原来的样子。

  def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
  }

groupBy

这个算子有三种实现,第一种是如下的没有指定分区数也没用指定分区类的

  def groupBy[K](f: T => K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {
    groupBy[K](f, defaultPartitioner(this))
  }

第二种是指定了分区数,选择HashPartitionner的

  def groupBy[K](
      f: T => K,
      numPartitions: Int)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {
    groupBy(f, new HashPartitioner(numPartitions))
  }

第三种指定了分区方式,需要用到map和groupByKey方法;实际上前两种都是调用这种方法实现。

  def groupBy[K](f: T => K, p: Partitioner)(implicit kt: ClassTag[K], ord: Ordering[K] = null)
      : RDD[(K, Iterable[T])] = withScope {
    val cleanF = sc.clean(f)
    this.map(t => (cleanF(t), t)).groupByKey(p)
  }

对于这个算子,源码有一些注释,

@note This operation may be very expensive. If you are grouping in order to perform an
aggregation (such as a sum or average) over each key, using PairRDDFunctions.aggregateByKey
or PairRDDFunctions.reduceByKey will provide much better performance.

take

take只有一种实现,但是有一些其他算子调用了它,所以放在一起来看。

take()

首先看take的源码,其实主要是计算一些因子麻烦了一点。

def take(num: Int): Array[T] = withScope {
    val scaleUpFactor = Math.max(conf.getInt("spark.rdd.limit.scaleUpFactor", 4), 2)
    if (num == 0) {
      new Array[T](0)
    } else {
      val buf = new ArrayBuffer[T]
      val totalParts = this.partitions.length
      var partsScanned = 0
      while (buf.size < num && partsScanned < totalParts) {
        // The number of partitions to try in this iteration. It is ok for this number to be
        // greater than totalParts because we actually cap it at totalParts in runJob.
        var numPartsToTry = 1L
        if (partsScanned > 0) {
          // If we didn't find any rows after the previous iteration, quadruple and retry.
          // Otherwise, interpolate the number of partitions we need to try, but overestimate
          // it by 50%. We also cap the estimation in the end.
          if (buf.isEmpty) {
            numPartsToTry = partsScanned * scaleUpFactor
          } else {
            // the left side of max is >=1 whenever partsScanned >= 2
            numPartsToTry = Math.max((1.5 * num * partsScanned / buf.size).toInt - partsScanned, 1)
            numPartsToTry = Math.min(numPartsToTry, partsScanned * scaleUpFactor)
          }
        }

        val left = num - buf.size
        val p = partsScanned.until(math.min(partsScanned + numPartsToTry, totalParts).toInt)
        val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p)

        res.foreach(buf ++= _.take(num - buf.size))
        partsScanned += p.size
      }

      buf.toArray
    }
  }

这里有两个需要注意的

   @note This method should only be used if the resulting array is expected to be small, as
    all the data is loaded into the driver's memory.这里是说调用它的RDD数据应该小一些,
    因为会加载到Driver的内存当中、
   
  @note Due to complications in the internal implementation, this method will raise
   an exception if called on an RDD of `Nothing` or `Null`.这里是RDD为空的话,会产生异常,
   毕竟源码只有对于长度为0的判断

first()

其实就是调用了take()方法,取参数为1

  def first(): T = withScope {
    take(1) match {
      case Array(t) => t
      case _ => throw new UnsupportedOperationException("empty collection")
    }
  }

takeOrdered

这个就是按指定规则排序后再take

takeOrdered()

默认自然排序,从小到大

def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
    if (num == 0) {
      Array.empty
    } else {
      val mapRDDs = mapPartitions { items =>
        // Priority keeps the largest elements, so let's reverse the ordering.
        val queue = new BoundedPriorityQueue[T](num)(ord.reverse)
        queue ++= util.collection.Utils.takeOrdered(items, num)(ord)
        Iterator.single(queue)
      }
      if (mapRDDs.partitions.length == 0) {
        Array.empty
      } else {
        mapRDDs.reduce { (queue1, queue2) =>
          queue1 ++= queue2
          queue1
        }.toArray.sorted(ord)
      }
    }
  }

top()

排序规则和自然排序相反,从大到小

  def top(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
    takeOrdered(num)(ord.reverse)
  }

reduce

这是个非常经典的算子操纵了

reduce()

按照指定的规则合并数据

  def reduce(f: (T, T) => T): T = withScope {
    val cleanF = sc.clean(f)
    val reducePartition: Iterator[T] => Option[T] = iter => {
      if (iter.hasNext) {
        Some(iter.reduceLeft(cleanF))
      } else {
        None
      }
    }
    var jobResult: Option[T] = None
    val mergeResult = (index: Int, taskResult: Option[T]) => {
      if (taskResult.isDefined) {
        jobResult = jobResult match {
          case Some(value) => Some(f(value, taskResult.get))
          case None => taskResult
        }
      }
    }
    sc.runJob(this, reducePartition, mergeResult)
    // Get the final result out of our Option, or throw an exception if the RDD was empty
    jobResult.getOrElse(throw new UnsupportedOperationException("empty collection"))
  }

很容易用reduce实现下面两种算子

max()

  def max()(implicit ord: Ordering[T]): T = withScope {
    this.reduce(ord.max)
  }

ord.max就是对表达式的封装

  /** Return `x` if `x` >= `y`, otherwise `y`. */
  def max(x: T, y: T): T = if (gteq(x, y)) x else y

min()

  def min()(implicit ord: Ordering[T]): T = withScope {
    this.reduce(ord.min)
  }

ord.min也是对表达式的封装

  /** Return `x` if `x` <= `y`, otherwise `y`. */
  def min(x: T, y: T): T = if (lteq(x, y)) x else y

map

刚刚跳到了reduce,那么map当然不能落下。

  def map[U: ClassTag](f: T => U): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
  }

flatMap

最后还有flatMap,这个在wordcount的经典案例中就出现的算子。

  def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
  }

总结

只是简单看了一部分算子,不难发现文中提到的转换算子好像最终都会调用iter的同名方法,行动算子最终都会调用runJob。那么iter是什么?runJob又是什么呢?留到之后的博客再探讨。

你可能感兴趣的:(Spark)