黑猴子的家:Spark RDD 行动算子 Action

1、reduce

(1)原理
通过f函数聚集RDD中的所有元素,这个功能必须是可交换且可并联的

(2)源码

def reduce(f: (T, T) => T): T = withScope {
  val cleanF = sc.clean(f)
  val reducePartition: Iterator[T] => Option[T] = iter => {
    if (iter.hasNext) {
      Some(iter.reduceLeft(cleanF))
    } else {
      None
    }
  }
  var jobResult: Option[T] = None
  val mergeResult = (index: Int, taskResult: Option[T]) => {
    if (taskResult.isDefined) {
      jobResult = jobResult match {
        case Some(value) => Some(f(value, taskResult.get))
        case None => taskResult
      }
    }
  }
  sc.runJob(this, reducePartition, mergeResult)
  // Get the final result out of our Option, or throw an exception if the RDD was empty
  jobResult.getOrElse(throw new UnsupportedOperationException("empty collection"))
}

(3)案列

scala> val rdd = sc.makeRDD(1 to 10,2)
rdd: org.apache.spark.rdd.RDD[Int] 
              = ParallelCollectionRDD[0] at makeRDD at :24

scala> rdd.reduce(_+_)
res0: Int = 55                                                                  

scala> val arrayrdd = sc.makeRDD(Array(("a",1),("a",3),("c",3),("d",5)))
arrayrdd: org.apache.spark.rdd.RDD[(String, Int)] 
              = ParallelCollectionRDD[1] at makeRDD at :24

scala> arrayrdd.reduce((x,y) => (x._1+y._1,x._2+y._2))
res1: (String, Int) = (cdaa,12)

2、collect

(1)原理
在驱动程序中,以数组的形式返回数据集的所有元素

(2)源码

def collect(): Array[T] = withScope {
  val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
  Array.concat(results: _*)
}

(3)案列

scala> val rdd = sc.makeRDD(1 to 10,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at makeRDD at :24

scala> rdd.collect
res2: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

3、count

(1)原理
返回RDD的元素个数

(2)源码

def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

(3)案列

scala> val rdd = sc.makeRDD(1 to 10,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at makeRDD at :24

scala> rdd.count
res3: Long = 10

4、first

(1)原理
返回RDD的第一个元素(类似于take(1))

(2)源码

def first(): T = withScope {
  take(1) match {
    case Array(t) => t
    case _ => throw new UnsupportedOperationException("empty collection")
  }
}

(3)案列

scala> val rdd = sc.makeRDD(1 to 10,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at makeRDD at :24

scala> rdd.collect
res5: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> rdd.first
res4: Int = 1

5、take

(1)原理
返回一个由数据集的前n个元素组成的数组

(2)源码

def take(num: Int): Array[T] = withScope {
  val scaleUpFactor = Math.max(conf.getInt("spark.rdd.limit.scaleUpFactor", 4), 2)
  if (num == 0) {
    new Array[T](0)
  } else {
    val buf = new ArrayBuffer[T]
    val totalParts = this.partitions.length
    var partsScanned = 0
    while (buf.size < num && partsScanned < totalParts) {
      // The number of partitions to try in this iteration. It is ok for this number to be
      // greater than totalParts because we actually cap it at totalParts in runJob.
      var numPartsToTry = 1L
      if (partsScanned > 0) {
        // If we didn't find any rows after the previous iteration, quadruple and retry.
        // Otherwise, interpolate the number of partitions we need to try, but overestimate
        // it by 50%. We also cap the estimation in the end.
        if (buf.isEmpty) {
          numPartsToTry = partsScanned * scaleUpFactor
        } else {
          // the left side of max is >=1 whenever partsScanned >= 2
          numPartsToTry = Math.max((1.5 * num * partsScanned / buf.size).toInt - partsScanned, 1)
          numPartsToTry = Math.min(numPartsToTry, partsScanned * scaleUpFactor)
        }
      }

      val left = num - buf.size
      val p = partsScanned.until(math.min(partsScanned + numPartsToTry, totalParts).toInt)
      val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p)

      res.foreach(buf ++= _.take(num - buf.size))
      partsScanned += p.size
    }

    buf.toArray
  }
}

(3)案列

scala> val rdd = sc.makeRDD(1 to 10,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at makeRDD at :24

scala> rdd.collect
res6: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> rdd.take(5)
res7: Array[Int] = Array(1, 2, 3, 4, 5)

6、takeSample

(1)原理
返回一个数组,该数组由从数据集中随机采样的num个元素组成,可以选择是否用随机数替换不足的部分,seed用于指定随机数生成器种子

(2)源码

def takeSample(
    withReplacement: Boolean,
    num: Int,
    seed: Long = Utils.random.nextLong): Array[T] = withScope {
  val numStDev = 10.0

  require(num >= 0, "Negative number of elements requested")
  require(num <= (Int.MaxValue - (numStDev * math.sqrt(Int.MaxValue)).toInt),
    "Cannot support a sample size > Int.MaxValue - " +
    s"$numStDev * math.sqrt(Int.MaxValue)")

  if (num == 0) {
    new Array[T](0)
  } else {
    val initialCount = this.count()
    if (initialCount == 0) {
      new Array[T](0)
    } else {
      val rand = new Random(seed)
      if (!withReplacement && num >= initialCount) {
        Utils.randomizeInPlace(this.collect(), rand)
      } else {
        val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount,
          withReplacement)
        var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()

        // If the first sample didn't turn out large enough, keep trying to take samples;
        // this shouldn't happen often because we use a big multiplier for the initial size
        var numIters = 0
        while (samples.length < num) {
          logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
          samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
          numIters += 1
        }
        Utils.randomizeInPlace(samples, rand).take(num)
      }
    }
  }
}

(3)案列

scala> val rdd = sc.makeRDD(1 to 10,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at makeRDD at :24

scala> rdd.collect
res8: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> rdd.takeSample(true,5,3)
res9: Array[Int] = Array(3, 5, 5, 9, 7)

7、takeOrdered

(1)原理
返回前几个的排序

(2)源码

def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
  if (num == 0) {
    Array.empty
  } else {
    val mapRDDs = mapPartitions { items =>
      // Priority keeps the largest elements, so let's reverse the ordering.
      val queue = new BoundedPriorityQueue[T](num)(ord.reverse)
      queue ++= util.collection.Utils.takeOrdered(items, num)(ord)
      Iterator.single(queue)
    }
    if (mapRDDs.partitions.length == 0) {
      Array.empty
    } else {
      mapRDDs.reduce { (queue1, queue2) =>
        queue1 ++= queue2
        queue1
      }.toArray.sorted(ord)
    }
  }
}

(3)案列

scala> val rdd = sc.makeRDD(Seq(10,4,3,14,6))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[7] at makeRDD at :24

scala> rdd.top(2)
res10: Array[Int] = Array(14, 10)

scala> rdd.takeOrdered(2)
res11: Array[Int] = Array(3, 4)

8、aggregate

(1)原理
aggregate函数将每个分区里面的元素通过seqOp和初始值进行聚合,然后用combine函数将每个分区的结果和初始值(zeroValue)进行combine操作。这个函数最终返回的类型不需要和RDD中元素类型一致。

(2)源码

def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U = withScope {
  // Clone the zero value since we will also be serializing it as part of tasks
  var jobResult = Utils.clone(zeroValue, sc.env.serializer.newInstance())
  val cleanSeqOp = sc.clean(seqOp)
  val cleanCombOp = sc.clean(combOp)
  val aggregatePartition = (it: Iterator[T]) => it.aggregate(zeroValue)(cleanSeqOp, cleanCombOp)
  val mergeResult = (index: Int, taskResult: U) => jobResult = combOp(jobResult, taskResult)
  sc.runJob(this, aggregatePartition, mergeResult)
  jobResult
}

(3)案列

scala> val rdd = sc.makeRDD(1 to 10,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at makeRDD at :24

scala> rdd.aggregate(1)(
     | { (x:Int,y:Int) => x+y },
     | { (a:Int,b:Int) => a+b }
     | )
res12: Int = 58

scala> rdd.aggregate(1)(
     | { (x:Int,y:Int) => x+y },
     | { (a:Int,b:Int) => a*b }
     | )
res13: Int = 656

scala> rdd.aggregate(1)(
     | { (x:Int,y:Int) => x*y },
     | { (a:Int,b:Int) => a+b }
     | )
res14: Int = 30361

9、fold

(1)原理
折叠操作,aggregate的简化操作,seqop和combop一样。

(2)源码

def fold(zeroValue: T)(op: (T, T) => T): T = withScope {
  // Clone the zero value since we will also be serializing it as part of tasks
  var jobResult = Utils.clone(zeroValue, sc.env.closureSerializer.newInstance())
  val cleanOp = sc.clean(op)
  val foldPartition = (iter: Iterator[T]) => iter.fold(zeroValue)(cleanOp)
  val mergeResult = (index: Int, taskResult: T) => jobResult = op(jobResult, taskResult)
  sc.runJob(this, foldPartition, mergeResult)
  jobResult
}

(3)案列

scala> val rdd = sc.makeRDD(1 to 10,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[11] at makeRDD at :24

scala> rdd.fold(1)(_+_)
res15: Int = 58

scala> rdd.aggregate(1)(
     | {(x:Int,y:Int) => x+y },
     | {(a:Int,b:Int) => a+b }
     | )
res16: Int = 58

10、saveAsTextFile

(1)原理
将数据集的元素以textfile的形式保存到HDFS文件系统或者其他支持的文件系统,对于每个元素,Spark将会调用toString方法,将它装换为文件中的文本

(2)源码

def saveAsTextFile(path: String): Unit = withScope {
  // https://issues.apache.org/jira/browse/SPARK-2075
  //
  // NullWritable is a `Comparable` in Hadoop 1.+, so the compiler cannot find an implicit
  // Ordering for it and will use the default `null`. However, it's a `Comparable[NullWritable]`
  // in Hadoop 2.+, so the compiler will call the implicit `Ordering.ordered` method to create an
  // Ordering for `NullWritable`. That's why the compiler will generate different anonymous
  // classes for `saveAsTextFile` in Hadoop 1.+ and Hadoop 2.+.
  //
  // Therefore, here we provide an explicit Ordering `null` to make sure the compiler generate
  // same bytecodes for `saveAsTextFile`.
  val nullWritableClassTag = implicitly[ClassTag[NullWritable]]
  val textClassTag = implicitly[ClassTag[Text]]
  val r = this.mapPartitions { iter =>
    val text = new Text()
    iter.map { x =>
      text.set(x.toString)
      (NullWritable.get(), text)
    }
  }
  RDD.rddToPairRDDFunctions(r)(nullWritableClassTag, textClassTag, null)
    .saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path)
}

(3)案列

scala> val rdd = sc.makeRDD(1 to 10,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[12] at makeRDD at :24

scala> rdd.saveAsTextFile("/opt/module/rdd")

[root@hadoop102 rdd]# cat part-0000*
1
2
3
4
5
6
7
8
9
10

11、saveAsSequenceFile

(1)原理
将数据集中的元素以Hadoop sequencefile的格式保存到指定的目录下,可以使HDFS或者其他Hadoop支持的文件系统。

(2)源码

def saveAsSequenceFile(
    path: String,
    codec: Option[Class[_ <: CompressionCodec]] = None): Unit = self.withScope {
  def anyToWritable[U <% Writable](u: U): Writable = u

  // TODO We cannot force the return type of `anyToWritable` be same as keyWritableClass and
  // valueWritableClass at the compile time. To implement that, we need to add type parameters to
  // SequenceFileRDDFunctions. however, SequenceFileRDDFunctions is a public class so it will be a
  // breaking change.
  val convertKey = self.keyClass != keyWritableClass
  val convertValue = self.valueClass != valueWritableClass

  logInfo("Saving as sequence file of type (" + keyWritableClass.getSimpleName + "," +
    valueWritableClass.getSimpleName + ")" )
  val format = classOf[SequenceFileOutputFormat[Writable, Writable]]
  val jobConf = new JobConf(self.context.hadoopConfiguration)
  if (!convertKey && !convertValue) {
    self.saveAsHadoopFile(path, keyWritableClass, valueWritableClass, format, jobConf, codec)
  } else if (!convertKey && convertValue) {
    self.map(x => (x._1, anyToWritable(x._2))).saveAsHadoopFile(
      path, keyWritableClass, valueWritableClass, format, jobConf, codec)
  } else if (convertKey && !convertValue) {
    self.map(x => (anyToWritable(x._1), x._2)).saveAsHadoopFile(
      path, keyWritableClass, valueWritableClass, format, jobConf, codec)
  } else if (convertKey && convertValue) {
    self.map(x => (anyToWritable(x._1), anyToWritable(x._2))).saveAsHadoopFile(
      path, keyWritableClass, valueWritableClass, format, jobConf, codec)
  }
}

(3)案列

scala> val rdd = sc.makeRDD(List((1,3),(1,2),(1,4),(2,3),(3,6),(3,8)),3)
rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[14] at makeRDD at :24

scala> rdd.saveAsSequenceFile("/opt/module/seqFile")

12、saveAsObjectFile

(1)原理
用于将RDD中的元素序列化成对象,存储到文件中。

(2)源码

def saveAsObjectFile(path: String): Unit = withScope {
  this.mapPartitions(iter => iter.grouped(10).map(_.toArray))
    .map(x => (NullWritable.get(), new BytesWritable(Utils.serialize(x))))
    .saveAsSequenceFile(path)
}

(3)案列

scala> val rdd = sc.makeRDD(List((1,3),(1,2),(1,4),(2,3),(3,6),(3,8)))
rdd: org.apache.spark.rdd.RDD[(Int, Int)] 
              = ParallelCollectionRDD[16] at makeRDD at :24

scala> rdd.saveAsObjectFile("/opt/module/obj")

13、countByKey

(1)原理
针对(K,V)类型的RDD,返回一个(K,Int)的map,表示每一个key对应的元素个数。

(2)源码

def countByKey(): Map[K, Long] = self.withScope {
  self.mapValues(_ => 1L).reduceByKey(_ + _).collect().toMap
}

(3)案列

scala> val rdd = sc.makeRDD(List((1,3),(1,2),(1,4),(2,3),(3,6),(3,8)),3)
rdd: org.apache.spark.rdd.RDD[(Int, Int)] 
                   = ParallelCollectionRDD[19] at makeRDD at :24

scala> rdd.countByKey
res23: scala.collection.Map[Int,Long] = Map(3 -> 2, 1 -> 3, 2 -> 1)

14、foreach

(1)原理
在数据集的每一个元素上,运行函数func进行更新。

(2)源码

def foreach(f: T => Unit): Unit = withScope {
  val cleanF = sc.clean(f)
  sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))
}

(3)案列

scala> val rdd = sc.makeRDD(1 to 10,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[23] at makeRDD at :24

scala> val sum = sc.accumulator(0)
warning: there were two deprecation warnings; re-run with -deprecation for details
sum: org.apache.spark.Accumulator[Int] = 0

scala> rdd.foreach(sum+=_)

scala> sum.value
res27: Int = 55

scala> rdd collect
warning: there was one feature warning; re-run with -feature for details
res28: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> rdd.collect.foreach(println)
1
2
3
4
5
6
7
8
9
10

你可能感兴趣的:(黑猴子的家:Spark RDD 行动算子 Action)