Spark-core 转换算子(九)

Transformations 算子详解 二



  上一篇,我们主要分析了一下简单的转换算子,这里我们先分析一下常见的转换算子。

Spark-core 转换算子(九)_第1张图片
1、groupBy算子

  groupBy算子如其名,分组算子。但是我们需要制定分组函数。它和groupByKey不同,groupByKey直接按照key分组。
源码部分:

  def groupBy[K](f: T => K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {
    groupBy[K](f, defaultPartitioner(this))
  }
  def groupBy[K](f: T => K, p: Partitioner)(implicit kt: ClassTag[K], ord: Ordering[K] = null)
      : RDD[(K, Iterable[T])] = withScope {
    val cleanF = sc.clean(f)
    this.map(t => (cleanF(t), t)).groupByKey(p)
  }

通过源码发现,groupBy的底层任仍然是groupByKey算子,经过map得到灭搁置的一个函数标签,类似于键值对,在按照groupByKey分分组得到RDD[(K, Iterable[T])]这样的结果类型。
用法:

    val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("BOKE")
    val sc = new SparkContext(conf)
    val rdd: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4, 5, 6, 7, 8, 9))
    val value: RDD[(Boolean, Iterable[Int])] = rdd.groupBy((x: Int) => x % 2 == 0)
	    println(value.count())
	    println(value.max())
	    value.foreach((iter: (Boolean, Iterable[Int])) =>{
      	println(iter._2.toList.mkString(","))
    	})
  }

在上面的groupBy后,我们可以做一些分组统计,比如:max、min、count等等。
也可以直接输出进行下一步的操作。

2、groupByKey算子

  groupByKey类似于groupBy算子,就是分组,但是groupByKey是根据键值对的键值进行分组。返回值是RDD[String, Iterable[Int]]
源码:

  def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
    groupByKey(defaultPartitioner(self))
  }
  
  def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
    // groupByKey shouldn't use map side combine because map side combine does not
    // reduce the amount of data shuffled and requires all map side data be inserted
    // into a hash table, leading to more objects in the old gen.
    val createCombiner = (v: V) => CompactBuffer(v)
    val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
    val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
    val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
      createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
    bufs.asInstanceOf[RDD[(K, Iterable[V])]]
  }

用法:

      val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("BOKE")
      val sc = new SparkContext(conf)
      val rdd: RDD[String] = sc.makeRDD(List("spark", "hello", "hadoop","hello"))
      rdd.map((_: String,1)).groupByKey().
        map((x: (String, Iterable[Int])) =>(x._1,x._2.sum))
        .foreach(println)

以上的实例就是wordCount的经典例子。
这里groupByKey就是一个简单的按照key分组的例子,里面不需要传参数,没有分区内分组聚合的功能,随着数据量的增加,shuffle过程中,可能会产生OOM。

3、reduceByKey算子

  reduceByKey算子是多键值的分组聚合算子,它和groupByKey算子最大的区别是可以传入聚合函数,支持分组内聚合,会减少shuffle的数量。
源码:

  /**
   * Merge the values for each key using an associative and commutative reduce function. This will
   * also perform the merging locally on each mapper before sending results to a reducer, similarly
   * to a "combiner" in MapReduce. Output will be hash-partitioned with numPartitions partitions.
   */
  def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)] = self.withScope {
    reduceByKey(new HashPartitioner(numPartitions), func)
  }

  /**
   * Merge the values for each key using an associative and commutative reduce function. This will
   * also perform the merging locally on each mapper before sending results to a reducer, similarly
   * to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/
   * parallelism level.
   */
  def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {
    reduceByKey(defaultPartitioner(self), func)
  }
 /**
 1. Merge the values for each key using an associative and commutative reduce function. This will
 2. also perform the merging locally on each mapper before sending results to a reducer, similarly
 3. to a "combiner" in MapReduce.
   */
  def reduceByKey(partitioner: Partitioner, func: (V, V) => V): 		    			
  		RDD[(K, V)] = self.withScope {
    	combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
  }

reduceByKey的执行原理图如下,在进行shuffle之前,会按照传入的函数,先进行分区内聚合,在进行分区间聚合,注意我们传入的函数只有一个,所以分区间、分区内聚合的函数是一样的。这样做可以减少分区间shuffle的数量,相应的也减少了OOM的风险。
Spark-core 转换算子(九)_第2张图片reduceByKey的返回值的类型是RDD[(K, V)],K也就是我们的分区规则,V是我们分区聚合后的数据。


4、aggregateByKe算子

  aggregateByKey算子类似于reduceByKey,都是分组聚合函数,但是他们之间的不同是

  1. aggregateByKey算子是柯西化函数算子,通俗点讲就是该函数支持传入两个参数列表。aggregateByKey算子的第一个参数列表是初值,也就是分区内聚合的初始值。在reduceByKey中,默认是第一个数据就是初值,而aggregateByKey算子支持自定义初始值。
  2. aggregateByKey支持分区间函数和分区内的聚合函数不同,也就是在第二列参数列表中,可以传入两个聚合函数。

源码如下:

  /**
   * Aggregate the values of each key, using given combine functions and a neutral "zero value".
   * This function can return a different result type, U, than the type of the values in this RDD,
   * V. Thus, we need one operation for merging a V into a U and one operation for merging two U's,
   * as in scala.TraversableOnce. The former operation is used for merging values within a
   * partition, and the latter is used for merging values between partitions. To avoid memory
   * allocation, both of these functions are allowed to modify and return their first argument
   * instead of creating a new U.
   */
  def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,
      combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
    aggregateByKey(zeroValue, defaultPartitioner(self))(seqOp, combOp)
  }
  /**
   * Aggregate the values of each key, using given combine functions and a neutral "zero value".
   * This function can return a different result type, U, than the type of the values in this RDD,
   * V. Thus, we need one operation for merging a V into a U and one operation for merging two U's,
   * as in scala.TraversableOnce. The former operation is used for merging values within a
   * partition, and the latter is used for merging values between partitions. To avoid memory
   * allocation, both of these functions are allowed to modify and return their first argument
   * instead of creating a new U.
   */
  def aggregateByKey[U: ClassTag](zeroValue: U, numPartitions: Int)(seqOp: (U, V) => U,
      combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
    aggregateByKey(zeroValue, new HashPartitioner(numPartitions))(seqOp, combOp)
  }

  /**
   * Aggregate the values of each key, using given combine functions and a neutral "zero value".
   * This function can return a different result type, U, than the type of the values in this RDD,
   * V. Thus, we need one operation for merging a V into a U and one operation for merging two U's,
   * as in scala.TraversableOnce. The former operation is used for merging values within a
   * partition, and the latter is used for merging values between partitions. To avoid memory
   * allocation, both of these functions are allowed to modify and return their first argument
   * instead of creating a new U.
   */
  def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U,
      combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
    // Serialize the zero value to a byte array so that we can get a new clone of it on each key
    val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)
    val zeroArray = new Array[Byte](zeroBuffer.limit)
    zeroBuffer.get(zeroArray)

    lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
    val createZero = () => cachedSerializer.deserialize[U](ByteBuffer.wrap(zeroArray))

    // We will clean the combiner closure later in `combineByKey`
    val cleanedSeqOp = self.context.clean(seqOp)
    combineByKeyWithClassTag[U]((v: V) => cleanedSeqOp(createZero(), v),
      cleanedSeqOp, combOp, partitioner)
  }

入源码中,在调用aggregateByKey中,第一个参数列表(zeroValue: U, numPartitions: Int),zeroValue就是初始值,numPartitions是分区数。
第二个参数列表是(seqOp: (U, V) => U, combOp: (U, U) => U),
seqOp就是分区内聚合函数,combOp就是分区间聚合函数。
用法如下:

       val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("BOKE")
       val sc = new SparkContext(conf)

      // groupByKey
      val rdd: RDD[String] = sc.makeRDD(List("spark", "hadoop","spark", "hadoop", "hello", "hadoop", "hello"))
      rdd.map((_: String,1)).groupByKey().
        map((x: (String, Iterable[Int])) =>(x._1,x._2.sum))
        .foreach(println)
      // reduceByKey
      rdd.map((_:String,1)).reduceByKey((_: Int)+(_: Int)).foreach(println)
      // aggregateByKey
      rdd.map((_:String,1)).aggregateByKey(1,2)(_+_*2,_+_).foreach(println)

output:
  (hello,6)
  (spark,6)
  (hadoop,9)


5、foldByKey算子

  foldByKey算子取自aggregateByKey和reduceByKey,支持柯西化参数,类似于aggregateByKey,第一个参数列表是初始值,第二个参数列表是分区间及分区内聚合函数(这个分区检核分区内聚合函数一样,所以第二个参数列表只有一个参数)。
源码:


  /**
   * Merge the values for each key using an associative function and a neutral "zero value" which
   * may be added to the result an arbitrary number of times, and must not change the result
   * (e.g., Nil for list concatenation, 0 for addition, or 1 for multiplication.).
   */
  def foldByKey(zeroValue: V, numPartitions: Int)(func: (V, V) => V): RDD[(K, V)] = self.withScope {
    foldByKey(zeroValue, new HashPartitioner(numPartitions))(func)
  }

  /**
   * Merge the values for each key using an associative function and a neutral "zero value" which
   * may be added to the result an arbitrary number of times, and must not change the result
   * (e.g., Nil for list concatenation, 0 for addition, or 1 for multiplication.).
   */
  def foldByKey(zeroValue: V)(func: (V, V) => V): RDD[(K, V)] = self.withScope {
    foldByKey(zeroValue, defaultPartitioner(self))(func)
  }
  /**
   * Merge the values for each key using an associative function and a neutral "zero value" which
   * may be added to the result an arbitrary number of times, and must not change the result
   * (e.g., Nil for list concatenation, 0 for addition, or 1 for multiplication.).
   */
  def foldByKey(
      zeroValue: V,
      partitioner: Partitioner)(func: (V, V) => V): RDD[(K, V)] = self.withScope {
    // Serialize the zero value to a byte array so that we can get a new clone of it on each key
    val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)
    val zeroArray = new Array[Byte](zeroBuffer.limit)
    zeroBuffer.get(zeroArray)

    // When deserializing, use a lazy val to create just one instance of the serializer per task
    lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
    val createZero = () => cachedSerializer.deserialize[V](ByteBuffer.wrap(zeroArray))

    val cleanedFunc = self.context.clean(func)
    combineByKeyWithClassTag[V]((v: V) => cleanedFunc(createZero(), v),
      cleanedFunc, cleanedFunc, partitioner)
  }

用法:

       val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("BOKE")
       val sc = new SparkContext(conf)

      // groupByKey
      val rdd: RDD[String] = sc.makeRDD(List("spark", "hadoop","spark", "hadoop", "hello", "hadoop", "hello"))
      rdd.map((_: String,1)).groupByKey().
        map((x: (String, Iterable[Int])) =>(x._1,x._2.sum))
        .foreach(println)
      // reduceByKey
      rdd.map((_:String,1)).reduceByKey((_: Int)+(_: Int)).foreach(println)
      // aggregateByKey
      rdd.map((_:String,1)).aggregateByKey(0,2)(_+_*2,_+_).foreach(println)
      // foldByKey
      rdd.map((_:String,1)).foldByKey(0,2)(_+_).foreach(println)

output:
(hello,2)
(spark,2)
(hadoop,3)


6、combineByKey算子

  combineByKey算子,当计算时,发现数据结构不满足要求时,可以让第一个数据转换结构。分区内和分区间计算规则不相同。在进行聚合时候,需要制定数据结构。
我们姑且先看下爱源码:


  /**
 * Generic function to combine the elements for each key using a custom set of aggregation
 * functions. This method is here for backward compatibility. It does not provide combiner
 * classtag information to the shuffle.
 *  * @see `combineByKeyWithClassTag`
   */
  def combineByKey[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      partitioner: Partitioner,
      mapSideCombine: Boolean = true,
      serializer: Serializer = null): RDD[(K, C)] = self.withScope {
    combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners,
      partitioner, mapSideCombine, serializer)(null)
  }

  /**
 * Simplified version of combineByKeyWithClassTag that hash-partitions the output RDD.
 * This method is here for backward compatibility. It does not provide combiner
 * classtag information to the shuffle.
 *  * @see `combineByKeyWithClassTag`
   */
  def combineByKey[C](
      createCombiner: V => C,
      createCombiner: (C, V) => C,
      mergeCombiners: (C, C) => C,
      numPartitions: Int): RDD[(K, C)] = self.withScope {
    combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners, numPartitions)(null)
  }

可以看到源码中,我们可以必须要传的参数有三个,createCombiner、createCombiner、mergeCombiners,选传:numPartitions(分区数量)。

  • createCombiner: V => C ,这个函数把当前的值作为参数,此时我们可以对其做些附加操作(类型转换)并把它返回 (这一步类似于初始化操作)
  • mergeValue: (C, V) => C,该函数把元素V合并到之前的元素C(createCombiner)上 (这个操作在每个分区内进行)
  • mergeCombiners: (C, C) => C,该函数把2个元素C合并 (这个操作在不同分区间进行)
    该算子是比较核心的算子,因为其他关于分组聚合的算子底层都会用到它,所以为了让大家更明白,我们用求平均值的方式解释一下用法,会让大家更清楚。
       val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("BOKE")
       val sc = new SparkContext(conf)

      // groupByKey
      val rdd: RDD[String] = sc.makeRDD(List("spark", "hadoop","spark", "hadoop", "hello", "hadoop", "hello"))
      val rdd1: RDD[(String, Double)] = sc.makeRDD(List(("Tom", 88.0), ("Tom", 90.0), ("jack", 86.0), ("jack", 85.0), ("jack", 82.0), ("Tom", 90.0)))
      rdd.map((_: String,1)).groupByKey().
        map((x: (String, Iterable[Int])) =>(x._1,x._2.sum))
        .foreach(println)
      // reduceByKey
      rdd.map((_:String,1)).reduceByKey((_: Int)+(_: Int)).foreach(println)
      // aggregateByKey
      rdd.map((_:String,1)).aggregateByKey(0,2)((_: Int)+(_: Int)*2,(_: Int)+(_: Int)).foreach(println)
      // foldByKey
      rdd.map((_:String,1)).foldByKey(0,2)((_: Int)+(_: Int)).foreach(println)
      // combineByKey
      rdd.map((_:String,1)).combineByKey((v: Int) => v, (x: Double, y: Int) => x + y, (x: Double, y: Double) => x + y)
    val value: RDD[(String, Double)] = rdd1.combineByKey(
      (score: Double) => (1, score),
      (c1: (Int, Double), new_values: Double) => (c1._1+1, c1._2 + new_values),
      (c1: (Int, Double), c2: (Int, Double)) => (c1._1 + c2._1, c1._2 + c2._2)
    ).map(x => {
      val x1: (String, (Int, Double)) = x
      (x1._1, x._2._2 / x._2._1)
    })
    value.foreach(println)
output:
		(jack,84.33333333333333)
		(Tom,89.33333333333333)

以上五个分组(聚合)函数,这里我们对四个算子进行一下总结:

算子 作用 相同点 不同点
groupByKey 分组 作用于键值对类型RDD 仅支持分组,没有聚合功能,所有的数据都会进行分区间分组,可能会产生大量的shuffle,造成OOM的风险较大。
reduceByKey 分组聚合 作用于键值对类型RDD,支持分区内、分区间聚合 主要的参数只有一个(聚合函数),没有柯西化,分区间聚合函数和分区内聚合函数一致,先采用该函数在shuffle之前进行分区内聚合,然后将结果在进行分区聚合,减少了shuffle数据量。第一个值作为初始值。底层是combineByKey。
aggregateByKey 分组聚合 作用于键聚合值对类型RDD,支持分区内、分区间聚合 支持柯西化参数,有两个参数列表,第一个参数列表是聚合初始值和分区数,第二个参数列表是两个聚合函数,他们分别是分区内聚合、分区间聚合函数。同reduceByKey先分区内聚合,在分区间聚合。主要区别是需要传入聚合初始值,解决特定场景下的聚合问题
foldByKey 分组聚合 作用于键值对类型RDD ,支持分区内、分区间聚合 同aggregateByKey算子,该算子参数支持柯西化,第一个参数列表是初始值以及分区数量,第二个参数列表是一个聚合函数,分区内聚合、分区间聚合函数一样。该算子夹杂在aggregateByKey和reduceByKey的作用之间。
combineByKey 分组聚合 作用于键值对类型RDD,支持分区内、分区间聚合 combineByKey算子是一个高级算子,不支持参数柯西化,主要有三个参数,createCombiner、createCombiner、mergeCombiners,createCombiner支持对数据进行特殊转换后聚合计算,我的理解是制作数据转换模型,相当于在统计前画好表格。 createCombiner参数可以将后面的元素进行聚合累加之类的,mergeCombiners合并参数2计算后的结果。



7、sortBy、sortByKey算子
  这里将sortBy、sortByKey算子放在一起总结,主要是他们 ‘太像了’,sortBy的低层就是sortByKey。
sortByKey算子,见名知意,就是按照Key值进行排序,使用起来也是很简单的。还是先看一下源码。
源码:


  /**
   * Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling
   * `collect` or `save` on the resulting RDD will return or output an ordered list of records
   * (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in
   * order of the keys).
   */
  // TODO: this currently doesn't work on P other than Tuple2!
  def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
      : RDD[(K, V)] = self.withScope
  {
    val part = new RangePartitioner(numPartitions, self, ascending)
    new ShuffledRDD[K, V, V](self, part)
      .setKeyOrdering(if (ascending) ordering else ordering.reverse)
  }

  /**
   * Return this RDD sorted by the given key function.
   */
  def sortBy[K](
      f: (T) => K,
      ascending: Boolean = true,
      numPartitions: Int = this.partitions.length)
      (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {
    this.keyBy[K](f)
        .sortByKey(ascending, numPartitions)
        .values
  }

可以看到sortByKey算子参数就是排序规则和分区数,主要作用于键值对型RDD,根据Key排序,sortBy获取排序标注值,然后再sortByKey排序,都是简单的算子。
这里先做一下简单的用法演示:

    // TODO 创建执行环境
    val conf = new SparkConf().setMaster("local[*]").setAppName("create")
    val sc = new SparkContext(conf)

    val rdd: RDD[(String, Double)] = sc.makeRDD(List(("Tom", 88.0), ("Tom", 90.0), ("jack", 86.0), ("Toni", 85.0), ("jack", 82.0), ("Tom", 93.0)),2)
    rdd.sortByKey(ascending = true).foreach(println)
    rdd.sortBy(_._1,ascending = false).foreach(println)
output:
// sortByKey
(Tom,88.0)
(Toni,85.0)
(Tom,90.0)
(jack,86.0)
(Tom,93.0)
(jack,82.0)
// sortBy
(Tom,88.0)
(Tom,90.0)
(Tom,93.0)
(jack,86.0)
(jack,82.0)
(Toni,85.0)

现在总结一下两个算子:

  1. sortBy:作用于一般RDD,使用时需要指定排序特征,也就是按照哪个值进行排序,使用排序更灵活,可指定排序顺序、与分区。
  2. sortByKey:仅用于二元组排序,根据键值排序,也就是二元组的第一个值排序。相对于sortBy灵活读不高,但是使用更加方便,可指定排序顺序、与分区。


    8、cogroup、join、cartesian算子

      cogroup、join、cartesian三个算子均是RDD合并算子,主要用途了及时多个RDD的合并。
  • cogroup
    cogroup算子可以同时合并多个RDD,它的合并原则是按照key值,有多少个RDD参与,就会有多少个CompactBuffer(),同一个RDD中的元素,都将在同一个CompactBuffer里面,不同的RDD元素,在不同的CompactBuffer里。

Spark-core 转换算子(九)_第3张图片源码如下:


  /**
   * For each key k in `this` or `other1` or `other2` or `other3`,
   * return a resulting RDD that contains a tuple with the list of values
   * for that key in `this`, `other1`, `other2` and `other3`.
   */
  def cogroup[W1, W2, W3](other1: RDD[(K, W1)],
      other2: RDD[(K, W2)],
      other3: RDD[(K, W3)],
      partitioner: Partitioner)
      : RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))] = self.withScope {
    if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {
      throw new SparkException("HashPartitioner cannot partition array keys.")
    }
    val cg = new CoGroupedRDD[K](Seq(self, other1, other2, other3), partitioner)
    cg.mapValues { case Array(vs, w1s, w2s, w3s) =>
       (vs.asInstanceOf[Iterable[V]],
         w1s.asInstanceOf[Iterable[W1]],
         w2s.asInstanceOf[Iterable[W2]],
         w3s.asInstanceOf[Iterable[W3]])
    }
  }
  • join
    join算子仅支持两个RDD之间的合并,他的底层实现就是cogroup,在cogroup的结果上进行加工。我们先看源码

  /**
   * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
   * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
   * (k, v2) is in `other`. Uses the given Partitioner to partition the output RDD.
   */
  def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {
    this.cogroup(other, partitioner).flatMapValues( pair =>
      for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
    )
  }

在cogroup的基础上,进行两个元组遍历返回。当一个元组长度为0时,整个键值都不会被合并。如:
for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w) 。所以jion算子就是当两个RDD共同的键值才会被jion,不共同存在的键值会被去掉。相当于mysql的内来接。当然,我们用leftOuterJoin、rightOuterJoin也相似于mysql的左外和右外连接。

  • cartesian
    cartesian是笛卡尔积,不会有Key的拼接,简单描述一下:就是两个RDD元素互相两两连接。直接看源码更清晰:

  /**
   * Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of
   * elements (a, b) where a is in `this` and b is in `other`.
   */
  def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)] = withScope {
    new CartesianRDD(sc, this, other)
  }


private[spark]
class CartesianRDD[T: ClassTag, U: ClassTag](
    sc: SparkContext,
    var rdd1 : RDD[T],
    var rdd2 : RDD[U])
  extends RDD[(T, U)](sc, Nil)
  with Serializable {

  val numPartitionsInRdd2 = rdd2.partitions.length

  override def getPartitions: Array[Partition] = {
    // create the cross product split
    val array = new Array[Partition](rdd1.partitions.length * rdd2.partitions.length)
    for (s1 <- rdd1.partitions; s2 <- rdd2.partitions) {
      val idx = s1.index * numPartitionsInRdd2 + s2.index
      array(idx) = new CartesianPartition(idx, rdd1, rdd2, s1.index, s2.index)
    }
    array
  }

  override def getPreferredLocations(split: Partition): Seq[String] = {
    val currSplit = split.asInstanceOf[CartesianPartition]
    (rdd1.preferredLocations(currSplit.s1) ++ rdd2.preferredLocations(currSplit.s2)).distinct
  }

  override def compute(split: Partition, context: TaskContext): Iterator[(T, U)] = {
    val currSplit = split.asInstanceOf[CartesianPartition]
    for (x <- rdd1.iterator(currSplit.s1, context);
         y <- rdd2.iterator(currSplit.s2, context)) yield (x, y)
  }

  override def getDependencies: Seq[Dependency[_]] = List(
    new NarrowDependency(rdd1) {
      def getParents(id: Int): Seq[Int] = List(id / numPartitionsInRdd2)
    },
    new NarrowDependency(rdd2) {
      def getParents(id: Int): Seq[Int] = List(id % numPartitionsInRdd2)
    }
  )

  override def clearDependencies(): Unit = {
    super.clearDependencies()
    rdd1 = null
    rdd2 = null
  }
}

关键部分是:

  override def compute(split: Partition, context: TaskContext): Iterator[(T, U)] = {
    val currSplit = split.asInstanceOf[CartesianPartition]
    for (x <- rdd1.iterator(currSplit.s1, context);
         y <- rdd2.iterator(currSplit.s2, context)) yield (x, y)
  }

用法:

    val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("BOKE")
    val sc = new SparkContext(conf)

    val rdd1: RDD[(String, Double)] = sc.makeRDD(List(("Tom", 88.0), ("Tom", 90.0), ("jack", 86.0), ("Toni", 85.0), ("jack", 82.0), ("Tom", 93.0)))
    val rdd2: RDD[(String, Double)] = sc.makeRDD(List(("Tom", 85.0), ("jack", 82.0), ("Toni", 93.0)))
    val value: RDD[(String, (Iterable[Double], Iterable[Double]))] = rdd1.cogroup(rdd2)
    value.foreach(println)
            //    (Toni,(CompactBuffer(85.0),CompactBuffer(93.0)))
            //    (Tom,(CompactBuffer(88.0, 90.0, 93.0),CompactBuffer(85.0)))
            //    (jack,(CompactBuffer(86.0, 82.0),CompactBuffer(82.0)))
    val value1: RDD[(String, (Double, Double))] = rdd1.join(rdd2)
    value1.foreach(println)
            //    (jack,(86.0,82.0))
            //    (Toni,(85.0,93.0))
            //    (jack,(82.0,82.0))
            //    (Tom,(88.0,85.0))
            //    (Tom,(90.0,85.0))
            //    (Tom,(93.0,85.0))
    val value2: RDD[((String, Double), (String, Double))] = rdd1.cartesian(rdd2)
    value2.foreach(println)
          //    ((Tom,88.0),(Tom,85.0))
          //    ((Tom,88.0),(jack,82.0))
          //    ((Tom,88.0),(Toni,93.0))
          //    ((Tom,90.0),(Tom,85.0))
          //    ((Tom,90.0),(jack,82.0))
          //    ((Tom,90.0),(Toni,93.0))
          //    ((jack,86.0),(Tom,85.0))
          //    ((jack,86.0),(jack,82.0))
          //    ((jack,86.0),(Toni,93.0))


9、pipe算子

  pipe算子是让RDD执行一下外部函数,参数是传入命令或者外部程序路径即可。官网的解释是:Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the process’s stdin and lines output to its stdout are returned as an RDD of strings。
该算子比较生僻,我也没有用过。理解不深入,等之后深入理解了在补充完整源码和用法。

10、coalesce、repartition、repartitionAndSortWithinPartitions算子

  coalesce、repartition、repartitionAndSortWithinPartitions三个算子的作用都是针对分区的操作,;coalesce、repartition算子就是单纯的重分区操作,但是repartitionAndSortWithinPartitions还具有分组、分区的功能。
因为这三个算子原理都是比较简单的,那么我们先看一下源码:

coalesce:


  /**
   * Return a new RDD that is reduced into `numPartitions` partitions.
   *
   * This results in a narrow dependency, e.g. if you go from 1000 partitions
   * to 100 partitions, there will not be a shuffle, instead each of the 100
   * new partitions will claim 10 of the current partitions. If a larger number
   * of partitions is requested, it will stay at the current number of partitions.
   *
   * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
   * this may result in your computation taking place on fewer nodes than
   * you like (e.g. one node in the case of numPartitions = 1). To avoid this,
   * you can pass shuffle = true. This will add a shuffle step, but means the
   * current upstream partitions will be executed in parallel (per whatever
   * the current partitioning is).
   *
   * @note With shuffle = true, you can actually coalesce to a larger number
   * of partitions. This is useful if you have a small number of partitions,
   * say 100, potentially with a few partitions being abnormally large. Calling
   * coalesce(1000, shuffle = true) will result in 1000 partitions with the
   * data distributed using a hash partitioner. The optional partition coalescer
   * passed in must be serializable.
   */
  def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
              (implicit ord: Ordering[T] = null)
      : RDD[T] = withScope {
    require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")
    if (shuffle) {
      /** Distributes elements evenly across output partitions, starting from a random partition. */
      val distributePartition = (index: Int, items: Iterator[T]) => {
        var position = new Random(hashing.byteswap32(index)).nextInt(numPartitions)
        items.map { t =>
          // Note that the hash code of the key will just be the key itself. The HashPartitioner
          // will mod it with the number of total partitions.
          position = position + 1
          (position, t)
        }
      } : Iterator[(Int, T)]

      // include a shuffle step so that our upstream tasks are still distributed
      new CoalescedRDD(
        new ShuffledRDD[Int, T, T](
          mapPartitionsWithIndexInternal(distributePartition, isOrderSensitive = true),
          new HashPartitioner(numPartitions)),
        numPartitions,
        partitionCoalescer).values
    } else {
      new CoalescedRDD(this, numPartitions, partitionCoalescer)
    }
  }

repartition:


  /**
   * Return a new RDD that has exactly numPartitions partitions.
   *
   * Can increase or decrease the level of parallelism in this RDD. Internally, this uses
   * a shuffle to redistribute data.
   *
   * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
   * which can avoid performing a shuffle.
   */
  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
  }

repartitionAndSortWithinPartitions:

  def repartitionAndSortWithinPartitions(partitioner: Partitioner): RDD[(K, V)] = self.withScope {
    new ShuffledRDD[K, V, V](self, partitioner).setKeyOrdering(ordering)
  }

可以看到repartition的底层就是coalesce参数为true的情况。
那我们介绍一下他们的使用场景:

假设RDD有N个分区,需要重新划分成M个分区:

    N < M: 一般情况下N个分区有数据分布不均匀的状况,利用HashPartitioner函数将数据重新分区为M个,这时需要将shuffle设置为true。因为重分区前后相当于宽依赖,会发生shuffle过程,此时可以使用coalesce(shuffle=true),或者直接使用repartition()。

    N > M:并且N和M相差不多(假如N是1000,M是100): 那么就可以将N个分区中的若干个分区合并成一个新的分区,最终合并为M个分区,这是前后是窄依赖关系,可以使用coalesce(shuffle=false)。

    N > M:并且两者相差悬殊: 这时如果将shuffle设置为false,父子RDD是窄依赖关系,他们同处在一个Stage中,就可能造成spark程序的并行度不够,从而影响性能,如果在M为1的时候,为了使coalesce之前的操作有更好的并行度,可以将shuffle设置为true。

总结
  如果传入的参数大于现有的分区数目,而shuffle为false,RDD的分区数不变,也就是说不经过shuffle,是无法将RDDde分区数变多的。(该部分来源于:Spark学习-Coalesce()方法和rePartition()方法)

11、mapValues算子

  mapValues是标准的键值对算子,他和map很像,都是非shuffle的一对一依赖算子,唯一的区别是mapValues作用于键值对的values值算子。源码和用法都是非常简单的,这里我们仅展示一下源码:


  /**
   * Pass each value in the key-value pair RDD through a map function without changing the keys;
   * this also retains the original RDD's partitioning.
   */
  def mapValues[U](f: V => U): RDD[(K, U)] = self.withScope {
    val cleanF = self.context.clean(f)
    new MapPartitionsRDD[(K, U), (K, V)](self,
      (context, pid, iter) => iter.map { case (k, v) => (k, cleanF(v)) },
      preservesPartitioning = true)
  }

在(context, pid, iter) => iter.map { case (k, v) => (k, cleanF(v)) },中我们可以看到,迭代器经过map后,k是没有发生变化的,只是v在我们的经过我们的函数作用变成了新的值。

12、flatMapValues算子

  flatMapValues是键值对算子,它和flatMap很像,针对值做扁平化操作,通过我们传入的函数,键值不变,生成一系列的同键值的键值对。
看下源码:


  /**
   * Pass each value in the key-value pair RDD through a flatMap function without changing the
   * keys; this also retains the original RDD's partitioning.
   */
  def flatMapValues[U](f: V => TraversableOnce[U]): RDD[(K, U)] = self.withScope {
    val cleanF = self.context.clean(f)
    new MapPartitionsRDD[(K, U), (K, V)](self,
      (context, pid, iter) => iter.flatMap { case (k, v) =>
        cleanF(v).map(x => (k, x))
      },
      preservesPartitioning = true)
  }

用法:

    val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("BOKE")
    val sc = new SparkContext(conf)
    val rdd2: RDD[(String, Int)] = sc.makeRDD(List(("Tom", 1), ("Tom", 2), ("jack", 3)))
    rdd2.flatMapValues(_ to 5).foreach(println)
          //    (jack,3)
          //    (Tom,1)
          //    (Tom,2)
          //    (Tom,2)
          //    (jack,4)
          //    (Tom,3)
          //    (Tom,3)
          //    (Tom,4)
          //    (jack,5)
          //    (Tom,5)
          //    (Tom,4)
          //    (Tom,5)


13、keys、values算子

  keys、values算子就是返回键值对的键值或者values值。形成一个新的RDD[Any]
用法和源码特别简单,就省略了。



对于transformations算子,暂时说到这里,后边随着知识的积累,逐步补充完善,后边我们会开始action算子,拜了个拜。

Spark-core 转换算子(九)_第4张图片

你可能感兴趣的:(spark,spark,big,data,hadoop)