Spark 中的shuffle解读以及repartition和coalesce介绍以及使用场景

1 shuffle操作

Spark中的某些操作会触发称为shuffle的事件。 随机播放是Spark的重新分配数据的机制,因此它可以跨分区进行不同的分组。 这通常涉及跨执行程序和机器复制数据,使得混洗成为复杂且昂贵的操作。

2 背景

为了理解在shuffle期间发生的事情,我们可以考虑reduceByKey操作的示例。 reduceByKey操作生成一个新的RDD,其中单个键的所有值都组合成一个元组 - 键和对与该键关联的所有值执行reduce函数的结果。 挑战在于,当我们计算时,数据不可能在同一个节点上,如果我们进行了一个reduceByKey 的操作,每个task是对应一个partition的,这时候我们必须从每个分区中读取数据,找到该键对应的值,然后将分区的值拉取出来集合在一起,以计算每个键的最终结果-这就是所谓的shuffle。

尽管经过shuffle操作过后每个分区中的元素集将是确定性的,并且分区本身的排序也是如此,但这些元素的排序不是。 如果在随机播放后需要可预测的有序数据,则可以使用:

  1. mapPartitions使用例如.sorted对每个分区进行排序
  2. repartitionAndSortWithinPartitions在同时重新分区的同时有效地对分区进行排序
  3. sortBy来创建一个全局排序的RDD

可以导致混洗的操作包括重新分区操作,例如repartition and coalesce,“ByKey操作(计数除外)”,如groupByKey和reduceByKey,以及联合操作,如cogroup和join。

3 性能影响


从而导致磁盘I / O的额外开销和垃圾回收的额外开销。

shuffle还会在磁盘上生成大量中间文件。从Spark 1.3开始,这些文件将被保留,直到相应的RDD不再使用或者


4 repartition和coalesce介绍以及使用场景

  • coalesce
   * Return a new RDD that is reduced into `numPartitions` partitions.
   * This results in a narrow dependency, e.g. if you go from 1000 partitions
   * to 100 partitions, there will not be a shuffle, instead each of the 100
   * new partitions will claim 10 of the current partitions. If a larger number
   * of partitions is requested, it will stay at the current number of partitions.
   * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
   * this may result in your computation taking place on fewer nodes than
   * you like (e.g. one node in the case of numPartitions = 1). To avoid this,
   * you can pass shuffle = true. This will add a shuffle step, but means the
   * current upstream partitions will be executed in parallel (per whatever
   * the current partitioning is).
   * @note With shuffle = true, you can actually coalesce to a larger number
   * of partitions. This is useful if you have a small number of partitions,
   * say 100, potentially with a few partitions being abnormally large. Calling
   * coalesce(1000, shuffle = true) will result in 1000 partitions with the
   * data distributed using a hash partitioner. The optional partition coalescer
   * passed in must be serializable.
  def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
              (implicit ord: Ordering[T] = null)
      : RDD[T] = withScope {
    require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")
    if (shuffle) {
      /** Distributes elements evenly across output partitions, starting from a random partition. */
      val distributePartition = (index: Int, items: Iterator[T]) => {
        var position = (new Random(index)).nextInt(numPartitions) { t =>
          // Note that the hash code of the key will just be the key itself. The HashPartitioner
          // will mod it with the number of total partitions.
          position = position + 1
          (position, t)
      } : Iterator[(Int, T)]



使用场景: 小文件合并:例如,对rdd操作时如果中间做了多个过滤操作,我现在每个分区有100条数据经过最终 过滤只有10条数据,那我现在有100个分区,必然产生很多小文件,所有这时候我们再最后加上一个coalesce算子进行小文件合并。

  • repartition
   * Return a new RDD that has exactly numPartitions partitions.
   * Can increase or decrease the level of parallelism in this RDD. Internally, this uses
   * a shuffle to redistribute data.
   * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
   * which can avoid performing a shuffle.
  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)

从源码中可以看到repartition其实调用的就是coalesce,只不过shuffle = true (coalesce中shuffle: Boolean = false)


使用场景: 处理数据倾斜:增加partiton的数量使得每个task处理的数据量减少

注意: partition数量等于task的数量
