[Spark] Spark的shuffle写过程

相关博客链接：https://blog.csdn.net/hkl15111093042/article/details/94376896
https://www.jianshu.com/p/1d714f0c5e07
简答：【Spark shuffle处于一个宽依赖，可以实现类似混洗的功能，可将相同的Key分发至同一个Reducer进行处理（即将多个节点上的同一类数据汇集到某一节点进行计算）】

Shuffle的重要性：无论是在Hadoop还是Spark中，Shuffle阶段会涉及到磁盘的读写和网络传输，因此Shuffle的性能高低直接影响整个程序的性能和吞吐量。

1.Hadoop MapReduce中的shuffle

在mapreduce中，shuffle是连接Map和Reduce之间的桥梁，Map操作的输出要到Reduce中必须经过shuffle操作。

image.png

mapreduce中的shuffle过程：
首先就是加载文件路径：
Path inPath=new Path(args[0])
FileInputFormat.addInputPath(job,inPath)
之后的过程大概如下所示：

image.png

【说明：】
partitioner：决定map的数据写到哪一个分区块中，被哪一个reduce获取；
sort：保证map的输出结果按照指定的排序方式进入到reduce中；
combiner：说明数据在map端是否需要进行一次规约(reduce)；
group：将同一个reduce中，相同的key规约到同一个组中；

【mapreduce中shuffle的调优点：】设置combiner，减少从map到reduce的数量量和io开销。

注意，另外的面试点：
？？mapper和reducer的数量由什么决定，mapper的数量不可以指定，reducer的数量可以指定

2.Spark中Shuffle的写操作

2.1 基于哈希的shuffle写操作

Spark避免了Hadoop中多余的排序（即在Reduce之前获取的数据要经过排序），提供基于哈希的Shuffle写操作。

1.【原理：】
<1> 每一个Mapper会根据Reducer的数量创建出对应数量的bucket（bucket的数量就是M*R）；
<2> Mapper生成的结果会根据设置的Partition算法填充到每个bucket中，这里的bucket是抽象的概念，在该机制中每个bucket对应一个文件，
<3>当Reduce启动时，会根据任务的编号和所依赖的Mapper的编号，从远端或者本地获取相应的bucket作为Reduce任务的输入

image.png

2.【HashShuffleManager】
博客链接：https://blog.csdn.net/weixin_39216383/article/details/81174950

image.png

这里的M*R个，相当于executor数 * 每个executor上的task数目

image.png

优化点：如果我们使用HashShuffleManager，可以设置spark.shuffle.consolidateFiles，该参数默认置为false，将其设置为true即可开启优化机制。

image.png

3.【HashShuffleManager源码流程】
见【Spark】HashShuffleManager

2.2 基于排序的shuffle写操作

1.【HashShuffle的不足】：

<1> 过程中创建的文件数是M*R，M是当前Shuffle Map Task的任务数，而R是后续任务的任务数。
如M和R都为1000，最终会产生1M个文件，对于文件系统负担非常大，同时在shuffle数据量不大而文件数量很多的情况下，随机写会降低I/O性能。
<2> 写文件时，每一个Writer Handler默认要100KB内存，虽然Shuffle写是分时运行的，其内存所需是C * F *100KB，其中C为Spark集群中运行的核数，F为后续任务数(即Reduce数目)，缓存占用内存也不小

2.【原理：】

区别：
每个Shuffle Map Task不会为后续的每个Task创建单独的文件，而是会将所有的结果写在同一个文件中，对应的生成一个Index文件进行索引。
最终临时文件的数量就是2 * reduce task num

运行机制：
a.普通运行机制 b.bypass运行机制
当shuffle read task的数量小于等于spark.shuffle.sort.bypassMergeThreshold参数的值时（默认为200），就会启用bypass机制

image.png

https://www.cnblogs.com/itboys/p/9201750.html

3.【BypassMergeSortShuffleWriter】

BypassMergeSortShuffleWriter：
未定义aggregator并且没有开启mapside combine，分区数小于spark.shuffle.sort.bypassMergeThreshold指定的分区数（默认200），则直接每个reduce分区写一个文件，然后再合并，避免为溢写文件合并而两次序列化和反序列化。
源码流程：
（1）前几步与HashShuffleWriter一致，获取rdd与dependency，并根据SparkEnv获取对应的ShuffleManager

image.png

（2）其中manager.getWriter方法的第一个参数 dep.shuffleHandle

val shuffleHandle: ShuffleHandle = _rdd.context.env.shuffleManager.registerShuffle(
    shuffleId, _rdd.partitions.length, this)

registerShuffle方法的具体实现

 override def registerShuffle[K, V, C](
      shuffleId: Int,
      numMaps: Int,
      dependency: ShuffleDependency[K, V, C]): ShuffleHandle = {
    if (SortShuffleWriter.shouldBypassMergeSort(conf, dependency)) {
      // If there are fewer than spark.shuffle.sort.bypassMergeThreshold partitions and we don't
      // need map-side aggregation, then write numPartitions files directly and just concatenate
      // them at the end. This avoids doing serialization and deserialization twice to merge
      // together the spilled files, which would happen with the normal code path. The downside is
      // having multiple files open at a time and thus more memory allocated to buffers.
      new BypassMergeSortShuffleHandle[K, V](
        shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
    } else if (SortShuffleManager.canUseSerializedShuffle(dependency)) {
      // Otherwise, try to buffer map outputs in a serialized form, since this is more efficient:
      new SerializedShuffleHandle[K, V](
        shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
    } else {
      // Otherwise, buffer map outputs in a deserialized form:
      new BaseShuffleHandle(shuffleId, numMaps, dependency)
    }
  }

其中使用BypassMergeSortShuffleHandle的条件：
a、未开启mapSideCombine
b、分区数小于
["spark.shuffle.sort.bypassMergeThreshold]

因为定义了map端的合并，无法避免要进行排序；
如果分区数小于线程数，就无需排序，假如大于的话也不能忽略排序。

image.png

（3）BypassMergeSortShuffleWriter的write方法
a、先根据分区数初始化partitionWriters数组

image.png

b、将数据迭代写入
key的hash值，对partition的数目取余数

image.png

写完提交关闭

image.png

c、合并文件，写DataFile和indexFile，同时更新MapStatus作为结果返回。

partitionLengths：long型数组，表示每个分区的数据长度，用于在indexFile中，表示每个分区的开始的索引，知道索引后，可以去dataFile中获取数据

4.【SortShuffleWriter】

与ByPass机制的不同：
并不会像bypass机制一样，直接写很多文件（按照每个Map Task对应分区数写对应数量的文件）再进行合并，而是采用了溢写磁盘的方式产生临时文件，并进行合并。
源码流程：
（1）前几步与bypass一致，直到调用registerShuflle方法返回ShuffleHandle实例的时候：
在不满足BypassMergeSortShuffleHandle和SerializedShuffleHandle的时候

image.png

根据BaseShuffleHandle匹配，返回ShuffleWriter为

image.png

（2）根据获取到的rdd和dependecny，是否有mapSideCombine，是否定义aggregator，来决定构建不同的ExternalSorter（第二种情况中，并不会在意每个分区中是否根据key进行排序，如果是sortByKey算子也是会在reduce端进行排序。）

image.png

ExternalSorter类的注释上有这么一段：
需要在map端combine，则使用PartitionedAppendOnlyMap写内存数据buffer；否则的话使用PartitionedPairBuffer。

image.png

（3）调用sorter的insertAll方法，将数据先写入到buffer中，有需要的话再溢写到磁盘中。
（4）insertAll方法中，

  def insertAll(records: Iterator[Product2[K, V]]): Unit = {
    // TODO: stop combining if we find that the reduction factor isn't high
    val shouldCombine = aggregator.isDefined

    if (shouldCombine) {
      // Combine values in-memory first using our AppendOnlyMap
      val mergeValue = aggregator.get.mergeValue
      val createCombiner = aggregator.get.createCombiner
      var kv: Product2[K, V] = null
      val update = (hadValue: Boolean, oldValue: C) => {
        if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
      }
      while (records.hasNext) {
        addElementsRead()
        kv = records.next()
        map.changeValue((getPartition(kv._1), kv._1), update)
        maybeSpillCollection(usingMap = true)
      }
    } else {
      // Stick values into our buffer
      while (records.hasNext) {
        addElementsRead()
        val kv = records.next()
        buffer.insert(getPartition(kv._1), kv._1, kv._2.asInstanceOf[C])
        maybeSpillCollection(usingMap = false)
      }
    }
  }

首先根据是否定义aggregator，决定写内存数据的buffer类型，将数据迭代写入，同时会判断是否需要溢写到磁盘上maybeSpillCollection，溢写条件判断在maybeSpill方法中，
最终的条件就是当前的内存占用大于等于初始化的内存阈值[spark.shuffle.spill.initialMemoryThreshold]，默认值是5 * 1024 * 1024或者当缓存的条数达到一定的量就进行溢写

image.png

maybeSpill过程描述：每写入32条数据检查1次，向内存管理器申请执行内存（granted代表内存真正分配的），如果内存真正占用超过了新的阈值，那么就进行溢写。

（5）溢写文件调用的是ExternalSorter的spill方法

 override protected[this] def spill(collection: WritablePartitionedPairCollection[K, C]): Unit = {
    val inMemoryIterator = collection.destructiveSortedWritablePartitionedIterator(comparator)
    val spillFile = spillMemoryIteratorToDisk(inMemoryIterator)
    spills += spillFile
  }

下图方法返回分区的写迭代器，会根据是否进行mapSideCombine有两种实现

image.png

a、PartitionedPairBuffer
的partitionedDestructiveSortedIterator，返回的仅仅是基于partitionId进行排序的一个迭代器
b、PartitionedAppendOnlyMap的
partitionedDestructiveSortedIterator返回的迭代器排序是基于(PartitionId,key)为key的排序迭代器

（6）写入内存buffer(或进一步溢写磁盘)后，合并DataFile及创建IndexFile

image.png

过程分析：
构建blockId，并调用writePartitionedFile进入将数据写入磁盘的动作

val blockId = ShuffleBlockId(dep.shuffleId, mapId, IndexShuffleBlockResolver.NOOP_REDUCE_ID)
      val partitionLengths = sorter.writePartitionedFile(blockId, tmp)

writePartitionedFile：

image.png

a、如果没有溢写文件，无需合并的步骤直接构建迭代器，根据分区按照文件块写文件。构建迭代器的时候跟spill里的方式一样
b、如果有溢写文件，其迭代器 this.partitionedIterator 会根据是否有溢写，是否需要排序分了三种情况：

  def partitionedIterator: Iterator[(Int, Iterator[Product2[K, C]])] = {
    val usingMap = aggregator.isDefined
    val collection: WritablePartitionedPairCollection[K, C] = if (usingMap) map else buffer
    if (spills.isEmpty) {
      // Special case: if we have only in-memory data, we don't need to merge streams, and perhaps
      // we don't even need to sort by anything other than partition ID
      if (!ordering.isDefined) {
        // The user hasn't requested sorted keys, so only sort by partition ID, not key
        groupByPartition(destructiveIterator(collection.partitionedDestructiveSortedIterator(None)))
      } else {
        // We do need to sort by both partition ID and key
        groupByPartition(destructiveIterator(
          collection.partitionedDestructiveSortedIterator(Some(keyComparator))))
      }
    } else {
      // Merge spilled and in-memory data
      merge(spills, destructiveIterator(
        collection.partitionedDestructiveSortedIterator(comparator)))
    }
  }

其中第三种情况，有溢写的话，需要合并溢写数据和内存数据：
定义了aggregator，要进行基于aggregator的合并；
没有定义aggregator，但是有顺序（如sortByKey），需要对数据排序但不需要合并merge；
既没有aggregator，也没有顺序，直接返回。

  private def merge(spills: Seq[SpilledFile], inMemory: Iterator[((Int, K), C)])
      : Iterator[(Int, Iterator[Product2[K, C]])] = {
    val readers = spills.map(new SpillReader(_))
    val inMemBuffered = inMemory.buffered
    (0 until numPartitions).iterator.map { p =>
      val inMemIterator = new IteratorForPartition(p, inMemBuffered)
      val iterators = readers.map(_.readNextPartition()) ++ Seq(inMemIterator)
      if (aggregator.isDefined) {
        // Perform partial aggregation across partitions
        (p, mergeWithAggregation(
          iterators, aggregator.get.mergeCombiners, keyComparator, ordering.isDefined))
      } else if (ordering.isDefined) {
        // No aggregator given, but we have an ordering (e.g. used by reduce tasks in sortByKey);
        // sort the elements without trying to merge them
        (p, mergeSort(iterators, ordering.get))
      } else {
        (p, iterators.iterator.flatten)
      }
    }
  }

最后，遍历每个MapTask的所有数据迭代器，将数据写入数据文件；接着写索引文件和构建MapStatus。

5.【UnsafeShuffleWriter】"钨丝计划"

执行过程图与SortShuffleWriter基本一致，也是先写内存buffer，再溢写到磁盘。
源码流程：
（1）前几步与bypass一致，直到调用SortShuffleManager.registerShuflle方法返回ShuffleHandle实例的时候：

image.png

在排除不是bypassMergeSortShuffleHandle的基础上，满足 1.对象需要支持序列化，2.不能定义aggregator，3.partition数目小于等于[MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE] (16777216个)
注意：不能大于该数目的原因是

image.png

根据ShuffleHandle类型，返回UnsafeShuffleWriter实现类。
（2）UnsafeShuffleWriter的write方法

  boolean success = false;
    try {
      while (records.hasNext()) {
        insertRecordIntoSorter(records.next());
      }
      closeAndWriteOutput();
      success = true;
    } finally {
      if (sorter != null) {
        try {
          sorter.cleanupResources();
        } catch (Exception e) {
          // Only throw this error if we won't be masking another
          // error.
          if (success) {
            throw e;
          } else {
            logger.error("In addition to a failure during writing, we failed during " +
                         "cleanup.", e);
          }
        }
      }
    }

其中insertRecordIntoSorter方法，
定义一个会清空的serBuffer，大小为1024*1024（就是1M），将数据写入到序列化流中并且刷新。

image.png

最后的sorter.insertRecord，该方法内会判断是否需要溢写文件：
内存中的record数目，是否大于设定的[numElementsForSpillThreshold]，由属性 spark.shuffle.spill.numElementsForceSpillThreshold指定

  public void insertRecord(Object recordBase, long recordOffset, int length, int partitionId)
    throws IOException {

    // for tests
    assert(inMemSorter != null);
    if (inMemSorter.numRecords() >= numElementsForSpillThreshold) {
      logger.info("Spilling data because number of spilledRecords crossed the threshold " +
        numElementsForSpillThreshold);
    // 需要溢写操作，调用spill函数
      spill();
    }
    // 判断是否需要将数组增大，以便接受数据，如果空间不足，则触发spill
    growPointerArrayIfNecessary();
    // Need 4 bytes to store the record length.
    final int required = length + 4;
    // 如果空间不足，就重新申请新的page，假如资源不足触发spill
    // 已经根据申请的字节数，修改了currentPage和pageCursor的值
    acquireNewPageIfNecessary(required);

    assert(currentPage != null);
    /**
    * 获取内存地址，通过一个内存地址（off-heap）或者一个对象引用的偏移（on-heap）追踪
    * 寻址方式对比：
    * on-heap：先找到对象，然后再找索引
    * off-heap：根据地址找到索引
    */
    final Object base = currentPage.getBaseObject();
    final long recordAddress = taskMemoryManager.encodePageNumberAndOffset(currentPage, pageCursor);
    Platform.putInt(base, pageCursor, length);
    pageCursor += 4;
    Platform.copyMemory(recordBase, recordOffset, base, pageCursor, length);
    pageCursor += length;
    inMemSorter.insertRecord(recordAddress, partitionId);
  }

on-heap和off-heap两种模式是由下图决定：

image.png

满足的条件为：

image.png

（3）再分析下spill方法，调用的是自身ShuffleExternalSorter的spill方法，其中进一步调用的是writeSortedFile方法。
（方法太复杂，会另写文章介绍具体过程）

（4）在writeSortedFile方法中，会先进行排序，再将数据写入磁盘
（5）不需要溢写，会先判断资源是否写一条记录，不足够时会先进行资源扩展，然后将数据写到内存里面。如果资源不够会触发溢写操作。
（6）在合并文件时，先将内存残余刷写到磁盘中，接着进行文件合并。
spill聚合文件的时候分为高效和低效两种方式，spark.shuffle.unsafe.fastMergeEnabled为true时，并且没有开启压缩或者压缩方式为snappy|LZF，可以采用非常高效的合并transferTo；否则只能使用fileStream-based。
（7）最后产生indexFile并且更新mapstatus

[Spark] Spark的shuffle写过程

1.Hadoop MapReduce中的shuffle

2.Spark中Shuffle的写操作

2.1 基于哈希的shuffle写操作

2.2 基于排序的shuffle写操作

1.【HashShuffle的不足】：

2.【原理：】

3.【BypassMergeSortShuffleWriter】

4.【SortShuffleWriter】

5.【UnsafeShuffleWriter】"钨丝计划"

你可能感兴趣的:([Spark] Spark的shuffle写过程)