Spark Shuffle机制详细源码解析

1.ShuffleManager

Spark在初始化SparkEnv的时候,会在create()方法里面初始化ShuffleManager

// Let the user specify short names for shuffle managersvalshortShuffleMgrNames =Map("sort"-> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName,"tungsten-sort"-> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName)valshuffleMgrName = conf.get(config.SHUFFLE_MANAGER)valshuffleMgrClass =      shortShuffleMgrNames.getOrElse(shuffleMgrName.toLowerCase(Locale.ROOT), shuffleMgrName)valshuffleManager = instantiateClass[ShuffleManager](shuffleMgrClass)

这里可以看到包含sort和tungsten-sort两种shuffle,通过反射创建了ShuffleManager,ShuffleManager是一个特质,核心方法有下面几个:

private[spark]traitShuffleManager{/**

  * 注册一个shuffle返回句柄

  */defregisterShuffle[K,V,C](      shuffleId:Int,      dependency:ShuffleDependency[K,V,C]):ShuffleHandle/** 获取一个Writer根据给定的分区,在executors执行map任务时被调用 */defgetWriter[K,V](      handle:ShuffleHandle,      mapId:Long,      context:TaskContext,      metrics:ShuffleWriteMetricsReporter):ShuffleWriter[K,V]/**

  * 获取一个Reader根据reduce分区的范围,在executors执行reduce任务时被调用

  */defgetReader[K,C](      handle:ShuffleHandle,      startPartition:Int,      endPartition:Int,      context:TaskContext,      metrics:ShuffleReadMetricsReporter):ShuffleReader[K,C]...}

2.SortShuffleManager

SortShuffleManager是ShuffleManager的唯一实现类,对于以上三个方法的实现如下:

2.1 registerShuffle

/**

  * Obtains a [[ShuffleHandle]] to pass to tasks.

  */overridedefregisterShuffle[K,V,C](      shuffleId:Int,      dependency:ShuffleDependency[K,V,C]):ShuffleHandle= {// 1.首先检查是否符合BypassMergeSortif(SortShuffleWriter.shouldBypassMergeSort(conf, dependency)) {// If there are fewer than spark.shuffle.sort.bypassMergeThreshold partitions and we don't// need map-side aggregation, then write numPartitions files directly and just concatenate// them at the end. This avoids doing serialization and deserialization twice to merge// together the spilled files, which would happen with the normal code path. The downside is// having multiple files open at a time and thus more memory allocated to buffers.newBypassMergeSortShuffleHandle[K,V](        shuffleId, dependency.asInstanceOf[ShuffleDependency[K,V,V]])// 2.否则检查是否能够序列化}elseif(SortShuffleManager.canUseSerializedShuffle(dependency)) {// Otherwise, try to buffer map outputs in a serialized form, since this is more efficient:newSerializedShuffleHandle[K,V](        shuffleId, dependency.asInstanceOf[ShuffleDependency[K,V,V]])    }else{// Otherwise, buffer map outputs in a deserialized form:newBaseShuffleHandle(shuffleId, dependency)    }  }

1.首先检查是否符合BypassMergeSort,这里需要满足两个条件,首先是当前shuffle依赖中没有map端的聚合操作,其次是分区数要小于spark.shuffle.sort.bypassMergeThreshold的值,默认为200,如果满足这两个条件,会返回BypassMergeSortShuffleHandle,启用bypass merge-sort shuffle机制

defshouldBypassMergeSort(conf:SparkConf, dep:ShuffleDependency[_, _, _]):Boolean= {// We cannot bypass sorting if we need to do map-side aggregation.if(dep.mapSideCombine) {false}else{// 默认值为200valbypassMergeThreshold:Int= conf.get(config.SHUFFLE_SORT_BYPASS_MERGE_THRESHOLD)    dep.partitioner.numPartitions <= bypassMergeThreshold  }}

2.如果不满足上面条件,检查是否满足canUseSerializedShuffle()方法,如果满足该方法中的3个条件,则会返回SerializedShuffleHandle,启用tungsten-sort shuffle机制

defcanUseSerializedShuffle(dependency:ShuffleDependency[_, _, _]):Boolean= {valshufId = dependency.shuffleIdvalnumPartitions = dependency.partitioner.numPartitions// 序列化器需要支持Relocationif(!dependency.serializer.supportsRelocationOfSerializedObjects) {    log.debug(s"Can't use serialized shuffle for shuffle$shufIdbecause the serializer, "+s"${dependency.serializer.getClass.getName}, does not support object relocation")false// 不能有map端聚合操作}elseif(dependency.mapSideCombine) {    log.debug(s"Can't use serialized shuffle for shuffle$shufIdbecause we need to do "+s"map-side aggregation")false// 分区数不能大于16777215+1}elseif(numPartitions >MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE) {    log.debug(s"Can't use serialized shuffle for shuffle$shufIdbecause it has more than "+s"$MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODEpartitions")false}else{    log.debug(s"Can use serialized shuffle for shuffle$shufId")true}}

3.如果以上两个条件都不满足的话,会返回BaseShuffleHandle,采用基本sort shuffle机制

2.2 getReader

/**

* Get a reader for a range of reduce partitions (startPartition to endPartition-1, inclusive).

* Called on executors by reduce tasks.

*/overridedefgetReader[K,C](    handle:ShuffleHandle,    startPartition:Int,    endPartition:Int,    context:TaskContext,    metrics:ShuffleReadMetricsReporter):ShuffleReader[K,C] = {valblocksByAddress =SparkEnv.get.mapOutputTracker.getMapSizesByExecutorId(    handle.shuffleId, startPartition, endPartition)newBlockStoreShuffleReader(    handle.asInstanceOf[BaseShuffleHandle[K, _,C]], blocksByAddress, context, metrics,    shouldBatchFetch = canUseBatchFetch(startPartition, endPartition, context))}

这里返回BlockStoreShuffleReader

2.3 getWriter

/** Get a writer for a given partition. Called on executors by map tasks. */overridedefgetWriter[K,V](    handle:ShuffleHandle,    mapId:Long,    context:TaskContext,    metrics:ShuffleWriteMetricsReporter):ShuffleWriter[K,V] = {valmapTaskIds = taskIdMapsForShuffle.computeIfAbsent(    handle.shuffleId, _ =>newOpenHashSet[Long](16))  mapTaskIds.synchronized { mapTaskIds.add(context.taskAttemptId()) }valenv =SparkEnv.get// 根据handle获取不同ShuffleWritehandlematch{caseunsafeShuffleHandle:SerializedShuffleHandle[K@unchecked,V@unchecked] =>newUnsafeShuffleWriter(        env.blockManager,        context.taskMemoryManager(),        unsafeShuffleHandle,        mapId,        context,        env.conf,        metrics,        shuffleExecutorComponents)casebypassMergeSortHandle:BypassMergeSortShuffleHandle[K@unchecked,V@unchecked] =>newBypassMergeSortShuffleWriter(        env.blockManager,        bypassMergeSortHandle,        mapId,        env.conf,        metrics,        shuffleExecutorComponents)caseother:BaseShuffleHandle[K@unchecked,V@unchecked, _] =>newSortShuffleWriter(        shuffleBlockResolver, other, mapId, context, shuffleExecutorComponents)  }}

这里会根据handle获取不同ShuffleWrite,如果是SerializedShuffleHandle,使用UnsafeShuffleWriter,如果是BypassMergeSortShuffleHandle,采用BypassMergeSortShuffleWriter,否则使用SortShuffleWriter

3.三种Writer的实现

如上文所说,当开启bypass机制后,会使用BypassMergeSortShuffleWriter,如果serializer支持relocation并且map端没有聚合同时分区数目不大于16777215+1三个条件都满足,使用UnsafeShuffleWriter,否则使用SortShuffleWriter

3.1 BypassMergeSortShuffleWriter

BypassMergeSortShuffleWriter继承ShuffleWriter,用java实现,会将map端的多个输出文件合并为一个文件,同时生成一个索引文件,索引记录到每个分区的初始地址,write()方法如下:

@Overridepublic void write(Iterator> records)throwsIOException{  assert (partitionWriters ==null);// 新建一个ShuffleMapOutputWriterShuffleMapOutputWritermapOutputWriter = shuffleExecutorComponents  .createMapOutputWriter(shuffleId, mapId, numPartitions);try{// 如果没有数据的话if(!records.hasNext()) {// 返回所有分区的写入长度partitionLengths = mapOutputWriter.commitAllPartitions();// 更新mapStatusmapStatus =MapStatus$.MODULE$.apply(        blockManager.shuffleServerId(), partitionLengths, mapId);return;    }finalSerializerInstanceserInstance = serializer.newInstance();finallong openStartTime =System.nanoTime();// 创建和分区数相等的DiskBlockObjectWriter FileSegmentpartitionWriters =newDiskBlockObjectWriter[numPartitions];    partitionWriterSegments =newFileSegment[numPartitions];// 对于每个分区for(int i =0; i < numPartitions; i++) {// 创建一个临时的blockfinalTuple2 tempShuffleBlockIdPlusFile =      blockManager.diskBlockManager().createTempShuffleBlock();// 获取temp block的file和idfinalFilefile = tempShuffleBlockIdPlusFile._2();finalBlockIdblockId = tempShuffleBlockIdPlusFile._1();// 对于每个分区,创建一个DiskBlockObjectWriterpartitionWriters[i] =      blockManager.getDiskWriter(blockId, file, serInstance, fileBufferSize, writeMetrics);    }// Creating the file to write to and creating a disk writer both involve interacting with// the disk, and can take a long time in aggregate when we open many files, so should be// included in the shuffle write time.// 创建文件和写入文件都需要大量时间,也需要包含在shuffle写入时间里面writeMetrics.incWriteTime(System.nanoTime() - openStartTime);// 如果有数据的话while(records.hasNext()) {finalProduct2 record = records.next();finalKkey = record._1();// 对于每条数据按key写入相应分区对应的文件partitionWriters[partitioner.getPartition(key)].write(key, record._2());    }for(int i =0; i < numPartitions; i++) {try(DiskBlockObjectWriterwriter = partitionWriters[i]) {// 提交partitionWriterSegments[i] = writer.commitAndGet();      }    }// 将所有分区文件合并成一个文件partitionLengths = writePartitionedData(mapOutputWriter);// 更新mapStatusmapStatus =MapStatus$.MODULE$.apply(      blockManager.shuffleServerId(), partitionLengths, mapId);  }catch(Exceptione) {try{      mapOutputWriter.abort(e);    }catch(Exceptione2) {      logger.error("Failed to abort the writer after failing to write map output.", e2);      e.addSuppressed(e2);    }throwe;  }}

合并文件的方法writePartitionedData()如下,默认采用零拷贝的方式来合并文件:

privatelong[] writePartitionedData(ShuffleMapOutputWritermapOutputWriter)throwsIOException{// Track location of the partition starts in the output fileif(partitionWriters !=null) {// 开始时间finallong writeStartTime =System.nanoTime();try{for(int i =0; i < numPartitions; i++) {// 获取每个文件finalFilefile = partitionWriterSegments[i].file();ShufflePartitionWriterwriter = mapOutputWriter.getPartitionWriter(i);if(file.exists()) {// 采取零拷贝方式if(transferToEnabled) {// Using WritableByteChannelWrapper to make resource closing consistent between// this implementation and UnsafeShuffleWriter.Optional maybeOutputChannel = writer.openChannelWrapper();// 在这里会调用Utils.copyFileStreamNIO方法,最终调用FileChannel.transferTo方法拷贝文件if(maybeOutputChannel.isPresent()) {              writePartitionedDataWithChannel(file, maybeOutputChannel.get());            }else{              writePartitionedDataWithStream(file, writer);            }          }else{// 否则采取流的方式拷贝writePartitionedDataWithStream(file, writer);          }if(!file.delete()) {            logger.error("Unable to delete file for partition {}", i);          }        }      }    }finally{      writeMetrics.incWriteTime(System.nanoTime() - writeStartTime);    }    partitionWriters =null;  }returnmapOutputWriter.commitAllPartitions();}

3.2 UnsafeShuffleWriter

UnsafeShuffleWriter也是继承ShuffleWriter,用java实现,write方法如下:

@Overridepublic void write(scala.collection.Iterator> records)throwsIOException{// Keep track of success so we know if we encountered an exception// We do this rather than a standard try/catch/re-throw to handle// generic throwables.// 跟踪异常boolean success =false;try{while(records.hasNext()) {// 将数据插入ShuffleExternalSorter进行外部排序insertRecordIntoSorter(records.next());    }// 合并并输出文件closeAndWriteOutput();    success =true;  }finally{if(sorter !=null) {try{        sorter.cleanupResources();      }catch(Exceptione) {// Only throw this error if we won't be masking another// error.if(success) {throwe;        }else{          logger.error("In addition to a failure during writing, we failed during "+"cleanup.", e);        }      }    }  }}

这里主要有两个方法:

3.2.1 insertRecordIntoSorter()

@VisibleForTestingvoid insertRecordIntoSorter(Product2 record)throwsIOException{  assert(sorter !=null);// 获取key和分区finalKkey = record._1();finalint partitionId = partitioner.getPartition(key);// 重置缓冲区serBuffer.reset();// 将key和value写入缓冲区serOutputStream.writeKey(key,OBJECT_CLASS_TAG);  serOutputStream.writeValue(record._2(),OBJECT_CLASS_TAG);  serOutputStream.flush();// 获取序列化数据大小finalint serializedRecordSize = serBuffer.size();  assert (serializedRecordSize >0);// 将序列化后的数据插入ShuffleExternalSorter处理sorter.insertRecord(    serBuffer.getBuf(),Platform.BYTE_ARRAY_OFFSET, serializedRecordSize, partitionId);}

该方法会将数据进行序列化,并且将序列化后的数据通过insertRecord()方法插入外部排序器中,insertRecord()方法如下:

public void insertRecord(ObjectrecordBase, long recordOffset, int length, int partitionId)throwsIOException{// for testsassert(inMemSorter !=null);// 如果数据条数超过溢写阈值,直接溢写磁盘if(inMemSorter.numRecords() >= numElementsForSpillThreshold) {    logger.info("Spilling data because number of spilledRecords crossed the threshold "+      numElementsForSpillThreshold);    spill();  }// Checks whether there is enough space to insert an additional record in to the sort pointer// array and grows the array if additional space is required. If the required space cannot be// obtained, then the in-memory data will be spilled to disk.// 检查是否有足够的空间插入额外的记录到排序指针数组中,如果需要额外的空间对数组进行扩容,如果空间不够,内存中的数据将会被溢写到磁盘上growPointerArrayIfNecessary();finalint uaoSize =UnsafeAlignedOffset.getUaoSize();// Need 4 or 8 bytes to store the record length.// 需要额外的4或8个字节存储数据长度finalint required = length + uaoSize;// 如果需要更多的内存,会想TaskMemoryManager申请新的pageacquireNewPageIfNecessary(required);  assert(currentPage !=null);finalObjectbase = currentPage.getBaseObject();//Given a memory page and offset within that page, encode this address into a 64-bit long.//This address will remain valid as long as the corresponding page has not been freed.// 通过给定的内存页和偏移量,将当前数据的逻辑地址编码成一个long型finallong recordAddress = taskMemoryManager.encodePageNumberAndOffset(currentPage, pageCursor);// 写长度值UnsafeAlignedOffset.putSize(base, pageCursor, length);// 移动指针pageCursor += uaoSize;// 写数据Platform.copyMemory(recordBase, recordOffset, base, pageCursor, length);// 移动指针pageCursor += length;// 将编码的逻辑地址和分区id传给ShuffleInMemorySorter进行排序inMemSorter.insertRecord(recordAddress, partitionId);}

在这里对于数据的缓存和溢写不借助于其他高级数据结构,而是直接操作内存空间

growPointerArrayIfNecessary()方法如下:

/**

* Checks whether there is enough space to insert an additional record in to the sort pointer

* array and grows the array if additional space is required. If the required space cannot be

* obtained, then the in-memory data will be spilled to disk.

*/privatevoid growPointerArrayIfNecessary()throwsIOException{  assert(inMemSorter !=null);// 如果没有空间容纳新的数据if(!inMemSorter.hasSpaceForAnotherRecord()) {// 获取当前内存使用量long used = inMemSorter.getMemoryUsage();LongArrayarray;try{// could trigger spilling// 分配给缓存原来两倍的容量array = allocateArray(used /8*2);    }catch(TooLargePageExceptione) {// The pointer array is too big to fix in a single page, spill.// 如果超出了一页的大小,直接溢写,溢写方法见后面// 一页的大小为128M,在PackedRecordPointer类中// static final int MAXIMUM_PAGE_SIZE_BYTES = 1 << 27;  // 128 megabytesspill();return;    }catch(SparkOutOfMemoryErrore) {// should have trigger spillingif(!inMemSorter.hasSpaceForAnotherRecord()) {        logger.error("Unable to grow the pointer array");throwe;      }return;    }// check if spilling is triggered or notif(inMemSorter.hasSpaceForAnotherRecord()) {// 如果有了剩余空间,则表明没必要扩容,释放分配的空间freeArray(array);    }else{// 否则把原来的数组复制到新的数组inMemSorter.expandPointerArray(array);    }  }}

spill()方法如下:

@Overridepublic long spill(long size,MemoryConsumertrigger)throwsIOException{if(trigger !=this|| inMemSorter ==null|| inMemSorter.numRecords() ==0) {return0L;  }  logger.info("Thread {} spilling sort data of {} to disk ({} {} so far)",Thread.currentThread().getId(),Utils.bytesToString(getMemoryUsage()),    spills.size(),    spills.size() >1?" times":" time");// Sorts the in-memory records and writes the sorted records to an on-disk file.// This method does not free the sort data structures.// 对内存中的数据进行排序并且将有序记录写到一个磁盘文件中,这个方法不会释放排序的数据结构writeSortedFile(false);finallong spillSize = freeMemory();// 重置ShuffleInMemorySorterinMemSorter.reset();// Reset the in-memory sorter's pointer array only after freeing up the memory pages holding the// records. Otherwise, if the task is over allocated memory, then without freeing the memory// pages, we might not be able to get memory for the pointer array.taskContext.taskMetrics().incMemoryBytesSpilled(spillSize);returnspillSize;}

writeSortedFile()方法:

privatevoid writeSortedFile(boolean isLastFile) {// This call performs the actual sort.// 返回一个排序好的迭代器finalShuffleInMemorySorter.ShuffleSorterIteratorsortedRecords =    inMemSorter.getSortedIterator();// If there are no sorted records, so we don't need to create an empty spill file.if(!sortedRecords.hasNext()) {return;  }finalShuffleWriteMetricsReporterwriteMetricsToUse;// 如果为true,则为输出文件,否则为溢写文件if(isLastFile) {// We're writing the final non-spill file, so we _do_ want to count this as shuffle bytes.writeMetricsToUse = writeMetrics;  }else{// We're spilling, so bytes written should be counted towards spill rather than write.// Create a dummy WriteMetrics object to absorb these metrics, since we don't want to count// them towards shuffle bytes written.writeMetricsToUse =newShuffleWriteMetrics();  }// Small writes to DiskBlockObjectWriter will be fairly inefficient. Since there doesn't seem to// be an API to directly transfer bytes from managed memory to the disk writer, we buffer// data through a byte array. This array does not need to be large enough to hold a single// record;// 创建一个字节缓冲数组,大小为1mfinalbyte[] writeBuffer =newbyte[diskWriteBufferSize];// Because this output will be read during shuffle, its compression codec must be controlled by// spark.shuffle.compress instead of spark.shuffle.spill.compress, so we need to use// createTempShuffleBlock here; see SPARK-3426 for more details.// 创建一个临时的shuffle blockfinalTuple2 spilledFileInfo =    blockManager.diskBlockManager().createTempShuffleBlock();// 获取文件和idfinalFilefile = spilledFileInfo._2();finalTempShuffleBlockIdblockId = spilledFileInfo._1();finalSpillInfospillInfo =newSpillInfo(numPartitions, file, blockId);// Unfortunately, we need a serializer instance in order to construct a DiskBlockObjectWriter.// Our write path doesn't actually use this serializer (since we end up calling the `write()`// OutputStream methods), but DiskBlockObjectWriter still calls some methods on it. To work// around this, we pass a dummy no-op serializer.// 不做任何转换的序列化器,因为需要一个实例来构造DiskBlockObjectWriterfinalSerializerInstanceser =DummySerializerInstance.INSTANCE;  int currentPartition =-1;finalFileSegmentcommittedSegment;try(DiskBlockObjectWriterwriter =      blockManager.getDiskWriter(blockId, file, ser, fileBufferSizeBytes, writeMetricsToUse)) {finalint uaoSize =UnsafeAlignedOffset.getUaoSize();// 遍历while(sortedRecords.hasNext()) {      sortedRecords.loadNext();finalint partition = sortedRecords.packedRecordPointer.getPartitionId();      assert (partition >= currentPartition);if(partition != currentPartition) {// Switch to the new partition// 如果切换到了新的分区,提交当前分区,并且记录当前分区大小if(currentPartition !=-1) {finalFileSegmentfileSegment = writer.commitAndGet();          spillInfo.partitionLengths[currentPartition] = fileSegment.length();        }// 然后切换到下一个分区currentPartition = partition;      }// 获取指针,通过指针获取页号和偏移量finallong recordPointer = sortedRecords.packedRecordPointer.getRecordPointer();finalObjectrecordPage = taskMemoryManager.getPage(recordPointer);finallong recordOffsetInPage = taskMemoryManager.getOffsetInPage(recordPointer);// 获取剩余数据int dataRemaining =UnsafeAlignedOffset.getSize(recordPage, recordOffsetInPage);// 跳过数据前面存储的长度long recordReadPosition = recordOffsetInPage + uaoSize;// skip over record lengthwhile(dataRemaining >0) {finalint toTransfer =Math.min(diskWriteBufferSize, dataRemaining);// 将数据拷贝到缓冲数组中Platform.copyMemory(          recordPage, recordReadPosition, writeBuffer,Platform.BYTE_ARRAY_OFFSET, toTransfer);// 从缓冲数组中转入DiskBlockObjectWriterwriter.write(writeBuffer,0, toTransfer);// 更新位置recordReadPosition += toTransfer;// 更新剩余数据dataRemaining -= toTransfer;      }      writer.recordWritten();    }// 提交committedSegment = writer.commitAndGet();  }// If `writeSortedFile()` was called from `closeAndGetSpills()` and no records were inserted,// then the file might be empty. Note that it might be better to avoid calling// writeSortedFile() in that case.// 记录溢写文件的列表if(currentPartition !=-1) {    spillInfo.partitionLengths[currentPartition] = committedSegment.length();    spills.add(spillInfo);  }// 如果是溢写文件,更新溢写的指标if(!isLastFile) {      writeMetrics.incRecordsWritten(      ((ShuffleWriteMetrics)writeMetricsToUse).recordsWritten());    taskContext.taskMetrics().incDiskBytesSpilled(      ((ShuffleWriteMetrics)writeMetricsToUse).bytesWritten());  }}

encodePageNumberAndOffset()方法如下:

public long encodePageNumberAndOffset(MemoryBlockpage, long offsetInPage) {// 如果开启了堆外内存,偏移量为绝对地址,可能需要64位进行编码,由于页大小限制,将其减去当前页的基地址,变为相对地址if(tungstenMemoryMode ==MemoryMode.OFF_HEAP) {// In off-heap mode, an offset is an absolute address that may require a full 64 bits to// encode. Due to our page size limitation, though, we can convert this into an offset that's// relative to the page's base offset; this relative offset will fit in 51 bits.offsetInPage -= page.getBaseOffset();  }returnencodePageNumberAndOffset(page.pageNumber, offsetInPage);}@VisibleForTestingpublic static long encodePageNumberAndOffset(int pageNumber, long offsetInPage) {  assert (pageNumber >=0) :"encodePageNumberAndOffset called with invalid page";// 高13位为页号,低51位为偏移量// 页号左移51位,再拼偏移量和上一个低51位都为1的掩码0x7FFFFFFFFFFFFLreturn(((long) pageNumber) <

ShuffleInMemorySorter的insertRecord()方法如下:

public void insertRecord(long recordPointer, int partitionId) {if(!hasSpaceForAnotherRecord()) {thrownewIllegalStateException("There is no space for new record");  }  array.set(pos,PackedRecordPointer.packPointer(recordPointer, partitionId));  pos++;}

PackedRecordPointer.packPointer()方法:

public static long packPointer(long recordPointer, int partitionId) {  assert (partitionId <=MAXIMUM_PARTITION_ID);// Note that without word alignment we can address 2^27 bytes = 128 megabytes per page.// Also note that this relies on some internals of how TaskMemoryManager encodes its addresses.// 将页号右移24位,和低27位拼在一起,这样逻辑地址被压缩成40位finallong pageNumber = (recordPointer &MASK_LONG_UPPER_13_BITS) >>>24;finallong compressedAddress = pageNumber | (recordPointer &MASK_LONG_LOWER_27_BITS);// 将分区号放在高24位上return(((long) partitionId) <<40) | compressedAddress;}

getSortedIterator()方法:

publicShuffleSorterIteratorgetSortedIterator() {  int offset =0;// 使用基数排序对内存分区ID进行排序。基数排序要快得多,但是在添加指针时需要额外的内存作为保留内存if(useRadixSort) {    offset =RadixSort.sort(      array, pos,PackedRecordPointer.PARTITION_ID_START_BYTE_INDEX,PackedRecordPointer.PARTITION_ID_END_BYTE_INDEX,false,false);// 否则采用timSort排序}else{MemoryBlockunused =newMemoryBlock(      array.getBaseObject(),      array.getBaseOffset() + pos *8L,      (array.size() - pos) *8L);LongArraybuffer =newLongArray(unused);Sorter sorter =newSorter<>(newShuffleSortDataFormat(buffer));    sorter.sort(array,0, pos,SORT_COMPARATOR);  }returnnewShuffleSorterIterator(pos, array, offset);}

3.2.2 closeAndWriteOutput()

@VisibleForTestingvoid closeAndWriteOutput()throwsIOException{  assert(sorter !=null);  updatePeakMemoryUsed();  serBuffer =null;  serOutputStream =null;// 获取溢写文件finalSpillInfo[] spills = sorter.closeAndGetSpills();  sorter =null;finallong[] partitionLengths;try{// 合并溢写文件partitionLengths = mergeSpills(spills);  }finally{// 删除溢写文件for(SpillInfospill : spills) {if(spill.file.exists() && !spill.file.delete()) {        logger.error("Error while deleting spill file {}", spill.file.getPath());      }    }  }// 更新mapstatusmapStatus =MapStatus$.MODULE$.apply(    blockManager.shuffleServerId(), partitionLengths, mapId);}

mergeSpills()方法:

privatelong[] mergeSpills(SpillInfo[] spills)throwsIOException{  long[] partitionLengths;// 如果没有溢写文件,创建空的if(spills.length ==0) {finalShuffleMapOutputWritermapWriter = shuffleExecutorComponents        .createMapOutputWriter(shuffleId, mapId, partitioner.numPartitions());returnmapWriter.commitAllPartitions();// 如果只有一个溢写文件,将它合并输出}elseif(spills.length ==1) {Optional maybeSingleFileWriter =        shuffleExecutorComponents.createSingleFileMapOutputWriter(shuffleId, mapId);if(maybeSingleFileWriter.isPresent()) {// Here, we don't need to perform any metrics updates because the bytes written to this// output file would have already been counted as shuffle bytes written.partitionLengths = spills[0].partitionLengths;      maybeSingleFileWriter.get().transferMapSpillFile(spills[0].file, partitionLengths);    }else{      partitionLengths = mergeSpillsUsingStandardWriter(spills);    }// 如果有多个,合并输出,合并的时候有NIO和BIO两种方式}else{    partitionLengths = mergeSpillsUsingStandardWriter(spills);  }returnpartitionLengths;}

3.3 SortShuffleWriter

SortShuffleWriter会使用PartitionedAppendOnlyMap或PartitionedPariBuffer在内存中进行排序,如果超过内存限制,会溢写到文件中,在全局输出有序文件的时候,对之前的所有输出文件和当前内存中的数据进行全局归并排序,对key相同的元素会使用定义的function进行聚合,入口为write()方法:

overridedefwrite(records:Iterator[Product2[K,V]]):Unit= {// 创建一个外部排序器,如果map端有预聚合,就传入aggregator和keyOrdering,否则不需要传入sorter =if(dep.mapSideCombine) {newExternalSorter[K,V,C](      context, dep.aggregator,Some(dep.partitioner), dep.keyOrdering, dep.serializer)  }else{// In this case we pass neither an aggregator nor an ordering to the sorter, because we don't// care whether the keys get sorted in each partition; that will be done on the reduce side// if the operation being run is sortByKey.newExternalSorter[K,V,V](      context, aggregator =None,Some(dep.partitioner), ordering =None, dep.serializer)  }// 将数据放入ExternalSorter进行排序sorter.insertAll(records)// Don't bother including the time to open the merged output file in the shuffle write time,// because it just opens a single file, so is typically too fast to measure accurately// (see SPARK-3570).// 创建一个输出WrtiervalmapOutputWriter = shuffleExecutorComponents.createMapOutputWriter(    dep.shuffleId, mapId, dep.partitioner.numPartitions)// 将外部排序的数据写入Writersorter.writePartitionedMapOutput(dep.shuffleId, mapId, mapOutputWriter)valpartitionLengths = mapOutputWriter.commitAllPartitions()// 更新mapstatusmapStatus =MapStatus(blockManager.shuffleServerId, partitionLengths, mapId)}

insertAll()方法:

definsertAll(records:Iterator[Product2[K,V]]):Unit= {//TODO:stop combining if we find that the reduction factor isn't highvalshouldCombine = aggregator.isDefined// 是否需要map端聚合if(shouldCombine) {// Combine values in-memory first using our AppendOnlyMap// 使用AppendOnlyMap在内存中聚合values// 获取mergeValue()函数,将新值合并到当前聚合结果中valmergeValue = aggregator.get.mergeValue// 获取createCombiner()函数,创建聚合初始值valcreateCombiner = aggregator.get.createCombinervarkv:Product2[K,V] =null// 如果一个key当前有聚合值,则合并,如果没有创建初始值valupdate = (hadValue:Boolean, oldValue:C) => {if(hadValue) mergeValue(oldValue, kv._2)elsecreateCombiner(kv._2)    }// 遍历while(records.hasNext) {// 增加读取记录数addElementsRead()      kv = records.next()// map为PartitionedAppendOnlyMap,将分区和key作为key,聚合值作为valuemap.changeValue((getPartition(kv._1), kv._1), update)// 是否需要溢写到磁盘maybeSpillCollection(usingMap =true)    }// 如果不需要map端聚合}else{// Stick values into our bufferwhile(records.hasNext) {      addElementsRead()valkv = records.next()// buffer为PartitionedPairBuffer,将分区和key加进去buffer.insert(getPartition(kv._1), kv._1, kv._2.asInstanceOf[C])// 是否需要溢写到磁盘maybeSpillCollection(usingMap =false)    }  }}

该方法主要是判断在插入数据时,是否需要在map端进行预聚合,分别采用两种数据结构来保存

maybeSpillCollection()方法里面会调用maybeSpill()方法检查是否需要溢写,如果发生溢写,重新构造一个map或者buffer结构从头开始缓存,如下:

privatedefmaybeSpillCollection(usingMap:Boolean):Unit= {varestimatedSize =0Lif(usingMap) {    estimatedSize = map.estimateSize()// 判断是否需要溢写if(maybeSpill(map, estimatedSize)) {      map =newPartitionedAppendOnlyMap[K,C]    }  }else{    estimatedSize = buffer.estimateSize()// 判断是否需要溢写if(maybeSpill(buffer, estimatedSize)) {      buffer =newPartitionedPairBuffer[K,C]    }  }if(estimatedSize > _peakMemoryUsedBytes) {    _peakMemoryUsedBytes = estimatedSize  }}protecteddefmaybeSpill(collection:C, currentMemory:Long):Boolean= {varshouldSpill =false// 如果读取的记录数是32的倍数,并且预估map或者buffer内存占用大于默认的5m阈值if(elementsRead %32==0&& currentMemory >= myMemoryThreshold) {// Claim up to double our current memory from the shuffle memory pool// 尝试申请2*currentMemory-5m的内存valamountToRequest =2* currentMemory - myMemoryThresholdvalgranted = acquireMemory(amountToRequest)// 更新阈值myMemoryThreshold += granted// If we were granted too little memory to grow further (either tryToAcquire returned 0,// or we already had more memory than myMemoryThreshold), spill the current collection// 判断,如果还是不够,确定溢写shouldSpill = currentMemory >= myMemoryThreshold    }// 如果shouldSpill为false,但是读取的记录数大于Integer.MAX_VALUE,也是需要溢写shouldSpill = shouldSpill || _elementsRead > numElementsForceSpillThreshold// Actually spillif(shouldSpill) {// 溢写次数+1_spillCount +=1logSpillage(currentMemory)// 溢写缓存的集合spill(collection)      _elementsRead =0_memoryBytesSpilled += currentMemory// 释放内存releaseMemory()    }    shouldSpill  }

maybeSpill()方法里面会调用spill()进行溢写,如下:

overrideprotected[this]defspill(collection:WritablePartitionedPairCollection[K,C]):Unit= {// 根据给定的比较器进行排序,返回排序结果的迭代器valinMemoryIterator = collection.destructiveSortedWritablePartitionedIterator(comparator)// 将迭代器中的数据溢写到磁盘文件中valspillFile = spillMemoryIteratorToDisk(inMemoryIterator)// ArrayBuffer记录所有溢写的文件spills += spillFile  }

spillMemoryIteratorToDisk()方法如下:

private[this]defspillMemoryIteratorToDisk(inMemoryIterator:WritablePartitionedIterator)    :SpilledFile= {// Because these files may be read during shuffle, their compression must be controlled by// spark.shuffle.compress instead of spark.shuffle.spill.compress, so we need to use// createTempShuffleBlock here; see SPARK-3426 for more context.// 创建一个临时块val(blockId, file) = diskBlockManager.createTempShuffleBlock()// These variables are reset after each flushvarobjectsWritten:Long=0valspillMetrics:ShuffleWriteMetrics=newShuffleWriteMetrics// 创建溢写文件的DiskBlockObjectWritervalwriter:DiskBlockObjectWriter=    blockManager.getDiskWriter(blockId, file, serInstance, fileBufferSize, spillMetrics)// List of batch sizes (bytes) in the order they are written to disk// 记录写入批次大小valbatchSizes =newArrayBuffer[Long]// How many elements we have in each partition// 记录每个分区条数valelementsPerPartition =newArray[Long](numPartitions)// Flush the disk writer's contents to disk, and update relevant variables.// The writer is committed at the end of this process.// 将内存中的数据按批次刷写到磁盘中defflush():Unit= {valsegment = writer.commitAndGet()    batchSizes += segment.length    _diskBytesSpilled += segment.length    objectsWritten =0}varsuccess =falsetry{// 遍历map或者buffer中的记录while(inMemoryIterator.hasNext) {valpartitionId = inMemoryIterator.nextPartition()      require(partitionId >=0&& partitionId < numPartitions,s"partition Id:${partitionId}should be in the range [0,${numPartitions})")// 写入并更新计数值inMemoryIterator.writeNext(writer)      elementsPerPartition(partitionId) +=1objectsWritten +=1// 写入条数达到10000条时,将这批刷写到磁盘if(objectsWritten == serializerBatchSize) {        flush()      }    }// 遍历完以后,将剩余的刷写到磁盘if(objectsWritten >0) {      flush()    }else{      writer.revertPartialWritesAndClose()    }    success =true}finally{if(success) {      writer.close()    }else{// This code path only happens if an exception was thrown above before we set success;// close our stuff and let the exception be thrown furtherwriter.revertPartialWritesAndClose()if(file.exists()) {if(!file.delete()) {          logWarning(s"Error deleting${file}")        }      }    }  }// 返回溢写文件SpilledFile(file, blockId, batchSizes.toArray, elementsPerPartition)}

接下来就是排序合并操作,调用ExternalSorter.writePartitionedMapOutput()方法:

defwritePartitionedMapOutput(    shuffleId:Int,    mapId:Long,    mapOutputWriter:ShuffleMapOutputWriter):Unit= {varnextPartitionId =0// 如果没有发生溢写if(spills.isEmpty) {// Case where we only have in-memory datavalcollection =if(aggregator.isDefined) mapelsebuffer// 根据指定的比较器进行排序valit = collection.destructiveSortedWritablePartitionedIterator(comparator)while(it.hasNext()) {valpartitionId = it.nextPartition()varpartitionWriter:ShufflePartitionWriter=nullvarpartitionPairsWriter:ShufflePartitionPairsWriter=nullTryUtils.tryWithSafeFinally {        partitionWriter = mapOutputWriter.getPartitionWriter(partitionId)valblockId =ShuffleBlockId(shuffleId, mapId, partitionId)        partitionPairsWriter =newShufflePartitionPairsWriter(          partitionWriter,          serializerManager,          serInstance,          blockId,          context.taskMetrics().shuffleWriteMetrics)// 将分区内的数据依次取出while(it.hasNext && it.nextPartition() == partitionId) {          it.writeNext(partitionPairsWriter)        }      } {if(partitionPairsWriter !=null) {          partitionPairsWriter.close()        }      }      nextPartitionId = partitionId +1}// 如果发生溢写,将溢写文件和缓存数据进行归并排序,排序完成后按照分区依次写入ShufflePartitionPairsWriter}else{// We must perform merge-sort; get an iterator by partition and write everything directly.// 这里会进行归并排序for((id, elements) <-this.partitionedIterator) {valblockId =ShuffleBlockId(shuffleId, mapId, id)varpartitionWriter:ShufflePartitionWriter=nullvarpartitionPairsWriter:ShufflePartitionPairsWriter=nullTryUtils.tryWithSafeFinally {        partitionWriter = mapOutputWriter.getPartitionWriter(id)        partitionPairsWriter =newShufflePartitionPairsWriter(          partitionWriter,          serializerManager,          serInstance,          blockId,          context.taskMetrics().shuffleWriteMetrics)if(elements.hasNext) {for(elem <- elements) {            partitionPairsWriter.write(elem._1, elem._2)          }        }      } {if(partitionPairsWriter !=null) {          partitionPairsWriter.close()        }      }      nextPartitionId = id +1}  }  context.taskMetrics().incMemoryBytesSpilled(memoryBytesSpilled)  context.taskMetrics().incDiskBytesSpilled(diskBytesSpilled)  context.taskMetrics().incPeakExecutionMemory(peakMemoryUsedBytes)}

partitionedIterator()方法:

defpartitionedIterator:Iterator[(Int,Iterator[Product2[K,C]])] = {valusingMap = aggregator.isDefinedvalcollection:WritablePartitionedPairCollection[K,C] =if(usingMap) mapelsebufferif(spills.isEmpty) {// Special case: if we have only in-memory data, we don't need to merge streams, and perhaps// we don't even need to sort by anything other than partition ID// 如果没有溢写,并且没有排序,只按照分区id排序if(ordering.isEmpty) {// The user hasn't requested sorted keys, so only sort by partition ID, not keygroupByPartition(destructiveIterator(collection.partitionedDestructiveSortedIterator(None)))// 如果没有溢写但是排序,先按照分区id排序,再按key排序}else{// We do need to sort by both partition ID and keygroupByPartition(destructiveIterator(        collection.partitionedDestructiveSortedIterator(Some(keyComparator))))    }  }else{// Merge spilled and in-memory data// 如果有溢写,就将溢写文件和内存中的数据归并排序merge(spills, destructiveIterator(      collection.partitionedDestructiveSortedIterator(comparator)))  }}

归并方法如下:

privatedefmerge(spills:Seq[SpilledFile], inMemory:Iterator[((Int,K),C)])    :Iterator[(Int,Iterator[Product2[K,C]])] = {// 读取溢写文件valreaders = spills.map(newSpillReader(_))valinMemBuffered = inMemory.buffered// 遍历分区(0until numPartitions).iterator.map { p =>valinMemIterator =newIteratorForPartition(p, inMemBuffered)// 合并溢写文件和内存中的数据valiterators = readers.map(_.readNextPartition()) ++Seq(inMemIterator)// 如果有聚合逻辑,按分区聚合,对key按照keyComparator排序if(aggregator.isDefined) {// Perform partial aggregation across partitions(p, mergeWithAggregation(        iterators, aggregator.get.mergeCombiners, keyComparator, ordering.isDefined))// 如果没有聚合,但是有排序逻辑,按照ordering做归并}elseif(ordering.isDefined) {// No aggregator given, but we have an ordering (e.g. used by reduce tasks in sortByKey);// sort the elements without trying to merge them(p, mergeSort(iterators, ordering.get))// 什么都没有直接归并}else{      (p, iterators.iterator.flatten)    }  }}

在write()方法中调用commitAllPartitions()方法输出数据,其中调用writeIndexFileAndCommit()方法写出数据和索引文件,如下:

defwriteIndexFileAndCommit(    shuffleId:Int,    mapId:Long,    lengths:Array[Long],    dataTmp:File):Unit= {// 创建索引文件和临时索引文件valindexFile = getIndexFile(shuffleId, mapId)valindexTmp =Utils.tempFileWith(indexFile)try{// 获取shuffle data filevaldataFile = getDataFile(shuffleId, mapId)// There is only one IndexShuffleBlockResolver per executor, this synchronization make sure// the following check and rename are atomic.// 对于每个executor只有一个IndexShuffleBlockResolver,确保原子性synchronized {// 检查索引是否和数据文件已经有了对应关系valexistingLengths = checkIndexAndDataFile(indexFile, dataFile, lengths.length)if(existingLengths !=null) {// Another attempt for the same task has already written our map outputs successfully,// so just use the existing partition lengths and delete our temporary map outputs.// 如果存在对应关系,说明shuffle write已经完成,删除临时索引文件System.arraycopy(existingLengths,0, lengths,0, lengths.length)if(dataTmp !=null&& dataTmp.exists()) {          dataTmp.delete()        }      }else{// 如果不存在,创建一个BufferedOutputStream// This is the first successful attempt in writing the map outputs for this task,// so override any existing index and data files with the ones we wrote.valout =newDataOutputStream(newBufferedOutputStream(newFileOutputStream(indexTmp)))Utils.tryWithSafeFinally {// We take in lengths of each block, need to convert it to offsets.// 获取每个分区的大小,累加偏移量,写入临时索引文件varoffset =0L          out.writeLong(offset)for(length <- lengths) {            offset += length            out.writeLong(offset)          }        } {          out.close()        }// 删除可能存在的其他索引文件if(indexFile.exists()) {          indexFile.delete()        }// 删除可能存在的其他数据文件if(dataFile.exists()) {          dataFile.delete()        }// 将临时文件重命名成正式文件if(!indexTmp.renameTo(indexFile)) {thrownewIOException("fail to rename file "+ indexTmp +" to "+ indexFile)        }if(dataTmp !=null&& dataTmp.exists() && !dataTmp.renameTo(dataFile)) {thrownewIOException("fail to rename file "+ dataTmp +" to "+ dataFile)        }      }    }  }finally{if(indexTmp.exists() && !indexTmp.delete()) {      logError(s"Failed to delete temporary index file at${indexTmp.getAbsolutePath}")    }  }}

龙华大道1号 http://www.kinghill.cn/Dynamics/2106.html

你可能感兴趣的:(Spark Shuffle机制详细源码解析)