Shuffle的读写操作(一)

下面是我们的ShuffleMapTask当中的runTask的方法,在这个方法当中主要是调用了我们的HashShuffleWrite当中的write方法来进行具体的写出操作

  /**
   *
   */
  override def runTask(context: TaskContext): MapStatus = {
    // Deserialize the RDD using the broadcast variable.   
    //反序列化的起始时间  
    val deserializeStartTime = System.currentTimeMillis()
    // 获得反序列化器closureSerializer  
    val ser = SparkEnv.get.closureSerializer.newInstance()
    // 调用反序列化器closureSerializer的deserialize()进行RDD和ShuffleDependency的反序列化,数据来源于taskBinary 
    val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
      ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
    //计算Executor进行反序列化的时间  
    _executorDeserializeTime = System.currentTimeMillis() - deserializeStartTime

    metrics = Some(context.taskMetrics)
    var writer: ShuffleWriter[Any, Any] = null
    try {
      //获得shuffleManager
      val manager = SparkEnv.get.shuffleManager
      //根据partition指定分区的Shufflea获取Shuffle Writer,shuffleHandle是shuffle ID
      //partitionId表示的是当前RDD的某个partition,也就是说write操作作用于partition之上  
      writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)
      //针对RDD中的分区partition,调用rdd的iterator()方法后,再调用writer的write()方法,写数据  
      writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
      //停止writer,并返回标志位 
      writer.stop(success = true).get
    } catch {
      case e: Exception =>
        try {
          if (writer != null) {
            writer.stop(success = false)
          }
        } catch {
          case e: Exception =>
            log.debug("Could not stop writer", e)
        }
        throw e
    }
  }

下面这个代码是我们的HashShuffleWrite的写方法的代码如下:

  /**
    * Write a bunch of records to this task's output
    * 将一堆记录写入此任务的输出*/
    /**
     * 主要处理两件事:
     * 1)判断是否需要进行聚合,比如都要写入的话,那么先生成
     *   然后再进行后续的写入工作
     * 2)利用Partition函数来决定写入哪一个文件中.
     */
  override def write(records: Iterator[Product2[K, V]]): Unit = {
    //判断aggregator是否被定义,需要做Map端聚合操作
    val iter = if (dep.aggregator.isDefined) {
      if (dep.mapSideCombine) {//判断是否需要聚合,如果需要,聚合records执行map端的聚合
        //汇聚工作,reducebyKey是一分为二的,一部在ShuffleMapTask中进行聚合
        //另一部分在resultTask中聚合
        dep.aggregator.get.combineValuesByKey(records, context)
      } else {
        records
      }
    } else {
      require(!dep.mapSideCombine, "Map-side combine without Aggregator specified!")
      records
    }
     //利用getPartition函数来决定写入哪一个文件中.
    for (elem <- iter) {
     //elem是类似于的键值对,以K为参数用partitioner计算其对应的值,
      val bucketId = dep.partitioner.getPartition(elem._1)//获得该element需要写入的partitioner
      //实际调用FileShuffleBlockManager.forMapTask进入数据写入
      //bucketId文件名称,key elem._1,value elem._2
      shuffle.writers(bucketId).write(elem._1, elem._2)
    }
  }

FileShuffleBlockResolver类的主要解析如下:

/**
 * Manages assigning disk-based block writers to shuffle tasks. Each shuffle task gets one file
 * per reducer (this set of files is called a ShuffleFileGroup).
  * 管理分配基于磁盘的块写入器来随机播放任务,每个shuffle任务每个reducer获取一个文件(这组文件称为ShuffleFileGroup)
 *
 * As an optimization to reduce the number of physical shuffle files produced, multiple shuffle
 * blocks are aggregated into the same file. There is one "combined shuffle file" per reducer
 * per concurrently executing shuffle task. As soon as a task finishes writing to its shuffle
 * files, it releases them for another task.
  *
  * 作为减少生成的物理随机播放文件数量的优化,多个shuffle块被聚合到同一个文件中,每个并发执行随机播放任务,每个reducer有一个“组合shuffle文件”
  * 一旦任务完成对其随机播放文件的写入,它将释放它们用于另一个任务。
  *
 * Regarding the implementation of this feature, shuffle files are identified by a 3-tuple:
  * 关于此功能的实现,随机播放文件由3元组标识:
 *   - shuffleId: The unique id given to the entire shuffle stage.给予整个洗牌阶段的唯一身份
 *   - bucketId: The id of the output partition (i.e., reducer id)输出分区的id(即reducer id)
 *   - fileId: The unique id identifying a group of "combined shuffle files." Only one task at a
 *       time owns a particular fileId, and this id is returned to a pool when the task finishes.
  *      识别一组“组合的shuffle文件”的唯一ID,一次只有一个任务拥有一个特定的fileId,当任务完成时,这个id返回给一个池
 * Each shuffle file is then mapped to a FileSegment, which is a 3-tuple (file, offset, length)
 * that specifies where in a given file the actual block data is located.
  * 然后将每个随机shuffle文件映射到FileSegment,FileSegment是一个3元组(文件,偏移量,长度),用于指定给定文件中实际块数据所在的位置
 *
 * Shuffle file metadata is stored in a space-efficient manner. Rather than simply mapping
 * ShuffleBlockIds directly to FileSegments, each ShuffleFileGroup maintains a list of offsets for
 * each block stored in each file. In order to find the location of a shuffle block, we search the
 * files within a ShuffleFileGroups associated with the block's reducer.
  *
  *Shuffle文件元数据以节省空间的方式存储,而不是简单的映射ShuffleBlock直接转到FileSegments,
  * 每个ShuffleFileGroup为每个文件中存储的每个块维护一个偏移量列表,为了找到混洗块的位置,
  * 我们搜索与块的reducer相关联的ShuffleFileGroup中的文件。
 */

上面这个类的 forMapTask方法如下

/**
   * 
   * Get a ShuffleWriterGroup for the given map task, which will register it as complete
   * when the writers are closed successfully
    * 为给定的Map任务获取一个ShuffleWriterGroup,当写关闭成功时,它将注册为完整的
    * mapId对应RDD的partionsID
   * 
   */
  def forMapTask(shuffleId: Int, mapId: Int, numBuckets: Int, serializer: Serializer,
      writeMetrics: ShuffleWriteMetrics): ShuffleWriterGroup = {
    new ShuffleWriterGroup {
      shuffleStates.putIfAbsent(shuffleId, new ShuffleState(numBuckets))
      private val shuffleState = shuffleStates(shuffleId)
      private var fileGroup: ShuffleFileGroup = null

      val openStartTime = System.nanoTime
      val serializerInstance = serializer.newInstance()
      //如果consolidateShuffleFiles为true,那么在一个Task中,有多少个输出的Partition就会有多少个中间文件,默认为false
      val writers: Array[DiskBlockObjectWriter] = if (consolidateShuffleFiles) {
        fileGroup = getUnusedFileGroup()//获取没有使用的FileGroup
        Array.tabulate[DiskBlockObjectWriter](numBuckets) { bucketId =>
          //mapId对应RDD的partionsID
          val blockId = ShuffleBlockId(shuffleId, mapId, bucketId)
          blockManager.getDiskWriter(blockId, fileGroup(bucketId), serializerInstance, bufferSize,
            writeMetrics)
        }
      } else {
        Array.tabulate[DiskBlockObjectWriter](numBuckets) { bucketId =>
          //mapId对应RDD的partionsID
          val blockId = ShuffleBlockId(shuffleId, mapId, bucketId)
          //如果blockFile已经存在,那么删除它并打印日志
          val blockFile = blockManager.diskBlockManager.getFile(blockId)           
          val tmp = Utils.tempFileWith(blockFile)    
          //tmp也就是blockFile如果已经存在则,在后面追加数据
          blockManager.getDiskWriter(blockId, tmp, serializerInstance, bufferSize, writeMetrics)
        }
      }
      // Creating the file to write to and creating a disk writer both involve interacting with
      // the disk, so should be included in the shuffle write time.
      //创建要写入和创建磁盘刻录机的文件都涉及与磁盘交互,因此应该包含在shuffle写入的时间。
      writeMetrics.incShuffleWriteTime(System.nanoTime - openStartTime)

      override def releaseWriters(success: Boolean) {
        if (consolidateShuffleFiles) {
          if (success) {
            val offsets = writers.map(_.fileSegment().offset)
            val lengths = writers.map(_.fileSegment().length)
            //mapId对应RDD的partionsID
            fileGroup.recordMapOutput(mapId, offsets, lengths)
          }
          recycleFileGroup(fileGroup)
        } else {
          //mapId对应RDD的partionsID
          shuffleState.completedMapTasks.add(mapId)
        }
      }

你可能感兴趣的:(spark)