第25课 Spark Hash Shuffle源码解读与剖析

第25课: Spark Hash Shuffle源码解读与剖析

Spark 2.1x 现在的版本已经没有Hash Shuffle的方式,那为什么我们还要讲解HashShuffle源码的内容呢?原因有3点:1,在现在的实际生产环境下,很多人在用Spark1.5.x,实际在使用Hash Shuffle的方式。2,Hash Shuffle的方式是后续Sort Shuffle的基础。3,在实际生产环境下,如果不需要排序,数据规模不是那么大,HashShuffle的方式是性能比较好的一种方式,SparkShuffle是可以插拔的,我们可以进行配置。

         本节我们基于 Spark 1.5.2版本讲解Hash Shuffle;Spark 1.6.3是Spark 1.6.0中的一个版本,如果在生产环境中使用Spark 1.x,最终都会转向Spark 1.6.3,Spark 1.6.3是1.x版本中最后一个版本,也是最稳定最强大的一个版本;Spark 2.1.0是Spark最新版本。可以在生产环境中实验。

 Shuffle 的过程是Mapper和Reducer以及网络传输构成的,Mapper端会把自己的数据写入本地磁盘,Reducer 端会通过网络把数据抓取过来。Mapper 会先把数据缓存在内存中,在默应情况下缓存空间是 32K,数据从内存到本地磁盘的一个过程就是写数据的一个过程。

这里有两Stage,上一个Stage 叫ShuffleMapTask,下面的一个Stage 可能是ShuffleMapTask,也有可能是 ResultsTask,取决于它这个任务是不是最后一个Stage所产生的。ShuffleMapTask会把我们处理的RDD的数据分成若干个 Bucket,即一个又一个的 Buffer。一个Task怎么去切分具体要看你的 partitioner,ShuffleMapTask肯定是属于具体的 Stage。

 ShuffleWriter:

         我们看一下Spark 1.5.2版本的ShuffleMapTask,里面创建了一个 ShuffleWriter,它是负责把缓存中的数据写入本地磁盘的,ShuffleWriter 写入入本地磁盘时,还有一个非常重要的工作,就是要跟Spark 的Driver 通信,告诉Driver把数据写到了什么地方,这样下一个Stage找上一个Stage的数据的时候,通过 Driver(blockManagerMaster)去获取数据的位置信息,Driver(blockManagerMaster)会告诉下一个Stage中的Task需要的数据在哪里。ShuffleMapTask的核心代码是runTask,runTask的源码如下:

1.          override def runTask(context: TaskContext):MapStatus = {

2.             // Deserialize the RDD using the broadcastvariable.

3.             val deserializeStartTime =System.currentTimeMillis()

4.             val ser =SparkEnv.get.closureSerializer.newInstance()

5.             val (rdd, dep) = ser.deserialize[(RDD[_],ShuffleDependency[_, _, _])](

6.               ByteBuffer.wrap(taskBinary.value),Thread.currentThread.getContextClassLoader)

7.             _executorDeserializeTime =System.currentTimeMillis() - deserializeStartTime

8.          

9.             metrics = Some(context.taskMetrics)

10.          var writer: ShuffleWriter[Any, Any] = null

11.          try {

12.            val manager = SparkEnv.get.shuffleManager

13.            writer = manager.getWriter[Any,Any](dep.shuffleHandle, partitionId, context)

14.            writer.write(rdd.iterator(partition,context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])

15.            writer.stop(success = true).get

16.          } catch {

17.            case e: Exception =>

18.              try {

19.                if (writer != null) {

20.                  writer.stop(success = false)

21.                }

22.              } catch {

23.                case e: Exception =>

24.                  log.debug("Could not stop writer",e)

25.              }

26.              throw e

27.          }

28.        }

 

我们从SparkEnv.get.shuffleManager获取Hash的方式,查看SparkEnv.scala的shuffleManager,在Spark 1.5.2版本中,有三种方式:HashShuffleManager、SortShuffleManager、UnsafeShuffleManager;在Spark 1.5.2版本中,默认也变成了SortShuffleManager的方式。在配置SparkConf的时候,可以进行配置。

1.           valshortShuffleMgrNames = Map(

2.               "hash" ->"org.apache.spark.shuffle.hash.HashShuffleManager",

3.               "sort" ->"org.apache.spark.shuffle.sort.SortShuffleManager",

4.               "tungsten-sort" ->"org.apache.spark.shuffle.unsafe.UnsafeShuffleManager")

5.             val shuffleMgrName =conf.get("spark.shuffle.manager", "sort")

6.             val shuffleMgrClass =shortShuffleMgrNames.getOrElse(shuffleMgrName.toLowerCase, shuffleMgrName)

7.             val shuffleManager = instantiateClass[ShuffleManager](shuffleMgrClass)

 

我们查看一下HashShuffleManager.scala的getWriter方法:

1.               override def getWriter[K, V](handle:ShuffleHandle, mapId: Int, context: TaskContext)

2.               : ShuffleWriter[K, V] = {

3.             new HashShuffleWriter(

4.               shuffleBlockResolver,handle.asInstanceOf[BaseShuffleHandle[K, V, _]], mapId, context)

5.           }

 

        从getWriter方式创建了 HashShuffleWriter 的实例对象,如果需要看它具体的怎么写数据的话,必需要看 HashShuffleWriter 类,然后它也必需有一个 write 的方法。

HashShuffleWriter类的write方法首先判断一下是否在 Mapper 端进行aggregrate操作,也就是说是否进行Map Reduce计算模型的 Local Reduce本地聚合,如果有本地聚合操作的话,就会循环遍历Buffer里面的数据,基于records 进行聚合。例如reduceByKey操作。怎么进行聚合?取决于reduceByKey中传入的算子如是加操作还是乘操作。reduceByKey将数据放到Buffer中聚合之后,再写入本地Block,在本地的聚合显现带来的好处是减少的磁盘IO的数据、以及操作磁盘IO的次数、以及网络传输的数据量、以及这个 Reduce Task 抓取 Mapper Task 数据的次数,这个意义肯定是非常重大的。

HashShuffleWriter的write方法源码如下:

1.           override def write(records:Iterator[Product2[K, V]]): Unit = {

2.             val iter = if (dep.aggregator.isDefined) {

3.               if (dep.mapSideCombine) {

4.                 dep.aggregator.get.combineValuesByKey(records,context)

5.               } else {

6.                 records

7.               }

8.             } else {

9.               require(!dep.mapSideCombine,"Map-side combine without Aggregator specified!")

10.            records

11.          }

12.       

13.          for (elem <- iter) {

14.            val bucketId =dep.partitioner.getPartition(elem._1)

15.            shuffle.writers(bucketId).write(elem._1,elem._2)

16.          }

17.        }

通过HashShuffleWriter的write代码可以看见,如果有本地聚合,就会在内存中完成聚合,例如说reduceByKey是累加的话,就往累加上写数据,因为它是线性执行的;后面才是本地文件写数据,先获取partitioner.getPartition,一个分片一个分片的写,如下图bucketId可以认为是内存操作的句柄,我们需要将bucketId传进去,然后使用 shuffle.writers(bucketId).write(elem._1,elem._2)来写数据。

  

图 7- 5 HashShuffle

 

我们看一下HashShuffleWriter类中write方法的shuffleBlockResolver.forMapTask这行代

码,FileShuffleBlockResolver类的forMapTask方法中的ShuffleWriterGroup:如果启动了文件合并机制,在写数据的时候,将很多的不同Task的相同Key的数据合并在同一个文件中,这个就是ShuffleWriterGroup。里面会有一个判断consolidateShuffleFiles,判断是否需要合并的过程。判断是否启动压缩机制,如果启动了压缩机制,会有一个fileGroup,否则的话就getFile。

FileShuffleBlockResolver类的forMapTask方法源码如下:

1.             def forMapTask(shuffleId: Int, mapId: Int,numBuckets: Int, serializer: Serializer,

2.               writeMetrics: ShuffleWriteMetrics):ShuffleWriterGroup = {

3.             new ShuffleWriterGroup {

4.               shuffleStates.putIfAbsent(shuffleId, newShuffleState(numBuckets))

5.               private val shuffleState =shuffleStates(shuffleId)

6.               private var fileGroup: ShuffleFileGroup =null

7.          

8.               val openStartTime = System.nanoTime

9.               val serializerInstance =serializer.newInstance()

10.            val writers: Array[DiskBlockObjectWriter]= if (consolidateShuffleFiles) {

11.              fileGroup = getUnusedFileGroup()

12.              Array.tabulate[DiskBlockObjectWriter](numBuckets){ bucketId =>

13.                val blockId = ShuffleBlockId(shuffleId,mapId, bucketId)

14.                blockManager.getDiskWriter(blockId,fileGroup(bucketId), serializerInstance, bufferSize,

15.                  writeMetrics)

16.              }

17.            } else {

18.              Array.tabulate[DiskBlockObjectWriter](numBuckets){ bucketId =>

19.                val blockId = ShuffleBlockId(shuffleId,mapId, bucketId)

20.                val blockFile =blockManager.diskBlockManager.getFile(blockId)

21.                // Because of previous failures, theshuffle file may already exist on this machine.

22.                // If so, remove it.

23.                if (blockFile.exists) {

24.                  if (blockFile.delete()) {

25.                    logInfo(s"Removed existingshuffle file $blockFile")

26.                  } else {

27.                    logWarning(s"Failed toremove existing shuffle file $blockFile")

28.                  }

29.                }

30.                blockManager.getDiskWriter(blockId,blockFile, serializerInstance, bufferSize,

31.                  writeMetrics)

32.              }

33.            }

34.            // Creating the file to write to andcreating a disk writer both involve interacting with

35.            // the disk, so should be included in theshuffle write time.

36.            writeMetrics.incShuffleWriteTime(System.nanoTime- openStartTime)

37.       

38.            override def releaseWriters(success:Boolean) {

39.              if (consolidateShuffleFiles) {

40.                if (success) {

41.                  val offsets =writers.map(_.fileSegment().offset)

42.                  val lengths =writers.map(_.fileSegment().length)

43.                  fileGroup.recordMapOutput(mapId,offsets, lengths)

44.                }

45.                recycleFileGroup(fileGroup)

46.              } else {

47.                shuffleState.completedMapTasks.add(mapId)

48.              }

49.            }

50.       

51.            private def getUnusedFileGroup():ShuffleFileGroup = {

52.              val fileGroup =shuffleState.unusedFileGroups.poll()

53.              if (fileGroup != null) fileGroup elsenewFileGroup()

54.            }

55.       

56.            private def newFileGroup():ShuffleFileGroup = {

57.              val fileId =shuffleState.nextFileId.getAndIncrement()

58.              val files =Array.tabulate[File](numBuckets) { bucketId =>

59.                val filename =physicalFileName(shuffleId, bucketId, fileId)

60.                blockManager.diskBlockManager.getFile(filename)

61.              }

62.              val fileGroup = newShuffleFileGroup(shuffleId, fileId, files)

63.              shuffleState.allFileGroups.add(fileGroup)

64.              fileGroup

65.            }

66.       

67.            private def recycleFileGroup(group:ShuffleFileGroup) {

68.              shuffleState.unusedFileGroups.add(group)

69.            }

70.          }

71.        }

 

是否进行consolidateShuffleFiles,无论是哪种情况最终都要写数据,写数据通过blockManager来实现,blockManager.getDiskWriter把数据写到本地磁盘,就是很基本的IO操作。

BlockManager.scala的getDiskWriter源码如下:

1.             def getDiskWriter(

2.               blockId: BlockId,

3.               file: File,

4.               serializerInstance: SerializerInstance,

5.               bufferSize: Int,

6.               writeMetrics: ShuffleWriteMetrics):DiskBlockObjectWriter = {

7.             val compressStream: OutputStream =>OutputStream = wrapForCompression(blockId, _)

8.             val syncWrites = conf.getBoolean("spark.shuffle.sync",false)

9.             new DiskBlockObjectWriter(blockId, file,serializerInstance, bufferSize, compressStream,

10.            syncWrites, writeMetrics)

11.        }

 

回到HashShuffleWriter.scala,shuffleBlockResolver.forMapTask传入的参数要注意:第一个参数是 shuffleId,第二个是 mapId,第三个是输出的 Split 个数,第4个是序列化器,第五个是metric 来统计它的一些基本信息。

1.           privateval shuffle = shuffleBlockResolver.forMapTask(dep.shuffleId, mapId,numOutputSplits, ser,

2.             writeMetrics)

HashShuffleWriter.scala中先进行forMapTask,然后进行writer操作。例如reduceByKey在本地进行了聚合,假设相同Key的Value有1万个,原本需要写1万次,objOut.writeKey(key)这个是写key,objOut.writeValue(value)这个是写value;但是如果进行了本地聚合,将1万个Value进行聚合,那只需要写1次。Writer写入部分的源码如下:

1.               for (elem <- iter) {

2.               val bucketId =dep.partitioner.getPartition(elem._1)

3.               shuffle.writers(bucketId).write(elem._1,elem._2)

4.             }

我们跟进去看一下DiskBlockObjectWriter.scala的write方法,这个write就是Disk级别的write:

1.             def write(key: Any, value: Any) {

2.             if (!initialized) {

3.               open()

4.             }

5.          

6.             objOut.writeKey(key)

7.             objOut.writeValue(value)

8.             recordWritten()

9.           }

 

我们看一下HashShuffleWriter.scala的write方法中shuffle.writers(bucketId).write(elem._1,elem._2)中的writers,把bucketId传进去,指明数据具体写在什么地方。

1.          private[spark] trait ShuffleWriterGroup {

2.           val writers: Array[DiskBlockObjectWriter]

3.          

4.           /** @param success Indicates all writes weresuccessful. If false, no blocks will be recorded. */

5.           def releaseWriters(success: Boolean)

6.         }

 

HashShuffle在内存中有bucket缓存,在本地有磁盘文件,在调优的时候需注意内存和磁盘IO的操作。再回看一下HashShuffleWriter.scalad的Write写入代码,shuffle.writers(bucketId).write(elem._1,elem._2),根据关联的bucketId将数据写入到本地文件中。写数据的时候两个参数:elem._1, elem._2:由于iter获取的是我们的元素,所以elem._1是key和elem._2是具体内容本身:

1.               for (elem <- iter) {

2.               val bucketId =dep.partitioner.getPartition(elem._1)

3.               shuffle.writers(bucketId).write(elem._1,elem._2)

4.             }

跟一下其中的getPartition代码,基于它的key分发到不同的bucketID上,Partitioner.scala的是getPartition空方法,没有具体实现:

1.          abstract class Partitioner extendsSerializable {

2.           def numPartitions: Int

3.           def getPartition(key: Any): Int

4.         }

 

我们看一下Partitioner.scala中HashPartitioner类的getPartition方法,把key值传进来。

Spark默认的并行度会遗传的,从上一个Stage传递到下一个Stage,例如,如果上游有4个并行任务的话,下游也会有4个。

HashPartitioner类的getPartition方法源码如下:

1.          class HashPartitioner(partitions: Int) extendsPartitioner {

2.           require(partitions >= 0, s"Number ofpartitions ($partitions) cannot be negative.")

3.          

4.           def numPartitions: Int = partitions

5.          

6.           def getPartition(key: Any): Int = key match {

7.             case null => 0

8.             case _ =>Utils.nonNegativeMod(key.hashCode, numPartitions)

9.           }

10.       

11.        override def equals(other: Any): Boolean =other match {

12.          case h: HashPartitioner =>

13.            h.numPartitions == numPartitions

14.          case _ =>

15.            false

16.        }

17.       

18.        override def hashCode: Int = numPartitions

19.      }

 

HashPartitioner类的getPartition方法中调用了nonNegativeMod方法,定义了一个计算方式,传入2个参数:key的hashCode,及要多少分片numPartitions,就是普通的求模运算:

1.            defnonNegativeMod(x: Int, mod: Int): Int = {

2.             val rawMod = x % mod

3.             rawMod + (if (rawMod < 0) mod else 0)

4.           }

 

 
基于writer的基础之上,我们看一下reader。

HashShuffleReader.scala重点是看它的 Read 方法,首先会创建一个ShuffleBlockFetcherIterator,这里有一个很重要的调优的参数spark.reducer.maxSizeInFlight,也就是说一次能最大的抓取多少数据过来,在 Spark1.5.2 默应情况下是 48M,如果你内存足够大以及把Shuffle内存空间分配足够的情况下(Shuffle默认占用20%的内存空间),可以尝试调大这个参数,如可将spark.reducer.maxSizeInFlight调成96M,甚至更高。 调大这个参数的好处是减少抓取次数,因为网络IO的开销来建立新的连接其实很耗时的;

HashShuffleReader.scala的Read 方法进行一个判断mapSideCombine,是否需要聚合aggregate,分别实现需要聚合及不需要聚合的操作;从reducer端借助HashShuffleReader,从远程抓取数据,抓取数据过来之后进行aggregate操作,至于汇聚之后进行分组或者还是reduce及其它的一些操作,这个是开发者决定的。

HashShuffleReader.scala的Read 方法源码如下:

1.          override def read(): Iterator[Product2[K, C]]= {

2.             val blockFetcherItr = newShuffleBlockFetcherIterator(

3.               context,

4.               blockManager.shuffleClient,

5.               blockManager,

6.               mapOutputTracker.getMapSizesByExecutorId(handle.shuffleId,startPartition),

7.               // Note: we use getSizeAsMb when nosuffix is provided for backwards compatibility

8.               SparkEnv.get.conf.getSizeAsMb("spark.reducer.maxSizeInFlight","48m") * 1024 * 1024)

9.          

10.          // Wrap the streams for compression basedon configuration

11.          val wrappedStreams = blockFetcherItr.map {case (blockId, inputStream) =>

12.            blockManager.wrapForCompression(blockId,inputStream)

13.          }

14.       

15.          val ser = Serializer.getSerializer(dep.serializer)

16.          val serializerInstance = ser.newInstance()

17.       

18.          // Create a key/value iterator for eachstream

19.          val recordIter = wrappedStreams.flatMap {wrappedStream =>

20.            // Note: the asKeyValueIterator belowwraps a key/value iterator inside of a

21.            // NextIterator. The NextIterator makessure that close() is called on the

22.            // underlying InputStream when allrecords have been read.

23.            serializerInstance.deserializeStream(wrappedStream).asKeyValueIterator

24.          }

25.       

26.          // Update the context task metrics for eachrecord read.

27.          val readMetrics =context.taskMetrics.createShuffleReadMetricsForDependency()

28.          val metricIter = CompletionIterator[(Any,Any), Iterator[(Any, Any)]](

29.            recordIter.map(record => {

30.              readMetrics.incRecordsRead(1)

31.              record

32.            }),

33.            context.taskMetrics().updateShuffleReadMetrics())

34.       

35.          // An interruptible iterator must be usedhere in order to support task cancellation

36.          val interruptibleIter = newInterruptibleIterator[(Any, Any)](context, metricIter)

37.       

38.          val aggregatedIter: Iterator[Product2[K,C]] = if (dep.aggregator.isDefined) {

39.            if (dep.mapSideCombine) {

40.              // We are reading values that arealready combined

41.              val combinedKeyValuesIterator = interruptibleIter.asInstanceOf[Iterator[(K,C)]]

42.              dep.aggregator.get.combineCombinersByKey(combinedKeyValuesIterator,context)

43.            } else {

44.              // We don't know the value type, butalso don't care -- the dependency *should*

45.              // have made sure its compatible w/this aggregator, which will convert the value

46.              // type to the combined type C

47.              val keyValuesIterator =interruptibleIter.asInstanceOf[Iterator[(K, Nothing)]]

48.              dep.aggregator.get.combineValuesByKey(keyValuesIterator,context)

49.            }

50.          } else {

51.            require(!dep.mapSideCombine,"Map-side combine without Aggregator specified!")

52.            interruptibleIter.asInstanceOf[Iterator[Product2[K,C]]]

53.          }

54.       

55.          // Sort the output if there is a sortordering defined.

56.          dep.keyOrdering match {

57.            case Some(keyOrd: Ordering[K]) =>

58.              // Create an ExternalSorter to sort thedata. Note that if spark.shuffle.spill is disabled,

59.              // the ExternalSorter won't spill todisk.

60.              val sorter = new ExternalSorter[K, C,C](ordering = Some(keyOrd), serializer = Some(ser))

61.              sorter.insertAll(aggregatedIter)

62.              context.taskMetrics().incMemoryBytesSpilled(sorter.memoryBytesSpilled)

63.              context.taskMetrics().incDiskBytesSpilled(sorter.diskBytesSpilled)

64.              context.internalMetricsToAccumulators(

65.                InternalAccumulator.PEAK_EXECUTION_MEMORY).add(sorter.peakMemoryUsedBytes)

66.              sorter.iterator

67.            case None =>

68.              aggregatedIter

69.          }

70.        }

 

这里谈到聚合,我们深入看一下reduceByKey:reduceByKey和Hadoop的Map reduce相比有个缺点:Hadoop的Map reduce中无论业务是什么类型,Mapreduce都可以自定义,Map reduce的业务逻辑都可以不一样。但reduceByKey 有个好处,可以很好的操作上一个Stage的算子,前面Mapper的算子,也可以很好的操作下一个Stage,具体reduce的算子。

看一下PairRDDFunctions.scala的reduceByKey方法:

1.           defreduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] =self.withScope {

2.             combineByKey[V]((v: V) => v, func, func,partitioner)

3.           }

reduceByKey方法中里面会调用combineByKey:

combineByKey方法中:

 第一个参数 `createCombiner`,是所谓的Combiner;例如建立一个元素列表,将V类型转换为C类型。

 第二个参数 `mergeValue`,在元素列表末尾追加元素,将V类型合并进C类型。

 第三个参数 `mergeCombiners`,将两个C类型合并成1个。

第四个参数 partitioner,指定分区器

 其中mapSideCombine默认为true,默认在Mapper端进行聚合。这里注意key的类型不能是数组。第二个参数, 第三个参数从reduceByKey的角度看是一样的。

1.          def combineByKey[C](createCombiner: V => C,

2.               mergeValue: (C, V) => C,

3.               mergeCombiners: (C, C) => C,

4.               partitioner: Partitioner,

5.               mapSideCombine: Boolean = true,

6.               serializer: Serializer = null): RDD[(K,C)] = self.withScope {

7.           

 

我们回到HashShuffleReader.scala的read方法,在reducer端抓取数据,需要进行网络通信的过程,那网络通信发生在什么时候呢?网络通信肯定由read方法中的ShuffleBlockFetcherIterator完成的

1.             finalclass ShuffleBlockFetcherIterator(

2.             context: TaskContext,

3.             shuffleClient: ShuffleClient,

4.             blockManager: BlockManager,

5.             blocksByAddress: Seq[(BlockManagerId,Seq[(BlockId, Long)])],

6.             maxBytesInFlight: Long)

7.           extends Iterator[(BlockId, InputStream)] withLogging {

 

我们看一下ShuffleBlockFetcherIterator.scala的initialize方法:

1.           private[this]def initialize(): Unit = {

2.             // Add a task completion callback (calledin both success case and failure case) to cleanup.

3.             context.addTaskCompletionListener(_ =>cleanup())

4.          

5.             // Split local and remote blocks.

6.             val remoteRequests =splitLocalRemoteBlocks()

7.             // Add the remote requests into our queuein a random order

8.             fetchRequests ++=Utils.randomize(remoteRequests)

9.          

10.          // Send out initial requests for blocks, upto our maxBytesInFlight

11.          while (fetchRequests.nonEmpty &&

12.            (bytesInFlight == 0 || bytesInFlight +fetchRequests.front.size <= maxBytesInFlight)) {

13.            sendRequest(fetchRequests.dequeue())

14.          }

15.       

16.          val numFetches = remoteRequests.size -fetchRequests.size

17.          logInfo("Started " + numFetches +" remote fetches in" + Utils.getUsedTimeMs(startTime))

18.       

19.          // Get Local Blocks

20.          fetchLocalBlocks()

21.          logDebug("Got local blocks in " +Utils.getUsedTimeMs(startTime))

22.        }

在ShuffleBlockFetcherIterator.scala的initialize方法循环遍历,发生请求拉取数据,每次最大可以拉取48M数据,其中的一行代码sendRequest(fetchRequests.dequeue()),我们看一下ShuffleBlockFetcherIterator.scala的sendRequest方法:

1.          private[this] def sendRequest(req:FetchRequest) {

2.             logDebug("Sending request for %dblocks (%s) from %s".format(

3.               req.blocks.size,Utils.bytesToString(req.size), req.address.hostPort))

4.             bytesInFlight += req.size

5.          

6.             // so we can look up the size of eachblockID

7.             val sizeMap = req.blocks.map { case (blockId,size) => (blockId.toString, size) }.toMap

8.             val blockIds =req.blocks.map(_._1.toString)

9.          

10.          val address = req.address

11.          shuffleClient.fetchBlocks(address.host,address.port, address.executorId, blockIds.toArray,

12.            new BlockFetchingListener {

13.              override defonBlockFetchSuccess(blockId: String, buf: ManagedBuffer): Unit = {

14.                // Only add the buffer to resultsqueue if the iterator is not zombie,

15.                // i.e. cleanup() has not been calledyet.

16.                if (!isZombie) {

17.                  // Increment the ref count because weneed to pass this to a different thread.

18.                  // This needs to be released afteruse.

19.                  buf.retain()

20.                  results.put(newSuccessFetchResult(BlockId(blockId), address, sizeMap(blockId), buf))

21.                  shuffleMetrics.incRemoteBytesRead(buf.size)

22.                  shuffleMetrics.incRemoteBlocksFetched(1)

23.                }

24.                logTrace("Got remote block "+ blockId + " after " + Utils.getUsedTimeMs(startTime))

25.              }

26.       

27.              override defonBlockFetchFailure(blockId: String, e: Throwable): Unit = {

28.                logError(s"Failed to getblock(s) from ${req.address.host}:${req.address.port}", e)

29.                results.put(newFailureFetchResult(BlockId(blockId), address, e))

30.              }

31.            }

32.          )

33.        }   

 

sendRequest方法中shuffleClient.fetchBlocks(address.host, address.port,address.executorId, blockIds.toArray 代码中就有host、port,机器的的域名和端口,executorId、 blockIds等相关信息,抓到信息以后会有BlockFetchingListener进行结果的处理。

我们看一下ShuffleClient.scala的fetchBlocks方法,从远程的节点中同步读取数据。:

1.              public abstract void fetchBlocks(

2.               String host,

3.               int port,

4.               String execId,

5.               String[] blockIds,

6.               BlockFetchingListener listener);

7.         }

 

fetchBlocks具体实现的代码是BlockTransferService.scala,里面仍没有具体实现方法

1.           override def fetchBlocks(

2.               host: String,

3.               port: Int,

4.               execId: String,

5.               blockIds: Array[String],

6.               listener: BlockFetchingListener): Unit

 

继续查看BlockTransferService.scala子类的实现方式,是NettyBlockTransferService.scala,Netty是基于NIO的理念进行网络通信,互联网公司进行不同进程的通信一般都使用Netty。说明一点就是它底层有一套通信框架,我们基于这套通信框架进行数据的请求和传输。查看NettyBlockTransferService.scala的fetchBlocks源码如下:

1.           overridedef fetchBlocks(

2.               host: String,

3.               port: Int,

4.               execId: String,

5.               blockIds: Array[String],

6.               listener: BlockFetchingListener): Unit ={

7.             logTrace(s"Fetch blocks from$host:$port (executor id $execId)")

8.             try {

9.               val blockFetchStarter = newRetryingBlockFetcher.BlockFetchStarter {

10.              override def createAndStart(blockIds:Array[String], listener: BlockFetchingListener) {

11.                val client =clientFactory.createClient(host, port)

12.                new OneForOneBlockFetcher(client,appId, execId, blockIds.toArray, listener).start()

13.              }

14.            }

15.       

16.            val maxRetries =transportConf.maxIORetries()

17.            if (maxRetries > 0) {

18.              // Note this Fetcher will correctlyhandle maxRetries == 0; we avoid it just in case there's

19.              // a bug in this code. We should removethe if statement once we're sure of the stability.

20.              new RetryingBlockFetcher(transportConf,blockFetchStarter, blockIds, listener).start()

21.            } else {

22.              blockFetchStarter.createAndStart(blockIds,listener)

23.            }

24.          } catch {

25.            case e: Exception =>

26.              logError("Exception whilebeginning fetchBlocks", e)

27.              blockIds.foreach(listener.onBlockFetchFailure(_,e))

28.          }

29.        }

fetchBlocks其中maxRetries是最大的重试次数, createAndStart这个是底层的Netty是怎么做的。。

1.           ……

2.          val blockFetchStarter = newRetryingBlockFetcher.BlockFetchStarter {

3.         ……

4.         val maxRetries =transportConf.maxIORetries()

5.         …..

6.           override def createAndStart(blockIds:Array[String], listener: BlockFetchingListener) {

7.                   val client =clientFactory.createClient(host, port)

8.                   new OneForOneBlockFetcher(client,appId, execId, blockIds.toArray, listener).start()

9.                 }

 

其中OneForOneBlockFetcher(client,appId, execId, blockIds.toArray, listener).start()就开始了通信过程,我们看一下OneForOneBlockFetcher.java的start源码,这个是rpc通信。:

1.           publicvoid start() {

2.             if (blockIds.length == 0) {

3.               throw newIllegalArgumentException("Zero-sized blockIds array");

4.             }

5.          

6.             client.sendRpc(openMessage.toByteArray(),new RpcResponseCallback() {

7.               @Override

8.               public void onSuccess(byte[] response) {

9.                 try {

10.                streamHandle = (StreamHandle)BlockTransferMessage.Decoder.fromByteArray(response);

11.                logger.trace("Successfullyopened blocks {}, preparing to fetch chunks.", streamHandle);

12.       

13.                // Immediately request all chunks --we expect that the total size of the request is

14.                // reasonable due to higher levelchunking in [[ShuffleBlockFetcherIterator]].

15.                for (int i = 0; i

16.                  client.fetchChunk(streamHandle.streamId,i, chunkCallback);

17.                }

18.              } catch (Exception e) {

19.                logger.error("Failed while startingblock fetches after success", e);

20.                failRemainingBlocks(blockIds, e);

21.              }

22.            }

23.       

24.            @Override

25.            public void onFailure(Throwable e) {

26.              logger.error("Failed whilestarting block fetches", e);

27.              failRemainingBlocks(blockIds, e);

28.            }

29.          });

30.        }

 

如果进一步查看fetch内容,一般情况下会提供HashMap的数据结构,为了将数据聚合起来。

总结一下:

1,Shuffle存数据和抓取数据,就是普通的scala和java编程,思想是一样的,有个缓存然后往磁盘写,不过Spark是分布式系统,要跟Driver的管理器进行合作,或者说受Driver的控制,如写数据的时候告诉Driver数据写在哪里,然后下一个阶段要去读数据,到Driver中去要数据。Driver会清晰的告诉你要读取的数据在哪里。具体读数据的过程是Netty的rpc框架,是基本的IO操作而已。

2,Reducer端如果内存不够写磁盘,代价是双倍的。Mapper如果内存不够要写磁盘,不管够不够,都只要写1次;而Reducer端如果内存不够,将数据存到磁盘,在计算数据的时候,又再一次将数据从磁盘上抓回来。这个时候有个很重要的调优参数,就是将Shuffle的内存适当调大一点。Shuffle的内存默认占20%,可以调大到30%的样子,但也不能太大,因为要进行persist,persist在磁盘中占用的空间就会越来越小。

 





你可能感兴趣的:(SparkInBeiJing,Spark,shuffle)