第25课: Spark Hash Shuffle源码解读与剖析
Spark 2.1x 现在的版本已经没有Hash Shuffle的方式,那为什么我们还要讲解HashShuffle源码的内容呢?原因有3点:1,在现在的实际生产环境下,很多人在用Spark1.5.x,实际在使用Hash Shuffle的方式。2,Hash Shuffle的方式是后续Sort Shuffle的基础。3,在实际生产环境下,如果不需要排序,数据规模不是那么大,HashShuffle的方式是性能比较好的一种方式,SparkShuffle是可以插拔的,我们可以进行配置。
本节我们基于 Spark 1.5.2版本讲解Hash Shuffle;Spark 1.6.3是Spark 1.6.0中的一个版本,如果在生产环境中使用Spark 1.x,最终都会转向Spark 1.6.3,Spark 1.6.3是1.x版本中最后一个版本,也是最稳定最强大的一个版本;Spark 2.1.0是Spark最新版本。可以在生产环境中实验。
Shuffle 的过程是Mapper和Reducer以及网络传输构成的,Mapper端会把自己的数据写入本地磁盘,Reducer 端会通过网络把数据抓取过来。Mapper 会先把数据缓存在内存中,在默应情况下缓存空间是 32K,数据从内存到本地磁盘的一个过程就是写数据的一个过程。
这里有两Stage,上一个Stage 叫ShuffleMapTask,下面的一个Stage 可能是ShuffleMapTask,也有可能是 ResultsTask,取决于它这个任务是不是最后一个Stage所产生的。ShuffleMapTask会把我们处理的RDD的数据分成若干个 Bucket,即一个又一个的 Buffer。一个Task怎么去切分具体要看你的 partitioner,ShuffleMapTask肯定是属于具体的 Stage。
我们看一下Spark 1.5.2版本的ShuffleMapTask,里面创建了一个 ShuffleWriter,它是负责把缓存中的数据写入本地磁盘的,ShuffleWriter 写入入本地磁盘时,还有一个非常重要的工作,就是要跟Spark 的Driver 通信,告诉Driver把数据写到了什么地方,这样下一个Stage找上一个Stage的数据的时候,通过 Driver(blockManagerMaster)去获取数据的位置信息,Driver(blockManagerMaster)会告诉下一个Stage中的Task需要的数据在哪里。ShuffleMapTask的核心代码是runTask,runTask的源码如下:
1. override def runTask(context: TaskContext):MapStatus = {
2. // Deserialize the RDD using the broadcastvariable.
3. val deserializeStartTime =System.currentTimeMillis()
4. val ser =SparkEnv.get.closureSerializer.newInstance()
5. val (rdd, dep) = ser.deserialize[(RDD[_],ShuffleDependency[_, _, _])](
6. ByteBuffer.wrap(taskBinary.value),Thread.currentThread.getContextClassLoader)
7. _executorDeserializeTime =System.currentTimeMillis() - deserializeStartTime
8.
9. metrics = Some(context.taskMetrics)
10. var writer: ShuffleWriter[Any, Any] = null
11. try {
12. val manager = SparkEnv.get.shuffleManager
13. writer = manager.getWriter[Any,Any](dep.shuffleHandle, partitionId, context)
14. writer.write(rdd.iterator(partition,context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
15. writer.stop(success = true).get
16. } catch {
17. case e: Exception =>
18. try {
19. if (writer != null) {
20. writer.stop(success = false)
21. }
22. } catch {
23. case e: Exception =>
24. log.debug("Could not stop writer",e)
25. }
26. throw e
27. }
28. }
我们从SparkEnv.get.shuffleManager获取Hash的方式,查看SparkEnv.scala的shuffleManager,在Spark 1.5.2版本中,有三种方式:HashShuffleManager、SortShuffleManager、UnsafeShuffleManager;在Spark 1.5.2版本中,默认也变成了SortShuffleManager的方式。在配置SparkConf的时候,可以进行配置。
1. valshortShuffleMgrNames = Map(
2. "hash" ->"org.apache.spark.shuffle.hash.HashShuffleManager",
3. "sort" ->"org.apache.spark.shuffle.sort.SortShuffleManager",
4. "tungsten-sort" ->"org.apache.spark.shuffle.unsafe.UnsafeShuffleManager")
5. val shuffleMgrName =conf.get("spark.shuffle.manager", "sort")
6. val shuffleMgrClass =shortShuffleMgrNames.getOrElse(shuffleMgrName.toLowerCase, shuffleMgrName)
7. val shuffleManager = instantiateClass[ShuffleManager](shuffleMgrClass)
我们查看一下HashShuffleManager.scala的getWriter方法:
1. override def getWriter[K, V](handle:ShuffleHandle, mapId: Int, context: TaskContext)
2. : ShuffleWriter[K, V] = {
3. new HashShuffleWriter(
4. shuffleBlockResolver,handle.asInstanceOf[BaseShuffleHandle[K, V, _]], mapId, context)
5. }
从getWriter方式创建了 HashShuffleWriter 的实例对象,如果需要看它具体的怎么写数据的话,必需要看 HashShuffleWriter 类,然后它也必需有一个 write 的方法。
HashShuffleWriter类的write方法首先判断一下是否在 Mapper 端进行aggregrate操作,也就是说是否进行Map Reduce计算模型的 Local Reduce本地聚合,如果有本地聚合操作的话,就会循环遍历Buffer里面的数据,基于records 进行聚合。例如reduceByKey操作。怎么进行聚合?取决于reduceByKey中传入的算子如是加操作还是乘操作。reduceByKey将数据放到Buffer中聚合之后,再写入本地Block,在本地的聚合显现带来的好处是减少的磁盘IO的数据、以及操作磁盘IO的次数、以及网络传输的数据量、以及这个 Reduce Task 抓取 Mapper Task 数据的次数,这个意义肯定是非常重大的。
HashShuffleWriter的write方法源码如下:
1. override def write(records:Iterator[Product2[K, V]]): Unit = {
2. val iter = if (dep.aggregator.isDefined) {
3. if (dep.mapSideCombine) {
4. dep.aggregator.get.combineValuesByKey(records,context)
5. } else {
6. records
7. }
8. } else {
9. require(!dep.mapSideCombine,"Map-side combine without Aggregator specified!")
10. records
11. }
12.
13. for (elem <- iter) {
14. val bucketId =dep.partitioner.getPartition(elem._1)
15. shuffle.writers(bucketId).write(elem._1,elem._2)
16. }
17. }
通过HashShuffleWriter的write代码可以看见,如果有本地聚合,就会在内存中完成聚合,例如说reduceByKey是累加的话,就往累加上写数据,因为它是线性执行的;后面才是本地文件写数据,先获取partitioner.getPartition,一个分片一个分片的写,如下图bucketId可以认为是内存操作的句柄,我们需要将bucketId传进去,然后使用 shuffle.writers(bucketId).write(elem._1,elem._2)来写数据。
图 7- 5 HashShuffle
我们看一下HashShuffleWriter类中write方法的shuffleBlockResolver.forMapTask这行代
码,FileShuffleBlockResolver类的forMapTask方法中的ShuffleWriterGroup:如果启动了文件合并机制,在写数据的时候,将很多的不同Task的相同Key的数据合并在同一个文件中,这个就是ShuffleWriterGroup。里面会有一个判断consolidateShuffleFiles,判断是否需要合并的过程。判断是否启动压缩机制,如果启动了压缩机制,会有一个fileGroup,否则的话就getFile。
FileShuffleBlockResolver类的forMapTask方法源码如下:
1. def forMapTask(shuffleId: Int, mapId: Int,numBuckets: Int, serializer: Serializer,
2. writeMetrics: ShuffleWriteMetrics):ShuffleWriterGroup = {
3. new ShuffleWriterGroup {
4. shuffleStates.putIfAbsent(shuffleId, newShuffleState(numBuckets))
5. private val shuffleState =shuffleStates(shuffleId)
6. private var fileGroup: ShuffleFileGroup =null
7.
8. val openStartTime = System.nanoTime
9. val serializerInstance =serializer.newInstance()
10. val writers: Array[DiskBlockObjectWriter]= if (consolidateShuffleFiles) {
11. fileGroup = getUnusedFileGroup()
12. Array.tabulate[DiskBlockObjectWriter](numBuckets){ bucketId =>
13. val blockId = ShuffleBlockId(shuffleId,mapId, bucketId)
14. blockManager.getDiskWriter(blockId,fileGroup(bucketId), serializerInstance, bufferSize,
15. writeMetrics)
16. }
17. } else {
18. Array.tabulate[DiskBlockObjectWriter](numBuckets){ bucketId =>
19. val blockId = ShuffleBlockId(shuffleId,mapId, bucketId)
20. val blockFile =blockManager.diskBlockManager.getFile(blockId)
21. // Because of previous failures, theshuffle file may already exist on this machine.
22. // If so, remove it.
23. if (blockFile.exists) {
24. if (blockFile.delete()) {
25. logInfo(s"Removed existingshuffle file $blockFile")
26. } else {
27. logWarning(s"Failed toremove existing shuffle file $blockFile")
28. }
29. }
30. blockManager.getDiskWriter(blockId,blockFile, serializerInstance, bufferSize,
31. writeMetrics)
32. }
33. }
34. // Creating the file to write to andcreating a disk writer both involve interacting with
35. // the disk, so should be included in theshuffle write time.
36. writeMetrics.incShuffleWriteTime(System.nanoTime- openStartTime)
37.
38. override def releaseWriters(success:Boolean) {
39. if (consolidateShuffleFiles) {
40. if (success) {
41. val offsets =writers.map(_.fileSegment().offset)
42. val lengths =writers.map(_.fileSegment().length)
43. fileGroup.recordMapOutput(mapId,offsets, lengths)
44. }
45. recycleFileGroup(fileGroup)
46. } else {
47. shuffleState.completedMapTasks.add(mapId)
48. }
49. }
50.
51. private def getUnusedFileGroup():ShuffleFileGroup = {
52. val fileGroup =shuffleState.unusedFileGroups.poll()
53. if (fileGroup != null) fileGroup elsenewFileGroup()
54. }
55.
56. private def newFileGroup():ShuffleFileGroup = {
57. val fileId =shuffleState.nextFileId.getAndIncrement()
58. val files =Array.tabulate[File](numBuckets) { bucketId =>
59. val filename =physicalFileName(shuffleId, bucketId, fileId)
60. blockManager.diskBlockManager.getFile(filename)
61. }
62. val fileGroup = newShuffleFileGroup(shuffleId, fileId, files)
63. shuffleState.allFileGroups.add(fileGroup)
64. fileGroup
65. }
66.
67. private def recycleFileGroup(group:ShuffleFileGroup) {
68. shuffleState.unusedFileGroups.add(group)
69. }
70. }
71. }
是否进行consolidateShuffleFiles,无论是哪种情况最终都要写数据,写数据通过blockManager来实现,blockManager.getDiskWriter把数据写到本地磁盘,就是很基本的IO操作。
BlockManager.scala的getDiskWriter源码如下:
1. def getDiskWriter(
2. blockId: BlockId,
3. file: File,
4. serializerInstance: SerializerInstance,
5. bufferSize: Int,
6. writeMetrics: ShuffleWriteMetrics):DiskBlockObjectWriter = {
7. val compressStream: OutputStream =>OutputStream = wrapForCompression(blockId, _)
8. val syncWrites = conf.getBoolean("spark.shuffle.sync",false)
9. new DiskBlockObjectWriter(blockId, file,serializerInstance, bufferSize, compressStream,
10. syncWrites, writeMetrics)
11. }
回到HashShuffleWriter.scala,shuffleBlockResolver.forMapTask传入的参数要注意:第一个参数是 shuffleId,第二个是 mapId,第三个是输出的 Split 个数,第4个是序列化器,第五个是metric 来统计它的一些基本信息。
1. privateval shuffle = shuffleBlockResolver.forMapTask(dep.shuffleId, mapId,numOutputSplits, ser,
2. writeMetrics)
HashShuffleWriter.scala中先进行forMapTask,然后进行writer操作。例如reduceByKey在本地进行了聚合,假设相同Key的Value有1万个,原本需要写1万次,objOut.writeKey(key)这个是写key,objOut.writeValue(value)这个是写value;但是如果进行了本地聚合,将1万个Value进行聚合,那只需要写1次。Writer写入部分的源码如下:
1. for (elem <- iter) {
2. val bucketId =dep.partitioner.getPartition(elem._1)
3. shuffle.writers(bucketId).write(elem._1,elem._2)
4. }
我们跟进去看一下DiskBlockObjectWriter.scala的write方法,这个write就是Disk级别的write:
1. def write(key: Any, value: Any) {
2. if (!initialized) {
3. open()
4. }
5.
6. objOut.writeKey(key)
7. objOut.writeValue(value)
8. recordWritten()
9. }
我们看一下HashShuffleWriter.scala的write方法中shuffle.writers(bucketId).write(elem._1,elem._2)中的writers,把bucketId传进去,指明数据具体写在什么地方。
1. private[spark] trait ShuffleWriterGroup {
2. val writers: Array[DiskBlockObjectWriter]
3.
4. /** @param success Indicates all writes weresuccessful. If false, no blocks will be recorded. */
5. def releaseWriters(success: Boolean)
6. }
HashShuffle在内存中有bucket缓存,在本地有磁盘文件,在调优的时候需注意内存和磁盘IO的操作。再回看一下HashShuffleWriter.scalad的Write写入代码,shuffle.writers(bucketId).write(elem._1,elem._2),根据关联的bucketId将数据写入到本地文件中。写数据的时候两个参数:elem._1, elem._2:由于iter获取的是我们的元素,所以elem._1是key和elem._2是具体内容本身:
1. for (elem <- iter) {
2. val bucketId =dep.partitioner.getPartition(elem._1)
3. shuffle.writers(bucketId).write(elem._1,elem._2)
4. }
跟一下其中的getPartition代码,基于它的key分发到不同的bucketID上,Partitioner.scala的是getPartition空方法,没有具体实现:
1. abstract class Partitioner extendsSerializable {
2. def numPartitions: Int
3. def getPartition(key: Any): Int
4. }
我们看一下Partitioner.scala中HashPartitioner类的getPartition方法,把key值传进来。
Spark默认的并行度会遗传的,从上一个Stage传递到下一个Stage,例如,如果上游有4个并行任务的话,下游也会有4个。
HashPartitioner类的getPartition方法源码如下:
1. class HashPartitioner(partitions: Int) extendsPartitioner {
2. require(partitions >= 0, s"Number ofpartitions ($partitions) cannot be negative.")
3.
4. def numPartitions: Int = partitions
5.
6. def getPartition(key: Any): Int = key match {
7. case null => 0
8. case _ =>Utils.nonNegativeMod(key.hashCode, numPartitions)
9. }
10.
11. override def equals(other: Any): Boolean =other match {
12. case h: HashPartitioner =>
13. h.numPartitions == numPartitions
14. case _ =>
15. false
16. }
17.
18. override def hashCode: Int = numPartitions
19. }
HashPartitioner类的getPartition方法中调用了nonNegativeMod方法,定义了一个计算方式,传入2个参数:key的hashCode,及要多少分片numPartitions,就是普通的求模运算:
1. defnonNegativeMod(x: Int, mod: Int): Int = {
2. val rawMod = x % mod
3. rawMod + (if (rawMod < 0) mod else 0)
4. }
HashShuffleReader.scala重点是看它的 Read 方法,首先会创建一个ShuffleBlockFetcherIterator,这里有一个很重要的调优的参数spark.reducer.maxSizeInFlight,也就是说一次能最大的抓取多少数据过来,在 Spark1.5.2 默应情况下是 48M,如果你内存足够大以及把Shuffle内存空间分配足够的情况下(Shuffle默认占用20%的内存空间),可以尝试调大这个参数,如可将spark.reducer.maxSizeInFlight调成96M,甚至更高。 调大这个参数的好处是减少抓取次数,因为网络IO的开销来建立新的连接其实很耗时的;
HashShuffleReader.scala的Read 方法进行一个判断mapSideCombine,是否需要聚合aggregate,分别实现需要聚合及不需要聚合的操作;从reducer端借助HashShuffleReader,从远程抓取数据,抓取数据过来之后进行aggregate操作,至于汇聚之后进行分组或者还是reduce及其它的一些操作,这个是开发者决定的。
HashShuffleReader.scala的Read 方法源码如下:
1. override def read(): Iterator[Product2[K, C]]= {
2. val blockFetcherItr = newShuffleBlockFetcherIterator(
3. context,
4. blockManager.shuffleClient,
5. blockManager,
6. mapOutputTracker.getMapSizesByExecutorId(handle.shuffleId,startPartition),
7. // Note: we use getSizeAsMb when nosuffix is provided for backwards compatibility
8. SparkEnv.get.conf.getSizeAsMb("spark.reducer.maxSizeInFlight","48m") * 1024 * 1024)
9.
10. // Wrap the streams for compression basedon configuration
11. val wrappedStreams = blockFetcherItr.map {case (blockId, inputStream) =>
12. blockManager.wrapForCompression(blockId,inputStream)
13. }
14.
15. val ser = Serializer.getSerializer(dep.serializer)
16. val serializerInstance = ser.newInstance()
17.
18. // Create a key/value iterator for eachstream
19. val recordIter = wrappedStreams.flatMap {wrappedStream =>
20. // Note: the asKeyValueIterator belowwraps a key/value iterator inside of a
21. // NextIterator. The NextIterator makessure that close() is called on the
22. // underlying InputStream when allrecords have been read.
23. serializerInstance.deserializeStream(wrappedStream).asKeyValueIterator
24. }
25.
26. // Update the context task metrics for eachrecord read.
27. val readMetrics =context.taskMetrics.createShuffleReadMetricsForDependency()
28. val metricIter = CompletionIterator[(Any,Any), Iterator[(Any, Any)]](
29. recordIter.map(record => {
30. readMetrics.incRecordsRead(1)
31. record
32. }),
33. context.taskMetrics().updateShuffleReadMetrics())
34.
35. // An interruptible iterator must be usedhere in order to support task cancellation
36. val interruptibleIter = newInterruptibleIterator[(Any, Any)](context, metricIter)
37.
38. val aggregatedIter: Iterator[Product2[K,C]] = if (dep.aggregator.isDefined) {
39. if (dep.mapSideCombine) {
40. // We are reading values that arealready combined
41. val combinedKeyValuesIterator = interruptibleIter.asInstanceOf[Iterator[(K,C)]]
42. dep.aggregator.get.combineCombinersByKey(combinedKeyValuesIterator,context)
43. } else {
44. // We don't know the value type, butalso don't care -- the dependency *should*
45. // have made sure its compatible w/this aggregator, which will convert the value
46. // type to the combined type C
47. val keyValuesIterator =interruptibleIter.asInstanceOf[Iterator[(K, Nothing)]]
48. dep.aggregator.get.combineValuesByKey(keyValuesIterator,context)
49. }
50. } else {
51. require(!dep.mapSideCombine,"Map-side combine without Aggregator specified!")
52. interruptibleIter.asInstanceOf[Iterator[Product2[K,C]]]
53. }
54.
55. // Sort the output if there is a sortordering defined.
56. dep.keyOrdering match {
57. case Some(keyOrd: Ordering[K]) =>
58. // Create an ExternalSorter to sort thedata. Note that if spark.shuffle.spill is disabled,
59. // the ExternalSorter won't spill todisk.
60. val sorter = new ExternalSorter[K, C,C](ordering = Some(keyOrd), serializer = Some(ser))
61. sorter.insertAll(aggregatedIter)
62. context.taskMetrics().incMemoryBytesSpilled(sorter.memoryBytesSpilled)
63. context.taskMetrics().incDiskBytesSpilled(sorter.diskBytesSpilled)
64. context.internalMetricsToAccumulators(
65. InternalAccumulator.PEAK_EXECUTION_MEMORY).add(sorter.peakMemoryUsedBytes)
66. sorter.iterator
67. case None =>
68. aggregatedIter
69. }
70. }
这里谈到聚合,我们深入看一下reduceByKey:reduceByKey和Hadoop的Map reduce相比有个缺点:Hadoop的Map reduce中无论业务是什么类型,Mapreduce都可以自定义,Map reduce的业务逻辑都可以不一样。但reduceByKey 有个好处,可以很好的操作上一个Stage的算子,前面Mapper的算子,也可以很好的操作下一个Stage,具体reduce的算子。
看一下PairRDDFunctions.scala的reduceByKey方法:
1. defreduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] =self.withScope {
2. combineByKey[V]((v: V) => v, func, func,partitioner)
3. }
reduceByKey方法中里面会调用combineByKey:
combineByKey方法中:
第一个参数 `createCombiner`,是所谓的Combiner;例如建立一个元素列表,将V类型转换为C类型。
第二个参数 `mergeValue`,在元素列表末尾追加元素,将V类型合并进C类型。
第三个参数 `mergeCombiners`,将两个C类型合并成1个。
第四个参数 partitioner,指定分区器
其中mapSideCombine默认为true,默认在Mapper端进行聚合。这里注意key的类型不能是数组。第二个参数, 第三个参数从reduceByKey的角度看是一样的。
1. def combineByKey[C](createCombiner: V => C,
2. mergeValue: (C, V) => C,
3. mergeCombiners: (C, C) => C,
4. partitioner: Partitioner,
5. mapSideCombine: Boolean = true,
6. serializer: Serializer = null): RDD[(K,C)] = self.withScope {
7.
我们回到HashShuffleReader.scala的read方法,在reducer端抓取数据,需要进行网络通信的过程,那网络通信发生在什么时候呢?网络通信肯定由read方法中的ShuffleBlockFetcherIterator完成的
1. finalclass ShuffleBlockFetcherIterator(
2. context: TaskContext,
3. shuffleClient: ShuffleClient,
4. blockManager: BlockManager,
5. blocksByAddress: Seq[(BlockManagerId,Seq[(BlockId, Long)])],
6. maxBytesInFlight: Long)
7. extends Iterator[(BlockId, InputStream)] withLogging {
我们看一下ShuffleBlockFetcherIterator.scala的initialize方法:
1. private[this]def initialize(): Unit = {
2. // Add a task completion callback (calledin both success case and failure case) to cleanup.
3. context.addTaskCompletionListener(_ =>cleanup())
4.
5. // Split local and remote blocks.
6. val remoteRequests =splitLocalRemoteBlocks()
7. // Add the remote requests into our queuein a random order
8. fetchRequests ++=Utils.randomize(remoteRequests)
9.
10. // Send out initial requests for blocks, upto our maxBytesInFlight
11. while (fetchRequests.nonEmpty &&
12. (bytesInFlight == 0 || bytesInFlight +fetchRequests.front.size <= maxBytesInFlight)) {
13. sendRequest(fetchRequests.dequeue())
14. }
15.
16. val numFetches = remoteRequests.size -fetchRequests.size
17. logInfo("Started " + numFetches +" remote fetches in" + Utils.getUsedTimeMs(startTime))
18.
19. // Get Local Blocks
20. fetchLocalBlocks()
21. logDebug("Got local blocks in " +Utils.getUsedTimeMs(startTime))
22. }
在ShuffleBlockFetcherIterator.scala的initialize方法循环遍历,发生请求拉取数据,每次最大可以拉取48M数据,其中的一行代码sendRequest(fetchRequests.dequeue()),我们看一下ShuffleBlockFetcherIterator.scala的sendRequest方法:
1. private[this] def sendRequest(req:FetchRequest) {
2. logDebug("Sending request for %dblocks (%s) from %s".format(
3. req.blocks.size,Utils.bytesToString(req.size), req.address.hostPort))
4. bytesInFlight += req.size
5.
6. // so we can look up the size of eachblockID
7. val sizeMap = req.blocks.map { case (blockId,size) => (blockId.toString, size) }.toMap
8. val blockIds =req.blocks.map(_._1.toString)
9.
10. val address = req.address
11. shuffleClient.fetchBlocks(address.host,address.port, address.executorId, blockIds.toArray,
12. new BlockFetchingListener {
13. override defonBlockFetchSuccess(blockId: String, buf: ManagedBuffer): Unit = {
14. // Only add the buffer to resultsqueue if the iterator is not zombie,
15. // i.e. cleanup() has not been calledyet.
16. if (!isZombie) {
17. // Increment the ref count because weneed to pass this to a different thread.
18. // This needs to be released afteruse.
19. buf.retain()
20. results.put(newSuccessFetchResult(BlockId(blockId), address, sizeMap(blockId), buf))
21. shuffleMetrics.incRemoteBytesRead(buf.size)
22. shuffleMetrics.incRemoteBlocksFetched(1)
23. }
24. logTrace("Got remote block "+ blockId + " after " + Utils.getUsedTimeMs(startTime))
25. }
26.
27. override defonBlockFetchFailure(blockId: String, e: Throwable): Unit = {
28. logError(s"Failed to getblock(s) from ${req.address.host}:${req.address.port}", e)
29. results.put(newFailureFetchResult(BlockId(blockId), address, e))
30. }
31. }
32. )
33. }
sendRequest方法中shuffleClient.fetchBlocks(address.host, address.port,address.executorId, blockIds.toArray 代码中就有host、port,机器的的域名和端口,executorId、 blockIds等相关信息,抓到信息以后会有BlockFetchingListener进行结果的处理。
我们看一下ShuffleClient.scala的fetchBlocks方法,从远程的节点中同步读取数据。:
1. public abstract void fetchBlocks(
2. String host,
3. int port,
4. String execId,
5. String[] blockIds,
6. BlockFetchingListener listener);
7. }
fetchBlocks具体实现的代码是BlockTransferService.scala,里面仍没有具体实现方法
1. override def fetchBlocks(
2. host: String,
3. port: Int,
4. execId: String,
5. blockIds: Array[String],
6. listener: BlockFetchingListener): Unit
继续查看BlockTransferService.scala子类的实现方式,是NettyBlockTransferService.scala,Netty是基于NIO的理念进行网络通信,互联网公司进行不同进程的通信一般都使用Netty。说明一点就是它底层有一套通信框架,我们基于这套通信框架进行数据的请求和传输。查看NettyBlockTransferService.scala的fetchBlocks源码如下:
1. overridedef fetchBlocks(
2. host: String,
3. port: Int,
4. execId: String,
5. blockIds: Array[String],
6. listener: BlockFetchingListener): Unit ={
7. logTrace(s"Fetch blocks from$host:$port (executor id $execId)")
8. try {
9. val blockFetchStarter = newRetryingBlockFetcher.BlockFetchStarter {
10. override def createAndStart(blockIds:Array[String], listener: BlockFetchingListener) {
11. val client =clientFactory.createClient(host, port)
12. new OneForOneBlockFetcher(client,appId, execId, blockIds.toArray, listener).start()
13. }
14. }
15.
16. val maxRetries =transportConf.maxIORetries()
17. if (maxRetries > 0) {
18. // Note this Fetcher will correctlyhandle maxRetries == 0; we avoid it just in case there's
19. // a bug in this code. We should removethe if statement once we're sure of the stability.
20. new RetryingBlockFetcher(transportConf,blockFetchStarter, blockIds, listener).start()
21. } else {
22. blockFetchStarter.createAndStart(blockIds,listener)
23. }
24. } catch {
25. case e: Exception =>
26. logError("Exception whilebeginning fetchBlocks", e)
27. blockIds.foreach(listener.onBlockFetchFailure(_,e))
28. }
29. }
fetchBlocks其中maxRetries是最大的重试次数, createAndStart这个是底层的Netty是怎么做的。。
1. ……
2. val blockFetchStarter = newRetryingBlockFetcher.BlockFetchStarter {
3. ……
4. val maxRetries =transportConf.maxIORetries()
5. …..
6. override def createAndStart(blockIds:Array[String], listener: BlockFetchingListener) {
7. val client =clientFactory.createClient(host, port)
8. new OneForOneBlockFetcher(client,appId, execId, blockIds.toArray, listener).start()
9. }
其中OneForOneBlockFetcher(client,appId, execId, blockIds.toArray, listener).start()就开始了通信过程,我们看一下OneForOneBlockFetcher.java的start源码,这个是rpc通信。:
1. publicvoid start() {
2. if (blockIds.length == 0) {
3. throw newIllegalArgumentException("Zero-sized blockIds array");
4. }
5.
6. client.sendRpc(openMessage.toByteArray(),new RpcResponseCallback() {
7. @Override
8. public void onSuccess(byte[] response) {
9. try {
10. streamHandle = (StreamHandle)BlockTransferMessage.Decoder.fromByteArray(response);
11. logger.trace("Successfullyopened blocks {}, preparing to fetch chunks.", streamHandle);
12.
13. // Immediately request all chunks --we expect that the total size of the request is
14. // reasonable due to higher levelchunking in [[ShuffleBlockFetcherIterator]].
15. for (int i = 0; i 16. client.fetchChunk(streamHandle.streamId,i, chunkCallback); 17. } 18. } catch (Exception e) { 19. logger.error("Failed while startingblock fetches after success", e); 20. failRemainingBlocks(blockIds, e); 21. } 22. } 23. 24. @Override 25. public void onFailure(Throwable e) { 26. logger.error("Failed whilestarting block fetches", e); 27. failRemainingBlocks(blockIds, e); 28. } 29. }); 30. } 如果进一步查看fetch内容,一般情况下会提供HashMap的数据结构,为了将数据聚合起来。 总结一下: 1,Shuffle存数据和抓取数据,就是普通的scala和java编程,思想是一样的,有个缓存然后往磁盘写,不过Spark是分布式系统,要跟Driver的管理器进行合作,或者说受Driver的控制,如写数据的时候告诉Driver数据写在哪里,然后下一个阶段要去读数据,到Driver中去要数据。Driver会清晰的告诉你要读取的数据在哪里。具体读数据的过程是Netty的rpc框架,是基本的IO操作而已。 2,Reducer端如果内存不够写磁盘,代价是双倍的。Mapper如果内存不够要写磁盘,不管够不够,都只要写1次;而Reducer端如果内存不够,将数据存到磁盘,在计算数据的时候,又再一次将数据从磁盘上抓回来。这个时候有个很重要的调优参数,就是将Shuffle的内存适当调大一点。Shuffle的内存默认占20%,可以调大到30%的样子,但也不能太大,因为要进行persist,persist在磁盘中占用的空间就会越来越小。