首先每个计算都是在对应机器的executor的task上面运行的, 那么计算完后也是从executor端开始写入的, 根据之前文章的解析, 我们知道最后Task是在executor的TaskRunner中执行的, 其中在数据操作端, 计算完成后如果resultSize大于Akka可以传输的size的话, 就会存储到block中, 然后通过Driver这边的taskscheduler来从executor端的blockmanager中获取对应block的信息, executor的taskrunner中是通过这段代码来存数据的:
val serializedResult: ByteBuffer = { if (maxResultSize > 0 && resultSize > maxResultSize) { logWarning(s"Finished $taskName (TID $taskId). Result is larger than maxResultSize " + s"(${Utils.bytesToString(resultSize)} > ${Utils.bytesToString(maxResultSize)}), " + s"dropping it.") ser.serialize(new IndirectTaskResult[Any](TaskResultBlockId(taskId), resultSize)) } else if (resultSize >= akkaFrameSize - AkkaUtils.reservedSizeBytes) { val blockId = TaskResultBlockId(taskId) env.blockManager.putBytes( blockId, serializedDirectResult, StorageLevel.MEMORY_AND_DISK_SER) logInfo( s"Finished $taskName (TID $taskId). $resultSize bytes result sent via BlockManager)") ser.serialize(new IndirectTaskResult[Any](blockId, resultSize)) } else { logInfo(s"Finished $taskName (TID $taskId). $resultSize bytes result sent to driver") serializedDirectResult } } execBackend.statusUpdate(taskId, TaskState.FINISHED, serializedResult)
可以看到如果数据小的话直接就返回给Driver了, 如果数据大的话, 那么就通过
blockId, serializedDirectResult, StorageLevel.MEMORY_AND_DISK_SER)
放到对应的block里面, 等Driver端来取。 这里直接调用了blockManager的putBytes方法 (这个env是slave端的, 所以blockmanager也是slave端的), 看下putbytes:
def putBytes( blockId: BlockId, bytes: ByteBuffer, level: StorageLevel, tellMaster: Boolean = true, effectiveStorageLevel: Option[StorageLevel] = None): Seq[(BlockId, BlockStatus)] = { require(bytes != null, "Bytes is null") doPut(blockId, ByteBufferValues(bytes), level, tellMaster, effectiveStorageLevel) }
直接调用了doPut方法, 这个方法蛮长的, 具体可以慢慢看, 挑几个重点的地方写一下:
private def doPut( blockId: BlockId, data: BlockValues, level: StorageLevel, tellMaster: Boolean = true, effectiveStorageLevel: Option[StorageLevel] = None) : Seq[(BlockId, BlockStatus)] = { ... val updatedBlocks = new ArrayBuffer[(BlockId, BlockStatus)] //获取BlockInfo如果已经存在则获取, 不然创建新的 val putBlockInfo = { val tinfo = new BlockInfo(level, tellMaster) // Do atomically ! val oldBlockOpt = blockInfo.putIfAbsent(blockId, tinfo) if (oldBlockOpt.isDefined) { if (oldBlockOpt.get.waitForReady()) { logWarning(s"Block $blockId already exists on this machine; not re-adding it") return updatedBlocks } // TODO: So the block info exists - but previous attempt to load it (?) failed. // What do we do now ? Retry on it ? oldBlockOpt.get } else { tinfo } } ... putBlockInfo.synchronized { ... //根据配置, 来确定是用memorystore还是diskstore或者externalBlockStore val (returnValues, blockStore: BlockStore) = { if (putLevel.useMemory) { // Put it in memory first, even if it also has useDisk set to true; // We will drop it to disk later if the memory store can't hold it. (true, memoryStore) } else if (putLevel.useOffHeap) { // Use external block store (false, externalBlockStore) } else if (putLevel.useDisk) { // Don't get back the bytes from put unless we replicate them (putLevel.replication > 1, diskStore) } else { assert(putLevel == StorageLevel.NONE) throw new BlockException( blockId, s"Attempted to put block $blockId without specifying storage level!") } } ... //根据不一样的类型, 存放数据到memory中或者放到磁盘上 val result = data match { case IteratorValues(iterator) => blockStore.putIterator(blockId, iterator, putLevel, returnValues) case ArrayValues(array) => blockStore.putArray(blockId, array, putLevel, returnValues) case ByteBufferValues(bytes) => bytes.rewind() blockStore.putBytes(blockId, bytes, putLevel) ... val putBlockStatus = getCurrentBlockStatus(blockId, putBlockInfo) if (putBlockStatus.storageLevel != StorageLevel.NONE) { // Now that the block is in either the memory, externalBlockStore, or disk store, // let other threads read it, and tell the master about it. marked = true putBlockInfo.markReady(size) if (tellMaster) { reportBlockStatus(blockId, putBlockInfo, putBlockStatus) } updatedBlocks += ((blockId, putBlockStatus)) } ... } ... }
可以看到代码里面是调用了各种store的putBytes方法 (或者putIterator, putArray)
那么我们拿memorystore来看一下, putBytes方法:
override def putBytes(blockId: BlockId, _bytes: ByteBuffer, level: StorageLevel): PutResult = { // Work on a duplicate - since the original input might be used elsewhere. val bytes = _bytes.duplicate() bytes.rewind() //如果选择memoryonly, 则值为false, 来源StorageLevel定义: //val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false) if (level.deserialized) { val values = blockManager.dataDeserialize(blockId, bytes) 存放数据 putIterator(blockId, values, level, returnValues = true) } else { val droppedBlocks = new ArrayBuffer[(BlockId, BlockStatus)] //存放数据 tryToPut(blockId, bytes, bytes.limit, deserialized = false, droppedBlocks) PutResult(bytes.limit(), Right(bytes.duplicate()), droppedBlocks) } }
可以看到如果是选择MemoryOnly的话就会去执行tryToPut方法存放数据, 看一下这个方法怎么做的:
private def tryToPut( blockId: BlockId, value: Any, size: Long, deserialized: Boolean, droppedBlocks: mutable.Buffer[(BlockId, BlockStatus)]): Boolean = { tryToPut(blockId, () => value, size, deserialized, droppedBlocks) } private def tryToPut( blockId: BlockId, value: () => Any, size: Long, deserialized: Boolean, droppedBlocks: mutable.Buffer[(BlockId, BlockStatus)]): Boolean = { /* TODO: Its possible to optimize the locking by locking entries only when selecting blocks * to be dropped. Once the to-be-dropped blocks have been selected, and lock on entries has * been released, it must be ensured that those to-be-dropped blocks are not double counted * for freeing up more space for another block that needs to be put. Only then the actually * dropping of blocks (and writing to disk if necessary) can proceed in parallel. */ memoryManager.synchronized { // Note: if we have previously unrolled this block successfully, then pending unroll // memory should be non-zero. This is the amount that we already reserved during the // unrolling process. In this case, we can just reuse this space to cache our block. // The synchronization on `memoryManager` here guarantees that the release and acquire // happen atomically. This relies on the assumption that all memory acquisitions are // synchronized on the same lock. releasePendingUnrollMemoryForThisTask() //判断是否有足够内存 val enoughMemory = memoryManager.acquireStorageMemory(blockId, size, droppedBlocks) if (enoughMemory) { // We acquired enough memory for the block, so go ahead and put it //如果有足够内存, 那么直接存到内存里, 其实就是放到entries里面就算放到内存中了 val entry = new MemoryEntry(value(), size, deserialized) entries.synchronized { entries.put(blockId, entry) } val valuesOrBytes = if (deserialized) "values" else "bytes" logInfo("Block %s stored as %s in memory (estimated size %s, free %s)".format( blockId, valuesOrBytes, Utils.bytesToString(size), Utils.bytesToString(blocksMemoryUsed))) } else { // Tell the block manager that we couldn't put it in memory so that it can drop it to // disk if the block allows disk storage. //如果没有足够内存, 那么就看我们是否允许将数据放到磁盘上, 我们选的MemoryOnly的话deserialized是false lazy val data = if (deserialized) { Left(value().asInstanceOf[Array[Any]]) } else { deserialized } val droppedBlockStatus = blockManager.dropFromMemory(blockId, () => data) droppedBlockStatus.foreach { status => droppedBlocks += ((blockId, status)) } } enoughMemory } }
到目前为止, 如果内存足够, 这个data就放到entries里面了, 那么接下来如果要从blockmanager里面获取数据呢, 比如说Driver端要从executor这边把数据拿回去, 我们看一下, 在taskschedulerimpl的statusUpdate是Driver端获取数据的方法, 里面调用了:
taskResultGetter.enqueueSuccessfulTask(taskSet, tid, serializedData)
这个方法是实际去取数据的, 我们看一下
def enqueueSuccessfulTask( taskSetManager: TaskSetManager, tid: Long, serializedData: ByteBuffer) { getTaskResultExecutor.execute(new Runnable { override def run(): Unit = Utils.logUncaughtExceptions { try { val (result, size) = serializer.get().deserialize[TaskResult[_]](serializedData) match { case directResult: DirectTaskResult[_] => if (!taskSetManager.canFetchMoreResults(serializedData.limit())) { return } // deserialize "value" without holding any lock so that it won't block other threads. // We should call it here, so that when it's called again in // "TaskSetManager.handleSuccessfulTask", it does not need to deserialize the value. directResult.value() (directResult, serializedData.limit()) case IndirectTaskResult(blockId, size) => if (!taskSetManager.canFetchMoreResults(size)) { // dropped by executor if size is larger than maxResultSize sparkEnv.blockManager.master.removeBlock(blockId) return } logDebug("Fetching indirect task result for TID %s".format(tid)) scheduler.handleTaskGettingResult(taskSetManager, tid) val serializedTaskResult = sparkEnv.blockManager.getRemoteBytes(blockId) if (!serializedTaskResult.isDefined) { /* We won't be able to get the task result if the machine that ran the task failed * between when the task ended and when we tried to fetch the result, or if the * block manager had to flush the result. */ scheduler.handleFailedTask( taskSetManager, tid, TaskState.FINISHED, TaskResultLost) return } val deserializedResult = serializer.get().deserialize[DirectTaskResult[_]]( serializedTaskResult.get) sparkEnv.blockManager.master.removeBlock(blockId) (deserializedResult, size) } result.metrics.setResultSize(size) scheduler.handleSuccessfulTask(taskSetManager, tid, result) } catch { case cnf: ClassNotFoundException => val loader = Thread.currentThread.getContextClassLoader taskSetManager.abort("ClassNotFound with classloader: " + loader) // Matching NonFatal so we don't catch the ControlThrowable from the "return" above. case NonFatal(ex) => logError("Exception while getting task result", ex) taskSetManager.abort("Exception while getting task result: %s".format(ex)) } } }) }
看到了把, 如果返回的是IndirectTaskResult, 那么就会根据blockID去blockManager去拿:
val serializedTaskResult = sparkEnv.blockManager.getRemoteBytes(blockId)
这里的blockManager应该还是Driver端的blockmanager, 我们看一下getRemoteBytes方法:
def getRemoteBytes(blockId: BlockId): Option[ByteBuffer] = { logDebug(s"Getting remote block $blockId as bytes") doGetRemote(blockId, asBlockResult = false).asInstanceOf[Option[ByteBuffer]] }
private def doGetRemote(blockId: BlockId, asBlockResult: Boolean): Option[Any] = { require(blockId != null, "BlockId is null") val locations = Random.shuffle(master.getLocations(blockId)) var numFetchFailures = 0 for (loc <- locations) { logDebug(s"Getting remote block $blockId from $loc") val data = try { blockTransferService.fetchBlockSync( loc.host, loc.port, loc.executorId, blockId.toString).nioByteBuffer() } catch { case NonFatal(e) => numFetchFailures += 1 if (numFetchFailures == locations.size) { // An exception is thrown while fetching this block from all locations throw new BlockFetchException(s"Failed to fetch block from" + s" ${locations.size} locations. Most recent failure cause:", e) } else { // This location failed, so we retry fetch from a different one by returning null here logWarning(s"Failed to fetch remote block $blockId " + s"from $loc (failed attempt $numFetchFailures)", e) null } } if (data != null) { if (asBlockResult) { return Some(new BlockResult( dataDeserialize(blockId, data), DataReadMethod.Network, data.limit())) } else { return Some(data) } } logDebug(s"The value of block $blockId is null") } logDebug(s"Block $blockId not found") None }
首先 会通过:
val locations = Random.shuffle(master.getLocations(blockId))
从master那边获取到blockId的所有location, 然后一个一个location这边取回, 取回的方式是通过:
loc.host, loc.port, loc.executorId, blockId.toString).nioByteBuffer()
这里通过NIO的方式从executor这边取回了数据。 总的存结果取结果的大致路径就是这个样子的, 拿了MemoryOnly做了一个列子, 其他方式就按这个流程跟踪一遍肯定明白。