(1)发生时机:
与Tuple的key操作相关,包括reduceByKey/ groupByKey/ sortByKey/ countByKey/ join/ cogroup
(2)特点:
[1] 在Spark早期版本中,bucket缓存非常重要,需要将一个ShuffleMapTask的所有数据全部写入内存之后,才会刷新到磁盘。但这也引发了一个问题,若Map一端数据过多,很容易造成内存溢出OM。所以,在之后的版本中进行了优化,默认bucket缓存是100kb,当以写入的数据达到阈值后,就将内存中的数据一点点的刷新到磁盘,避免了容易发生的OM,但若缓存过小将引发过多的io操作。
[2] 与MapReduce完全不一样的是: MapReduce必须在Map阶段将所有数据写入本地磁盘文件后,才能启动Reduce来拉取数据,因为MR要实现默认Key排序,只有先得到所有数据才能排序。Spark默认情况下不会对数据进行排序,因此ShuffleMapTask每写入一点数据,ResultTask就可以拉取一点数据,然后在本地执行自定义聚合函数和算子计算。优点在于速度快,缺点在于不如MR计算模型方便。
(3)默认的Shuffle
(4)优化后的Shuffle
在Spark新版本中的consolidation机制,即ShuffleGroup的概念:第一批ShuffleMapTask将数据写入n个(n=ResultTask数量Cpu_core数量)本地文件,之后批次的ShuffleMapTask不再新建文件,而是直接将数据写入之前已创建的对应本地文件中,相当于多个ShuffleMap Task的输出在一组共享文件中合并 ,从而减少本地磁盘上的文件数量。
【防裂说明】假设:当前节点有2个Cpu core,运行了4个ShuffleMapTask,==因此只能同时并行其中的2个(即分成了22组合)==
<1>并行的ShuffleMapTask,写入的文件一定是不同的:当一批并行ShuffleMapTask完成后,新一批的ShuffleMapTask启动并行时,使用consolidation机制复用上一批内存缓存和文件。
<2>一个ShuffleGroup中的每个文件,都存储了多个ShuffleMapTask的数据,每个ShuffleMapTask数据称为一个Segement。此外,还会通过一些索引,来标记每个ShuffleMapTask的输出在ShuffleBlockFile中的索引及偏移量,来进行不同ShuffleMapTask的数据区分。
(5)相关源码:
<1> ShuffleWrite相关:HashShuffleWriter / FileShuffleBlockManager 【注:源码Spark2.3 与Spark1.6完全不同】
<2> ShuffleRead入口:ShuffledRDD.compute,但其他逻辑HashShuffleReader / BlockFileShuffleFetcher【注:源码Spark2.3与Spark1.6完全不同】
<3>调优参数:max.bytes.in.flight
override def compute (split : Partition, context: TaskContext): Iterator[( K, C)] = {
val dep = dependencies.head .asInstanceOf [ShuffleDependency[K, V, C]]
SparkEnv.get .shuffleManager .getReader (dep .shuffleHandle , split .index , split.index + 1 , context )
.read()
.asInstanceOf[ Iterator[(K , C)]]
}
/**
* 拉取的关键组件BlockShuffleFetcher,开始拉取ResultTask对应的多份数据
*/
private[ this] def initialize (): Unit = {
// Add a task completion callback (called in both success case and failure case) to cleanup.
context.addTaskCompletionListener(_ => cleanup())
// Split local and remote blocks.
val remoteRequests = splitLocalRemoteBlocks()
// Add the remote requests into our queue in a random order
fetchRequests ++= Utils.randomize(remoteRequests)
assert ((0 == reqsInFlight) == ( 0 == bytesInFlight),
"expected reqsInFlight = 0 but found reqsInFlight = " + reqsInFlight +
", expected bytesInFlight = 0 but found bytesInFlight = " + bytesInFlight )
// Send out initial requests for blocks, up to our maxBytesInFlight
fetchUpToMaxBytes()
val numFetches = remoteRequests.size - fetchRequests .size
logInfo("Started " + numFetches + " remote fetches in" + Utils.getUsedTimeMs(startTime))
// Get Local Blocks 即数据本地化
fetchLocalBlocks()
logDebug("Got local blocks in " + Utils. getUsedTimeMs(startTime))
}
/**
* 循环往复,只要发现还有数据没有拉取完,就发送请求到远程去拉取数据
*/
private def fetchUpToMaxBytes (): Unit = {
// Send fetch requests up to maxBytesInFlight(max.bytes.in.flight调优参数). If you cannot fetch from a remote host
// immediately, defer the request until the next time it can be processed.
// Process any outstanding deferred fetch requests if possible.
if (deferredFetchRequests. nonEmpty) {
for (( remoteAddress, defReqQueue) <- deferredFetchRequests) {
while (isRemoteBlockFetchable(defReqQueue) &&
!isRemoteAddressMaxedOut(remoteAddress, defReqQueue.front)) {
val request = defReqQueue.dequeue()
logDebug(s "Processing deferred fetch request for $ remoteAddress with "
+ s "${request.blocks.length} blocks")
send(remoteAddress, request)
if (defReqQueue.isEmpty) {
deferredFetchRequests -= remoteAddress
}
}
}
}
// Process any regular fetch requests if possible.
while ( isRemoteBlockFetchable(fetchRequests )) {
val request = fetchRequests. dequeue()
val remoteAddress = request. address
if (isRemoteAddressMaxedOut( remoteAddress, request )) {
logDebug(s"Deferring fetch request for $ remoteAddress with ${request.blocks.size} blocks")
val defReqQueue = deferredFetchRequests.getOrElse(remoteAddress, new Queue[FetchRequest]())
defReqQueue.enqueue (request)
deferredFetchRequests(remoteAddress) = defReqQueue
} else {
send( remoteAddress, request ) //复杂的实现,去远程获取数据
}
}
【防裂说明】每个节点上都有BlockManager,包括几个关键组件:
组件 | 功能 |
---|---|
DiskStore | 负责对磁盘上的数据进行读写 |
MemoryStore | 负责对内存中的数据进行读写 |
BlockTransferService | 负责对远程其他节点的BlockManager管理的数据读写 |
ConnectionManager | 负责建立当前BlockManager到远程其他节点的BlockManager的网络连接 |
(1)BlockManagerMaster负责管理各个节点上的BlockManager“内部管理的数据”的元数据,比如增删改的变更操作都会维护在这里。
(2)每个BlockManager创建初始化之后,首先会向BlockManagerMaster进行注册,此时BlockManager会为其创建对应的BlockManagerInfo。
(3)使用BlockManger“写”操作时,例如RDD运行过程中的一些数据,或手动指定了persist(),优先会将数据写入内存中,当内存大小不够时,会使自己的算法将内存中的部分数据写入磁盘。此外,如果persist()指定了replica,那么会使用BlockTransferService将数据replicate一份到其他节点的BlockManager上。
(4)使用BlockManger“读”操作时,例如ShuffleRead,如果能从本地读取数据,那么就利用DiskStore或MemoryStore从本地读取数据。但如果本地没有数据,会使用ConnectionManger向数据所在的节点的BlockManager建立连接,然后使用BlockTransferService从远程BlockManger读取数据。
(5)只要使用了BlockManager进行增删改操作,就必须将Block的BlockStatus上报至BlockManagerMater,修改其内部对应的BlockManagerInfo的BlockStatus元数据。
【源码】
(1)BlockManagerMaster.scala中只是定义了一些操作入口,BlockManagerEndPoint才是真正作用的组件
/**
* BlockManagerMasterEndpoint is an [[ThreadSafeRpcEndpoint]] on the master node to track statuses
* of all slaves' block managers.
*/
private[spark]
class BlockManagerMasterEndpoint(
override val rpcEnv : RpcEnv,
val isLocal: Boolean,
conf: SparkConf,
listenerBus: LiveListenerBus)
extends ThreadSafeRpcEndpoint with Logging {
// Mapping from block manager id to the block manager's information.
// BlockManagerMaster要负责维护每个BlockManager的BlockManagerInfo
private val blockManagerInfo = new mutable.HashMap[BlockManagerId, BlockManagerInfo]
// Mapping from executor ID to block manager ID.
// 每个Executor与一个BlockManager相关联
private val blockManagerIdByExecutor = new mutable.HashMap[String, BlockManagerId]
// Mapping from block id to the set of block managers that have the block.
private val blockLocations = new JHashMap[BlockId, mutable.HashSet[BlockManagerId]]
private val askThreadPool = ThreadUtils.newDaemonCachedThreadPool( "block-manager-ask-thread-pool" )
// ... ...
/**
* 注册BlockManager
* Returns the BlockManagerId with topology information populated, if available.
*/
private def register (
idWithoutTopologyInfo: BlockManagerId,
maxOnHeapMemSize: Long,
maxOffHeapMemSize: Long,
slaveEndpoint: RpcEndpointRef): BlockManagerId = {
// the dummy id is not expected to contain the topology information.
// we get that info here and respond back with a more fleshed out block manager id
val id = BlockManagerId(
idWithoutTopologyInfo. executorId,
idWithoutTopologyInfo. host,
idWithoutTopologyInfo. port,
topologyMapper.getTopologyForHost(idWithoutTopologyInfo.host))
val time = System. currentTimeMillis()
// 检查已注册的元数据,没有则新注册
if (!blockManagerInfo.contains(id)) {
// 根据BlockManager对应的ExecutorId找到对应的BlockManagerInfo
// 安全判断:如果BlockManagerInfo map里没有BlockManagerId,则同步的BlockManagerIdByExecutor map里也必须没有
blockManagerIdByExecutor.get(id.executorId) match {
case Some (oldId ) =>
// A block manager of the same executor already exists, so remove it (assumed dead)
logError("Got two different block manager registrations on same executor - "
+ s " will replace old one $oldId with new one $ id")
removeExecutor(id .executorId ) //blockManagerIdByExecutor.get(execId).foreach(removeBlockManager)
case None =>
}
logInfo("Registering block manager %s with %s RAM, %s" .format(
id. hostPort, Utils.bytesToString( maxOnHeapMemSize + maxOffHeapMemSize ), id))
//往blockManagerIdByExecutor map中保存一份executorId到blockManagerId的映射
blockManagerIdByExecutor(id.executorId) = id
//为blockManagerId创建一份BlockManagerInfo并往blockManagerInfo map中保存一份ID->INFO的映射
blockManagerInfo(id) = new BlockManagerInfo(
id, System.currentTimeMillis(), maxOnHeapMemSize, maxOffHeapMemSize, slaveEndpoint)
}
listenerBus. post( SparkListenerBlockManagerAdded(time , id , maxOnHeapMemSize + maxOffHeapMemSize,
Some(maxOnHeapMemSize ), Some(maxOffHeapMemSize)))
id
}
private def removeBlockManager (blockManagerId : BlockManagerId) {
// 尝试根据blockManagerId获取对应的BlockManagerInfo
val info = blockManagerInfo(blockManagerId)
// Remove the block manager from blockManagerIdByExecutor.
blockManagerIdByExecutor -= blockManagerId.executorId
// Remove it from blockManagerInfo and remove all the blocks.
blockManagerInfo.remove(blockManagerId)
val iterator = info.blocks.keySet.iterator
// 遍历BlockManagerInfo内部所有Block对应的BlockId
while ( iterator. hasNext) {
//清空BlockManagerInfo内部的Block的BlockStatus
val blockId = iterator.next
val locations = blockLocations.get(blockId)
locations -= blockManagerId
// De-register the block if none of the block managers have it. Otherwise, if pro-active
// replication is enabled, and a block is either an RDD or a test block (the latter is used
// for unit testing), we send a message to a randomly chosen executor location to replicate
// the given block. Note that we ignore other block types (such as broadcast/shuffle blocks
// etc.) as replication doesn't make much sense in that context.
if (locations.size == 0) {
blockLocations.remove(blockId)
logWarning(s"No more replicas available for $ blockId !")
} else if (proactivelyReplicate && (blockId.isRDD || blockId.isInstanceOf[TestBlockId])) {
// As a heursitic, assume single executor failure to find out the number of replicas that
// existed before failure
val maxReplicas = locations .size + 1
val i = (new Random(blockId.hashCode)).nextInt(locations.size)
val blockLocations = locations .toSeq
val candidateBMId = blockLocations (i)
blockManagerInfo.get(candidateBMId). foreach { bm =>
val remainingLocations = locations.toSeq.filter(bm => bm != candidateBMId)
val replicateMsg = ReplicateBlock(blockId, remainingLocations, maxReplicas)
bm.slaveEndpoint.ask[Boolean](replicateMsg)
}
}
}
listenerBus. post( SparkListenerBlockManagerRemoved(System.currentTimeMillis (), blockManagerId))
logInfo(s"Removing block manager $ blockManagerId")
}
/**
* 更新BlockInfo,每个BlockManager上,如果Block发生变化,那么都要发送updateBlockInfo请求至BlockManagerMastrer更新
*/
private def updateBlockInfo (
blockManagerId: BlockManagerId,
blockId: BlockId,
storageLevel: StorageLevel,
memSize: Long,
diskSize: Long): Boolean = {
if (!blockManagerInfo.contains(blockManagerId)) {
if (blockManagerId. isDriver && !isLocal) {
// We intentionally do not register the master (except in local mode),
// so we should not indicate failure.
return true
} else {
return false
}
}
if (blockId == null) {
blockManagerInfo(blockManagerId). updateLastSeenMs()
return true
}
blockManagerInfo(blockManagerId). updateBlockInfo(blockId, storageLevel, memSize, diskSize)
// 每一个Block可能会在多个BlockManager上
// 因为如果StorageLevel设置为_2这种级别,就需要将Block赋值一份副本,放到其他BlockManager上
// BockLocations map 保存了每个blockId对应的BlockManagerId的Set(自动去重的多个)
var locations: mutable.HashSet[ BlockManagerId] = null
if (blockLocations.containsKey(blockId)) {
locations = blockLocations.get(blockId)
} else {
locations = new mutable.HashSet[BlockManagerId]
blockLocations.put(blockId, locations)
}
if (storageLevel. isValid) {
locations.add(blockManagerId)
} else {
locations.remove(blockManagerId)
}
// Remove the block from master tracking if it has been removed on all slaves.
if (locations.size == 0) {
blockLocations.remove(blockId)
}
true
}
// ... ...
}
/**
* 每一个BlockManager的元数据结构BlockManagerInfo
*/
private[spark] class BlockManagerInfo(
val blockManagerId: BlockManagerId,
timeMs: Long,
val maxOnHeapMem: Long,
val maxOffHeapMem: Long,
val slaveEndpoint: RpcEndpointRef)
extends Logging {
val maxMem = maxOnHeapMem + maxOffHeapMem
private var _lastSeenMs : Long = timeMs
private var _remainingMem : Long = maxMem
// Mapping from block id to its status.
private val _blocks = new JHashMap[BlockId, BlockStatus]
// Cached blocks held by this BlockManager. This does not include broadcast blocks.
private val _cachedBlocks = new mutable.HashSet[BlockId]
}
@DeveloperApi
case class BlockStatus(storageLevel: StorageLevel, memSize : Long, diskSize : Long) {
def isCached: Boolean = memSize + diskSize > 0
}
def updateBlockInfo(
blockId: BlockId,
storageLevel: StorageLevel,
memSize: Long,
diskSize: Long) {
updateLastSeenMs()
val blockExists = _blocks.containsKey(blockId)
var originalMemSize: Long = 0
var originalDiskSize: Long = 0
var originalLevel: StorageLevel = StorageLevel. NONE
if (blockExists) {
// The block exists on the slave already.
val blockStatus: BlockStatus = _blocks.get(blockId)
originalLevel = blockStatus.storageLevel
originalMemSize = blockStatus.memSize
originalDiskSize = blockStatus.diskSize
// 判断如果storageLevel是基于内存,那么就给剩余内存数量加上当前的内存
if (originalLevel. useMemory) {
_remainingMem += originalMemSize
}
}
if (storageLevel. isValid) {
/* isValid means it is either stored in-memory or on-disk.
* The memSize here indicates the data size in or dropped from memory,
* externalBlockStoreSize here indicates the data size in or dropped from externalBlockStore,
* and the diskSize here indicates the data size in or dropped to disk.
* They can be both larger than 0, when a block is dropped from memory to disk.
* Therefore, a safe way to set BlockStatus is to set its info in accurate modes. */
var blockStatus: BlockStatus = null
if (storageLevel. useMemory) {
blockStatus = BlockStatus (storageLevel, memSize = memSize, diskSize = 0)
_blocks.put(blockId, blockStatus)
_remainingMem -= memSize
if (blockExists ) {
logInfo(s"Updated $blockId in memory on ${blockManagerId.hostPort}" +
s " (current size: ${Utils. bytesToString(memSize )}," +
s " original size: ${Utils. bytesToString(originalMemSize )}," +
s " free: ${Utils. bytesToString(_remainingMem)})" )
} else {
logInfo(s"Added $blockId in memory on ${blockManagerId.hostPort}" +
s " (size: ${Utils. bytesToString(memSize )}," +
s " free: ${Utils. bytesToString(_remainingMem)})" )
}
}
if (storageLevel. useDisk) {
blockStatus = BlockStatus (storageLevel, memSize = 0, diskSize = diskSize)
_blocks.put(blockId, blockStatus)
if (blockExists ) {
logInfo(s"Updated $blockId on disk on ${blockManagerId.hostPort}" +
s " (current size: ${Utils. bytesToString(diskSize )}," +
s " original size: ${Utils. bytesToString(originalDiskSize )})" )
} else {
logInfo(s"Added $blockId on disk on ${blockManagerId.hostPort}" +
s " (size: ${Utils. bytesToString(diskSize )})" )
}
}
if (! blockId. isBroadcast && blockStatus.isCached) {
_cachedBlocks += blockId
}
} else if (blockExists ) {
// If isValid is not true, drop the block.
_blocks.remove(blockId)
_cachedBlocks -= blockId
if (originalLevel. useMemory) {
logInfo(s"Removed $ blockId on ${blockManagerId.hostPort} in memory" +
s " (size: ${Utils. bytesToString(originalMemSize )}," +
s " free: ${Utils. bytesToString(_remainingMem)})" )
}
if (originalLevel. useDisk) {
logInfo(s"Removed $ blockId on ${blockManagerId.hostPort} on disk" +
s " (size: ${Utils. bytesToString(originalDiskSize )})" )
}
}
}
}
/**
* Manager running on every node (driver and executors) which provides interfaces for putting and
* retrieving blocks both locally and remotely into various stores (memory, disk, and off-heap).
*
* Note that [[initialize()]] must be called before the BlockManager is usable.
*/
private[spark] class BlockManager(
executorId: String,
rpcEnv: RpcEnv,
val master: BlockManagerMaster,
val serializerManager: SerializerManager,
val conf: SparkConf,
memoryManager: MemoryManager,
mapOutputTracker: MapOutputTracker,
shuffleManager: ShuffleManager,
val blockTransferService: BlockTransferService,
securityManager: SecurityManager,
numUsableCores: Int)
extends BlockDataManager with BlockEvictionHandler with Logging {
private[spark] val externalShuffleServiceEnabled =
conf.getBoolean("spark.shuffle.service.enabled", false)
val diskBlockManager = {
// Only perform cleanup if an external service is not serving our shuffle files.
val deleteFilesOnStop =
!externalShuffleServiceEnabled || executorId == SparkContext.DRIVER_IDENTIFIER
new DiskBlockManager(conf, deleteFilesOnStop )
}
// Visible for testing
private[storage] val blockInfoManager = new BlockInfoManager
private val futureExecutionContext = ExecutionContext .fromExecutorService (
ThreadUtils.newDaemonCachedThreadPool ("block-manager-future" , 128 ))
// Actual storage of where blocks are kept
private[spark] val memoryStore =
new MemoryStore(conf, blockInfoManager, serializerManager, memoryManager, this)
private[spark] val diskStore = new DiskStore (conf, diskBlockManager, securityManager)
memoryManager.setMemoryStore(memoryStore)
// ... ...
/**
* Initializes the BlockManager with the given appId. This is not performed in the constructor as
* the appId may not be known at BlockManager instantiation time (in particular for the driver,
* where it is only learned after registration with the TaskScheduler).
*
* This method initializes the BlockTransferService and ShuffleClient, registers with the
* BlockManagerMaster, starts the BlockManagerWorker endpoint, and registers with a local shuffle
* service if configured.
*/
def initialize(appId: String): Unit = {
// 首先初始化用于远程数据传输的BlockTransferService
blockTransferService.init( this)
shuffleClient.init(appId)
blockReplicationPolicy = {
val priorityClass = conf.get(
"spark.storage.replication.policy" , classOf[RandomBlockReplicationPolicy].getName)
val clazz = Utils.classForName (priorityClass )
val ret = clazz.newInstance. asInstanceOf[BlockReplicationPolicy]
logInfo(s"Using $ priorityClass for block replication policy")
ret
}
// 为当前这个BlockManager创建唯一一个BlockManagerId
// 从BlockManagerId的初始化即可看出一个BlockManager是通过一个节点上的一个Executor进行唯一标识的
val id =
BlockManagerId( executorId, blockTransferService .hostName, blockTransferService.port, None )
//发送BlockManager的注册消息
val idFromMaster = master. registerBlockManager(
id,
maxOnHeapMemory,
maxOffHeapMemory,
slaveEndpoint)
blockManagerId = if (idFromMaster != null) idFromMaster else id
shuffleServerId = if (externalShuffleServiceEnabled) {
logInfo(s"external shuffle service port = $ externalShuffleServicePort")
BlockManagerId(executorId, blockTransferService.hostName, externalShuffleServicePort)
} else {
blockManagerId
}
// Register Executors' configuration with the local shuffle service, if one should exist.
if (externalShuffleServiceEnabled && !blockManagerId.isDriver) {
registerWithExternalShuffleServer()
}
logInfo(s"Initialized BlockManager: $ blockManagerId")
}
/**
* Get block from the local block manager as serialized bytes.
*
* Must be called while holding a read lock on the block.
* Releases the read lock upon exception; keeps the read lock upon successful return.
*/
private def doGetLocalBytes (blockId : BlockId, info: BlockInfo): BlockData = {
val level = info.level
logDebug(s"Level for block $ blockId is $level ")
// In order, try to read the serialized bytes from memory, then from disk, then fall back to
// serializing in-memory objects, and, finally, throw an exception if the block does not exist.
if (level.deserialized) {
// Try to avoid expensive serialization by reading a pre -serialized copy from disk:
if (level.useDisk && diskStore.contains(blockId)) {
// Note: we purposely do not try to put the block back into memory here. Since this branch
// handles deserialized blocks, this block may only be cached in memory as objects, not
// serialized bytes. Because the caller only requested bytes, it doesn't make sense to
// cache the block's deserialized objects since that caching may not have a payoff.
// DiskStore底层使用Java NIO进行读写操作
diskStore.getBytes(blockId)
} else if (level .useMemory && memoryStore.contains(blockId)) {
// The block was not found on disk, so serialize an in-memory copy:
// 关键MemoryStore中用entry维持Block在内存中的数据 : private val entries = new LinkedHashMap[ BlockId, MemoryEntry[_](32,0.75f,true) ] getBytes / getValues 进行多线程并发访问同步
new ByteBufferBlockData(serializerManager.dataSerializeWithExplicitClassTag(
blockId, memoryStore.getValues(blockId).get, info.classTag), true)
} else {
handleLocalReadFailure(blockId )
}
} else { // storage level is serialized
if (level.useMemory && memoryStore.contains(blockId)) {
new ByteBufferBlockData(memoryStore.getBytes(blockId).get, false)
} else if (level .useDisk && diskStore.contains(blockId)) {
val diskData = diskStore.getBytes(blockId)
maybeCacheDiskBytesInMemory(info , blockId , level , diskData )
. map( new ByteBufferBlockData(_, false))
. getOrElse(diskData )
} else {
handleLocalReadFailure(blockId )
}
}
}
/**
* Get block from remote block managers as serialized bytes.
*/
def getRemoteBytes( blockId: BlockId): Option[ChunkedByteBuffer ] = {
logDebug(s"Getting remote block $ blockId")
require(blockId != null, "BlockId is null")
var runningFailureCount = 0
var totalFailureCount = 0
// Because all the remote blocks are registered in driver, it is not necessary to ask
// all the slave executors to get block status.
val locationsAndStatus = master. getLocationsAndStatus(blockId )
val blockSize = locationsAndStatus.map { b =>
b.status.diskSize.max(b.status.memSize)
}.getOrElse(0L)
val blockLocations = locationsAndStatus.map (_.locations ).getOrElse (Seq.empty)
// If the block size is above the threshold, we should pass our FileManger to
// BlockTransferService, which will leverage it to spill the block; if not, then passed-in
// null value means the block will be persisted in memory.
val tempFileManager = if (blockSize > maxRemoteBlockToMem) {
remoteBlockTempFileManager
} else {
null
}
val locations = sortLocations(blockLocations )
val maxFetchFailures = locations.size
var locationIterator = locations.iterator
while ( locationIterator.hasNext ) {
val loc = locationIterator. next()
logDebug(s"Getting remote block $ blockId from $loc ")
val data = try {
blockTransferService.fetchBlockSync(
loc.host, loc.port, loc.executorId, blockId.toString, tempFileManager).nioByteBuffer ()
} catch {
case NonFatal (e ) =>
runningFailureCount += 1
totalFailureCount += 1
if (totalFailureCount >= maxFetchFailures ) {
// Give up trying anymore locations. Either we've tried all of the original locations,
// or we've refreshed the list of locations from the master, and have still
// hit failures after trying locations from the refreshed list.
logWarning(s"Failed to fetch block after $totalFailureCount fetch failures. " +
s "Most recent failure cause:", e )
return None
}
logWarning(s"Failed to fetch remote block $blockId " +
s "from $loc (failed attempt $ runningFailureCount)", e)
// If there is a large number of executors then locations list can contain a
// large number of stale entries causing a large number of retries that may
// take a significant amount of time. To get rid of these stale entries
// we refresh the block locations after a certain number of fetch failures
if (runningFailureCount >= maxFailuresBeforeLocationRefresh) {
locationIterator = sortLocations(master .getLocations (blockId )).iterator
logDebug(s"Refreshed locations from the driver " +
s "after ${runningFailureCount } fetch failures." )
runningFailureCount = 0
}
// This location failed, so we retry fetch from a different one by returning null here
null
}
if (data != null) {
return Some (new ChunkedByteBuffer (data ))
}
logDebug(s"The value of block $ blockId is null")
}
logDebug(s"Block $ blockId not found")
None
}
/**
* Put the given bytes according to the given level in one of the block stores, replicating
* the values if necessary.
*
* If the block already exists, this method will not overwrite it.
*
* '''Important!''' Callers must not mutate or release the data buffer underlying `bytes`. Doing
* so may corrupt or change the data stored by the `BlockManager`.
*
* @param keepReadLock if true, this method will hold the read lock when it returns (even if the
* block already exists). If false, this method will hold no locks when it
* returns.
* @return true if the block was already present or if the put succeeded, false otherwise.
*/
private def doPutBytes [T](
blockId: BlockId,
bytes: ChunkedByteBuffer,
level: StorageLevel,
classTag: ClassTag[T],
tellMaster: Boolean = true,
keepReadLock: Boolean = false): Boolean = {
doPut(blockId, level, classTag, tellMaster = tellMaster, keepReadLock = keepReadLock) { info =>
val startTimeMs = System.currentTimeMillis
// Since we're storing bytes, initiate the replication before storing them locally.
// This is faster as data is already serialized and ready to send.
val replicationFuture = if (level.replication > 1 ) {
Future {
// This is a blocking action and should run in futureExecutionContext which is a cached
// thread pool. The ByteBufferBlockData wrapper is not disposed of to avoid releasing
// buffers that are owned by the caller.
replicate(blockId, new ByteBufferBlockData(bytes, false), level, classTag)
}(futureExecutionContext)
} else {
null
}
val size = bytes.size
if (level.useMemory) {
// Put it in memory first, even if it also has useDisk set to true;
// We will drop it to disk later if the memory store can't hold it.
val putSucceeded = if (level.deserialized) {
val values =
serializerManager.dataDeserializeStream(blockId, bytes.toInputStream())(classTag)
memoryStore.putIteratorAsValues(blockId, values, classTag) match {
case Right(_) => true
case Left(iter) =>
// If putting deserialized values in memory failed, we will put the bytes directly to
// disk, so we don't need this iterator and can close it to free resources earlier.
iter.close()
false
}
} else {
val memoryMode = level.memoryMode
memoryStore.putBytes(blockId, size, memoryMode, () => {
if (memoryMode == MemoryMode.OFF_HEAP &&
bytes.chunks.exists(buffer => !buffer.isDirect)) {
bytes.copy(Platform.allocateDirectBuffer)
} else {
bytes
}
})
}
if (!putSucceeded && level.useDisk) {
logWarning(s "Persisting block $blockId to disk instead." )
diskStore.putBytes(blockId, bytes)
}
} else if (level.useDisk) {
diskStore.putBytes(blockId, bytes)
}
val putBlockStatus = getCurrentBlockStatus(blockId, info)
val blockWasSuccessfullyStored = putBlockStatus.storageLevel.isValid
if (blockWasSuccessfullyStored) {
// Now that the block is in either the memory or disk store,
// tell the master about it.
info.size = size
if (tellMaster && info.tellMaster) {
reportBlockStatus(blockId, putBlockStatus)
}
addUpdatedBlockStatusToTaskMetrics(blockId, putBlockStatus)
}
logDebug( "Put block %s locally took %s" .format(blockId, Utils.getUsedTimeMs(startTimeMs)))
if (level.replication > 1) {
// Wait for asynchronous replication to finish
try {
ThreadUtils.awaitReady(replicationFuture, Duration.Inf)
} catch {
case NonFatal(t) =>
throw new Exception("Error occurred while waiting for replication to finish", t)
}
}
if (blockWasSuccessfullyStored) {
None
} else {
Some(bytes)
}
}.isEmpty
}
/**
* Put the given block according to the given level in one of the block stores, replicating
* the values if necessary.
*
* If the block already exists, this method will not overwrite it.
*
* @param keepReadLock if true, this method will hold the read lock when it returns (even if the
* block already exists). If false, this method will hold no locks when it
* returns.
* @return None if the block was already present or if the put succeeded, or Some(iterator)
* if the put failed.
*/
private def doPutIterator [T](
blockId: BlockId,
iterator: () => Iterator[T ],
level: StorageLevel,
classTag: ClassTag[T],
tellMaster: Boolean = true,
keepReadLock: Boolean = false): Option[PartiallyUnrolledIterator[T]] = {
doPut(blockId, level, classTag, tellMaster = tellMaster, keepReadLock = keepReadLock) { info =>
val startTimeMs = System.currentTimeMillis
var iteratorFromFailedMemoryStorePut: Option[PartiallyUnrolledIterator[T]] = None
// Size of the block in bytes
var size = 0L
if (level.useMemory) {
// Put it in memory first, even if it also has useDisk set to true;
// We will drop it to disk later if the memory store can't hold it.
if (level.deserialized) {
memoryStore.putIteratorAsValues(blockId, iterator(), classTag) match {
case Right(s) =>
size = s
case Left(iter) =>
// Not enough space to unroll this block; drop to disk if applicable
if (level.useDisk) {
logWarning(s "Persisting block $blockId to disk instead." )
diskStore.put(blockId) { channel =>
val out = Channels.newOutputStream(channel)
serializerManager.dataSerializeStream(blockId, out, iter)(classTag)
}
size = diskStore.getSize(blockId)
} else {
iteratorFromFailedMemoryStorePut = Some(iter)
}
}
} else { // !level.deserialized
memoryStore.putIteratorAsBytes(blockId, iterator(), classTag, level.memoryMode) match {
case Right(s) =>
size = s
case Left(partiallySerializedValues) =>
// Not enough space to unroll this block; drop to disk if applicable
if (level.useDisk) {
logWarning(s "Persisting block $blockId to disk instead." )
diskStore.put(blockId) { channel =>
val out = Channels.newOutputStream(channel)
partiallySerializedValues.finishWritingToStream(out)
}
size = diskStore.getSize(blockId)
} else {
iteratorFromFailedMemoryStorePut = Some(partiallySerializedValues.valuesIterator)
}
}
}
} else if (level.useDisk) {
diskStore.put(blockId) { channel =>
val out = Channels.newOutputStream(channel)
serializerManager.dataSerializeStream(blockId, out, iterator())(classTag)
}
size = diskStore.getSize(blockId)
}
val putBlockStatus = getCurrentBlockStatus(blockId, info)
val blockWasSuccessfullyStored = putBlockStatus.storageLevel.isValid
if (blockWasSuccessfullyStored) {
// Now that the block is in either the memory or disk store, tell the master about it.
info.size = size
if (tellMaster && info.tellMaster) {
reportBlockStatus(blockId, putBlockStatus)
}
addUpdatedBlockStatusToTaskMetrics(blockId, putBlockStatus)
logDebug( "Put block %s locally took %s" .format(blockId, Utils.getUsedTimeMs(startTimeMs)))
if (level.replication > 1 ) {
val remoteStartTime = System.currentTimeMillis
val bytesToReplicate = doGetLocalBytes(blockId, info)
// [SPARK-16550] Erase the typed classTag when using default serialization, since
// NettyBlockRpcServer crashes when deserializing repl-defined classes.
// TODO( ekl) remove this once the classloader issue on the remote end is fixed.
val remoteClassTag = if (!serializerManager.canUseKryo(classTag)) {
scala.reflect.classTag[Any]
} else {
classTag
}
try {
replicate(blockId, bytesToReplicate, level, remoteClassTag)
} finally {
bytesToReplicate.dispose()
}
logDebug( "Put block %s remotely took %s"
.format(blockId, Utils.getUsedTimeMs(remoteStartTime)))
}
}
assert(blockWasSuccessfullyStored == iteratorFromFailedMemoryStorePut.isEmpty)
iteratorFromFailedMemoryStorePut
}
}
// ... ...
}
/**
* Tracks metadata for an individual block.
*
* Instances of this class are _not_ thread-safe and are protected by locks in the
* [[BlockInfoManager]].
*
* @param level the block's storage level. This is the requested persistence level, not the
* effective storage level of the block (i.e. if this is MEMORY_AND_DISK, then this
* does not imply that the block is actually resident in memory).
* @param classTag the block's [[ClassTag]], used to select the serializer
* @param tellMaster whether state changes for this block should be reported to the master. This
* is true for most blocks, but is false for broadcast blocks.
*/
private[storage] class BlockInfo(
val level: StorageLevel,
val classTag: ClassTag[_],
val tellMaster: Boolean)
/**
* Internal method to this RDD; will read from cache if applicable, or otherwise compute it.
* This should ''not'' be called by users directly, but is available for implementors of custom
* subclasses of RDD.
*/
final def iterator (split : Partition, context: TaskContext): Iterator[ T] = {
// 若storageLevel不为NONE,也就是之前持久化过该RDD,那么就不会去直接从父RDD执行算子计算新RDD的partition
// 优先尝试使用CacheManager去获取持久化的数据
if (storageLevel != StorageLevel.NONE) {
getOrCompute( split, context)
} else {
// 否则,尝试从ChkPoint的持久化中获取
computeOrReadCheckpoint( split, context)
}
}
/**
* Gets or computes an RDD partition. Used by RDD.iterator() when an RDD is cached.
*/
private[spark] def getOrCompute (partition : Partition, context: TaskContext): Iterator[T ] = {
val blockId = RDDBlockId(id, partition .index )
var readCachedBlock = true
// This method is called on executors, so we need call SparkEnv.get instead of sc.env.
SparkEnv.get .blockManager .getOrElseUpdate (blockId , storageLevel, elementClassTag, () => {
readCachedBlock = false
computeOrReadCheckpoint( partition, context )
}) match {
case Left( blockResult) =>
if (readCachedBlock ) {
val existingMetrics = context .taskMetrics ().inputMetrics
existingMetrics.incBytesRead (blockResult.bytes)
new InterruptibleIterator[T](context, blockResult.data.asInstanceOf[Iterator[T]]) {
override def next (): T = {
existingMetrics.incRecordsRead (1 )
delegate.next()
}
}
} else {
new InterruptibleIterator( context, blockResult.data.asInstanceOf[Iterator[ T]])
}
case Right( iter) =>
new InterruptibleIterator( context, iter. asInstanceOf[Iterator[T]])
}
}
/**
* Compute an RDD partition or read it from a checkpoint if the RDD is checkpointing.
*/
private[spark] def computeOrReadCheckpoint (split : Partition, context: TaskContext): Iterator[ T] =
{
if (isCheckpointedAndMaterialized) {
firstParent[ T]. iterator( split, context)
} else {
compute(split, context)
}
}
/**
* Retrieve the given block if it exists, otherwise call the provided `makeIterator` method
* to compute the block, persist it, and return its values.
*
* @return either a BlockResult if the block was successfully cached, or an iterator if the block
* could not be cached.
*/
def getOrElseUpdate[ T](
blockId: BlockId,
level: StorageLevel,
classTag: ClassTag[T],
makeIterator: () => Iterator[T ]): Either[BlockResult, Iterator[ T]] = {
// Attempt to read the block from local or remote storage. If it's present, then we don't need
// to go through the local-get-or-put path.
get[T](blockId)(classTag) match {
case Some(block ) =>
return Left (block )
case _ =>
// Need to compute the block.
}
// Initially we hold no locks on this block.
doPutIterator(blockId, makeIterator, level, classTag, keepReadLock = true ) match {
case None =>
// doPut() didn't hand work back to us, so the block already existed or was successfully
// stored. Therefore, we now hold a read lock on the block.
val blockResult = getLocalValues (blockId ).getOrElse {
// Since we held a read lock between the doPut() and get() calls, the block should not
// have been evicted, so get() not returning the block indicates some internal error.
releaseLock(blockId)
throw new SparkException(s "get() failed for block $blockId even though we held a lock")
}
// We already hold a read lock on the block from the doPut() call and getLocalValues()
// acquires the lock again, so we need to call releaseLock() here so that the net number
// of lock acquisitions is 1 (since the caller will only call release() once).
releaseLock(blockId)
Left( blockResult)
case Some(iter ) =>
// The put failed, likely because the data was too large to fit in memory and could not be
// dropped to disk. Therefore, we need to pass the input iterator back to the caller so
// that they can decide what to do with the values (e.g. process them without caching).
Right(iter)
}
}