Spark为了避免Hadoop读写磁盘的I/O操作成为性能瓶颈,优先将配置信息、中间计算结果等数据存入内存,极大的提高了系统的执行效率。除此之外,还可以将这些数据放入磁盘或者外部存储系统中。
块管理器BlockManager是Spark存储体系中的核心组件。Driver 和 Executor都会创建BlockManager。其主构造器如下:
/**
* Manager running on every node (driver and executors) which provides interfaces for putting and
* retrieving blocks both locally and remotely into various stores (memory, disk, and off-heap).
*
* Note that #initialize() must be called before the BlockManager is usable.
*/
private[spark] class BlockManager(
executorId: String,
rpcEnv: RpcEnv,
val master: BlockManagerMaster,
defaultSerializer: Serializer,
val conf: SparkConf,
memoryManager: MemoryManager,
mapOutputTracker: MapOutputTracker,
shuffleManager: ShuffleManager,
blockTransferService: BlockTransferService,
securityManager: SecurityManager,
numUsableCores: Int)
extends BlockDataManager with Logging
BlockManager主要组成:
0. BlockManagerMaster:Driver上的BlockManagerMaster对存在于Executor上的BlockManager统一管理;
1. DiskBlockManager:磁盘块管理器;
2. blockInfo:用于缓存BlockId和对应的BlockInfo;
3. ExecutionContext: 创建ExecutionContext,它是以ThreadPoolExecutor线程池作为服务的,每个线程的名称前缀是block-manager-future,最大可以创建128个线程;
4. MemoryStore: 内存存储,将blocks存储在内存中,存储方式可以是Java object数组或者序列化后的
ByteBuffers;
5. DiskStore:磁盘存储;
6. ExternalBlockStore : ExternalBlockStore 存储BlockManager blocks,内部实际使用的是TachyonBlockManager进行管理;
7. ShuffleClient:shuffle客户端ShuffleClient,默认使用BlockTransferService ,通过spark.shuffle.service.enabled属性设置为true则可以使用外部的ShuffleService;
8. BlockManagerSlaveEndpoint:注册BlockManagerSlaveEndpoint并且返回它的引用(默认Netty模式的话为NettyRpcEndpointRef);
9. metadataCleaner : 非广播Block清理器;
10. broadcastCleaner : 广播Block清理器;
11. CompressionCodec :压缩算法实现.
val diskBlockManager = new DiskBlockManager(this, conf)
private val blockInfo = new TimeStampedHashMap[BlockId, BlockInfo]
private val futureExecutionContext = ExecutionContext.fromExecutorService( ThreadUtils.newDaemonCachedThreadPool("block-manager-future", 128))
// Actual storage of where blocks are kept
private var externalBlockStoreInitialized = false
private[spark] val memoryStore = new MemoryStore(this, memoryManager)
private[spark] val diskStore = new DiskStore(this, diskBlockManager)
private[spark] lazy val externalBlockStore: ExternalBlockStore = {
externalBlockStoreInitialized = true
new ExternalBlockStore(this, executorId)
}
memoryManager.setMemoryStore(memoryStore)
private[spark]
val externalShuffleServiceEnabled = conf.getBoolean("spark.shuffle.service.enabled", false)
// Client to read other executors' shuffle files. This is either an external service, or just the
// standard BlockTransferService to directly connect to other Executors.
private[spark] val shuffleClient = if (externalShuffleServiceEnabled) {
val transConf = SparkTransportConf.fromSparkConf(conf, "shuffle", numUsableCores)
new ExternalShuffleClient(transConf, securityManager, securityManager.isAuthenticationEnabled(),
securityManager.isSaslEncryptionEnabled())
} else {
blockTransferService
}
// Register a [[RpcEndpoint]] with a name and return its [[RpcEndpointRef]].
private val slaveEndpoint = rpcEnv.setupEndpoint(
"BlockManagerEndpoint" + BlockManager.ID_GENERATOR.next,
new BlockManagerSlaveEndpoint(rpcEnv, this, mapOutputTracker))
private val metadataCleaner = new MetadataCleaner(
MetadataCleanerType.BLOCK_MANAGER, this.dropOldNonBroadcastBlocks, conf)
private val broadcastCleaner = new MetadataCleaner(
MetadataCleanerType.BROADCAST_VARS, this.dropOldBroadcastBlocks, conf)
/* The compression codec to use. Note that the "lazy" val is necessary because we want to delay
* the initialization of the compression codec until it is first used. The reason is that a Spark
* program could be using a user-defined codec in a third party jar, which is loaded in
* Executor.updateDependencies. When the BlockManager is initialized, user level jars hasn't been
* loaded yet. */
private lazy val compressionCodec: CompressionCodec = CompressionCodec.createCodec(conf)
BlockManager要生效,必须进行初始化操作。而且不能在BlockManager构造过程中进行初始化。因为这个时候应用程序的ID可能还没获得。
/**
* Initializes the BlockManager with the given appId. This is not performed in the constructor as
* the appId may not be known at BlockManager instantiation time (in particular for the driver,
* where it is only learned after registration with the TaskScheduler).
*
* This method initializes the BlockTransferService and ShuffleClient, registers with the
* BlockManagerMaster, starts the BlockManagerWorker endpoint, and registers with a local shuffle
* service if configured.
*/
def initialize(appId: String): Unit = {
blockTransferService.init(this)
shuffleClient.init(appId) // 默认是BlockTransferService
blockManagerId = BlockManagerId(
executorId, blockTransferService.hostName, blockTransferService.port)
// 当有外部的ShuffleService时,创建新的BlockManagerId,否则使用当前BlockManager的BlockManagerId
shuffleServerId = if (externalShuffleServiceEnabled) {
logInfo(s"external shuffle service port = $externalShuffleServicePort")
BlockManagerId(executorId, blockTransferService.hostName, externalShuffleServicePort)
} else {
blockManagerId
}
// 向BlockManagerMaster注册BlockManagerId
master.registerBlockManager(blockManagerId, maxMemory, slaveEndpoint)
// 当有外部的ShuffleService且是Executor的BlockManager时,还需要向BlockManagerMaster注册ShuffleServerId
// Register Executors' configuration with the local shuffle service, if one should exist.
if (externalShuffleServiceEnabled && !blockManagerId.isDriver) {
registerWithExternalShuffleServer()
}
}
该方法用于内存不足时,将指定的Block移出内存。其处理步骤如下:
1. 从blockInfo中查找是否存在要移除的blockId,如果存在则继续操作,否则返回None;
2. 判断该Block是否可以移除,不可移除返回None;
3. 判断该Block内容是否为空,如果为空,则已经移除过了,返回None;
4. 获取该Block的StorageLevel,如果StorageLevel允许存入磁盘,且DiskStore中没有保存此Block,那个调用DiskStore的putArray或者putBytes方法,将此Block写入磁盘;
5. 从内存(MemoryStore)中清除此Block;
6. 使用getCurrentBlockStatus方法获取Block最新的状态。如果此Block的tellMaster属性为true,则调用reportBlockStatus方法给BlockManagerMasterEndpoint报告状态;
7. 如果此Block没有存入磁盘,则从blockInfo中清除此BlockId;
8. 返回Block状态
/**
* Drop a block from memory, possibly putting it on disk if applicable. Called when the memory
* store reaches its limit and needs to free up space.
*
* If `data` is not put on disk, it won't be created.
*
* Return the block status if the given block has been updated, else None.
*/
def dropFromMemory(
blockId: BlockId,
data: () => Either[Array[Any], ByteBuffer]): Option[BlockStatus] = {
logInfo(s"Dropping block $blockId from memory")
val info = blockInfo.get(blockId).orNull
// If the block has not already been dropped
if (info != null) {
info.synchronized {
// required ? As of now, this will be invoked only for blocks which are ready
// But in case this changes in future, adding for consistency sake.
if (!info.waitForReady()) {
// If we get here, the block write failed.
logWarning(s"Block $blockId was marked as failure. Nothing to drop")
return None
} else if (blockInfo.get(blockId).isEmpty) {
logWarning(s"Block $blockId was already dropped.")
return None
}
var blockIsUpdated = false
val level = info.level
// Drop to disk, if storage level requires
if (level.useDisk && !diskStore.contains(blockId)) {
logInfo(s"Writing block $blockId to disk")
data() match {
case Left(elements) =>
diskStore.putArray(blockId, elements, level, returnValues = false)
case Right(bytes) =>
diskStore.putBytes(blockId, bytes, level)
}
blockIsUpdated = true
}
// Actually drop from memory store
val droppedMemorySize =
if (memoryStore.contains(blockId)) memoryStore.getSize(blockId) else 0L
val blockIsRemoved = memoryStore.remove(blockId)
if (blockIsRemoved) {
blockIsUpdated = true
} else {
logWarning(s"Block $blockId could not be dropped from memory as it does not exist")
}
val status = getCurrentBlockStatus(blockId, info)
if (info.tellMaster) {
reportBlockStatus(blockId, info, status, droppedMemorySize)
}
if (!level.useDisk) {
// The block is completely gone from this node; forget it so we can put() it again later.
blockInfo.remove(blockId)
}
if (blockIsUpdated) {
return Some(status)
}
}
}
None
}
reportBlockStatus方法用于向BlockManagerMasterEndpoint报告Block的状态并且重新注册BlockManager。其处理步骤如下:
1. 调用tryToReportBlockStatus方法,tryToReportBlockStatus方法调用了BlockManagerMaster的 updateBlockInfo方法向BlockManagerMasterEndpoint发送消息更新Block占用的内存大小、磁盘大小、存储级别等信息。
2. 如果此BlockManager还没有向BlockManagerMasterEndpoint注册,则调用asyncReregister方法进行注册。
/**
* Tell the master about the current storage status of a block. This will send a block update
* message reflecting the current status, *not* the desired storage level in its block info.
* For example, a block with MEMORY_AND_DISK set might have fallen out to be only on disk.
*
* droppedMemorySize exists to account for when the block is dropped from memory to disk (so
* it is still valid). This ensures that update in master will compensate for the increase in
* memory on slave.
*/
private def reportBlockStatus(
blockId: BlockId,
info: BlockInfo,
status: BlockStatus,
droppedMemorySize: Long = 0L): Unit = {
val needReregister = !tryToReportBlockStatus(blockId, info, status, droppedMemorySize)
if (needReregister) {
logInfo(s"Got told to re-register updating block $blockId")
// Re-registering will report our new block for free.
asyncReregister()
}
logDebug(s"Told master about block $blockId")
}
/**
* Actually send a UpdateBlockInfo message. Returns the master's response,
* which will be true if the block was successfully recorded and false if
* the slave needs to re-register.
*/
private def tryToReportBlockStatus(
blockId: BlockId,
info: BlockInfo,
status: BlockStatus,
droppedMemorySize: Long = 0L): Boolean = {
if (info.tellMaster) {
val storageLevel = status.storageLevel
val inMemSize = Math.max(status.memSize, droppedMemorySize)
val inExternalBlockStoreSize = status.externalBlockStoreSize
val onDiskSize = status.diskSize
master.updateBlockInfo(
blockManagerId, blockId, storageLevel, inMemSize, onDiskSize, inExternalBlockStoreSize)
} else {
true
}
}
putSingle方法用于将一个由对象构成的Block写入存储系统。
/**
* Write a block consisting of a single object.
*/
def putSingle(
blockId: BlockId,
value: Any,
level: StorageLevel,
tellMaster: Boolean = true): Seq[(BlockId, BlockStatus)] = {
putIterator(blockId, Iterator(value), level, tellMaster)
}
def putIterator(
blockId: BlockId,
values: Iterator[Any],
level: StorageLevel,
tellMaster: Boolean = true,
effectiveStorageLevel: Option[StorageLevel] = None): Seq[(BlockId, BlockStatus)] = {
require(values != null, "Values is null")
doPut(blockId, IteratorValues(values), level, tellMaster, effectiveStorageLevel)
}
putBytes方法用于将序列化字节组成的Block写入存储系统。
/**
* Put a new block of serialized bytes to the block manager.
* Return a list of blocks updated as a result of this put.
*/
def putBytes(
blockId: BlockId,
bytes: ByteBuffer,
level: StorageLevel,
tellMaster: Boolean = true,
effectiveStorageLevel: Option[StorageLevel] = None): Seq[(BlockId, BlockStatus)] = {
require(bytes != null, "Bytes is null")
doPut(blockId, ByteBufferValues(bytes), level, tellMaster, effectiveStorageLevel)
}
doPut方法是真正的将Block存入存储系统的方法,其处理了是否写入内存、磁盘、外部存储系统,并且为了容错,将数据备份到其他节点上的操作。
getDiskWriter方法用于创建DiskBlockObjectWriter, spark.shuffle.sync属性决定了写操作是否是同步的,默认是异步的。
/**
* A short circuited method to get a block writer that can write data directly to disk.
* The Block will be appended to the File specified by filename. Callers should handle error
* cases.
*/
def getDiskWriter(
blockId: BlockId,
file: File,
serializerInstance: SerializerInstance,
bufferSize: Int,
writeMetrics: ShuffleWriteMetrics): DiskBlockObjectWriter = {
val compressStream: OutputStream => OutputStream = wrapForCompression(blockId, _)
val syncWrites = conf.getBoolean("spark.shuffle.sync", false)
new DiskBlockObjectWriter(file, serializerInstance, bufferSize, compressStream,
syncWrites, writeMetrics, blockId)
}
getBlockData用于从本地获取Block的数据。其处理过程如下:
1. 如果Block是ShuffleMapTask的输出,那么多个Partition的中间结果都写入了同一个文件,那么IndexShuffleBlockManager的getBlockData方法可以处理这个问题;
2. 如果Block是ResultTask的输出,则使用doGetLocal方法来获取本地中间结果数据。
/**
* Interface to get local block data. Throws an exception if the block cannot be found or
* cannot be read successfully.
*/
override def getBlockData(blockId: BlockId): ManagedBuffer = {
if (blockId.isShuffle) {
shuffleManager.shuffleBlockResolver.getBlockData(blockId.asInstanceOf[ShuffleBlockId])
} else {
val blockBytesOpt = doGetLocal(blockId, asBlockResult = false)
.asInstanceOf[Option[ByteBuffer]]
if (blockBytesOpt.isDefined) {
val buffer = blockBytesOpt.get
new NioManagedBuffer(buffer)
} else {
throw new BlockNotFoundException(blockId.toString)
}
}
}
当reduce任务与map任务处在同一个节点时,不需要远程拉取,只需要调用doGetLocal方法从本地获取中间结果即可。可以从内存、磁盘、外部存储系统中获取。
private def doGetLocal(blockId: BlockId, asBlockResult: Boolean): Option[Any] = {
val info = blockInfo.get(blockId).orNull
if (info != null) {
info.synchronized {
if (blockInfo.get(blockId).isEmpty) {
logWarning(s"Block $blockId had been removed")
return None
}
// If another thread is writing the block, wait for it to become ready.
if (!info.waitForReady()) {
// If we get here, the block write failed.
logWarning(s"Block $blockId was marked as failure.")
return None
}
val level = info.level
logDebug(s"Level for block $blockId is $level")
// Look for the block in memory
if (level.useMemory) {
...
}
// Look for the block in external block store
if (level.useOffHeap) {
...
}
// Look for block on disk, potentially storing it back in memory if required
if (level.useDisk) {
...
}
}
} else {
logDebug(s"Block $blockId not registered locally")
}
None
}
doGetRemote用于从远端节点上获取Block数据。其处理步骤如下:
1. 向BlockManagerMasterEndpoint发送GetLocations消息获取Block数据存储的BlockManagerId。如果Block数据复制份数多于1份,则会返回多个BlockManagerId,对这些BlockManagerId洗牌,避免总是从一个BlockManager获取Block数据,代码如下:
/** Get locations of the blockId from the driver */
def getLocations(blockId: BlockId): Seq[BlockManagerId] = {
driverEndpoint.askWithRetry[Seq[BlockManagerId]](GetLocations(blockId))
}
private def doGetRemote(blockId: BlockId, asBlockResult: Boolean): Option[Any] = {
require(blockId != null, "BlockId is null")
val locations = Random.shuffle(master.getLocations(blockId))
var numFetchFailures = 0
for (loc <- locations) {
logDebug(s"Getting remote block $blockId from $loc")
val data = try {
blockTransferService.fetchBlockSync(
loc.host, loc.port, loc.executorId, blockId.toString).nioByteBuffer()
} catch {
case NonFatal(e) =>
numFetchFailures += 1
if (numFetchFailures == locations.size) {
// An exception is thrown while fetching this block from all locations
throw new BlockFetchException(s"Failed to fetch block from" +
s" ${locations.size} locations. Most recent failure cause:", e)
} else {
// This location failed, so we retry fetch from a different one by returning null here
logWarning(s"Failed to fetch remote block $blockId " +
s"from $loc (failed attempt $numFetchFailures)", e)
null
}
}
if (data != null) {
if (asBlockResult) {
return Some(new BlockResult(
dataDeserialize(blockId, data),
DataReadMethod.Network,
data.limit()))
} else {
return Some(data)
}
}
logDebug(s"The value of block $blockId is null")
}
logDebug(s"Block $blockId not found")
None
}
get方法用于通过BlockId获取Block。get方法首先从本地上获取,如果没有则去远端获取。
/**
* Get a block from the block manager (either local or remote).
*/
def get(blockId: BlockId): Option[BlockResult] = {
val local = getLocal(blockId)
if (local.isDefined) {
logInfo(s"Found block $blockId locally")
return local
}
val remote = getRemote(blockId)
if (remote.isDefined) {
logInfo(s"Found block $blockId remotely")
return remote
}
None
}
如果写入存储体系的数据本身是序列化的,则读取的时候应该对其进行反序列化。dataSerializeStream方法使用了compressionCodec对文件输入流进行压缩和序列化处理。
/** Serializes into a stream. */
def dataSerializeStream(
blockId: BlockId,
outputStream: OutputStream,
values: Iterator[Any]): Unit = {
val byteStream = new BufferedOutputStream(outputStream)
val ser = defaultSerializer.newInstance()
ser.serializeStream(wrapForCompression(blockId, byteStream)).writeAll(values).close()
}
/**
* Wrap an output stream for compression if block compression is enabled for its block type
*/
def wrapForCompression(blockId: BlockId, s: OutputStream): OutputStream = {
if (shouldCompress(blockId)) compressionCodec.compressedOutputStream(s) else s
}
为了有效利用磁盘空间和内存,metadataCleaner和broadcastCleaner分别用于清除blockInfo中很久不用的非广播和广播Block信息。
metadataCleaner参数是dropOldNonBroadcastBlocks,broadcastCleaner的参数是dropOldBroadcastBlocks。这两个函数都会调用dropOldBlocks,它会遍历blockInfo,将很久不用的Block从MemoryStore、DiskStore、ExternalStore中清除。
private val metadataCleaner = new MetadataCleaner(
MetadataCleanerType.BLOCK_MANAGER, this.dropOldNonBroadcastBlocks, conf)
private val broadcastCleaner = new MetadataCleaner(
MetadataCleanerType.BROADCAST_VARS, this.dropOldBroadcastBlocks, conf)
private def dropOldNonBroadcastBlocks(cleanupTime: Long): Unit = {
logInfo(s"Dropping non broadcast blocks older than $cleanupTime")
dropOldBlocks(cleanupTime, !_.isBroadcast)
}
private def dropOldBroadcastBlocks(cleanupTime: Long): Unit = {
logInfo(s"Dropping broadcast blocks older than $cleanupTime")
dropOldBlocks(cleanupTime, _.isBroadcast)
}
private def dropOldBlocks(cleanupTime: Long, shouldDrop: (BlockId => Boolean)): Unit = {
val iterator = blockInfo.getEntrySet.iterator
while (iterator.hasNext) {
val entry = iterator.next()
val (id, info, time) = (entry.getKey, entry.getValue.value, entry.getValue.timestamp)
if (time < cleanupTime && shouldDrop(id)) {
info.synchronized {
val level = info.level
if (level.useMemory) { memoryStore.remove(id) }
if (level.useDisk) { diskStore.remove(id) }
if (level.useOffHeap) { externalBlockStore.remove(id) }
iterator.remove()
logInfo(s"Dropped block $id")
}
val status = getCurrentBlockStatus(id, info)
reportBlockStatus(id, info, status)
}
}
}
为了节省磁盘存储空间,有些情况下需要对Block进行压缩。根据配置属性spark.io.compression.codec来确定要使用的压缩算法,默认为snappy,此压缩算法在牺牲少量压缩比例的条件下,极大的提高了压缩速度。
private[spark] object CompressionCodec {
private val configKey = "spark.io.compression.codec"
private[spark] def supportsConcatenationOfSerializedStreams(codec: CompressionCodec): Boolean = {
codec.isInstanceOf[SnappyCompressionCodec] || codec.isInstanceOf[LZFCompressionCodec]
}
private val shortCompressionCodecNames = Map(
"lz4" -> classOf[LZ4CompressionCodec].getName,
"lzf" -> classOf[LZFCompressionCodec].getName,
"snappy" -> classOf[SnappyCompressionCodec].getName)
def getCodecName(conf: SparkConf): String = {
conf.get(configKey, DEFAULT_COMPRESSION_CODEC)
}
def createCodec(conf: SparkConf): CompressionCodec = {
createCodec(conf, getCodecName(conf))
}
def createCodec(conf: SparkConf, codecName: String): CompressionCodec = {
val codecClass = shortCompressionCodecNames.getOrElse(codecName.toLowerCase, codecName)
val codec = try {
val ctor = Utils.classForName(codecClass).getConstructor(classOf[SparkConf])
Some(ctor.newInstance(conf).asInstanceOf[CompressionCodec])
} catch {
case e: ClassNotFoundException => None
case e: IllegalArgumentException => None
}
codec.getOrElse(throw new IllegalArgumentException(s"Codec [$codecName] is not available. " +
s"Consider setting $configKey=$FALLBACK_COMPRESSION_CODEC"))
}
/**
* Return the short version of the given codec name.
* If it is already a short name, just return it.
*/
def getShortName(codecName: String): String = {
if (shortCompressionCodecNames.contains(codecName)) {
codecName
} else {
shortCompressionCodecNames
.collectFirst { case (k, v) if v == codecName => k }
.getOrElse { throw new IllegalArgumentException(s"No short name for codec $codecName.") }
}
}
val FALLBACK_COMPRESSION_CODEC = "lzf"
val DEFAULT_COMPRESSION_CODEC = "snappy"
val ALL_COMPRESSION_CODECS = shortCompressionCodecNames.values.toSeq
}
DiskBlockObjectWriter被用于输出Spark任务的中间计算结果。DiskBlockObjectWriter继承自OutputStream,不能支持并发的写,而且只能打开一次。其核心是使用fileSegment方法创建FileSegment 来记录分片的起始、结束偏移量。
/**
* Returns the file segment of committed data that this Writer has written.
* This is only valid after commitAndClose() has been called.
*/
def fileSegment(): FileSegment = {
if (!commitAndCloseHasBeenCalled) {
throw new IllegalStateException(
"fileSegment() is only valid after commitAndClose() has been called")
}
new FileSegment(file, initialPosition, finalPosition - initialPosition)
}
参考:深入理解Spark核心思想与源码分析