上一篇博客,分析了BlockManager读取数据,主要分为本地拉取和远程拉取。现在我们分析一下写数据,主要是调用doPut()方法:
private def doPut(
blockId: BlockId,
data: BlockValues,
level: StorageLevel,
tellMaster: Boolean = true,
effectiveStorageLevel: Option[StorageLevel] = None)
: Seq[(BlockId, BlockStatus)] = {
// 当前Block不为null,并且存储级别是valid
require(blockId != null, "BlockId is null")
require(level != null && level.isValid, "StorageLevel is null or invalid")
effectiveStorageLevel.foreach { level =>
require(level != null && level.isValid, "Effective StorageLevel is null or invalid")
}
// Return value
// 保存写如数据的信息,BlockId到BlockStatus的映射
val updatedBlocks = new ArrayBuffer[(BlockId, BlockStatus)]
/* Remember the block's storage level so that we can correctly drop it to disk if it needs
* to be dropped right after it got put into memory. Note, however, that other threads will
* not be able to get() this block until we call markReady on its BlockInfo. */
// 为要写入的block,创建一个BlockInfo,并将其放入BlockInfo map中,
// 记录Block的持久化级别,防止存入新的数据的时候,Block内存不够会尝试将内存数据写入磁盘,空出内存
val putBlockInfo = {
val tinfo = new BlockInfo(level, tellMaster)
// Do atomically !
val oldBlockOpt = blockInfo.putIfAbsent(blockId, tinfo)
if (oldBlockOpt.isDefined) {
if (oldBlockOpt.get.waitForReady()) {
logWarning(s"Block $blockId already exists on this machine; not re-adding it")
return updatedBlocks
}
// TODO: So the block info exists - but previous attempt to load it (?) failed.
// What do we do now ? Retry on it ?
oldBlockOpt.get
} else {
tinfo
}
}
val startTimeMs = System.currentTimeMillis
/* If we're storing values and we need to replicate the data, we'll want access to the values,
* but because our put will read the whole iterator, there will be no values left. For the
* case where the put serializes data, we'll remember the bytes, above; but for the case where
* it doesn't, such as deserialized storage, let's rely on the put returning an Iterator. */
var valuesAfterPut: Iterator[Any] = null
// Ditto for the bytes after the put
var bytesAfterPut: ByteBuffer = null
// Size of the block in bytes
var size = 0L
// The level we actually use to put the block
val putLevel = effectiveStorageLevel.getOrElse(level)
// If we're storing bytes, then initiate the replication before storing them locally.
// This is faster as data is already serialized and ready to send.
val replicationFuture = data match {
case b: ByteBufferValues if putLevel.replication > 1 =>
// Duplicate doesn't copy the bytes, but just creates a wrapper
val bufferView = b.buffer.duplicate()
Future {
// This is a blocking action and should run in futureExecutionContext which is a cached
// thread pool
replicate(blockId, bufferView, putLevel)
}(futureExecutionContext)
case _ => null
}
// 尝试对BlockInfo加锁,进行多线程并发访问同步
putBlockInfo.synchronized {
logTrace("Put for block %s took %s to get into synchronized block"
.format(blockId, Utils.getUsedTimeMs(startTimeMs)))
var marked = false
try {
// 首先根据持久化级别,选择一种BlockStore,如MemoryStore、DiskStore等
val (returnValues, blockStore: BlockStore) = {
if (putLevel.useMemory) {
// Put it in memory first, even if it also has useDisk set to true;
// We will drop it to disk later if the memory store can't hold it.
(true, memoryStore)
} else if (putLevel.useOffHeap) {
// Use external block store
(false, externalBlockStore)
} else if (putLevel.useDisk) {
// Don't get back the bytes from put unless we replicate them
(putLevel.replication > 1, diskStore)
} else {
assert(putLevel == StorageLevel.NONE)
throw new BlockException(
blockId, s"Attempted to put block $blockId without specifying storage level!")
}
}
// Actually put the values
// 根据选择的store类型,并且根据数据的类型,将数据放入store中
val result = data match {
case IteratorValues(iterator) =>
blockStore.putIterator(blockId, iterator, putLevel, returnValues)
case ArrayValues(array) =>
blockStore.putArray(blockId, array, putLevel, returnValues)
case ByteBufferValues(bytes) =>
bytes.rewind()
blockStore.putBytes(blockId, bytes, putLevel)
}
size = result.size
result.data match {
case Left (newIterator) if putLevel.useMemory => valuesAfterPut = newIterator
case Right (newBytes) => bytesAfterPut = newBytes
case _ =>
}
// Keep track of which blocks are dropped from memory
if (putLevel.useMemory) {
result.droppedBlocks.foreach { updatedBlocks += _ }
}
// 获取一个block对应的BlockStatus
val putBlockStatus = getCurrentBlockStatus(blockId, putBlockInfo)
if (putBlockStatus.storageLevel != StorageLevel.NONE) {
// Now that the block is in either the memory, externalBlockStore, or disk store,
// let other threads read it, and tell the master about it.
marked = true
putBlockInfo.markReady(size)
if (tellMaster) {
// 调用reportBlockStatus,将新写入的数据发送到BlockManagerMasterEndpoint
// 以便于进行Block元数据的同步和维护,说白了就是对BlockManagerMaster上的block状态信息进行更新
reportBlockStatus(blockId, putBlockInfo, putBlockStatus)
}
updatedBlocks += ((blockId, putBlockStatus))
}
} finally {
// If we failed in putting the block to memory/disk, notify other possible readers
// that it has failed, and then remove it from the block info map.
if (!marked) {
// Note that the remove must happen before markFailure otherwise another thread
// could've inserted a new BlockInfo before we remove it.
blockInfo.remove(blockId)
putBlockInfo.markFailure()
logWarning(s"Putting block $blockId failed")
}
}
}
logDebug("Put block %s locally took %s".format(blockId, Utils.getUsedTimeMs(startTimeMs)))
// Either we're storing bytes and we asynchronously started replication, or we're storing
// values and need to serialize and replicate them now:
// 如果我们的持久化级别,是定义了_2 (比如 MEMORY_AND_DISK_2)这种后缀,
// 说明需要对block进行replicate,然后传输到其他节点上去。
if (putLevel.replication > 1) {
data match {
case ByteBufferValues(bytes) =>
if (replicationFuture != null) {
Await.ready(replicationFuture, Duration.Inf)
}
case _ =>
val remoteStartTime = System.currentTimeMillis
// Serialize the block if not already done
if (bytesAfterPut == null) {
if (valuesAfterPut == null) {
throw new SparkException(
"Underlying put returned neither an Iterator nor bytes! This shouldn't happen.")
}
bytesAfterPut = dataSerialize(blockId, valuesAfterPut)
}
// 调用replicate方法进行复制操作
replicate(blockId, bytesAfterPut, putLevel)
logDebug("Put block %s remotely took %s"
.format(blockId, Utils.getUsedTimeMs(remoteStartTime)))
}
}
BlockManager.dispose(bytesAfterPut)
if (putLevel.replication > 1) {
logDebug("Putting block %s with replication took %s"
.format(blockId, Utils.getUsedTimeMs(startTimeMs)))
} else {
logDebug("Putting block %s without replication took %s"
.format(blockId, Utils.getUsedTimeMs(startTimeMs)))
}
updatedBlocks
}
首先为要写入的Block创建一个BlockInfo,并且记录当前Block的持久化级别,在后面将数据存入本地内存的时候,假如内存不够需要释放一些内存,这时候就会根据持久化级别来判断是删除还是持久化到磁盘上,后面会分析到。然后在写数据的时候,加锁,进行多线程并发写加锁,防止在写的过程中数据被篡改。接着根据持久化级别选择一种BlockStore,比如MemoryStore,DiskStore或者externalBlockStore,接着根据选择的store类型,以及数据的类型,将数据放入store中。然后获取Block对应的BlockStatus,调用reportBlockStatus将新写入的数据信息BlockStatus发送到BlockManagerMasterEndpoint上,同步Block的元数据信息;下一步判断就很重要了,假设要写入的数据的持久化级别是 _2 这种类型,则需要将Block进行replicate,拷贝一份到其他节点上去,这里主要调用的是replicate,里面就是随机选择一个节点,然后将数据发送过去。
下面接着分析具体的数据写入操作,我们以MemoryStore为例,因为MemoryStore在写入数据的时候,假设内存不够,会先尝试释放一些存储时间久的RDD,如果这个RDD的持久化级别是带有Disk的,那么就将它写入磁盘,假如只是内存级别,那么就会彻底删除。其实在写入数据到内存的时候,最终会调用tryPut()方法,我们分析一下这个方法:
private def tryToPut(
blockId: BlockId,
value: () => Any,
size: Long,
deserialized: Boolean,
droppedBlocks: mutable.Buffer[(BlockId, BlockStatus)]): Boolean = {
// 这里进行多线程同步操作,在写入数据的时候,防止其他线程也在写
memoryManager.synchronized {
// 释放unroll内存
releasePendingUnrollMemoryForThisTask()
// 这里在放入数据的时候,会判断是否有足够的内存来放入数据
val enoughMemory = memoryManager.acquireStorageMemory(blockId, size, droppedBlocks)
if (enoughMemory) {
// We acquired enough memory for the block, so go ahead and put it
// 如果获取到足够的内存存储数据
val entry = new MemoryEntry(value(), size, deserialized)
// 将数据写入内存中
entries.synchronized {
entries.put(blockId, entry)
}
val valuesOrBytes = if (deserialized) "values" else "bytes"
logInfo("Block %s stored as %s in memory (estimated size %s, free %s)".format(
blockId, valuesOrBytes, Utils.bytesToString(size), Utils.bytesToString(blocksMemoryUsed)))
} else {
// 假设内存不够,那么假如持久化级别带有Disk,那么就将数据写入磁盘中,注意是lazy
lazy val data = if (deserialized) {
Left(value().asInstanceOf[Array[Any]])
} else {
Right(value().asInstanceOf[ByteBuffer].duplicate())
}
// 删除之前的存在的数据
val droppedBlockStatus = blockManager.dropFromMemory(blockId, () => data)
droppedBlockStatus.foreach { status => droppedBlocks += ((blockId, status)) }
}
enoughMemory
}
}
在将数据写入内存的时候,首先添加写锁,防止数据被篡改,接着释放unroll内存,然后去尝试获取足够的内存来存放数据,调用方法acquireStorageMemory,来判断当前内存是否够存放数据,不够的话,那么就会移除一部分数据,再次判断够不够,直到内存足够或者一直不够,就返回,它里面调用了evictBlocksToFreeSpace方法;再申请到足够的内存之后,就将当前数据创建一份MemoryEntry,并写入内存中。
我们具体看看是怎么申请到足够的内存的:
private[spark] def evictBlocksToFreeSpace(
blockId: Option[BlockId],
space: Long,
droppedBlocks: mutable.Buffer[(BlockId, BlockStatus)]): Boolean = {
assert(space > 0)
memoryManager.synchronized {
var freedMemory = 0L
// 获取当前Block的RDD
val rddToAdd = blockId.flatMap(getRddId)
// 选择需要删除的RDD
val selectedBlocks = new ArrayBuffer[BlockId]
entries.synchronized {
// 获取内存中保存的MemoryEntry
val iterator = entries.entrySet().iterator()
while (freedMemory < space && iterator.hasNext) {
// 去除待的时间最长的RDD
val pair = iterator.next()
val blockId = pair.getKey
if (rddToAdd.isEmpty || rddToAdd != getRddId(blockId)) {
// 放入选择要删除的队列中
selectedBlocks += blockId
// 释放内存
freedMemory += pair.getValue.size
}
}
}
if (freedMemory >= space) {
logInfo(s"${selectedBlocks.size} blocks selected for dropping")
for (blockId <- selectedBlocks) {
// 获取MemoryEntry集合
val entry = entries.synchronized { entries.get(blockId) }
// This should never be null as only one task should be dropping
// blocks and removing entries. However the check is still here for
// future safety.
if (entry != null) {
val data = if (entry.deserialized) {
Left(entry.value.asInstanceOf[Array[Any]])
} else {
Right(entry.value.asInstanceOf[ByteBuffer].duplicate())
}
// 删除老的数据,这里面会判断待移除的RDD的持久化类型是否带有Disk
// 假如只是memory,那么就彻底移除,否则的话持久化到磁盘上
val droppedBlockStatus = blockManager.dropFromMemory(blockId, data)
droppedBlockStatus.foreach { status => droppedBlocks += ((blockId, status)) }
}
}
true
} else {
blockId.foreach { id =>
logInfo(s"Will not store $id as it would require dropping another block " +
"from the same RDD")
}
false
}
}
}
当前内存不够存放的时候,会将部分数据移除,移除策略是这样的,取entries也即一个LinkedHashMap结构的数据,我们知道LinkedHashMap的存储数据是有先后顺序的,放的越久的数据会在头部,而最新的数据都是放在尾部的,这有点类似LRU缓存淘汰算法,这里的数据移除策略就是这样的,在移除的过程中也会判断RDD的持久化级别是否包含Disk,假如包含的话,那么会存储到磁盘上,否则就彻底丢失了,这里就涉及到了Spark的性能调优,RDD在没有持久化的情况下可能会丢失数据。
总结一下,BlockManager在写数据的时候,也是根据持久化的级别来进行数据存储,我们分析了写入内存(MemoryStore)是如何操作的,其他几种情况类似。在写入内存的时候,会有一个缓存淘汰的机制,即在内存不够的情况下,会将Block中存放时间较久的RDD删除(假如RDD的持久化机制中包含了Disk,那么会写入磁盘),这就会导致RDD的数据丢失。因此对于比较重要或使用次数较多的RDD,在持久化的时候最好加上Disk类型。