通过之前的文章Spark源码解读之Shuffle原理剖析与源码分析我们知道,一个Shuffle操作被DAGScheduler划分为两个stage,第一个stage是ShuffleMapTask,第二个是ResultTask。ShuffleMapTask会产生临时计算结果,这些数据会被ResultTask作为输入而读取。
原文地址:原文链接
那么ShuffleMapTask的计算结果是如何被ResultTask取得的呢?过程如下:
每一个ShuffleMapTask都会用一个MapStatus来保存计算结果。MapStatus是由BlockManagerId和ByeteSize构成,BlockManagerId表示这些计算的中间结果的实际数据在哪个BlockManager,ByteSize表示不同reduceid所要读取的数据的大小。
private[spark] sealed trait MapStatus {
/** Location where this task was run. */
def location: BlockManagerId
/**
* Estimated size for the reduce block, in bytes.
*
* If a block is non-empty, then this method MUST return a non-zero size. This invariant is
* necessary for correctness, since block fetchers are allowed to skip zero-size blocks.
*/
//不同reduceID所要读取的数据的大小
def getSizeForBlock(reduceId: Int): Long
}
Shuffle的写入过程如下:
ShuffleMapTask.runTask ----> HashShuffleWriter.writer ----> BlockObjectWriter.writer
ShuffleMapTask中runTask方法源码如下:
override def runTask(context: TaskContext): MapStatus = {
//使用广播变量反序列化RDD
// Deserialize the RDD using the broadcast variable.
val ser = SparkEnv.get.closureSerializer.newInstance()
val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
metrics = Some(context.taskMetrics)
var writer: ShuffleWriter[Any, Any] = null
try {
//获取ShuffleManager,从ShuffleManager中获取ShuffleWriter
val manager = SparkEnv.get.shuffleManager
writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)
//首先调用rdd的iterator方法,并且传入了当前task要处理那个partition,然后执行我们定义的函数
//处理返回的数据都是通过ShuffleWriter,经过HashPartitioner进行分区之后,写入了自己对应的bucket
writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
//最后返回结果,MapStatus
//MapStatus里面封装了ShffleMapTask计算后的数据,存储在哪里,其实就是BlockManager的信息
//BlockManager是spark底层内存,数据,磁盘数据管理的组件
return writer.stop(success = true).get
} catch {
case e: Exception =>
try {
if (writer != null) {
writer.stop(success = false)
}
} catch {
case e: Exception =>
log.debug("Could not stop writer", e)
}
throw e
}
}
在HashShuffleWriter.writer中主要处理两件事:
HashShuffleWriter中的writer方法源码如下:
/** Write a bunch of records to this task's output */
/**
* 将每个ShuffleMapTask计算出来的新的RDD的partition数据,写入本地磁盘
* @param records
*/
override def write(records: Iterator[_ <: Product2[K, V]]): Unit = {
//判断是否需要进行本地,如果是reduceByKey这种操作,则要进行聚合操作
//即dep.aggregator.isDefined为true
//dep.mapSideCombine也为true
val iter = if (dep.aggregator.isDefined) {
if (dep.mapSideCombine) {
//这里进行本地聚合操作,比如本地有(hello,1),(hello,1)
//则可以聚合成(hello,2)
dep.aggregator.get.combineValuesByKey(records, context)
} else {
records
}
} else {
require(!dep.mapSideCombine, "Map-side combine without Aggregator specified!")
records
}
//如果需要本地聚合,则先进行聚合
//然后遍历数据,对每一个数据,进行partition操作,默认的是HashPartitioner,并且生成bucketId
//也就表示这数据要写入哪一个bucket
for (elem <- iter) {
//计算bucketId
val bucketId = dep.partitioner.getPartition(elem._1)
//调用shuffleBlockManager.forMapTask()方法生成bucketId对应的writer,然后用writer将数据写入bucket
//DiskBlockObjectWriter负责将数据真正写入磁盘
shuffle.writers(bucketId).write(elem)
}
}
在上面writer方法中,使用到的Shuffle由ShuffleBlockManager中的forMapTask函数生成,该方法源码如下:
/**
* Get a ShuffleWriterGroup for the given map task, which will register it as complete
* when the writers are closed successfully
*/
/**
* 给每一个map task生成 一个ShuffleWriterGroup
*/
def forMapTask(shuffleId: Int, mapId: Int, numBuckets: Int, serializer: Serializer,
writeMetrics: ShuffleWriteMetrics) = {
new ShuffleWriterGroup {
shuffleStates.putIfAbsent(shuffleId, new ShuffleState(numBuckets))
private val shuffleState = shuffleStates(shuffleId)
private var fileGroup: ShuffleFileGroup = null
val openStartTime = System.nanoTime
//判断是否开启了consolidate优化,如果开启了,就不会为每一个bucket获取一个输出文件
//而是为每一个bucket获取一个ShuffleGroup的write
val writers: Array[BlockObjectWriter] = if (consolidateShuffleFiles) {
fileGroup = getUnusedFileGroup()
Array.tabulate[BlockObjectWriter](numBuckets) { bucketId =>
//首先生成一个唯一的blockId,然后用bucketId来调用ShuffleFileGroup的apply函数来获取一个writer
val blockId = ShuffleBlockId(shuffleId, mapId, bucketId)
//使用blockManager.getDiskWriter()函数来获取一个writer
//实际上在开启优化配置后,对一个bucketId,不再是像之前一样获取一个独立的ShuffleBlockFile的writer
//而是获取ShuffleFileGroup中的一个writer
//这样就实现了多个ShufffleMapTask的输出文件的合并
blockManager.getDiskWriter(blockId, fileGroup(bucketId), serializer, bufferSize,
writeMetrics)
}
} else {
//如果没有进行shuffle优化配置,也会针对每一个shuffleMapTask创建一个ShuffleBlockFile
Array.tabulate[BlockObjectWriter](numBuckets) { bucketId =>
val blockId = ShuffleBlockId(shuffleId, mapId, bucketId)
val blockFile = blockManager.diskBlockManager.getFile(blockId)
// Because of previous failures, the shuffle file may already exist on this machine.
// If so, remove it.
//如果ShuffleBlockFile存在,则进行删除
if (blockFile.exists) {
if (blockFile.delete()) {
logInfo(s"Removed existing shuffle file $blockFile")
} else {
logWarning(s"Failed to remove existing shuffle file $blockFile")
}
}
//写入磁盘中
blockManager.getDiskWriter(blockId, blockFile, serializer, bufferSize, writeMetrics)
}
}
// Creating the file to write to and creating a disk writer both involve interacting with
// the disk, so should be included in the shuffle write time.
writeMetrics.incShuffleWriteTime(System.nanoTime - openStartTime)
override def releaseWriters(success: Boolean) {
if (consolidateShuffleFiles) {
if (success) {
val offsets = writers.map(_.fileSegment().offset)
val lengths = writers.map(_.fileSegment().length)
fileGroup.recordMapOutput(mapId, offsets, lengths)
}
recycleFileGroup(fileGroup)
} else {
shuffleState.completedMapTasks.add(mapId)
}
}
private def getUnusedFileGroup(): ShuffleFileGroup = {
val fileGroup = shuffleState.unusedFileGroups.poll()
if (fileGroup != null) fileGroup else newFileGroup()
}
private def newFileGroup(): ShuffleFileGroup = {
val fileId = shuffleState.nextFileId.getAndIncrement()
val files = Array.tabulate[File](numBuckets) { bucketId =>
val filename = physicalFileName(shuffleId, bucketId, fileId)
blockManager.diskBlockManager.getFile(filename)
}
val fileGroup = new ShuffleFileGroup(shuffleId, fileId, files)
shuffleState.allFileGroups.add(fileGroup)
fileGroup
}
private def recycleFileGroup(group: ShuffleFileGroup) {
shuffleState.unusedFileGroups.add(group)
}
}
}
在上面的源码中涉及到Shuffle的优化原理,细节可以查看上篇文章Spark源码解读之Shuffle原理剖析与源码分析
在gieFile方法中负责将Shuffle需要写入的数据映射为一个文件。
/** Looks up a file by hashing it into one of our local subdirectories. */
// This method should be kept in sync with
// org.apache.spark.network.shuffle.StandaloneShuffleBlockManager#getFile().
//负责将三元组(shuffle_id,map_id,reduce_id)映射到文件名
def getFile(filename: String): File = {
// Figure out which local directory it hashes to, and which subdirectory in that
val hash = Utils.nonNegativeHash(filename)
val dirId = hash % localDirs.length
val subDirId = (hash / localDirs.length) % subDirsPerLocalDir
// Create the subdirectory if it doesn't already exist
var subDir = subDirs(dirId)(subDirId)
if (subDir == null) {
subDir = subDirs(dirId).synchronized {
val old = subDirs(dirId)(subDirId)
if (old != null) {
old
} else {
val newDir = new File(localDirs(dirId), "%02x".format(subDirId))
if (!newDir.exists() && !newDir.mkdir()) {
throw new IOException(s"Failed to create local dir in $newDir.")
}
subDirs(dirId)(subDirId) = newDir
newDir
}
}
}
new File(subDir, filename)
}
最后使用DiskBlockObjectWriter.writer负责将数据真正写入磁盘中。
override def write(value: Any) {
if (!initialized) {
open()
}
objOut.writeObject(value)
numRecordsWritten += 1
writeMetrics.incShuffleRecordsWritten(1)
if (numRecordsWritten % 32 == 0) {
updateBytesWritten()
}
}
Shuffle结果的读取过程如下所示:
ShuffleRDD.compute ---> HashShuffleRead.read ---> BlockStoreShuffleFetcher.fetch ---> BlockManager.getMultiple
ShuffleRDD的compute函数是读取ShuffleMapTask计算结果的出发点。compute源码如下:
/**
*shuffle的入口
*/
override def compute(split: Partition, context: TaskContext): Iterator[(K, C)] = {
//这里会调用shuffleManager.getReader()来获取一个HashShuffleReader
//然后调用它的reader方法来拉取resultTask需要聚合的数据
val dep = dependencies.head.asInstanceOf[ShuffleDependency[K, V, C]]
SparkEnv.get.shuffleManager.getReader(dep.shuffleHandle, split.index, split.index + 1, context)
.read()
.asInstanceOf[Iterator[(K, C)]]
}
在这里使用HashShuffleReader调用reader方法获取合并后的数据,源码如下所示:
/** Read the combined key-values for this reduce task */
override def read(): Iterator[Product2[K, C]] = {
val ser = Serializer.getSerializer(dep.serializer)
//通过BlockStoreShuffleFetcher的fetch方法来从DAGScheduler的MapOutputTrackerMaster中获取
//自己需要的数据的信息,然后底层再通过对应的BlockManager拉取需要的数据
val iter = BlockStoreShuffleFetcher.fetch(handle.shuffleId, startPartition, context, ser)
val aggregatedIter: Iterator[Product2[K, C]] = if (dep.aggregator.isDefined) {
if (dep.mapSideCombine) {
new InterruptibleIterator(context, dep.aggregator.get.combineCombinersByKey(iter, context))
} else {
new InterruptibleIterator(context, dep.aggregator.get.combineValuesByKey(iter, context))
}
} else {
require(!dep.mapSideCombine, "Map-side combine without Aggregator specified!")
// Convert the Product2s to pairs since this is what downstream RDDs currently expect
iter.asInstanceOf[Iterator[Product2[K, C]]].map(pair => (pair._1, pair._2))
}
// Sort the output if there is a sort ordering defined.
dep.keyOrdering match {
case Some(keyOrd: Ordering[K]) =>
// Create an ExternalSorter to sort the data. Note that if spark.shuffle.spill is disabled,
// the ExternalSorter won't spill to disk.
val sorter = new ExternalSorter[K, C, C](ordering = Some(keyOrd), serializer = Some(ser))
sorter.insertAll(aggregatedIter)
context.taskMetrics.incMemoryBytesSpilled(sorter.memoryBytesSpilled)
context.taskMetrics.incDiskBytesSpilled(sorter.diskBytesSpilled)
sorter.iterator
case None =>
aggregatedIter
}
}
在reader函数中调用BlockStoreShuffleFetcher的fetch方法去获取MapStatus,最后通过BlockManager去真正获取数据。源码如下:
private[hash] object BlockStoreShuffleFetcher extends Logging {
def fetch[T](
shuffleId: Int,
reduceId: Int,
context: TaskContext,
serializer: Serializer)
: Iterator[T] =
{
logDebug("Fetching outputs for shuffle %d, reduce %d".format(shuffleId, reduceId))
val blockManager = SparkEnv.get.blockManager
val startTime = System.currentTimeMillis
//获取一个全局的MapOutputTracker,并且调用其getServerStatuses方法
//注意这里传入了两个参数,shuffleId和reduceId
//shuffle有两个stage参与,因此shuffleId代表表示上一个stage,使用这个参数来获取
//上一个stage的ShuffleMapTask shuffle write输出的MapStatus数据信息
//在获取到MapStatus之后,还要使用reduceId来拉取当前stage需要获取的之前stage的ShuffleMapTask的输出文件信息
//这个getServerStatuses方法是需要走网络通信的,因为它要连接Driver上的DAGScheduler来获取MapOutputTracker上的数据信息
val statuses = SparkEnv.get.mapOutputTracker.getServerStatuses(shuffleId, reduceId)
logDebug("Fetching map output location for shuffle %d, reduce %d took %d ms".format(
shuffleId, reduceId, System.currentTimeMillis - startTime))
val splitsByAddress = new HashMap[BlockManagerId, ArrayBuffer[(Int, Long)]]
for (((address, size), index) <- statuses.zipWithIndex) {
splitsByAddress.getOrElseUpdate(address, ArrayBuffer()) += ((index, size))
}
val blocksByAddress: Seq[(BlockManagerId, Seq[(BlockId, Long)])] = splitsByAddress.toSeq.map {
case (address, splits) =>
(address, splits.map(s => (ShuffleBlockId(shuffleId, s._1, reduceId), s._2)))
}
def unpackBlock(blockPair: (BlockId, Try[Iterator[Any]])) : Iterator[T] = {
val blockId = blockPair._1
val blockOption = blockPair._2
blockOption match {
case Success(block) => {
block.asInstanceOf[Iterator[T]]
}
case Failure(e) => {
blockId match {
case ShuffleBlockId(shufId, mapId, _) =>
val address = statuses(mapId.toInt)._1
throw new FetchFailedException(address, shufId.toInt, mapId.toInt, reduceId, e)
case _ =>
throw new SparkException(
"Failed to get block " + blockId + ", which is not a shuffle block", e)
}
}
}
}
val blockFetcherItr = new ShuffleBlockFetcherIterator(
context,
SparkEnv.get.blockManager.shuffleClient,
blockManager,
blocksByAddress,
serializer,
SparkEnv.get.conf.getLong("spark.reducer.maxMbInFlight", 48) * 1024 * 1024)
val itr = blockFetcherItr.flatMap(unpackBlock)
val completionIter = CompletionIterator[T, Iterator[T]](itr, {
context.taskMetrics.updateShuffleReadMetrics()
})
new InterruptibleIterator[T](context, completionIter) {
val readMetrics = context.taskMetrics.createShuffleReadMetricsForDependency()
override def next(): T = {
readMetrics.incRecordsRead(1)
delegate.next()
}
}
}
}
在MapOutputTracker中调用getServerStatuses在Executor中获取ShuffleMapTask输出结果数据的所在的URL和Size,源码如下:
/**
* Called from executors to get the server URIs and output sizes of the map outputs of
* a given shuffle.
*/
def getServerStatuses(shuffleId: Int, reduceId: Int): Array[(BlockManagerId, Long)] = {
val statuses = mapStatuses.get(shuffleId).orNull
if (statuses == null) {
logInfo("Don't have map outputs for shuffle " + shuffleId + ", fetching them")
var fetchedStatuses: Array[MapStatus] = null
fetching.synchronized {
// Someone else is fetching it; wait for them to be done
//等待抓取数据
while (fetching.contains(shuffleId)) {
try {
fetching.wait()
} catch {
case e: InterruptedException =>
}
}
// Either while we waited the fetch happened successfully, or
// someone fetched it in between the get and the fetching.synchronized.
fetchedStatuses = mapStatuses.get(shuffleId).orNull
if (fetchedStatuses == null) {
// We have to do the fetch, get others to wait for us.
fetching += shuffleId
}
}
if (fetchedStatuses == null) {
// We won the race to fetch the output locs; do so
logInfo("Doing the fetch; tracker actor = " + trackerActor)
// This try-finally prevents hangs due to timeouts:
try {
val fetchedBytes =
askTracker(GetMapOutputStatuses(shuffleId)).asInstanceOf[Array[Byte]]
fetchedStatuses = MapOutputTracker.deserializeMapStatuses(fetchedBytes)
logInfo("Got the output locations")
mapStatuses.put(shuffleId, fetchedStatuses)
} finally {
fetching.synchronized {
fetching -= shuffleId
fetching.notifyAll()
}
}
}
if (fetchedStatuses != null) {
fetchedStatuses.synchronized {
return MapOutputTracker.convertMapStatuses(shuffleId, reduceId, fetchedStatuses)
}
} else {
logError("Missing all output locations for shuffle " + shuffleId)
throw new MetadataFetchFailedException(
shuffleId, reduceId, "Missing all output locations for shuffle " + shuffleId)
}
} else {
statuses.synchronized {
return MapOutputTracker.convertMapStatuses(shuffleId, reduceId, statuses)
}
}
}
一个ShuffleMapTask会生成一个MapStatus,在MapStatus中含有当前ShuffleMapTask产生的数据落到各个Partition中的大小。如果大小为0,则表示该分区中没有数据产生。每一个分区中的数据大小使用一个byte来表示的,但是一个byte最多只能表示255,如何表示更大的size呢?这里就使用到了巧妙的转换,使用1.1作为对数底,可以将28,转换为1.1256。MapStatus中的compressSize和decompressSize的作用,就是将数据的大小用另一种进制来表示,这样就可以让表达的空间从0至255转换为0至35903328256,单个存储的大小可以高达近35GB。
源码如下:
/**
* Compress a size in bytes to 8 bits for efficient reporting of map output sizes.
* We do this by encoding the log base 1.1 of the size as an integer, which can support
* sizes up to 35 GB with at most 10% error.
*/
def compressSize(size: Long): Byte = {
if (size == 0) {
0
} else if (size <= 1L) {
1
} else {
math.min(255, math.ceil(math.log(size) / math.log(LOG_BASE)).toInt).toByte
}
}
/**
* Decompress an 8-bit encoded block size, using the reverse operation of compressSize.
*/
def decompressSize(compressedSize: Byte): Long = {
if (compressedSize == 0) {
0
} else {
math.pow(LOG_BASE, compressedSize & 0xFF).toLong
}
}
ShuffleId唯一标识了一个job中的stage,这一个stage是作为ReduceTask所在Stage的直接上游。需要遍历该Stage中每一个Task产生的mapStatus来获知是否有当前ResultTask需要读取的数据。
在BlockManager中首先会调用initialize函数进行初始化,初始化BlockTransferService 和 ShuffleClient,向BlockManagerMaster进行注册,并且在BlockManagerWorker中注册本地的Shuffle service。如果所要获取的文件落在本地,则调用getLocal获取,否则调用getRemote远程拉取。initialize函数源码如下:
/**
* Initializes the BlockManager with the given appId. This is not performed in the constructor as
* the appId may not be known at BlockManager instantiation time (in particular for the driver,
* where it is only learned after registration with the TaskScheduler).
*
* This method initializes the BlockTransferService and ShuffleClient, registers with the
* BlockManagerMaster, starts the BlockManagerWorker actor, and registers with a local shuffle
* service if configured.
*/
def initialize(appId: String): Unit = {
blockTransferService.init(this)
shuffleClient.init(appId)
blockManagerId = BlockManagerId(
executorId, blockTransferService.hostName, blockTransferService.port)
shuffleServerId = if (externalShuffleServiceEnabled) {
BlockManagerId(executorId, blockTransferService.hostName, externalShuffleServicePort)
} else {
blockManagerId
}
master.registerBlockManager(blockManagerId, maxMemory, slaveActor)
// Register Executors' configuration with the local shuffle service, if one should exist.
if (externalShuffleServiceEnabled && !blockManagerId.isDriver) {
registerWithExternalShuffleServer()
}
}
Shuffle操作会消耗大量的内存,具体体现在下面几个方面:
在上面我们剖析了Shuffle的存储过程,对于Spark,它首先会将RDD缓存在内存中,其次磁盘等,那么它的存取过程是怎样的呢?下面我们来看看Spark的存储系统的框架图:
以上框架图主要包含以下几个模块:
在SparkEnv中初始化过程源码如下:
//创建各个子模块
val blockManagerMaster = new BlockManagerMaster(registerOrLookup(
"BlockManagerMaster",
new BlockManagerMasterActor(isLocal, conf, listenerBus)), conf, isDriver)
// NB: blockManager is not valid until initialize() is called later.
val blockManager = new BlockManager(executorId, actorSystem, blockManagerMaster,
serializer, conf, mapOutputTracker, shuffleManager, blockTransferService, securityManager,
numUsableCores)
val broadcastManager = new BroadcastManager(isDriver, conf, securityManager)
val cacheManager = new CacheManager(blockManager)
在registerOrLookup函数中,如果当前节点是Driver则创建这个Actor,否则建立到Driver的连接,取得BlockManagerMaster的Actor。
def registerOrLookup(name: String, newActor: => Actor): ActorRef = {
//如果当前节点是Driver则创建 Actor
if (isDriver) {
logInfo("Registering " + name)
actorSystem.actorOf(Props(newActor), name = name)
//否则建立到Driver连接,取得BlockManagerMaster
} else {
AkkaUtils.makeDriverRef(name, conf, actorSystem)
}
}
数据写入过程简述如下:
RDD.iterator方法源码如下:
/**
* Internal method to this RDD; will read from cache if applicable, or otherwise compute it.
* This should ''not'' be called by users directly, but is available for implementors of custom
* subclasses of RDD.
*/
//与子Storage子系统交互的入口
final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
if (storageLevel != StorageLevel.NONE) {
SparkEnv.get.cacheManager.getOrCompute(this, split, context, storageLevel)
} else {
computeOrReadCheckpoint(split, context)
}
}
CacheManager.getOrCompute源码如下:
def getOrCompute[T](
rdd: RDD[T],
partition: Partition,
context: TaskContext,
storageLevel: StorageLevel): Iterator[T] = {
val key = RDDBlockId(rdd.id, partition.index)
logDebug(s"Looking for partition $key")
blockManager.get(key) match {
case Some(blockResult) =>
// Partition is already materialized, so just return its values
val inputMetrics = blockResult.inputMetrics
val existingMetrics = context.taskMetrics
.getInputMetricsForReadMethod(inputMetrics.readMethod)
existingMetrics.incBytesRead(inputMetrics.bytesRead)
val iter = blockResult.data.asInstanceOf[Iterator[T]]
new InterruptibleIterator[T](context, iter) {
override def next(): T = {
existingMetrics.incRecordsRead(1)
delegate.next()
}
}
case None =>
// Acquire a lock for loading this partition
// If another thread already holds the lock, wait for it to finish return its results
val storedValues = acquireLockForPartition[T](key)
if (storedValues.isDefined) {
return new InterruptibleIterator[T](context, storedValues.get)
}
// Otherwise, we have to load the partition ourselves
try {
logInfo(s"Partition $key not found, computing it")
//判断是否进行了checkPoint操作,如果没有则进行计算
val computedValues = rdd.computeOrReadCheckpoint(partition, context)
// If the task is running locally, do not persist the result
if (context.isRunningLocally) {
return computedValues
}
// Otherwise, cache the values and keep track of any updates in block statuses
val updatedBlocks = new ArrayBuffer[(BlockId, BlockStatus)]
//缓存计算结果,默认MEMORY_AND_DISK级别
val cachedValues = putInBlockManager(key, computedValues, storageLevel, updatedBlocks)
val metrics = context.taskMetrics
val lastUpdatedBlocks = metrics.updatedBlocks.getOrElse(Seq[(BlockId, BlockStatus)]())
metrics.updatedBlocks = Some(lastUpdatedBlocks ++ updatedBlocks.toSeq)
new InterruptibleIterator(context, cachedValues)
} finally {
loading.synchronized {
loading.remove(key)
loading.notifyAll()
}
}
}
}
putInBlockManager方法源码如下:
/**
* Cache the values of a partition, keeping track of any updates in the storage statuses of
* other blocks along the way.
*
* The effective storage level refers to the level that actually specifies BlockManager put
* behavior, not the level originally specified by the user. This is mainly for forcing a
* MEMORY_AND_DISK partition to disk if there is not enough room to unroll the partition,
* while preserving the the original semantics of the RDD as specified by the application.
*/
private def putInBlockManager[T](
key: BlockId,
values: Iterator[T],
level: StorageLevel,
updatedBlocks: ArrayBuffer[(BlockId, BlockStatus)],
effectiveStorageLevel: Option[StorageLevel] = None): Iterator[T] = {
val putLevel = effectiveStorageLevel.getOrElse(level)
//如果没有缓存到内存中,则进行计算,并且作为BlockManager的一个iterator,而不是展现在内存中
if (!putLevel.useMemory) {
/*
* This RDD is not to be cached in memory, so we can just pass the computed values as an
* iterator directly to the BlockManager rather than first fully unrolling it in memory.
*/
updatedBlocks ++=
blockManager.putIterator(key, values, level, tellMaster = true, effectiveStorageLevel)
blockManager.get(key) match {
case Some(v) => v.data.asInstanceOf[Iterator[T]]
case None =>
logInfo(s"Failure to store $key")
throw new BlockException(key, s"Block manager failed to return cached value for $key!")
}
} else {
/*
* This RDD is to be cached in memory. In this case we cannot pass the computed values
* to the BlockManager as an iterator and expect to read it back later. This is because
* we may end up dropping a partition from memory store before getting it back.
*
* In addition, we must be careful to not unroll the entire partition in memory at once.
* Otherwise, we may cause an OOM exception if the JVM does not have enough space for this
* single partition. Instead, we unroll the values cautiously, potentially aborting and
* dropping the partition to disk if applicable.
*/
//如果RDD缓存到内存中了,这时不需要进行计算,需要读取缓存的RDD之后返回,否则可能因为在读取返回之前将其删除导致RDD
//丢失。另外,不能将整个partition展现在内存中,否则可能会出现OOM,可进行适当刷新数据到磁盘上
blockManager.memoryStore.unrollSafely(key, values, updatedBlocks) match {
case Left(arr) =>
// We have successfully unrolled the entire partition, so cache it in memory
updatedBlocks ++=
blockManager.putArray(key, arr, level, tellMaster = true, effectiveStorageLevel)
arr.iterator.asInstanceOf[Iterator[T]]
case Right(it) =>
// There is not enough space to cache this partition in memory
//没有足够的内存写入磁盘
val returnValues = it.asInstanceOf[Iterator[T]]
if (putLevel.useDisk) {
logWarning(s"Persisting partition $key to disk instead.")
val diskOnlyLevel = StorageLevel(useDisk = true, useMemory = false,
useOffHeap = false, deserialized = false, putLevel.replication)
putInBlockManager[T](key, returnValues, level, updatedBlocks, Some(diskOnlyLevel))
} else {
returnValues
}
}
}
}
这时进入BlockManager中在putArray中调用doPut方法:
/**
* Put a new block of values to the block manager.
* Return a list of blocks updated as a result of this put.
*/
def putArray(
blockId: BlockId,
values: Array[Any],
level: StorageLevel,
tellMaster: Boolean = true,
effectiveStorageLevel: Option[StorageLevel] = None): Seq[(BlockId, BlockStatus)] = {
require(values != null, "Values is null")
//调用doPut方法
doPut(blockId, ArrayValues(values), level, tellMaster, effectiveStorageLevel)
}
在doPut方法中,如果replicate大于1,则调用replicate方法进行备份,然后缓存数据到内存,tachyon或者磁盘中,最后向Master报告每一个Block的信息。
// If we're storing bytes, then initiate the replication before storing them locally.
// This is faster as data is already serialized and ready to send.
val replicationFuture = data match {
//如果备份数目大于1,调用replicate函数将数据备份到其他节点
case b: ByteBufferValues if putLevel.replication > 1 =>
// Duplicate doesn't copy the bytes, but just creates a wrapper
val bufferView = b.buffer.duplicate()
Future { replicate(blockId, bufferView, putLevel) }
case _ => null
}
// Keep track of which blocks are dropped from memory
if (putLevel.useMemory) {
result.droppedBlocks.foreach { updatedBlocks += _ }
}
val putBlockStatus = getCurrentBlockStatus(blockId, putBlockInfo)
if (putBlockStatus.storageLevel != StorageLevel.NONE) {
// Now that the block is in either the memory, tachyon, or disk store,
// let other threads read it, and tell the master about it.
marked = true
putBlockInfo.markReady(size)
if (tellMaster) {
//将数据缓存到内存,tachyon或者磁盘上之后向master报告并且 写入每一个block的信息
reportBlockStatus(blockId, putBlockInfo, putBlockStatus)
}
updatedBlocks += ((blockId, putBlockStatus))
}
reportBlockStatus方法源码如下:
/**
* Tell the master about the current storage status of a block. This will send a block update
* message reflecting the current status, *not* the desired storage level in its block info.
* For example, a block with MEMORY_AND_DISK set might have fallen out to be only on disk.
*
* droppedMemorySize exists to account for when the block is dropped from memory to disk (so
* it is still valid). This ensures that update in master will compensate for the increase in
* memory on slave.
*/
private def reportBlockStatus(
blockId: BlockId,
info: BlockInfo,
status: BlockStatus,
droppedMemorySize: Long = 0L): Unit = {
val needReregister = !tryToReportBlockStatus(blockId, info, status, droppedMemorySize)
if (needReregister) {
logInfo(s"Got told to re-register updating block $blockId")
// Re-registering will report our new block for free.
asyncReregister()
}
logDebug(s"Told master about block $blockId")
}
数据读取的入口是get函数,首先尝试从本地获取数据,如果数据不在本地则从远程获取:
/**
* Get a block from the block manager (either local or remote).
*/
def get(blockId: BlockId): Option[BlockResult] = {
//首先尝试从本地获取,如果数据在本地返回,否则从远程拉取数据
val local = getLocal(blockId)
if (local.isDefined) {
logInfo(s"Found block $blockId locally")
return local
}
val remote = getRemote(blockId)
if (remote.isDefined) {
logInfo(s"Found block $blockId remotely")
return remote
}
None
}
获取本地数据时,首先尝试从内存中获取,接着到堆外内存中尝试或者,最后尝试去磁盘中读取数据。
远程获取数据调用路径为getRemote ---> doGetRemote ---> BlockTransferService.fetchBlockSync
fetchBlockSync方法的源码如下,通过BlockFetchingListener监视器来得知获取数据是否成功:
/**
* A special case of [[fetchBlocks]], as it fetches only one block and is blocking.
*
* It is also only available after [[init]] is invoked.
*/
def fetchBlockSync(host: String, port: Int, execId: String, blockId: String): ManagedBuffer = {
// A monitor for the thread to wait on.
val result = Promise[ManagedBuffer]()
fetchBlocks(host, port, execId, Array(blockId),
new BlockFetchingListener {
//获取数据失败
override def onBlockFetchFailure(blockId: String, exception: Throwable): Unit = {
result.failure(exception)
}
//获取数据成功
override def onBlockFetchSuccess(blockId: String, data: ManagedBuffer): Unit = {
val ret = ByteBuffer.allocate(data.size.toInt)
ret.put(data.nioByteBuffer())
ret.flip()
result.success(new NioManagedBuffer(ret))
}
})
Await.result(result.future, Duration.Inf)
}
至此,关于Spark的存储机制的源码剖析结束,如有任何问题,欢迎留言讨论。