本篇结构:
- 读取数据块过程
- 内存读取
- 磁盘读取
- 远程读取
一、读取数据块过程
BlockManager 的 get 方法是读数据的入口点,有本地读取和远程读取两个分叉口。本地读取使用 getLocalValues 方法,根据存储级别的不同,使用 MemoryStore.getValues 或者 DiskStore.getBytes 读取数据。
远程读取使用 getRemoteValues 方法,调用远程数据传输服务类 BlockTransferService 的 fetchBlockSync 获取数据。
完整的数据读取过程如下:
二、内存读取
根据缓存的数据是否反序列化,getLocalValues 读取内存中的数据方法不同,如果反序列化,则调用 MemoryStore 的 getValues 方法,如果没有反序列化,则调用 MemoryStore 的 getBytes 方法。
BlockManager # getLocalValues:
if (level.useMemory && memoryStore.contains(blockId)) {
// 如果反序列化,则直接读取内存中的数据
val iter: Iterator[Any] = if (level.deserialized) {
memoryStore.getValues(blockId).get
} else {
// 否则读取字节数组,并需要做反序列化处理
serializerManager.dataDeserializeStream(
blockId, memoryStore.getBytes(blockId).get.toInputStream())(info.classTag)
}
// We need to capture the current taskId in case the iterator completion is triggered
// from a different thread which does not have TaskContext set; see SPARK-18406 for
// discussion.
// 返回数据及数据块大小、读取方法等
val ci = CompletionIterator[Any, Iterator[Any]](iter, {
releaseLock(blockId, taskAttemptId)
})
Some(new BlockResult(ci, DataReadMethod.Memory, info.size))
在 MemoryStore 中, getValues 和 getBytes 都根据 BlockId 获取内存中的数据块。
MemoryStore # getValues:
def getValues(blockId: BlockId): Option[Iterator[_]] = {
val entry = entries.synchronized { entries.get(blockId) }
entry match {
case null => None
case e: SerializedMemoryEntry[_] =>
throw new IllegalArgumentException("should only call getValues on deserialized blocks")
case DeserializedMemoryEntry(values, _, _) =>
val x = Some(values)
x.map(_.iterator)
}
}
MemoryStore # getBytes:
def getBytes(blockId: BlockId): Option[ChunkedByteBuffer] = {
val entry = entries.synchronized { entries.get(blockId) }
entry match {
case null => None
case e: DeserializedMemoryEntry[_] =>
throw new IllegalArgumentException("should only call getBytes on serialized blocks")
case SerializedMemoryEntry(bytes, _, _) => Some(bytes)
}
}
观察 entries,发现其实就是一个 LinkedHashMap。所以缓存在内存里的数据都是放入 LinkedHashMap 中。
private val entries = new LinkedHashMap[BlockId, MemoryEntry[_]](32, 0.75f, true)
LinkedHashMap 保存了插入的顺序,遍历 LinkedHashMap 时,先得到的记录是先插入的。如果内存不够,先保存的数据会被先清除。
三、磁盘读取
getLocalValues 方法中,根据缓存级别,如果使用磁盘缓存,则调用 DiskStore 的 getBytes 方法。
BlockManager # getLocalValues:
else if (level.useDisk && diskStore.contains(blockId)) {
// 从磁盘中获取数据,由于保存到磁盘的数据是序列化的,读取到的数据也是序列化后的
val diskData = diskStore.getBytes(blockId)
val iterToReturn: Iterator[Any] = {
if (level.deserialized) {
// 如果储存级别需要反序列化,则先反序列化,然后根据是否 level.useMemory 的值,判断是否存储到内存中
val diskValues = serializerManager.dataDeserializeStream(
blockId,
diskData.toInputStream())(info.classTag)
maybeCacheDiskValuesInMemory(info, blockId, level, diskValues)
} else {
// 如果不需要反序列化,则直接判断是否需要将这些序列化数据缓存到内存中
val stream = maybeCacheDiskBytesInMemory(info, blockId, level, diskData)
.map { _.toInputStream(dispose = false) }
.getOrElse { diskData.toInputStream() }
// 返回的数据需做反序列化处理
serializerManager.dataDeserializeStream(blockId, stream)(info.classTag)
}
}
val ci = CompletionIterator[Any, Iterator[Any]](iterToReturn, {
releaseLockAndDispose(blockId, diskData, taskAttemptId)
})
// 返回数据及数据块大小、读取方法等
Some(new BlockResult(ci, DataReadMethod.Disk, info.size))
重点看 DiskStore # getBytes:
def getBytes(blockId: BlockId): BlockData = {
val file = diskManager.getFile(blockId.name)
val blockSize = getSize(blockId)
securityManager.getIOEncryptionKey() match {
case Some(key) =>
// Encrypted blocks cannot be memory mapped; return a special object that does decryption
// and provides InputStream / FileRegion implementations for reading the data.
new EncryptedBlockData(file, blockSize, conf, key)
case _ =>
new DiskBlockData(minMemoryMapBytes, maxMemoryMapBytes, file, blockSize)
}
}
3.1、获取磁盘存储目录
DiskStore 通过 DiskBlockManager 管理 Block 和相应磁盘文件的映射关系,从而将 Block 存储到磁盘的文件中。
val file = diskManager.getFile(blockId.name)
DiskBlockManager 根据 LOCAL_DIRS(yarn模式),SPARK_LOCAL_DIRS 或 spark.local.dir(其他模式,默认值 System.getProperty(“java.io.tmpdir“))配置的本地根目录(可能有多个,以逗号分隔)来生成 DiskStore 存放 Block 的根目录(与配置的根目录对应,也有可能有多个):
- …/blockmgr-UUID.randomUUID.toString(yarn模式)
- …/spark-UUID.randomUUID.toString/blockmgr-UUID.randomUUID.toString(其他模式)
同时 DiskBlockManager 会为每个根目录生成conf.getInt(“spark.diskStore.subDirectories“, 64) 个子目录用来存放 Block 对应的文件,每个 Block 会根据它的 name 哈希到相应的子目录,然后以 Block 的 name 为文件名来生成文件存储。
具体过程参看 DiskBlockManager 的 localDirs 属性赋值过程:
private[spark] val localDirs: Array[File] = createLocalDirs(conf)
DiskBlockManager # createLocalDirs :
/**
* Create local directories for storing block data. These directories are
* located inside configured local directories and won't
* be deleted on JVM exit when using the external shuffle service.
*/
private def createLocalDirs(conf: SparkConf): Array[File] = {
Utils.getConfiguredLocalDirs(conf).flatMap { rootDir =>
try {
val localDir = Utils.createDirectory(rootDir, "blockmgr")
logInfo(s"Created local directory at $localDir")
Some(localDir)
} catch {
case e: IOException =>
logError(s"Failed to create local dir in $rootDir. Ignoring this directory.", e)
None
}
}
}
DiskBlockManager # getConfiguredLocalDirs:
def getConfiguredLocalDirs(conf: SparkConf): Array[String] = {
val shuffleServiceEnabled = conf.getBoolean("spark.shuffle.service.enabled", false)
if (isRunningInYarnContainer(conf)) {
// If we are in yarn mode, systems can have different disk layouts so we must set it
// to what Yarn on this system said was available. Note this assumes that Yarn has
// created the directories already, and that they are secured so that only the
// user has access to them.
getYarnLocalDirs(conf).split(",")
} else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
} else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
conf.getenv("SPARK_LOCAL_DIRS").split(",")
} else if (conf.getenv("MESOS_DIRECTORY") != null && !shuffleServiceEnabled) {
// Mesos already creates a directory per Mesos task. Spark should use that directory
// instead so all temporary files are automatically cleaned up when the Mesos task ends.
// Note that we don't want this if the shuffle service is enabled because we want to
// continue to serve shuffle files after the executors that wrote them have already exited.
Array(conf.getenv("MESOS_DIRECTORY"))
} else {
if (conf.getenv("MESOS_DIRECTORY") != null && shuffleServiceEnabled) {
logInfo("MESOS_DIRECTORY available but not using provided Mesos sandbox because " +
"spark.shuffle.service.enabled is enabled.")
}
// In non-Yarn mode (or for the driver in yarn-client mode), we cannot trust the user
// configuration to point to a secure directory. So create a subdirectory with restricted
// permissions under each listed directory.
conf.get("spark.local.dir", System.getProperty("java.io.tmpdir")).split(",")
}
}
3.2、从文件读取数据块
参看 DiskBlockData 源码:
private class DiskBlockData(
minMemoryMapBytes: Long,
maxMemoryMapBytes: Long,
file: File,
blockSize: Long) extends BlockData {
override def toInputStream(): InputStream = new FileInputStream(file)
/**
* Returns a Netty-friendly wrapper for the block's data.
*
* Please see `ManagedBuffer.convertToNetty()` for more details.
*/
override def toNetty(): AnyRef = new DefaultFileRegion(file, 0, size)
override def toChunkedByteBuffer(allocator: (Int) => ByteBuffer): ChunkedByteBuffer = {
Utils.tryWithResource(open()) { channel =>
var remaining = blockSize
val chunks = new ListBuffer[ByteBuffer]()
while (remaining > 0) {
val chunkSize = math.min(remaining, maxMemoryMapBytes)
val chunk = allocator(chunkSize.toInt)
remaining -= chunkSize
JavaUtils.readFully(channel, chunk)
chunk.flip()
chunks += chunk
}
new ChunkedByteBuffer(chunks.toArray)
}
}
override def toByteBuffer(): ByteBuffer = {
require(blockSize < maxMemoryMapBytes,
s"can't create a byte buffer of size $blockSize" +
s" since it exceeds ${Utils.bytesToString(maxMemoryMapBytes)}.")
Utils.tryWithResource(open()) { channel =>
if (blockSize < minMemoryMapBytes) {
// For small files, directly read rather than memory map.
val buf = ByteBuffer.allocate(blockSize.toInt)
JavaUtils.readFully(channel, buf)
buf.flip()
buf
} else {
channel.map(MapMode.READ_ONLY, 0, file.length)
}
}
}
override def size: Long = blockSize
override def dispose(): Unit = {}
private def open() = new FileInputStream(file).getChannel
}
提供 toInputStream、toChunkedByteBuffer、 toByteBuffer 的方式读取数据。
四、远程读取
Spark 读取远程节点的数据,依赖 Netty 实现的 Spark Rpc 框架,涉及两个重要的类:
- NettyBlockTransferService:为 Shuffle、存储模块提供了数据存取的接口实现,接收到数据存取的命令时,通过 Netty RPC 框架发送消息给指定节点,请求进行数据存取操作。
- NettyBlockRpcServer:Executor启动时,会启动 RPC 监听器,当监听到消息时将消息传递到该类进行处理,消息包括读取数据 OpenBlocks 和写入数据 uploadBlock 两种。
4.1、获取数据块位置
入口为 BlockManager # getRemoteValues,接着调用 getRemoteBytes 方法。在 getRemoteBytes 方法中调用 getLocationsAndStatus 方法向 BlockManagerMasterEndpoint 发送 GetLocationsAndStatus 消息,请求数据块所在的位置和状态。
/**
* Get block from remote block managers.
*
* This does not acquire a lock on this block in this JVM.
*/
private def getRemoteValues[T: ClassTag](blockId: BlockId): Option[BlockResult] = {
val ct = implicitly[ClassTag[T]]
getRemoteBytes(blockId).map { data =>
val values =
serializerManager.dataDeserializeStream(blockId, data.toInputStream(dispose = true))(ct)
new BlockResult(values, DataReadMethod.Network, data.size)
}
}
BlockManagerMaster # getLocationsAndStatus:
/** Get locations as well as status of the blockId from the driver */
def getLocationsAndStatus(blockId: BlockId): Option[BlockLocationsAndStatus] = {
driverEndpoint.askSync[Option[BlockLocationsAndStatus]](
GetLocationsAndStatus(blockId))
}
获取到 Block 的位置列表后,BlockManager 的 getRemoteBytes 方法中调用 BlockTransferService 的 fetchBlockSync 方法。
4.2、向数据块所在节点发送 OpenBlocks 消息
BlockTransferService 的 fetchBlockSync 调用其实现 NettyBlockTransferService 的fetchBlocks 方法。
/**
* A special case of [[fetchBlocks]], as it fetches only one block and is blocking.
*
* It is also only available after [[init]] is invoked.
*/
def fetchBlockSync(
host: String,
port: Int,
execId: String,
blockId: String,
tempFileManager: TempFileManager): ManagedBuffer = {
// A monitor for the thread to wait on.
val result = Promise[ManagedBuffer]()
fetchBlocks(host, port, execId, Array(blockId),
new BlockFetchingListener {
override def onBlockFetchFailure(blockId: String, exception: Throwable): Unit = {
result.failure(exception)
}
override def onBlockFetchSuccess(blockId: String, data: ManagedBuffer): Unit = {
data match {
case f: FileSegmentManagedBuffer =>
result.success(f)
case _ =>
val ret = ByteBuffer.allocate(data.size.toInt)
ret.put(data.nioByteBuffer())
ret.flip()
result.success(new NioManagedBuffer(ret))
}
}
}, tempFileManager)
ThreadUtils.awaitResult(result.future, Duration.Inf)
}
NettyBlockTransferService # fetchBlocks:
override def fetchBlocks(
host: String,
port: Int,
execId: String,
blockIds: Array[String],
listener: BlockFetchingListener,
tempFileManager: TempFileManager): Unit = {
logTrace(s"Fetch blocks from $host:$port (executor id $execId)")
try {
val blockFetchStarter = new RetryingBlockFetcher.BlockFetchStarter {
override def createAndStart(blockIds: Array[String], listener: BlockFetchingListener) {
// 根据远程节点的地址和端口创建通信客户端
val client = clientFactory.createClient(host, port)
// 通过该客户端向指定节点发送读取数据消息
new OneForOneBlockFetcher(client, appId, execId, blockIds, listener,
transportConf, tempFileManager).start()
}
}
val maxRetries = transportConf.maxIORetries()
if (maxRetries > 0) {
// Note this Fetcher will correctly handle maxRetries == 0; we avoid it just in case there's
// a bug in this code. We should remove the if statement once we're sure of the stability.
new RetryingBlockFetcher(transportConf, blockFetchStarter, blockIds, listener).start()
} else {
blockFetchStarter.createAndStart(blockIds, listener)
}
} catch {
case e: Exception =>
logError("Exception while beginning fetchBlocks", e)
blockIds.foreach(listener.onBlockFetchFailure(_, e))
}
}
fetchBlocks 中,根据远程节点的地址和端口创建通信客户端 TransportClient,通过该客户端向指定节点发送读取数据消息。
消息的具体发送是在 OneForOneBlockFetcher 的 start 方法中。
public void start() {
if (blockIds.length == 0) {
throw new IllegalArgumentException("Zero-sized blockIds array");
}
client.sendRpc(openMessage.toByteBuffer(), new RpcResponseCallback() {
@Override
public void onSuccess(ByteBuffer response) {
...
}
@Override
public void onFailure(Throwable e) {
...
}
});
}
openMessage 是 OpenBlocks 类型。
this.openMessage = new OpenBlocks(appId, execId, blockIds);
4.3、远程节点响应并传输对应的数据块
对应的远程节点监听消息,当接收到消息后,在 NettyBlockRpcServer 中进行消息匹配。
override def receive(
client: TransportClient,
rpcMessage: ByteBuffer,
responseContext: RpcResponseCallback): Unit = {
val message = BlockTransferMessage.Decoder.fromByteBuffer(rpcMessage)
logTrace(s"Received request: $message")
message match {
case openBlocks: OpenBlocks =>
val blocksNum = openBlocks.blockIds.length
val blocks = for (i <- (0 until blocksNum).view)
yield blockManager.getBlockData(BlockId.apply(openBlocks.blockIds(i)))
// 注册 ManagedBuffer,利用 Netty 传输
val streamId = streamManager.registerStream(appId, blocks.iterator.asJava)
logTrace(s"Registered streamId $streamId with $blocksNum buffers")
responseContext.onSuccess(new StreamHandle(streamId, blocksNum).toByteBuffer)
case uploadBlock: UploadBlock =>
// StorageLevel and ClassTag are serialized as bytes using our JavaSerializer.
val (level: StorageLevel, classTag: ClassTag[_]) = {
serializer
.newInstance()
.deserialize(ByteBuffer.wrap(uploadBlock.metadata))
.asInstanceOf[(StorageLevel, ClassTag[_])]
}
val data = new NioManagedBuffer(ByteBuffer.wrap(uploadBlock.blockData))
val blockId = BlockId(uploadBlock.blockId)
blockManager.putBlockData(blockId, data, level, classTag)
responseContext.onSuccess(ByteBuffer.allocate(0))
}
}
如上源码,当匹配到 OpenBlocks 时,调用 BlockManager 的 getBlockData 方法读取该节点上的数据。读取的数据块封装为 ManagedBuffer ,然后使用 Netty 传输通道,把数据传递到请求节点上,完成数据传输。
BlockManager # getBlockData:
/**
* Interface to get local block data. Throws an exception if the block cannot be found or
* cannot be read successfully.
*/
override def getBlockData(blockId: BlockId): ManagedBuffer = {
if (blockId.isShuffle) {
shuffleManager.shuffleBlockResolver.getBlockData(blockId.asInstanceOf[ShuffleBlockId])
} else {
getLocalBytes(blockId) match {
case Some(blockData) =>
new BlockManagerManagedBuffer(blockInfoManager, blockId, blockData, true)
case None =>
// If this block manager receives a request for a block that it doesn't have then it's
// likely that the master has outdated block statuses for this block. Therefore, we send
// an RPC so that this block is marked as being unavailable from this block manager.
reportBlockStatus(blockId, BlockStatus.empty)
throw new BlockNotFoundException(blockId.toString)
}
}
}