Spark存储分析 - 存储架构

我们从两个方面,来分析Spark的存储管理:

1、Spark存储管理中Block的定义

2、BlockManager存储架构

Spark存储管理中Block的定义

Spark的存储模块负责了Spark计算过程中所有的存储,包括基于Disk的和基于Memory的。而存储管理的最小单元是以Block组织的,每个Block对应了RDD中一个partition。对Block的查询、存储管理,是通过唯一的Block ID来进行区分的。同一个Spark Application,以及多个运行的Application之间,对应的Block都具有唯一的ID,通过代码可以看到,BlockID包括:RDDBlockId、ShuffleBlockId、ShuffleDataBlockId、ShuffleIndexBlockId、BroadcastBlockId、TaskResultBlockId、TempLocalBlockId、TempShuffleBlockId这8种ID,可以详见如下代码定义:

@DeveloperApi
case class RDDBlockId(rddId: Int, splitIndex: Int) extends BlockId {
  override def name: String = "rdd_" + rddId + "_" + splitIndex
}
 
// Format of the shuffle block ids (including data and index) should be kept in sync with
// org.apache.spark.network.shuffle.ExternalShuffleBlockResolver#getBlockData().
@DeveloperApi
case class ShuffleBlockId(shuffleId: Int, mapId: Int, reduceId: Int) extends BlockId {
  override def name: String = "shuffle_" + shuffleId + "_" + mapId + "_" + reduceId
}
 
@DeveloperApi
case class ShuffleDataBlockId(shuffleId: Int, mapId: Int, reduceId: Int) extends BlockId {
  override def name: String = "shuffle_" + shuffleId + "_" + mapId + "_" + reduceId + ".data"
}
 
@DeveloperApi
case class ShuffleIndexBlockId(shuffleId: Int, mapId: Int, reduceId: Int) extends BlockId {
  override def name: String = "shuffle_" + shuffleId + "_" + mapId + "_" + reduceId + ".index"
}
 
@DeveloperApi
case class BroadcastBlockId(broadcastId: Long, field: String = "") extends BlockId {
  override def name: String = "broadcast_" + broadcastId + (if (field == "") "" else "_" + field)
}
 
@DeveloperApi
case class TaskResultBlockId(taskId: Long) extends BlockId {
  override def name: String = "taskresult_" + taskId
}
 
@DeveloperApi
case class StreamBlockId(streamId: Int, uniqueId: Long) extends BlockId {
  override def name: String = "input-" + streamId + "-" + uniqueId
}
 
/** Id associated with temporary local data managed as blocks. Not serializable. */
private[spark] case class TempLocalBlockId(id: UUID) extends BlockId {
  override def name: String = "temp_local_" + id
}
 
/** Id associated with temporary shuffle data managed as blocks. Not serializable. */
private[spark] case class TempShuffleBlockId(id: UUID) extends BlockId {
  override def name: String = "temp_shuffle_" + id
}


BlockManager存储架构

BlockManager负责管理在每个Dirver和Executor上的Block数据,可能是本地或者远程的。具体操作包括查询Block、将Block保存在指定的存储中,如内存、磁盘、堆外(Off-heap)。而BlockManager依赖的后端,对Block数据进行内存、磁盘存储访问,都是基于MemoryStore、DiskStore实现的。
在Spark的Driver以及所有的Executor上,都存在一个BlockManager、BlockManagerMaster,BlockManager提供了存储模块与其他模块的交互接口,而BlockManagerMaster则是Block管理的接口类,通过调用BlockManagerMasterEndpoint和RpcEndpointRef进行通信,如下图所示:

Spark存储分析 - 存储架构_第1张图片

存储实现类分析

BlockManager

BlockManagerMaster

BlockManagerMasterEndpoint

RpcEndpointRef

BlockManagerInfo

MemoryStore

DiskStore


参考Spark Block存储管理分析




你可能感兴趣的:(Spark,Spark专栏)