spark2.2.0源码阅读---spark core包 --- storage

1、本文目标以及其它说明:

    本文主要是介绍storage包下面的类

2、storage包下面的数据结构说明

sealed abstract class BlockId { 
表示的是数据块的标识。具体子类有rddblockid / shuffle / broadcast / task / stream / temlocal/ 

temshuffle 等等

private[storage] class BlockInfo(
    val level: StorageLevel,
    val classTag: ClassTag[_],
    val tellMaster: Boolean) {

维护、跟踪一个数据块的元信息。


private[storage] class BlockInfoManager extends Logging {

本管理器同时维护了线程(任务ID: taskAttemptId)读、写快(blockid)的同步信息,blockInfo里面封装了的是读写锁访问信息。

private[spark] trait BlockData {

对数据块的抽象,抽象数据是如何存储的,以及提供了访问潜在数据的方法。

private[spark] class ByteBufferBlockData(
    val buffer: ChunkedByteBuffer,
    val shouldDispose: Boolean) extends BlockData {

BlockData的实现类,字节缓冲块数据。

private[spark] class BlockManager(
    executorId: String,
    rpcEnv: RpcEnv,
    val master: BlockManagerMaster,
    val serializerManager: SerializerManager,
    val conf: SparkConf,
    memoryManager: MemoryManager,
    mapOutputTracker: MapOutputTracker,
    shuffleManager: ShuffleManager,
    val blockTransferService: BlockTransferService,
    securityManager: SecurityManager,
    numUsableCores: Int)
  extends BlockDataManager with BlockEvictionHandler with Logging {

提供了接口用来存放和获取本地以及远端的数据块,存储的方式可以分为这样的三种:内存、磁盘、和off-heap。通过memoryStore/ diskStore 存储数据

class BlockManagerId private (
    private var executorId_ : String,
    private var host_ : String,
    private var port_ : Int,
    private var topologyInfo_ : Option[String])
  extends Externalizable {

代表的是BlockManager的唯一标识。


private[storage] class BlockManagerManagedBuffer(
    blockInfoManager: BlockInfoManager,
    blockId: BlockId,
    data: BlockData,
    dispose: Boolean) extends ManagedBuffer {

底层调用的依然是BlockData数据结构的方法,其实就是对BlockData进行了一次包裹。

class BlockManagerMaster(
    var driverEndpoint: RpcEndpointRef,
    conf: SparkConf,
    isDriver: Boolean)
  extends Logging {
本类其实里面的方法全是是通过RpcEndpointRef实现的,用来和driver端通信。涉及到了blockmanager,块,rdd,

广播变量,等等在driver端的状态。

private[spark]
class BlockManagerMasterEndpoint(
    override val rpcEnv: RpcEnv,
    val isLocal: Boolean,
    conf: SparkConf,
    listenerBus: LiveListenerBus)
  extends ThreadSafeRpcEndpoint with Logging {

这个类只存在于driver端,就是对slave上的blockmanager进行管理

sealed trait ToBlockManagerSlave

这个是从master到从节点发送的消息

sealed trait ToBlockManagerMaster

这个是从slaves到master节点发送的消息

private[storage]
class BlockManagerSlaveEndpoint(
    override val rpcEnv: RpcEnv,
    blockManager: BlockManager,
    mapOutputTracker: MapOutputTracker)
  extends ThreadSafeRpcEndpoint with Logging {

主要是接受master发来的消息,一般是删除块,备份块,获取块信息等等

private[spark] class BlockManagerSource(val blockManager: BlockManager)
    extends Source {

数据源,来源于blockmanager的数据源头信息

trait BlockReplicationPolicy {

备份策略,也就一个方法:prioritize ,返回的是按照先后顺序存放到的blockmanager

case class BlockUpdatedInfo(
    blockManagerId: BlockManagerId,
    blockId: BlockId,
    storageLevel: StorageLevel,
    memSize: Long,
    diskSize: Long)

块更新信息

private[spark] class DiskBlockManager(conf: SparkConf, deleteFilesOnStop: Boolean) 
extends Logging {
维护了逻辑块和在磁盘上物理数据的映射
private[spark] class DiskBlockObjectWriter(

将jvm对象写到磁盘上去,并且可以追加写

private[spark] class DiskStore(

将blockmanager里面的块存放到磁盘上去。

class RDDInfo(
    val id: Int,
    var name: String,
    val numPartitions: Int,
    var storageLevel: StorageLevel,
    val parentIds: Seq[Int],
    val callSite: String = "",
    val scope: Option[RDDOperationScope] = None)
  extends Ordered[RDDInfo] {

里面也就3个方法:isCached ,toString, compare

private[spark]
final class ShuffleBlockFetcherIterator(
    context: TaskContext,
    shuffleClient: ShuffleClient,
    blockManager: BlockManager,
    blocksByAddress: Iterator[(BlockManagerId, Seq[(BlockId, Long)])],
    streamWrapper: (BlockId, InputStream) => InputStream,
    maxBytesInFlight: Long,
    maxReqsInFlight: Int,
    maxBlocksInFlightPerAddress: Int,
    maxReqSizeShuffleToMem: Long,
    detectCorrupt: Boolean)
  extends Iterator[(BlockId, InputStream)] with TempFileManager with Logging {
抓取远端的数据块,存放在当前数据结构中的阻塞队列中。并从这个阻塞队列里面迭代出来。
case class FetchRequest(address: BlockManagerId, blocks: Seq[(BlockId, Long)]) {
  val size = blocks.map(_._2).sum
}

抓取请求,都是远端的blockmanagerid和相应的块

private[storage] sealed trait FetchResult {
  val blockId: BlockId
  val address: BlockManagerId
}

抓取的返回结果,有两个类

private[storage] case class SuccessFetchResult(
    blockId: BlockId,
    address: BlockManagerId,
    size: Long,
    buf: ManagedBuffer,
    isNetworkReqDone: Boolean) extends FetchResult {
  require(buf != null)
  require(size >= 0)
}

抓取成功,返回数据

private[storage] case class FailureFetchResult(
    blockId: BlockId,
    address: BlockManagerId,
    e: Throwable)
  extends FetchResult

抓取失败,返回异常

private[spark] class StorageStatus(
    val blockManagerId: BlockManagerId,
    val maxMemory: Long,
    val maxOnHeapMem: Option[Long],
    val maxOffHeapMem: Option[Long]) {

里面存放了当前blockmanager对应的快状态信息,内存和磁盘使用情况




你可能感兴趣的:(spark源码)