段智华

Spark 2.1.0新一代Tungsten 内存管理的模型及其实现类的解析

9.2.2内存管理的模型及其实现类的解析

在2016年1月4号发布的 Spark 1.6中，提出了一个新的内存管理模型，即统一内存管理管理模型，对应在Spark 1.5 及之前的版本则使用静态的内存管理模型。关于新的统一内存管理模型，可以参考https://issues.apache.org/jira/secure/attachment/12765646/unified-memory-management-spark-10000.pdf。在该文档中详细描述了各种可能的设计以及各设计的优缺点。另外也可以参考网上对Spark内存管理模型解析非常深入的博客http://0x0fff.com/spark-memory-management/（Alexey Grishchenko），博客内容包含了静态内存模型管理与动态内存模型管理的详细说明。

为了解决现有基于JVM托管方式的内存模型所存在的缺陷，ProjectTungsten设计了一套新的内存管理机制，在新的内存管理机制中，Spark的operation可以直接使用分配的binary data（二进制数据）而不是JVM objects。避免了数据处理过程中不必要的序列化与反序列化的开销，同时基于off-heap方式管理内存，降低了GC所带来的开销。

ProjectTungsten通过sun.misc.Unsafe来管理内存，关于sun.misc.Unsafe（从命名上可知该工具不能滥用。）及其使用等内容，可以参考官网文档（http://www.docjar.com/docs/api/sun/misc/Unsafe.html）。在此主要分析ProjectTungsten中的内存管理模型的具体实现。

1. 首先对ProjectTungsten的内存模型给出整体描述

ProjectTungsten内存管理模型主要的类图结构，如下图所示：

图 9- 1内存管理模型主要的类图结构

在图中，基类MemoryManager封装了静态内存管理模型与统一内存管理模型，即分别对应两个具体实现子类StaticMemoryManger与UnitedMemoryManager。对应的内存分配由MemoryManager的成员tungstenMemoryMode决定，即由基类MemoryAllocator负责具体内存分配，对应off-heap与on-heap两种内存模式，分别实现了两个具体子类UnsafeMemoryAllocator与HeapMemoryAllocator。MemoryAllocator提供了allocate与free两个成员函数来提供内存的分配与释放，分配的内存以MemoryBlock来表示。

另外根据内存使用目的的不同，将内存分为两大部分：Storage和Execution，对应的以MemoryPool的两个具体实现子类StorageMemoryPool与ExecutionMemoryPool对其进行管理。实际上除了这两部分，总的内存还包括为系统预留的OtherMemory。

对于内存分类及其对应管理的主要类之间的关系可以通过图9-2来描述，如下所示：

图 9- 2内存分类及其对应管理的主要类之间的关系

在Worker上运行的每个Execution进程（抽象描述，实际对应各部署场景下的具体ExecutorBackend实现子类），对应由一个MemoryManager负责管理其内存，即图中MemoryManager与JVM的对应关系为1:1。

Storage部分的内存由StorageMemoryPool负责管理，Execution部分的内存根据不同的内存模式（MemoryMode）分为on-heap与off-heap两种，分别由onHeapExecutionMemoryPool与offHeapExecutionMemoryPool进行管理。管理内存主要是通过内存使用量进行控制，不涉及内存的分配与释放。

2. MemoryManager的实现及其源码解析

MemoryManager目前实现了两种具体的内存管理模型，从Spark 1.6 版本开始，默认使用统一内存管理模型，对应的配置属性为"spark.memory.useLegacyMode"，控制代码位于SparkEnv类中，代码如下所示：

1. // Spark 1.5 及之前的版本所使用的内存管理模型对应配置属性

2. // "spark.memory.useLegacyMode"为ture。当前默认为false。

3. val useLegacyMemoryManager = conf.getBoolean("spark.memory.useLegacyMode", false)

4. val memoryManager: MemoryManager =

5. if (useLegacyMemoryManager) {

6. // 使用静态内存管理模型

7. new StaticMemoryManager(conf, numUsableCores)

8. } else {

9. // 使用统一内存管理模型

10. UnifiedMemoryManager(conf, numUsableCores)

11. }

以上是选择具体采用哪种内存管理模型的代码，下面开始分析内存管理相关的源码，首先查看MemoryManager的注释，如下所示：

1. /**

2. * An abstract memory manager that enforces how memory is shared between execution and

3. * storage.

4. * In this context, execution memory refers to that used for computation in shuffles, joins,

5. * sorts and aggregations, while storage memory refers to that used for caching and

6. * propagating internal data across the cluster. There exists one MemoryManager per JVM.

7. *

8. * 内存管理的抽象接口，用于指定如何在 Execution 与 Storage 间共享内存。

9. * Execution Memory是指用计算的内容，包括Shuffles、joins、sorts以及aggregations。

10. * Storage Memory则是指用于缓存或内部数据传输过程中所使用的内存。

11. *MemoryManager与JVM进程的对应关系为1:1。即一个JVM进程中的内存由一个

12. *MemoryManager进行管理。

13. */

14. private[spark] abstract class MemoryManager(

15. ……

在MemoryManager类中提供的内存分配与释放的几个主要接口如下：

1）Storage部分内存的分配与释放接口：acquireStorageMemory、

acquireUnrollMemory、releaseStorageMemory以及releaseUnrollMemory。

2）Execution部分内存的分配与释放接口：acquireExecutionMemory与releaseExecutionMemory。

具体分配与释放的实现由MemoryManager的具体子类提供。

两大实现子类（StaticMemoryManager和UnifiedMemoryManager）的主要差别在于Storage与Execution内存之间的边界是静态的还是动态可变的，下面分别简单描述下两大子类的实现细节。

StaticMemoryManager类的注释如下所示：

1. /**

2. * A [[MemoryManager]] that statically partitions the heap space into disjoint regions.

3. *

4. * The sizes of the execution and storage regions are determined through

5. * `spark.shuffle.memoryFraction` and `spark.storage.memoryFraction` respectively. The two

6. * regions are cleanly separated such that neither usage can borrow memory from the other.

7. *静态划分Storage与Execution内存之间的边界的一种内存管理实现。

8. * Storage与Execution内存大小分别由配置属性`spark.shuffle.memoryFraction` 与

9. *`spark.storage.memoryFraction`各自指定，由于是静态划分边界，因此这两者之间不能

10. *互相借用多余的内存。

11. */

12. private[spark] class StaticMemoryManager(

静态内存管理模型中各部分内存的分配可以通过以下几个接口或成员变量查看：

1） maxUnrollMemory：unroll过程中可用的内存，占最大可用Storage内存的0.2（占比）。

2） getMaxStorageMemory：获取分配给Storage使用的最大内存大小。

3） getMaxExecutionMemory：获取分配给Execution使用的最大内存大小。

其中，getMaxStorageMemory对应用于Storage的最大内存，具体配置如下所示：

1. private def getMaxStorageMemory(conf: SparkConf): Long = {

2. val systemMaxMemory = conf.getLong("spark.testing.memory", Runtime.getRuntime.maxMemory)

3. val memoryFraction = conf.getDouble("spark.storage.memoryFraction", 0.6)

4. val safetyFraction = conf.getDouble("spark.storage.safetyFraction", 0.9)

5. (systemMaxMemory * memoryFraction * safetyFraction).toLong

6. }

其中配置属性"spark.storage.memoryFraction"表示Storage内存占用全部内存（除预留给系统的内存外）的占比，"spark.storage.safetyFraction"对应为Storage内存的安全系数。

对应的getMaxExecutionMemory方法指明了用于Execution内存的相关配置属性，与Storage内存一样包含占总内存的占比（0.2）及对应的安全系数。

另外，除了Storage内存与Execution内存占用的0.6+0.2之外的剩余内存，作为系统预留内存。

通过StaticMemoryManager类简单分析静态内存管理模型后，继续查看统一内存管理模型，首先查看其类注释：

1. /**

2. * A [[MemoryManager]] that enforces a soft boundary between execution and storage such

3. * that either side can borrow memory from the other.

4. *

5. * The region shared between execution and storage is a fraction of (the total heap space -

6. * 300MB) configurable through `spark.memory.fraction` (default 0.75). The position of the

7. * boundary within this space is further determined by `spark.memory.storageFraction`

8. * (default 0.5).This means the size of the storage region is 0.75 * 0.5 = 0.375 of the heap

9. * space by default.

10. * Storage can borrow as much execution memory as is free until execution reclaims its space.

11. * When this happens, cached blocks will be evicted from memory until sufficient borrowed

12. * memory is released to satisfy the execution memory request.

13. *

14. * Similarly, execution can borrow as much storage memory as is free. However, execution

15. * memory is *never* evicted by storage due to the complexities involved in implementing

16. * T this.he implication is that attempts to cache blocks may fail if execution has already eaten

17. * up most of the storage space, in which case the new blocks will be evicted immediately

18. * according to their respective storage levels.

19. *

20. * @paramstorageRegionSize Size of the storage region, in bytes.

21. * This region is not statically reserved; execution can borrow

22. *from it if necessary. Cached blocks can be evicted only if

23. *actual storage memory usage exceeds this region.

24. *

25. * UnifiedMemoryManager ：是MemoryManager的一个具体子类，实现 Storage 与

26. *Execution 间软边界（即动态边界）的内存管理模式。

27. * 动态边界意味着 Storage 与 Execution 的内存是可以相互借用的。

28. * Storage 与 Execution ：共享的内存通过配置 `spark.memory.fraction` 进行设置。

29. * 两者间的内存分配通过配置 `spark.memory.storageFraction` 进行设置。

30. * 内存借用：

31. * 1） Storage 可以借用 Execution 中空闲的内存，但在 Execution 执行需要内存时

32. * 会被回收（回收缓存内存直到满足执行申请的内存）。

33. * 2）类似地，Execution 也可以借用 Storage 中空闲的内存。但是，出于实现的复

34. * 杂性考虑（具体可以参考前面给出的官方设计文档

35. *unified-memory-management-spark-10000.pdf），

36. * 借用的内存`永远`不会因为 Storage 需要而进行回收。

37. */

38. private[spark] class UnifiedMemoryManager private[memory] (

39. ……

UnifiedMemoryManager与StaticMemoryManager一样实现了MemoryManager的几个内存分配、释放的接口，对应分配与释放接口的实现，在StaticMemoryManager相对比较简单，而在UnifiedMemoryManager中，由于考虑到动态借用的情况，实现相对比较复杂，具体细节可以参考官方提供的统一内存管理设计文档以及相关源码，比如针对各个Task如何保证其最小分配的内存（最少1/2N，其中N表示当前活动状态的Task个数，最大的Task个数可以从Executor分配的内核个数/每个Task占用的内核个数得到）等等。

下面简单分析下统一内存管理模型中，Storage内存与Execution内存等相关的配置。

UnifiedMemoryManager的getMaxMemory方法，在Spark 1.6版本中，spark.memory.fraction的默认值是0.75；在Spark 2.1 版本中，spark.memory.fraction的默认值是0.6。

UnifiedMemoryManager.scala的getMaxMemory源码：

1. /**

2. * Return the total amount of memory sharedbetween execution and storage, in bytes.

3. *返回Execution与Storage共享的最大内存。

4. */

5. private def getMaxMemory(conf:SparkConf): Long = {

6. val systemMemory =conf.getLong("spark.testing.memory", Runtime.getRuntime.maxMemory)

7. // 系统预留的内存大小，默认为300M。

8. val reservedMemory =conf.getLong("spark.testing.reservedMemory",

9. if(conf.contains("spark.testing")) 0 else RESERVED_SYSTEM_MEMORY_BYTES)

10. // 当前最小的内存需要300 * 1.5，即450M，不满足该条件时会报错退出

11. val minSystemMemory = (reservedMemory *1.5).ceil.toLong

12. if (systemMemory < minSystemMemory) {

13. throw new IllegalArgumentException(s"Systemmemory $systemMemory must " +

14. s"be at least $minSystemMemory.Please increase heap size using the --driver-memory " +

15. s"option or spark.driver.memory inSpark configuration.")

16. }

17. // SPARK-12759 Check executor memory tofail fast if memory is insufficient

18. if(conf.contains("spark.executor.memory")) {

19. val executorMemory =conf.getSizeAsBytes("spark.executor.memory")

20. if (executorMemory < minSystemMemory){

21. throw new IllegalArgumentException(s"Executormemory $executorMemory must be at least " +

22. s"$minSystemMemory. Pleaseincrease executor memory using the " +

23. s"--executor-memory option orspark.executor.memory in Spark configuration.")

24. }

25. }

26. val usableMemory = systemMemory -reservedMemory

27.

28. // 当前Execution与Storage共享的最大内存占比默认为0.6，即

29. // Execution与Storage内存为可用内存的0.6；

30. // 用户内存为可用内存的（1-0.6）= 0.4

31.

32. val memoryFraction =conf.getDouble("spark.memory.fraction", 0.6)

33. (usableMemory * memoryFraction).toLong

34. }

35. }

另外，虽然Execution与Storage之间共享内存，但仍然存在一个初始边界值，参考伴生对象UnifiedMemoryManager的apply工厂方法，具体代码如下所示：

UnifiedMemoryManager.scala源码：

1. def apply(conf: SparkConf, numCores: Int): UnifiedMemoryManager = {

2. val maxMemory = getMaxMemory(conf)

3. new UnifiedMemoryManager(

4. conf,

5. maxMemory = maxMemory,

6. // 通过配置属性"spark.memory.storageFraction"，可以设置Execution与Storage

7. // 共享内存的初始边界值，即默认初始时，各占总的内存的一半。

8. storageRegionSize =

9. (maxMemory * conf.getDouble("spark.memory.storageFraction", 0.5)).toLong,

10. numCores = numCores)

11. }

另外，需要注意的是，前面Execution内存指的是on-heap部分的内存，在ProjectTungsten中引入了off-heap（堆外）内存，这部分内存大小的设置在基类MemoryManager中，对应代码如下所示：

MemoryManager.scala源码：

1. // 1.Storage部分的内存池初始大小设置

2. onHeapStorageMemoryPool.incrementPoolSize(onHeapStorageMemory)

3. // 2. on-heap部分的Execution内存池初始大小设置

4. onHeapExecutionMemoryPool.incrementPoolSize(onHeapExecutionMemory)

6. protected[this] valmaxOffHeapMemory = conf.getSizeAsBytes("spark.memory.offHeap.size",0)

7. protected[this] valoffHeapStorageMemory =

8. (maxOffHeapMemory *conf.getDouble("spark.memory.storageFraction", 0.5)).toLong

9. //3. 计算获取off-heap内存池的初始内存大小

10. offHeapExecutionMemoryPool.incrementPoolSize(maxOffHeapMemory- offHeapStorageMemory)

11. offHeapStorageMemoryPool.incrementPoolSize(offHeapStorageMemory)

当需要使用off-heap内存时，需要注意的是，除了需要修改off-heap内存池（offHeapExecutionMemoryPool）的内存初始值时（默认为0），还需要打开对应的控制开关，具体代码参考内存分配MemoryManager中内存模式的设置（该内存模式可以控制用于内存分配MemoryAllocator的具体子类。），对应代码如下所示：

MemoryManager.scala源码：

1. final val tungstenMemoryMode: MemoryMode = {

2. // 当需要使用off-heap内存模式时，需要通过"spark.memory.offHeap.enabled"

3. // 配置属性打开开关，然后通过"spark.memory.offHeap.size"配置属性

4. // 指定off-heap的内存大小。

5. if (conf.getBoolean("spark.memory.offHeap.enabled", false)) {

6. require(conf.getSizeAsBytes("spark.memory.offHeap.size", 0) > 0,

7. "spark.memory.offHeap.size must be > 0 when spark.memory.offHeap.enabled == true")

8. MemoryMode.OFF_HEAP

9. } else {

10. MemoryMode.ON_HEAP

11. }

12. }

从图9-2可以看出，Execution内存根据不同的内存模式（on-heap或off-heap）可以有两种内存池管理方式，对应可以查看下Execution内存分配的方法（方法注释中给出了为Task分配内存的实现细节，有兴趣可以查看源码注释），关键的代码如下所示：

UnifiedMemoryManager.scala源码：

1. override private[memory] defacquireExecutionMemory(

2. numBytes: Long,

3. taskAttemptId: Long,

4. memoryMode: MemoryMode):Long = synchronized {

5. assertInvariants()

6. assert(numBytes >= 0)

7. val (executionPool,storagePool, storageRegionSize, maxMemory) = memoryMode match {

8. case MemoryMode.ON_HEAP=> (

9. // 当内存模式为on-heap时，使用onHeapExecutionMemoryPool内存池来管理

10. onHeapExecutionMemoryPool,

11. onHeapStorageMemoryPool,

12. onHeapStorageRegionSize,

13. maxHeapMemory)

14. case MemoryMode.OFF_HEAP=> (

15. //当内存模式为off-heap时，使用offHeapExecutionMemoryPool内存池来管理

16. offHeapExecutionMemoryPool,

17. offHeapStorageMemoryPool,

18. offHeapStorageMemory,

19. maxOffHeapMemory)

20. }

MemoryMode是二选一，因此在启动off-heap内存模式时，可以将Storage的内存占比（对应配置属性"spark.memory.storageFraction"）设置高一点，虽然在具体分配过程中Storage也可以向on-heap这部分Execution借用内存。

关于内存池部分，可以阅读Spark内存管理的相关源码来加深理解，内存池示意图如下图所示：

图 9- 3内存池相关类图

主要是通过内部池大小和使用的内存大小等进行控制，对应统一内存管理模型，需要考虑借用等具体实现（关键代码可以查看UnitedMemoryManager对StorageMemoryPool类的shrinkPoolToFreeSpace方法的调用）。

以上是对Tungsten的两种内存管理模型的简单解析，下面开始对内存管理模型的内部组织结构进行解析。

3. 内存管理模型中对内存描述的封装

关于ProjectTungsten相关内容，可以参考https://github.com/hustnn/TungstenSecret。其中对Page Table给出了描述非常详细的说明图。

下面从最基本的源码开始逐步分析内存管理模型中内存描述的封装。主要包含内存地址的封装和内存块的封装，分别对应MemoryLocation与MemoryBlock。

在ProjectTungsten中，为了统一管理on-heap与off-heap两种内存模式，引入了统一的地址表示形式，即通过MemoryLocation类来表示on-heap或off-heap两种内存模式下的地址。

首先查看该类的注释信息，MemoryLocation.scala源码：

1. /**

2. * A memory location. Tracked either by a memory address (with off-heap allocation),

3. * or by an offset from a JVM object (in-heap allocation).

4. *一个内存地址。用于跟踪off-heap模式下的内存地址或on-heap模式下的内存地址。

5. */

6. public class MemoryLocation {

当使用off-heap内存模式时，内存地址可以通过64bit的绝对地址来描述，对应的，但使用on-heap内存模式时，由于GC过程中会对堆（heap）内存进行重组，因此地址的定位需要通过对象在堆内存的引用以及在该对象内的偏移量来表示，此时便需要对象引用和一个偏移量来表示内存地址。

因此，在MemoryLocation中定义了两个成员变量，具体代码如下所示：

MemoryLocation.scala源码：

1. @Nullable

2. Object obj;

3. long offset;

对应两种不同的内存模式，两个成员变量的描述如下：

1) off-heap内存模式：obj为null，地址由64bit的offset唯一标识。

2) on-heap内存模式：obj为堆中该对象的引用，offset对应数据在该对象中的偏移量。

由以上分析可知，通过MemoryLocation类可以统一定位一个off-heap与on-heap两种内存模式下的内存地址。

对应MemoryLocation类的继承子类为MemoryBlock，顾名思义，该子类表示一个内存块，不管是是off-heap或on-heap内存模式，在ProjectTungsten内存管理时，都使用一块连续的内存空间来存储数据，因此即使是在on-heap模式下，也可以降低GC的开销。下面查看下MemoryBlock类的注释信息，具体如下所示：

MemoryBlock.scala源码：

1. /**

2. * A consecutive block of memory, starting at a {@link MemoryLocation} with a fixed size.

3. *一个连续的内存块，继承自描述内存地址的MemoryLocation类，同时提供内存块的

4. *大小。

5. */

6. public class MemoryBlock extends MemoryLocation {

在代码复用方式上存在两种形式，继承与组合。目前在MemoryBlock中使用继承的方式包含内存块的地址信息。在实现上，也可以采用组合这种复用方式，指定内存块的地址，以及内存块本身的内存大小。

下面简单介绍下MemoryBlock类中除了继承自MemoryLocation类之外的部分成员：

1） private final long length：表示内存块的长度。

2） public int pageNumber：表示内存块对应的page号。

3） public static MemoryBlock fromLongArray(finallong[] array)：这是提供的一个将long型数组转换为MemoryBlock内存块的接口。

在提供了内存块之后，进一步的就是如何去组织这些内存块，在Project Tungsten中采用了类似操作系统的内存管理模式，即使用Page Table方式来管理内存。因此，下面开始对Page Table管理方式进行解析。

4. 内存管理模型中的内存组织、管理模式

Spark是一个技术框架，数据以分区粒度进行处理，即每个分区对应一个处理的任务（Task），因此内存的组织与管理等可以通过与Task一一对应的TaskMemoryManager来理解。

下面首先给出TaskMemoryManager与MemoryManager间的关系图，如下图所示：

图 9- 4 TaskMemoryManager与MemoryManager的关系图

在图中，各个MemoryConsumer是具体处理时需要使用（消耗）内存块的实体，MemoryConsumer通过TaskMemoryManager提供的接口向MemoryManager申请或释放内存资源，即申请或释放内存块，TaskMemoryManager类中会管理全部MemoryConsumer，并对这些内存消耗实体所申请的内存块进行组织与管理，具体是通过通过PageTable的方式来实现。

首先查看下类的注释信息，原注释信息比较多，在此仅给出简单的中文描述，具体代码如下所示：

1. /**

2. * Manages the memory allocated by an individual task.

3. * ……

4. * 管理为单个 Task 所分配的内存。

5. * 内存地址在不同的内存模式下的表示：

6. * 1. off-heap ：直接使用 64-bit 表示内存地址。

7. * 2. on-heap ：通过 base object 和该对象中 64-bit 的偏移量来表示。

8. * 通过封装类MemoryBlock统一表示内存块信息：

9. * 1. off-heap ：MemoryBlock的 base object 为 null，偏移量对应 64-bit 的绝对地址。

10. * 2. on-heap ：MemoryBlock的 base object 保存对象的引用（该引用可以由 page 的索引从pageTable获取），

11. * 偏移量对应数据在该对象中的偏移量。

12. *

13. * 通过这两种内存模式对应的编码方式，最终对外提供的编码格式为 13bit-pageNumber + 51bit-offset

14. */

15. public class TaskMemoryManager {

下面从三个方面对TaskMemoryManager进行解析，包含内存地址的编码与解码、PageTable的组织与管理，以及内存的分配与释放。

首先解析内存地址的编码与解码部分。从TaskMemoryManager类的注释部分可以知道，off-heap与on-heap两种内存模式最终对外都是是采用一致的编码格式，即对应13bit的pageNumber（页码）和 51bit的offset（偏移量），可以通过下图来描述对应的编码方式：

图 9- 5 Page的编码方式

下面分别对TaskMemoryManager类中与编码与解码相关的几个接口进行解析，编码接口主要有两个，encodePageNumberAndOffset和decodePageNumber、decodeOffset，其源码与解析如下所示：

TaskMemoryManager.scala的encodePageNumberAndOffset源码：

1. /**

2. * Given a memory page andoffset within that page, encode this address into a 64-bit long.

3. * This address will remainvalid as long as the corresponding page has not been freed.

4. *

5. * @param page a data pageallocated by {@link TaskMemoryManager#allocatePage}/

6. * @param offsetInPage an offsetin this page which incorporates the base offset. In other words,

7. * this should be the valuethat you would pass as the base offset into an

8. * UNSAFE call (e.g. page.baseOffset()+ something).

9. * @return an encoded pageaddress.

10. *

11. * 将针对某个 Page 的地址进行编码：

12. * on-heap ：offsetInPage是针对 base object 的偏移量。

13. * off-heap ：此时，offsetInPage是绝对地址，因此编码到 Page方式的地址时，

14. * 需要将绝对地址转换为相对于已有的Page（MemoryBlock）中的绝对

15. *地址 offset 的相对地址。最后将得到的两个偏移量和 Page Number 一起组装到 13 + 51 bits 的 64 bit 中。

16. */

17.

18.

19. public long encodePageNumberAndOffset(MemoryBlock page, longoffsetInPage) {

20. if (tungstenMemoryMode ==MemoryMode.OFF_HEAP) {

21. //如果是off-heap，则对应的offsetInPage为64bit的绝对地址，需要转换为Page

22. // 编码能容纳的51bit编码中，因此此时需要将其转换为Page内的相对地址，

23. // 即页内的偏移地址。

24.

25. offsetInPage -=page.getBaseOffset();

26. }

27. returnencodePageNumberAndOffset(page.pageNumber, offsetInPage);

28. }

29. …..

30. @VisibleForTesting

31. public static long encodePageNumberAndOffset(intpageNumber, long offsetInPage) {

32. assert (pageNumber != -1) :"encodePageNumberAndOffset called with invalid page";

33. //将13bit的页码与51bit的页内偏移量组装成64bit的编码地址。

34. return (((long) pageNumber)<< OFFSET_BITS) | (offsetInPage & MASK_LONG_LOWER_51_BITS);

35. }

通过pageNumber可以找到最终的Page，Page内部会根据off-heap或on-heap两种模式分别存储Page对应内存块的起始地址（或对象内偏移地址），因此编码后的地址可以通过查找到Page，最终解码出原始地址。

TaskMemoryManager.scala的decodePageNumber、decodeOffset源码：

1. @VisibleForTesting

2. public static intdecodePageNumber(long pagePlusOffsetAddress) {

3. // 解析出编码地址中的页码信息

4. return (int)(pagePlusOffsetAddress >>> OFFSET_BITS);

5. }

7. private static longdecodeOffset(long pagePlusOffsetAddress) {

8. // 通过51bit掩码解析出编码地址中的页码信息，即对应的低51bit内容

9. return (pagePlusOffsetAddress& MASK_LONG_LOWER_51_BITS);

10. }

在TaskMemoryManager类中另外还提供了针对on-heap内存模式下，获取base object的接口，对应的源码及其解析如下所示。

TaskMemoryManager.scala的getPage源码：

1. /**

2. * Get the page associated withan address encoded by

3. 获取编码地址相关的base object

4. * {@linkTaskMemoryManager#encodePageNumberAndOffset(MemoryBlock, long)}

5. */

6. public Object getPage(longpagePlusOffsetAddress) {

7. if (tungstenMemoryMode ==MemoryMode.ON_HEAP) {

8. // 首先从地址中解析出页码

9. final int pageNumber =decodePageNumber(pagePlusOffsetAddress);

10. assert (pageNumber >= 0&& pageNumber < PAGE_TABLE_SIZE);

11. // 根据页码从pageTable变量中获取对应的内存块

12. final MemoryBlock page =pageTable[pageNumber];

13. assert (page != null);

14. assert (page.getBaseObject() != null);

15. // 获取内存块对应的BaseObject

16. return page.getBaseObject();

17. } else {

18. // off-heap内存模式下MemoryBlock只需要保存一个绝对地址，因此对应base object 为null

19. return null;

20. }

21. }

下面开始解析PageTable的组织与管理方面的内容，在解析之前先给出内存以Page Table方式进行组织与管理的大致描述图，如下图所示：

图 9- 6 PageTable组织与管理描述图

在图中，右侧是分配的内存块，即当前需要管理的Page，在TaskMemoryManager中，通过Page Table来存放内存块，同时，通过变量allocatedPages中指定值为Page Number（页码）的下标（索引）对应的值是否为1来表示当前Page Number对应的Page Table中的Page是否已经存放了对应的内存块，即每当分配到一个内存块时，从allocatedPages获取一个值为0的位置（页码），并将该位置作为内存块放入到Page Table中的位置。

简单点描述的话，就是allocatedPages中各个位置上的值为1或0来表示在Page Table中相同位置是否已经放置了内存块（Page）。

而对应在Page Table中已经存放的内存块，实际上就是对应了右侧已经分配的内存块。

当针对一个Page Encode（页地址编码）时，首先从中获取Page Number，根据该值从Page Table中获取确定的内存块（MemoryBlock或Page），找到确定内存块之后，再通过页地址编码中的offset（具体两种内存模式下的概念如图所示）确定的内存块中的相关偏移量，如果是off-heap，则该offset是相对于内存块（从前面分析可知，内存块本身的信息也与内存模式相关）中的绝对地址的相对地址，如果是on-heap，则该offset是相对于内存块的base object中的偏移量。

相关的源码主要涉及TaskMemoryManager类的两个成员变量，如下所示：

TaskMemoryManager.scala源码：

1. // 对应图中的Page Table

2. private final MemoryBlock[] pageTable = new MemoryBlock[PAGE_TABLE_SIZE];

4. // 对应图中的allocatedPages

5. private final BitSet allocatedPages = new BitSet(PAGE_TABLE_SIZE);

PageTable的组织与管理中关于页码的偏移量已经在上一部分给出详细描述，而对应的具体的管理操作则与实际的内存分配与解析部分相关，通过内存分配与解析部分来详细解析具体的管理细节。

下面开始分析TaskMemoryManager类提供的内存分配与解析部分。关于这部分内容，主要参考allocatePage与freePage两个方法，对应allocatePage内部如何申请内存，以及申请内存时采用的spill策略等细节，大家可以继续深入，比如查看acquireExecutionMemory的具体源码来加深理解。

TaskMemoryManager.scala的allocatePage源码：

1. /**

2. * Allocate a block of memorythat will be tracked in the MemoryManager's page table; this is

3. * intended for allocating largeblocks of Tungsten memory that will be shared between operators.

4. *

5. * Returns `null` if there wasnot enough memory to allocate the page. May return a page that

6. * contains fewer bytes thanrequested, so callers should verify the size of returned pages.

7. *

8. * 分配一块内存，并通过MemoryManager（实际上是在TaskMemoryManager中）的

9. * Page Table 进行跟踪；分配的是 Execution 部分的内存。

10. * Project Tungsten 的内存包含 off-heap 和 on-heap 两种模式，由底层

11. * tungstenMemoryMode（在MemoryManager中设置）控制具体分配的MemoryAllocator子类。

12. */

13. public MemoryBlockallocatePage(long size, MemoryConsumer consumer) {

14. assert(consumer != null);

15. assert(consumer.getMode() ==tungstenMemoryMode);

16. // 页大小的限制

17. if (size >MAXIMUM_PAGE_SIZE_BYTES) {

18. throw newIllegalArgumentException(

19. "Cannot allocate apage with more than " + MAXIMUM_PAGE_SIZE_BYTES + " bytes");

20. }

21. // 申请一定的内存量

22. long acquired =acquireExecutionMemory(size, consumer);

23. if (acquired <= 0) {

24. return null;

25. }

26.

27. final int pageNumber;

28. synchronized (this) {

29. // 获取当前未被占用的页码

30. pageNumber =allocatedPages.nextClearBit(0);

31. if (pageNumber >=PAGE_TABLE_SIZE) {

32. releaseExecutionMemory(acquired,consumer);

33. throw newIllegalStateException(

34. "Have alreadyallocated a maximum of " + PAGE_TABLE_SIZE + " pages");

35. }

36. // 设置该页码已经被占用（即设置对应页码位置的值）

37. allocatedPages.set(pageNumber);

38. }

39. MemoryBlock page = null;

40. try {

41.

42. // 开始通过MemoryAllocator真正分配内存

43. // 注意 : acquireExecutionMemory 中通过ExecutionMemoryPool进行分配时，

44. // 仅仅是内存使用大小上的控制，并没有真正分配内存。

45. // 有兴趣的话，可以查看对 acquireExecutionMemory 的调用点（其中

46. // 可以指定与tungstenMemoryMode不同的其他内存模式，

47. // 此时是不存在真正的内存分配的）

48.

49.

50. page =memoryManager.tungstenMemoryAllocator().allocate(acquired);

51. } catch (OutOfMemoryError e) {

52. logger.warn("Failed toallocate a page ({} bytes), try again.", acquired);

53. // there is no enough memoryactually, it means the actual free memory is smaller than

54. // MemoryManager thought, weshould keep the acquired memory.

55. synchronized (this) {

56. acquiredButNotUsed +=acquired;

57. allocatedPages.clear(pageNumber);

58. }

59. // this could triggerspilling to free some pages.

60. return allocatePage(size,consumer);

61. }

62.

63. // 分配得到内存块之后，会设置该内存块对应的pageNumber，

64. // 即此时设置MemoryBlock在其管理的 Page Table 中的位置。

65.

66. page.pageNumber = pageNumber;

67. pageTable[pageNumber] = page;

68. if (logger.isTraceEnabled()) {

69. logger.trace("Allocatepage number {} ({} bytes)", pageNumber, acquired);

70. }

71. return page;

72. }

其中，MAXIMUM_PAGE_SIZE_BYTES是页内数据量大小的现在，从之前MemoryBlock提供的从long型数组转换得到MemoryBlock接口，可以知道当前连续的内存块是通过long型数组来获取的，因此对应的内存块的大小也会受到数组的最大长度的现在。

至于对应在具体的处理过程中，对页内的数据量大小是否还有其他现在，可以参考具体的处理细节，下一章节会给出一个具体处理过程的源码解析，其中会包含这部分内容。

由于分配的细节比较多，这里给出主要的过程描述：

1）首先通过acquireExecutionMemory方法，向ExecutionMemoryPool申请内存（根

据统一或静态两种具体实现给出）：这一部分主要是判断当前可用内存是否满足申请需求，并根据申请结果修改当前内存池可用内存信息（实际是当前使用内存量信息）。

2）从当前Page Table中找出一个可用位置，用于存放所申请的内存块

（MemoryBlock或Page）。

3）准备好前两步后，开始通过MemoryAllocator真正分区内存块。

4）将分配的内存块放入Page Table。

在整个过程中，allocatedPages与pageTable这两个成员变量的使用是体现Page Table组织与管理的关键所在。

下面解析freePage的源码，如下所示：

TaskMemoryManager.scala的freePage源码：

1. /**

2. * Free a block of memoryallocated via {@link TaskMemoryManager#allocatePage}.

3. * 更新 Page Table 相关信息，通过MemoryAllocator释放 Page 的内存，

4. * 最后通过MemoryManager修改ExecutorManagerPool中内存使用量（即释放）。

5. */

7. public void freePage(MemoryBlockpage, MemoryConsumer consumer) {

8. // 首先确认当前释放的内存块是在 Page Table 的管理中的，即页码必须有效

9. assert (page.pageNumber != -1):

10. "Called freePage() onmemory that wasn't allocated with allocatePage()";

11. assert(allocatedPages.get(page.pageNumber));

12. pageTable[page.pageNumber] =null;

13. // allocatedPages是控制pageTable中对应位置是否可用的，

14. // 需要考虑释放与分配时的并发性，因此需同步处理

15.

16. synchronized (this) {

17. allocatedPages.clear(page.pageNumber);

18. }

19. if (logger.isTraceEnabled()) {

20. logger.trace("Freedpage number {} ({} bytes)", page.pageNumber, page.size());

21. }

22. // 通过当前内存模式对应的MemoryAllocator真正释放该内存块

23. long pageSize = page.size();

24. memoryManager.tungstenMemoryAllocator().free(page);

25. // 对应ExecutionMemoryPool部分的内存释放，

26. // 参考前面 acquireExecutionMemory 解析一起了解

27.

28. releaseExecutionMemory(pageSize, consumer);

29. }

释放Page的逻辑实际上可以参考申请Page，大部分都是步骤相反而已。

王家林：大数据Spark中国第一人. DT大数据梦工厂创始人.微信公众号DT_Spark .微信号：13928463918 邮箱[email protected]

本博客根据家林大神视频课程整理。

Spark 2.1.0新一代Tungsten 内存管理的模型及其实现类的解析

9.2.2内存管理的模型及其实现类的解析

你可能感兴趣的:(SparkInBeiJing)