RDD 内存管理

Memory Management 
• Three options for persistent RDDs 
– in-memory storage as deserialized Java objects 
– in-memory storage as serialized data 
– on-disk storage 
• LRU eviction policy at the level of RDDs 
– when there’s not enough memory, evict a 

partition from the least recently accessed RDD




Checkpointing 
• Checkpoint RDDs to prevent long lineage 
chains during fault recovery 
 
• Simpler to checkpoint than shared memory 
– Read-only nature of RDDs 






Spark provides three options for storage of persistent RDDs:

  • in-memory storage as deserialized Java objects, The first option provides the fastest performance, because the Java VM can access each RDD element natively.
  • in-memory storage as serialized data, The second option lets users choose a more memory-efficient representation than Java object graphs when space is limited, at the cost of lower performance.
  • and on-disk stor- age. The third option is useful for RDDs that are too large to keep in RAM but costly to recompute on each use.

To manage the limited memory available, we use an LRU eviction policy at the level of RDDs. When a new RDD partition is computed but there is not enough space to store it, we evict a partition from the least recently ac- cessed RDD, unless this is the same RDD as the one with the new partition. In that case, we keep the old partition in memory to prevent cycling partitions from the same RDD in and out. This is important because most oper- ations will run tasks over an entire RDD, so it is quite likely that the partition already in memory will be needed in the future. We found this default policy to work well in all our applications so far, but we also give users further control via a “persistence priority” for each RDD.

Finally, each instance of Spark on a cluster currently has its own separate memory space. In future work, we plan to investigate sharing RDDs across instances of Spark through a unified memory manager.


你可能感兴趣的:(RDD 内存管理)