SparkCore:RDD Persistence持久化策略, persist和cache算子


    • 1、RDD Persistence介绍
    • 2、persist()和cache()算子
      • 2.1 cache底层源码
      • 2.2 StorageLevel
      • 2.2 StorageLevel使用
      • 2.3 StorageLevel如何选择
      • 2.4 RDD.unpersist()移除缓存数据

官网:RDD Persistence

1、RDD Persistence介绍

  • One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations.When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it).This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.

  • You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.

scala> val lines = sc.textFile("hdfs://")
scala> lines.cache


scala> lines.collect
res1: Array[String] = Array(hello spark, hello mr, hello yarn, hello hive, hello spark)

SparkCore:RDD Persistence持久化策略, persist和cache算子_第1张图片
SparkCore:RDD Persistence持久化策略, persist和cache算子_第2张图片


2.1 cache底层源码

查看cache()源码,cache底层调用的是persist(),默认storage level (MEMORY_ONLY)

   * Persist this RDD with the default storage level (`MEMORY_ONLY`).
  def cache(): this.type = persist()
   * Persist this RDD with the default storage level (`MEMORY_ONLY`).
  def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)

2.2 StorageLevel

 * Various [[]] defined and utility functions for creating
 * new storage levels.
object StorageLevel {
  val NONE = new StorageLevel(false, false, false, false)
  val DISK_ONLY = new StorageLevel(true, false, false, false)
  val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)
  val MEMORY_ONLY = new StorageLevel(false, true, false, true)
  val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)
  val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)
  val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)
  val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)
  val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
  val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
  val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
  val OFF_HEAP = new StorageLevel(true, true, true, false, 1)

然后在点击new里面的 StorageLevel源码,就是可以知道后面的false,true,2是代表什么意思

class StorageLevel private(
    private var _useDisk: Boolean,  //是否磁盘
    private var _useMemory: Boolean,  //是否内存
    private var _useOffHeap: Boolean, //是否OffHeap
    private var _deserialized: Boolean, //是否反序列化
    private var _replication: Int = 1)  //多少副本
  extends Externalizable {}

2.2 StorageLevel使用

scala> lines.unpersist(true)  #去掉缓存,此时去UI界面是看不到前面缓存信息了
scala> import  #导入StorageLevel类
scala> lines.persist(StorageLevel.MEMORY_AND_DISK_SER_2)
scala> lines.count   
19/08/02 04:26:21 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
19/08/02 04:26:21 WARN BlockManager: Block rdd_1_1 replicated to only 0 peer(s) instead of 1 peers
19/08/02 04:26:21 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
19/08/02 04:26:21 WARN BlockManager: Block rdd_1_0 replicated to only 0 peer(s) instead of 1 peers
res4: Long = 5

UI界面可以看到StorageLevel已经变成了 内存带序列化2个副本
SparkCore:RDD Persistence持久化策略, persist和cache算子_第3张图片

2.3 StorageLevel如何选择

Spark’s storage levels are meant to provide different trade-offs between memory usage and CPU efficiency. We recommend going through the following process to select one:

  • If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible.

  • If not, try using MEMORY_ONLY_SER and selecting a fast serialization library to make the objects much more space-efficient, but still reasonably fast to access. (Java and Scala)

  • Don’t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition may be as fast as reading it from disk.

  • Use the replicated storage levels if you want fast fault recovery (e.g. if using Spark to serve requests from a web application). All the storage levels provide full fault tolerance by recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waiting to recompute a lost partition.

2.4 RDD.unpersist()移除缓存数据

Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.

在你程序最后 sc.stop()前,最好加上RDD.unpersist()方法释放缓存
