有人经常会看到Spark中有句话说:cache是persist的特例。
通过分析源码,我们来看一下,这句话的含义:
\spark-1.5.0\core\src\main\scala\org\apache\spark\rdd\RDD.scala
/** Persist thisRDD with the default storage level (`MEMORY_ONLY`). */
def persist(): this.type =persist(StorageLevel.MEMORY_ONLY)
/** Persist this RDD with the default storagelevel (`MEMORY_ONLY`). */
def cache(): this.type = persist()
/**
* Mark the RDD as non-persistent, and removeall blocks for it from memory and disk.
* @param blocking Whether to block until allblocks are deleted.
* @return This RDD.
*/
def unpersist(blocking: Boolean = true):this.type = {
logInfo("Removing RDD " + id +" from persistence list")
sc.unpersistRDD(id, blocking)
storageLevel = StorageLevel.NONE
this
}
从上述的源码当中,我们可以清晰的看出,调用cache方法其实就是调用缓存策略StorageLevel值为MEMORY_ONLY类型的persist方法,所以才说cache是persist的特例。
我们这里顺便也说一下缓存级别:
先看一下源代码位置:
\spark-1.5.0\core\src\main\scala\org\apache\spark\storage\StorageLevel.scala
/**
* Various[[org.apache.spark.storage.StorageLevel]] defined and utility functions forcreating
* new storage levels.
*/
objectStorageLevel {
val NONE = new StorageLevel(false, false,false, false)
val DISK_ONLY = new StorageLevel(true, false,false, false)
val DISK_ONLY_2 = new StorageLevel(true,false, false, false, 2)
val MEMORY_ONLY = new StorageLevel(false,true, false, true)
val MEMORY_ONLY_2 = new StorageLevel(false,true, false, true, 2)
val MEMORY_ONLY_SER = new StorageLevel(false,true, false, false)
val MEMORY_ONLY_SER_2 = newStorageLevel(false, true, false, false, 2)
val MEMORY_AND_DISK = new StorageLevel(true,true, false, true)
val MEMORY_AND_DISK_2 = newStorageLevel(true, true, false, true, 2)
val MEMORY_AND_DISK_SER = newStorageLevel(true, true, false, false)
val MEMORY_AND_DISK_SER_2 = newStorageLevel(true, true, false, false, 2)
val OFF_HEAP = new StorageLevel(false, false,true, false) //专门针对Tachyon
……………………………………………..
// 这个类需要五个参数,可以看上面的伴生对象传入的参数
class StorageLevelprivate(
private var _useDisk: Boolean,
private var _useMemory: Boolean,
privatevar _useOffHeap: Boolean, //是否使用tachyon分布式内存系统保存RDD
private var _deserialized: Boolean, //kryo序列化比较好,节省内存,同时消耗cpu
private var _replication: Int = 1)
extends Externalizable {
// TODO: Also add fields for cachingpriority, dataset ID, and flushing.
private def this(flags: Int, replication:Int) {
this((flags & 8) != 0, (flags & 4)!= 0, (flags & 2) != 0, (flags & 1) != 0, replication)
}
def this() = this(false, true, false,false) // For deserialization
def useDisk: Boolean = _useDisk
def useMemory: Boolean = _useMemory
def useOffHeap: Boolean = _useOffHeap
def deserialized: Boolean = _deserialized
def replication: Int = _replication
...............................................................................................
所以,我们也可以通过persist方法手工设定StorageLevel参数值来满足自己需要的缓存级别。