Application
a driver program + executors
SparkContext = application
spark-shell ? application
gateway
application1: 1 driver + 10 executors
application2: 1 driver + 10 executors
share
application ==> n jobs ==> n stages ==> n tasks
partition == task
textFile("")............ count
textFile("")............ count
textFile("")............ count
textFile("").cache
cache lazy === transformation
unpersist eager
def persist() = persist(StorageLevel.MEMORY_ONLY)
def cache() = persist()
class StorageLevel private(
private var _useDisk: Boolean,
private var _useMemory: Boolean,
private var _useOffHeap: Boolean,
private var _deserialized: Boolean,
private var _replication: Int = 1)
MEMORY_ONLY (false, true, false, true)
Lineage
textFile ==> xx ==> yy ==> zz
map filter map .....
描述的是一个RDD如何从父RDD计算得来的
Dependency
窄依赖
一个父RDD的partition至多被子RDD的某个partition使用一次
pipeline
宽依赖
一个父RDD的partition会被子RDD的partition使用多次
xxKey
join not co.....
shuffle ==> stage
lines.flatMap(_.split("\t")).map((_,1)).reduceByKey(_+_).collect