15-SparkCore02

Application

a driver program +  executors

SparkContext = application

spark-shell ? application

gateway

application1: 1 driver + 10 executors

application2: 1 driver + 10 executors

share

application ==> n jobs ==> n stages ==> n tasks

partition  ==  task

textFile("")............ count

textFile("")............ count

textFile("")............ count

textFile("").cache

cache  lazy === transformation

unpersist eager

def persist() = persist(StorageLevel.MEMORY_ONLY)

def cache()  = persist()

class StorageLevel private(

    private var _useDisk: Boolean,

    private var _useMemory: Boolean,

    private var _useOffHeap: Boolean,

    private var _deserialized: Boolean,

    private var _replication: Int = 1)

MEMORY_ONLY  (false, true, false, true)

Lineage

textFile ==> xx ==> yy ==> zz 

  map filter  map  .....

描述的是一个RDD如何从父RDD计算得来的

Dependency

窄依赖

一个父RDD的partition至多被子RDD的某个partition使用一次

pipeline

宽依赖

一个父RDD的partition会被子RDD的partition使用多次

xxKey

join not co.....

shuffle ==> stage

lines.flatMap(_.split("\t")).map((_,1)).reduceByKey(_+_).collect

你可能感兴趣的:(15-SparkCore02)