resilient distributed datasets 读后笔记

1.Formally, an RDD is a read-only, partitioned collection of records. RDDs can be only created through deterministic operations on either (1) a dataset in stable storage or (2) other existing RDDs.

 

2.RDD是延迟加载的,就是说直到action被触发,才真正有动作。
resilient distributed datasets 读后笔记_第1张图片
 

3. RDD之间的关系分为narrow dependency 和 wide dependency,看图很好理解


resilient distributed datasets 读后笔记_第2张图片

 

4.spark的scheuler会把程序逻辑和RDD变成DAG图来,分stage执行


resilient distributed datasets 读后笔记_第3张图片
 

 

 


 

 

 

你可能感兴趣的:(Data)