1.定义
RDD是只读的记录分区的集合,是一种基于工作集的应用抽象
创建RDD的方式有两种:
从驱动程序中的集合中并行创建
从外部数据集创建
2.底层存储原理
每个RDD的数据以Block的形式存储在多个机器上,对于每个Executor都会启动一个BlockManagerSlave,并且管理一部分Block,Driver节点上通过BlockManagerMaster保存Block的元数据,BlockManagerSlave生成Block之后会向BlockManagerMaster注册,然后通过它来管理RDD和Block的关系,如果RDD不需要存储,BlockManagerMaster向BlockManagerSlave发送删除指令删除相应的Block
BlockManager来管理RDD的物理分区,每个Block对应节点上一个数据块,存储位置可以是磁盘或内存,RDD中的Partition是一个逻辑数据块,对应相应的Block,在代码中一个RDD相当于数据的元数据结构,保存数据的分区以及逻辑结构映射关系还有RDD之前的依赖关系
在每个节点上都是通过BlockManager来管理Block,部分源码如下:
/**
* Manager running on every node (driver and executors) which provides interfaces for putting and
* retrieving blocks both locally and remotely into various stores (memory, disk, and off-heap).
* BlockManager在每个节点上运行管理Block(Driver和Executors),它提供一个接口检索本地和远程的存储变量,如memory、disk、off- heap
* Note that [[initialize()]] must be called before the BlockManager is usable.
* 使用BlockManager前必须先初始化
*/
private[spark] class BlockManager(
executorId: String,
rpcEnv: RpcEnv,
val master: BlockManagerMaster,
val serializerManager: SerializerManager,
val conf: SparkConf,
memoryManager: MemoryManager,
mapOutputTracker: MapOutputTracker,
shuffleManager: ShuffleManager,
val blockTransferService: BlockTransferService,
securityManager: SecurityManager,
numUsableCores: Int)
extends BlockDataManager with BlockEvictionHandler with Logging {
RDD.scala源码如下:
/**
* A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable,
* partitioned collection of elements that can be operated on in parallel. This class contains the
* basic operations available on all RDDs, such as `map`, `filter`, and `persist`. In addition,
* [[org.apache.spark.rdd.PairRDDFunctions]] contains operations available only on RDDs of key-value
* pairs, such as `groupByKey` and `join`;
* [[org.apache.spark.rdd.DoubleRDDFunctions]] contains operations available only on RDDs of
* Doubles; and
* [[org.apache.spark.rdd.SequenceFileRDDFunctions]] contains operations available on RDDs that
* can be saved as SequenceFiles.
* All operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)]
* through implicit.
*
* Internally, each RDD is characterized by five main properties:
* 5个主要特性
* - A list of partitions 一个分区的列表
* - A function for computing each split 每个分片有一个计算函数
* - A list of dependencies on other RDDs 一个依赖于其他RDD的列表
* 可选的,一个key-value类型的RDD分区器
* - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
* 每个分区都有一个分区位置列表
* - Optionally, a list of preferred locations to compute each split on (e.g. block locations for
* an HDFS file)
*
* All of the scheduling and execution in Spark is done based on these methods, allowing each RDD
* to implement its own way of computing itself. Indeed, users can implement custom RDDs (e.g. for
* reading data from a new storage system) by overriding these functions. Please refer to the
* Spark paper
* for more details on RDD internals.
*/
abstract class RDD[T: ClassTag](
@transient private var _sc: SparkContext, //入口
@transient private var deps: Seq[Dependency[_]]
) extends Serializable with Logging {
3.RDD五大特性
1.A list of partitions
RDD可以有多个分区,每个分区被一个Task处理,可以在创建RDD时指定分区个数,如果不指定,则为分配到的CPU核数
2.A function for computing each split
每个分区有一个compute函数,对具体的分片进行计算,分片是并行的
3.A list of dependencies on other RDDs
一个依赖于其他RDD的列表,RDD之间的依赖有两种:Narrow Dependency和Wide Dependency
Narrow Dependency指每一个父RDD的Partition最多被子RDD的一个Partition使用
Wide Dependency指多个子RDD的Partition依赖于同一个父RDD的Partition
4.Optionally, a Partitioner for key-value RDDs
每个key-value形式的RDD都有Partitioner属性,决定RDD如何分区,Partition的个数决定每个Stage的Task个数
5.Optionally, a list of preferred locations to compute each split on
每个分区都有一个优先位置列表,Spark在进行任务调度的时候,会优先将任务分配到处理数据的数据块所在的位置,符合数据本地性
RDD的源码文件中通过4个方法和一个属性对应了这5大特性:
/**
* :: DeveloperApi ::
* Implemented by subclasses to compute a given partition.
* 通过子类实现给定分区的计算
*/
@DeveloperApi
def compute(split: Partition, context: TaskContext): Iterator[T]
/**
* Implemented by subclasses to return the set of partitions in this RDD. This method will only
* be called once, so it is safe to implement a time-consuming computation in it.
* 通过子类实现,返回一个RDD的分区列表,该方法仅仅被调用一次,因此在其中执行耗时的操作是安全的
*
* The partitions in this array must satisfy the following property:
* 在这个数组中的分区必须符合以下属性
* `rdd.partitions.zipWithIndex.forall { case (partition, index) => partition.index == index }`
*/
protected def getPartitions: Array[Partition]
/**
* Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only
* be called once, so it is safe to implement a time-consuming computation in it.
* 通过子类实现该RDD是如何依赖父RDDs的,该方法仅仅被调用一次,因此在其中执行耗时的操作是安全的
*/
protected def getDependencies: Seq[Dependency[_]] = deps
/**
* Optionally overridden by subclasses to specify placement preferences.
* 可选的,通过子类覆盖指定优先位置
*/
protected def getPreferredLocations(split: Partition): Seq[String] = Nil
/** Optionally overridden by subclasses to specify how they are partitioned.
* 可选的,通过子类覆盖指定分区
* */
@transient val partitioner: Option[Partitioner] = None
TaskContext是读取执行任务的环境,可以调用内部的函数访问正在运行任务的环境信息,Partitioner定义了如何在key-value类型的RDD元素中用key分区,Partition时一个RDD的分区标识符,源码如下:
/**
* An identifier for a partition in an RDD.
*/
trait Partition extends Serializable {
/**
* Get the partition's index within its parent RDD
* 获取父RDD的分区索引
*/
def index: Int
// A better default implementation of HashCode
// 最好默认实现HashCode
override def hashCode(): Int = index
override def equals(other: Any): Boolean = super.equals(other)
}