RDD:弹性分布式数据集

1.RDD解析
分布式:数据的来源
数据集:数据的类型 & 计算逻辑的封装(类似数据模型)
弹性:


  • 抽象类
abstract class RDD[T: ClassTag](
    @transient private var _sc: SparkContext,
    @transient private var deps: Seq[Dependency[_]]
  ) extends Serializable with Logging

  • 不可变:计算逻辑不可变
  • 可分区:提高数据处理能力
  • 并行计算:多任务同时执行

2.RDD属性

@DeveloperApi
  def compute(split: Partition, context: TaskContext): Iterator[T]

  /**
   * Implemented by subclasses to return the set of partitions in this RDD. This method will only
   * be called once, so it is safe to implement a time-consuming computation in it.
   *
   * The partitions in this array must satisfy the following property:
   *   `rdd.partitions.zipWithIndex.forall { case (partition, index) => partition.index == index }`
   */
  protected def getPartitions: Array[Partition]

  /**
   * Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only
   * be called once, so it is safe to implement a time-consuming computation in it.
   */
  protected def getDependencies: Seq[Dependency[_]] = deps

  /**
   * Optionally overridden by subclasses to specify placement preferences.
   */
  protected def getPreferredLocations(split: Partition): Seq[String] = Nil

  /** Optionally overridden by subclasses to specify how they are partitioned. */
  @transient val partitioner: Option[Partitioner] = None

1.一组分区(Partition)即数据集的基本组成单位: protected def getPartitions: Array[Partition]
2.一个计算每个分区的函数: def compute(split: Partition, context: TaskContext): Iterator[T]
3.RDD之间的依赖关系 : protected def getDependencies: Seq[Dependency[_]] = deps
4.一个Partitioner,即RDD的分片函数: @transient val partitioner: Option[Partitioner] = None
5.一个列表,存储存取每个Partition的优先位置:将计算发往数据位置。 protected def getPreferredLocations(split: Partition): Seq[String] = Nil

RDD表示只读的分区的数据集,对RDD进行改动,只能通过RDD的转换操作,由一个RDD得到一个新的RDD,新的RDD包含了从其他RDD衍生所必需的信息。RDDs之间存在依赖,RDD的执行是按照血缘关系延时计算的。如果血缘关系较长,可以通过持久化RDD来切断血缘关系。

你可能感兴趣的:(RDD:弹性分布式数据集)