RDD创建内幕彻底解密(DT大数据梦工厂)

方面

1、RDD创建的几个方式;

2、RDD创建实战;

3、RDD创建内幕

==========RDD创建的几个方式 ============

为什么要有几个创建方式?

Spark会基于不同的介质进行计算。

Spark和Hadoop有没有关系?

没有任何关系,知识当Spark运行在Hadoop之上,Hadoop作为数据来源的时候才有关系。

如果知识基于计算本身,完全没必要学Hadoop。

既可以运行在Hadoop之上,也可以运行在其它分布式文件之上,也可以运行在本地。

第一个RDD代表了Spark应用程序输入数据的来源。只有创建了第一个RDD,才能通过transformation来进行对RDD各种算子的转换。实现算法。

RDD基本有三种创建方式(其实300种都不止):

1、使用程序中的集合创建RDD;

2、使用本地文件系统创建RDD;

3、使用HDFS创建RDD;

另外常见的比如

4、基于DB(oracle、mysql)创建;

5、基于NoSQL(HBASE);

6、基于S3;

7、基于数据流创建RDD;

创建RDD的7大方式

本课演示前面三种方式。

通过集合创建RDD实际意义?测试!!!

使用本地文件系统创建RDD的主要作用能够是什么?集合的数据量有限,可以测试大量数据的文件。

使用HDFS创建RDD,主要是生产环境最常用的RDD创建方式。

Hadoop+Spark是大数据领域最有前途的组合

~~~1、基于集合创建RDD~~~

package com.dt.spark.cores

import org.apache.spark.{SparkContextSparkConf}

/**
  * Created by 威 on 2016/2/6.
  */
object RDDBaseOnCollection {
  def main(args: Array[String]) {
    /**
      * 1、创建一个Scala集合
      */
    val conf = new SparkConf()//创建SparkConf对象
    conf.setAppName("RDDBaseOnCollection")//设置应用程序的名称,在程序运行的监控界面可以看到名称
    conf.setMaster("local")//此时程序在本地运行,不需要安装Spark集群
    val sc = new SparkContext(conf)//通过创建SparkContext对象,通过传入SparkConf实例来定制Spark运行的具体参数和配置信息

    val numbers = to 100
    val rdd = sc.parallelize(numbers)

    val sum = rdd.reduce(_+_)//1+2=3 3+3=6 6+4=10 ......

    println("1+2+...+99+100="+sum)

  }
}

结果:

16/02/06 14:18:01 INFO DAGScheduler: Job 0 finished: reduce at RDDBaseOnCollection.scala:21, took 1.192071 s

1+2+...+99+100=5050

16/02/06 14:18:01 INFO SparkContext: Invoking stop() from shutdown hook

验证Spark也可以作为数据处理的单机程序,可以在只能设备,例如:手机、平板、电视上使用Spark,也可以在PC、Server上使用Spark。所以Spark可以运行在一切设备上。

单机上可以多线程的方式,模拟分布式。

local模式失败了就是失败了,MAX_LOCAL_TASK_FAILURES用来重试次数

private def createTaskScheduler(
    sc: SparkContext,
    master: String): (SchedulerBackendTaskScheduler) = {
  import SparkMasterRegex._

  // When running locally, don't try to re-execute tasks on failure.
  val MAX_LOCAL_TASK_FAILURES = 1

  master match {
    case "local" =>
      val scheduler = new TaskSchedulerImpl(scMAX_LOCAL_TASK_FAILURESisLocal = true)
      val backend = new LocalBackend(sc.getConfscheduler1)
      scheduler.initialize(backend)
      (backendscheduler)

    case LOCAL_N_REGEX(threads) =>
      def localCpuCountInt = Runtime.getRuntime.availableProcessors()
      // local[*] estimates the number of cores on the machine; local[N] uses exactly N threads.
      val threadCount = if (threads == "*") localCpuCount else threads.toInt
      if (threadCount <= 0) {
       

在集群中运行,上面的代码

./spark-shell --master spark://Master:7077,Worker1:7077,Worker2:7077

scala> val numbers = 1 to 100

numbers: scala.collection.immutable.Range.Inclusive = Range(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100)

scala> val rdd = sc.parallelize(numbers)

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:29

scala> val sum = rdd.reduce(_+_)

16/02/06 14:42:06 INFO spark.SparkContext: Starting job: reduce at <console>:31

16/02/06 14:42:06 INFO scheduler.DAGScheduler: Got job 0 (reduce at <console>:31) with 24 output partitions

16/02/06 14:42:06 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (reduce at <console>:31)

16/02/06 14:42:06 INFO scheduler.DAGScheduler: Parents of final stage: List()

16/02/06 14:42:06 INFO scheduler.DAGScheduler: Missing parents: List()

16/02/06 14:42:06 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (ParallelCollectionRDD[0] at parallelize at <console>:29), which has no missing parents

16/02/06 14:42:07 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1216.0 B, free 1216.0 B)

16/02/06 14:42:07 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 858.0 B, free 2.0 KB)

16/02/06 14:42:07 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.145.131:54442 (size: 858.0 B, free: 1247.2 MB)

16/02/06 14:42:07 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006

16/02/06 14:42:07 INFO scheduler.DAGScheduler: Submitting 24 missing tasks from ResultStage 0 (ParallelCollectionRDD[0] at parallelize at <console>:29)

16/02/06 14:42:07 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 24 tasks

16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, Worker2, partition 0,PROCESS_LOCAL, 2078 bytes)

16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, Master, partition 1,PROCESS_LOCAL, 2078 bytes)

16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, Worker1, partition 2,PROCESS_LOCAL, 2078 bytes)

16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, Worker2, partition 3,PROCESS_LOCAL, 2078 bytes)

16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, Master, partition 4,PROCESS_LOCAL, 2078 bytes)

16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, Worker1, partition 5,PROCESS_LOCAL, 2078 bytes)

16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 6.0 in stage 0.0 (TID 6, Worker2, partition 6,PROCESS_LOCAL, 2078 bytes)

16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 7.0 in stage 0.0 (TID 7, Master, partition 7,PROCESS_LOCAL, 2078 bytes)

16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 8.0 in stage 0.0 (TID 8, Worker1, partition 8,PROCESS_LOCAL, 2078 bytes)

16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 9.0 in stage 0.0 (TID 9, Worker2, partition 9,PROCESS_LOCAL, 2078 bytes)

16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 10.0 in stage 0.0 (TID 10, Master, partition 10,PROCESS_LOCAL, 2078 bytes)

16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 11.0 in stage 0.0 (TID 11, Worker1, partition 11,PROCESS_LOCAL, 2078 bytes)

16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 12.0 in stage 0.0 (TID 12, Worker2, partition 12,PROCESS_LOCAL, 2078 bytes)

16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 13.0 in stage 0.0 (TID 13, Master, partition 13,PROCESS_LOCAL, 2078 bytes)

16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 14.0 in stage 0.0 (TID 14, Worker1, partition 14,PROCESS_LOCAL, 2078 bytes)

16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 15.0 in stage 0.0 (TID 15, Worker2, partition 15,PROCESS_LOCAL, 2078 bytes)

16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 16.0 in stage 0.0 (TID 16, Master, partition 16,PROCESS_LOCAL, 2078 bytes)

16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 17.0 in stage 0.0 (TID 17, Worker1, partition 17,PROCESS_LOCAL, 2078 bytes)

16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 18.0 in stage 0.0 (TID 18, Worker2, partition 18,PROCESS_LOCAL, 2078 bytes)

16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 19.0 in stage 0.0 (TID 19, Master, partition 19,PROCESS_LOCAL, 2078 bytes)

16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 20.0 in stage 0.0 (TID 20, Worker1, partition 20,PROCESS_LOCAL, 2078 bytes)

16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 21.0 in stage 0.0 (TID 21, Worker2, partition 21,PROCESS_LOCAL, 2078 bytes)

16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 22.0 in stage 0.0 (TID 22, Master, partition 22,PROCESS_LOCAL, 2078 bytes)

16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 23.0 in stage 0.0 (TID 23, Worker1, partition 23,PROCESS_LOCAL, 2135 bytes)

16/02/06 14:42:08 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on Master:57909 (size: 858.0 B, free: 511.1 MB)

16/02/06 14:42:08 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on Worker1:55725 (size: 858.0 B, free: 511.1 MB)

16/02/06 14:42:09 INFO scheduler.TaskSetManager: Finished task 7.0 in stage 0.0 (TID 7) in 2201 ms on Master (1/24)

16/02/06 14:42:09 INFO scheduler.TaskSetManager: Finished task 22.0 in stage 0.0 (TID 22) in 2325 ms on Master (2/24)

16/02/06 14:42:09 INFO scheduler.TaskSetManager: Finished task 16.0 in stage 0.0 (TID 16) in 2330 ms on Master (3/24)

16/02/06 14:42:09 INFO scheduler.TaskSetManager: Finished task 19.0 in stage 0.0 (TID 19) in 2332 ms on Master (4/24)

16/02/06 14:42:09 INFO scheduler.TaskSetManager: Finished task 4.0 in stage 0.0 (TID 4) in 2349 ms on Master (5/24)

16/02/06 14:42:09 INFO scheduler.TaskSetManager: Finished task 10.0 in stage 0.0 (TID 10) in 2349 ms on Master (6/24)

16/02/06 14:42:09 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 2354 ms on Master (7/24)

16/02/06 14:42:09 INFO scheduler.TaskSetManager: Finished task 13.0 in stage 0.0 (TID 13) in 2347 ms on Master (8/24)

16/02/06 14:42:09 INFO scheduler.TaskSetManager: Finished task 14.0 in stage 0.0 (TID 14) in 2598 ms on Worker1 (9/24)

16/02/06 14:42:09 INFO scheduler.TaskSetManager: Finished task 5.0 in stage 0.0 (TID 5) in 2606 ms on Worker1 (10/24)

16/02/06 14:42:09 INFO scheduler.TaskSetManager: Finished task 23.0 in stage 0.0 (TID 23) in 2593 ms on Worker1 (11/24)

16/02/06 14:42:09 INFO scheduler.TaskSetManager: Finished task 17.0 in stage 0.0 (TID 17) in 2598 ms on Worker1 (12/24)

16/02/06 14:42:09 INFO scheduler.TaskSetManager: Finished task 20.0 in stage 0.0 (TID 20) in 2606 ms on Worker1 (13/24)

16/02/06 14:42:09 INFO scheduler.TaskSetManager: Finished task 8.0 in stage 0.0 (TID 8) in 2620 ms on Worker1 (14/24)

16/02/06 14:42:09 INFO scheduler.TaskSetManager: Finished task 11.0 in stage 0.0 (TID 11) in 2621 ms on Worker1 (15/24)

16/02/06 14:42:09 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 2629 ms on Worker1 (16/24)

16/02/06 14:42:09 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on Worker2:53946 (size: 858.0 B, free: 511.1 MB)

16/02/06 14:42:10 INFO scheduler.TaskSetManager: Finished task 21.0 in stage 0.0 (TID 21) in 3189 ms on Worker2 (17/24)

16/02/06 14:42:10 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 3206 ms on Worker2 (18/24)

16/02/06 14:42:10 INFO scheduler.TaskSetManager: Finished task 18.0 in stage 0.0 (TID 18) in 3196 ms on Worker2 (19/24)

16/02/06 14:42:10 INFO scheduler.TaskSetManager: Finished task 6.0 in stage 0.0 (TID 6) in 3212 ms on Worker2 (20/24)

16/02/06 14:42:10 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 3265 ms on Worker2 (21/24)

16/02/06 14:42:10 INFO scheduler.TaskSetManager: Finished task 15.0 in stage 0.0 (TID 15) in 3208 ms on Worker2 (22/24)

16/02/06 14:42:10 INFO scheduler.TaskSetManager: Finished task 12.0 in stage 0.0 (TID 12) in 3211 ms on Worker2 (23/24)

16/02/06 14:42:10 INFO scheduler.TaskSetManager: Finished task 9.0 in stage 0.0 (TID 9) in 3218 ms on Worker2 (24/24)

16/02/06 14:42:10 INFO scheduler.DAGScheduler: ResultStage 0 (reduce at <console>:31) finished in 3.271 s

16/02/06 14:42:10 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool

16/02/06 14:42:10 INFO scheduler.DAGScheduler: Job 0 finished: reduce at <console>:31, took 4.025308 s

sum: Int = 5050

这里用到了24个cores,因为有3台机器,一台机器8个cores,不指定多少个core,就全部用满,看下面的截图,说明Spark会最大化的运营你的cores,但是如果不配置Spark会极大的消耗内存

spacer.gif

只有一个Stage,reduce不会产生RDD,不会用到Shuffle,所以只有一个Stage,一共23个任务

spacer.gif

可以设置并行粒度

val rdd = sc.parallelize(numbers,10)  第二个参数

val sum = rdd.reduce(_+_)

scala> val sum = rdd.reduce(_+_)

16/02/06 14:53:27 INFO spark.SparkContext: Starting job: reduce at <console>:31

16/02/06 14:53:27 INFO scheduler.DAGScheduler: Got job 1 (reduce at <console>:31) with 10 output partitions

16/02/06 14:53:27 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (reduce at <console>:31)

16/02/06 14:53:27 INFO scheduler.DAGScheduler: Parents of final stage: List()

16/02/06 14:53:27 INFO scheduler.DAGScheduler: Missing parents: List()

16/02/06 14:53:27 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (ParallelCollectionRDD[1] at parallelize at <console>:29), which has no missing parents

16/02/06 14:53:27 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 1216.0 B, free 3.2 KB)

16/02/06 14:53:27 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 858.0 B, free 4.1 KB)

16/02/06 14:53:27 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.145.131:54442 (size: 858.0 B, free: 1247.2 MB)

16/02/06 14:53:27 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006

16/02/06 14:53:27 INFO scheduler.DAGScheduler: Submitting 10 missing tasks from ResultStage 1 (ParallelCollectionRDD[1] at parallelize at <console>:29)

16/02/06 14:53:27 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 10 tasks

16/02/06 14:53:27 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 24, Master, partition 0,PROCESS_LOCAL, 2078 bytes)

16/02/06 14:53:27 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 25, Worker1, partition 1,PROCESS_LOCAL, 2078 bytes)

16/02/06 14:53:27 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 1.0 (TID 26, Worker2, partition 2,PROCESS_LOCAL, 2078 bytes)

16/02/06 14:53:27 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 1.0 (TID 27, Master, partition 3,PROCESS_LOCAL, 2078 bytes)

16/02/06 14:53:27 INFO scheduler.TaskSetManager: Starting task 4.0 in stage 1.0 (TID 28, Worker1, partition 4,PROCESS_LOCAL, 2078 bytes)

16/02/06 14:53:27 INFO scheduler.TaskSetManager: Starting task 5.0 in stage 1.0 (TID 29, Worker2, partition 5,PROCESS_LOCAL, 2078 bytes)

16/02/06 14:53:27 INFO scheduler.TaskSetManager: Starting task 6.0 in stage 1.0 (TID 30, Master, partition 6,PROCESS_LOCAL, 2078 bytes)

16/02/06 14:53:27 INFO scheduler.TaskSetManager: Starting task 7.0 in stage 1.0 (TID 31, Worker1, partition 7,PROCESS_LOCAL, 2078 bytes)

16/02/06 14:53:27 INFO scheduler.TaskSetManager: Starting task 8.0 in stage 1.0 (TID 32, Worker2, partition 8,PROCESS_LOCAL, 2078 bytes)

16/02/06 14:53:27 INFO scheduler.TaskSetManager: Starting task 9.0 in stage 1.0 (TID 33, Master, partition 9,PROCESS_LOCAL, 2135 bytes)

16/02/06 14:53:27 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on Worker1:55725 (size: 858.0 B, free: 511.1 MB)

16/02/06 14:53:27 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 1.0 (TID 25) in 113 ms on Worker1 (1/10)

16/02/06 14:53:27 INFO scheduler.TaskSetManager: Finished task 7.0 in stage 1.0 (TID 31) in 114 ms on Worker1 (2/10)

16/02/06 14:53:27 INFO scheduler.TaskSetManager: Finished task 4.0 in stage 1.0 (TID 28) in 117 ms on Worker1 (3/10)

16/02/06 14:53:27 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on Master:57909 (size: 858.0 B, free: 511.1 MB)

16/02/06 14:53:27 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 1.0 (TID 27) in 185 ms on Master (4/10)

16/02/06 14:53:27 INFO scheduler.TaskSetManager: Finished task 9.0 in stage 1.0 (TID 33) in 186 ms on Master (5/10)

16/02/06 14:53:27 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 24) in 202 ms on Master (6/10)

16/02/06 14:53:27 INFO scheduler.TaskSetManager: Finished task 6.0 in stage 1.0 (TID 30) in 210 ms on Master (7/10)

16/02/06 14:53:28 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on Worker2:53946 (size: 858.0 B, free: 511.1 MB)

16/02/06 14:53:28 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 1.0 (TID 26) in 808 ms on Worker2 (8/10)

16/02/06 14:53:28 INFO scheduler.TaskSetManager: Finished task 5.0 in stage 1.0 (TID 29) in 812 ms on Worker2 (9/10)

16/02/06 14:53:28 INFO scheduler.TaskSetManager: Finished task 8.0 in stage 1.0 (TID 32) in 814 ms on Worker2 (10/10)

16/02/06 14:53:28 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool

16/02/06 14:53:28 INFO scheduler.DAGScheduler: ResultStage 1 (reduce at <console>:31) finished in 0.815 s

16/02/06 14:53:28 INFO scheduler.DAGScheduler: Job 1 finished: reduce at <console>:31, took 0.834312 s

这里只有10个并行单位了

spacer.gif

实际上Spark并行度应该设置多少合适呢?

根据经验,每个core可以承载2~4个partition。32个核,最大64~128个partition之间比较合适。跟数据规模没有任何关系,只跟task使用partition的内存使用量和CPU使用时间有关系。

默认情况下,可以设置为core数量的2~4倍

来看看上面的partition的原理:

ctrl+shift+n

spacer.gif查询

private[spark] class ParallelCollectionRDD[T: ClassTag](
    sc: SparkContext,
    @transient private val data: Seq[T],
    numSlices: Int,
    locationPrefs: Map[Int, Seq[String]])
    extends RDD[T](scNil) {
  // TODO: Right now, each split sends along its full data, even if later down the RDD chain it gets
  // cached. It might be worthwhile to write the data to a file in the DFS and read it in the split
  // instead.

  // UPDATE: A parallel collection can be checkpointed to HDFS, which achieves this goal.

numSlices,并行度

locationPrefs,刚才1~100运行在3台机器,有24个分片,每个分片都有具体的位置,被blockManager管理的

DAGScheduler运行时都决定好在哪里运行

override def getPartitions: Array[Partition] = {
  val slices = ParallelCollectionRDD.slice(datanumSlices).toArray
  slices.indices.map(i => new ParallelCollectionPartition(idislices(i))).toArray
}

slices变成了一个数组,我们会对数据分片。就是一个最基本的数据结构

override def getPreferredLocations(s: Partition): Seq[String] = {
  locationPrefs.getOrElse(s.indexNil)
}

~~~2、读取本地文件创建RDD~~~

package com.dt.spark.cores

import org.apache.spark.{SparkConfSparkContext}

/**
  * Created by 威 on 2016/2/6.
  */
object RDDBaseOnLocalFile {
  def main(args: Array[String]) {
    /**
      * 1、创建一个Scala集合
      */
    val conf = new SparkConf()//创建SparkConf对象
    conf.setAppName("RDDBaseOnLocalFile")//设置应用程序的名称,在程序运行的监控界面可以看到名称
    conf.setMaster("local")//此时程序在本地运行,不需要安装Spark集群
    val sc = new SparkContext(conf)//通过创建SparkContext对象,通过传入SparkConf实例来定制Spark运行的具体参数和配置信息

    val rdd = sc.textFile("F:/安装文件/操作系统/spark-1.6.0-bin-hadoop2.6/README.md")

    //计算所行长度的总和
    val linesLength = rdd.map{line=>line.length}

    val sum = linesLength.reduce(_+_)

    println("The total charecters of file is ="+sum)

  }
}

运行结果:

16/02/06 15:19:43 INFO DAGScheduler: Job 0 finished: reduce at RDDBaseOnLocalFile.scala:23, took 2.369687 s

The total charecters of file is =3264

16/02/06 15:19:43 INFO SparkContext: Invoking stop() from shutdown hook

override def getPartitions: Array[Partition] = {
  val jobConf = getJobConf()
  // add the credentials here as this can be called before SparkContext initialized
  SparkHadoopUtil.get.addCredentials(jobConf)
  val inputFormat = getInputFormat(jobConf)
  val inputSplits = inputFormat.getSplits(jobConfminPartitions)    //逻辑拆分
  val array = new Array[Partition](inputSplits.size)
  for (i <- until inputSplits.size) {
    array(i) = new HadoopPartition(idiinputSplits(i))
  }
  array
}

然后集群运行:

scala> val lines = sc.textFile("/historyserverforSpark/README.md", 1)

16/02/06 15:34:02 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 212.8 KB, free 212.8 KB)

16/02/06 15:34:02 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 19.7 KB, free 232.4 KB)

16/02/06 15:34:02 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.145.131:54442 (size: 19.7 KB, free: 1247.2 MB)

16/02/06 15:34:02 INFO spark.SparkContext: Created broadcast 2 from textFile at <console>:27

lines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at textFile at <console>:27

scala> val linesLength = lines.map{line=>line.length}

linesLength: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[4] at map at <console>:29

scala> val sum = linesLength.reduce(_+_)

16/02/06 15:35:07 INFO mapred.FileInputFormat: Total input paths to process : 1

16/02/06 15:35:08 INFO spark.SparkContext: Starting job: reduce at <console>:31

16/02/06 15:35:08 INFO scheduler.DAGScheduler: Got job 2 (reduce at <console>:31) with 1 output partitions

16/02/06 15:35:08 INFO scheduler.DAGScheduler: Final stage: ResultStage 2 (reduce at <console>:31)

16/02/06 15:35:08 INFO scheduler.DAGScheduler: Parents of final stage: List()

16/02/06 15:35:08 INFO scheduler.DAGScheduler: Missing parents: List()

16/02/06 15:35:08 INFO scheduler.DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[4] at map at <console>:29), which has no missing parents

16/02/06 15:35:08 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 3.3 KB, free 235.7 KB)

16/02/06 15:35:08 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 1964.0 B, free 237.6 KB)

16/02/06 15:35:08 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on 192.168.145.131:54442 (size: 1964.0 B, free: 1247.2 MB)

16/02/06 15:35:08 INFO spark.SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1006

16/02/06 15:35:08 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 2 (MapPartitionsRDD[4] at map at <console>:29)

16/02/06 15:35:08 INFO scheduler.TaskSchedulerImpl: Adding task set 2.0 with 1 tasks

16/02/06 15:35:09 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 2.0 (TID 34, Worker1, partition 0,NODE_LOCAL, 2152 bytes)

16/02/06 15:35:09 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on Worker1:55725 (size: 1964.0 B, free: 511.1 MB)

16/02/06 15:35:09 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on Worker1:55725 (size: 19.7 KB, free: 511.1 MB)

16/02/06 15:35:12 INFO scheduler.DAGScheduler: ResultStage 2 (reduce at <console>:31) finished in 3.718 s

16/02/06 15:35:12 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 2.0 (TID 34) in 3716 ms on Worker1 (1/1)

16/02/06 15:35:12 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool

16/02/06 15:35:12 INFO scheduler.DAGScheduler: Job 2 finished: reduce at <console>:31, took 4.366041 s

sum: Int = 3264

Tachyon中间件可以从mysql或者oracle抽取数据建立RDD

也可以把mysql或者oracle数据导入Hive中

王家林老师名片:

中国Spark第一人

新浪微博:http://weibo.com/ilovepains

微信公众号:DT_Spark

博客:http://blog.sina.com.cn/ilovepains

手机:18610086859

QQ:1740415547

邮箱:[email protected]


本文出自 “一枝花傲寒” 博客,谢绝转载!

你可能感兴趣的:(RDD,创建内幕,彻底解密)