RDD持久化、广播、累加器(DT大数据梦工厂)

内容:

1、RDD持久化实战;

2、Spark广播实战;

3、Spark累加器实战;

持久化实战几个方面:

1、怎么保存结果;

2、实现算法的时候cache、persist;

3、checkpoint

广播:

构建算法至关重要,降低网络传输数据量、提高内存的使用效率、加快程序的运行速度

累加器:

全局的指针部件的变量,在executor中只能修改累加器的内容,不能读累加器的内容,在driver中才能读取

========== Action============

collect、count、saveTextFile、foreach、countByKey

用shell方式验证,因为action有个特点,就是输出

~~~reduce~~~

scala> val numbers = sc.parallelize(1 to 100,3)

numbers: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:27

scala> numbers.reduce(_+_)

16/02/08 12:26:31 INFO spark.SparkContext: Starting job: reduce at <console>:30

16/02/08 12:26:31 INFO scheduler.DAGScheduler: Got job 1 (reduce at <console>:30) with 3 output partitions

16/02/08 12:26:31 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (reduce at <console>:30)

16/02/08 12:26:31 INFO scheduler.DAGScheduler: Parents of final stage: List()

16/02/08 12:26:31 INFO scheduler.DAGScheduler: Missing parents: List()

16/02/08 12:26:31 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (ParallelCollectionRDD[1] at parallelize at <console>:27), which has no missing parents

16/02/08 12:26:31 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 1216.0 B, free 3.2 KB)

16/02/08 12:26:31 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 858.0 B, free 4.1 KB)

16/02/08 12:26:31 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.145.132:39834 (size: 858.0 B, free: 1247.2 MB)

16/02/08 12:26:31 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006

16/02/08 12:26:31 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ResultStage 1 (ParallelCollectionRDD[1] at parallelize at <console>:27)

16/02/08 12:26:31 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 3 tasks

16/02/08 12:26:31 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 24, Master, partition 0,PROCESS_LOCAL, 2078 bytes)

16/02/08 12:26:31 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 25, Worker2, partition 1,PROCESS_LOCAL, 2078 bytes)

16/02/08 12:26:31 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 1.0 (TID 26, Worker1, partition 2,PROCESS_LOCAL, 2135 bytes)

16/02/08 12:26:31 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on Worker2:36686 (size: 858.0 B, free: 511.1 MB)

16/02/08 12:26:31 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on Master:59109 (size: 858.0 B, free: 511.1 MB)

16/02/08 12:26:31 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on Worker1:51526 (size: 858.0 B, free: 511.1 MB)

16/02/08 12:26:31 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 1.0 (TID 25) in 107 ms on Worker2 (1/3)

16/02/08 12:26:31 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 24) in 135 ms on Master (2/3)

16/02/08 12:26:31 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 1.0 (TID 26) in 181 ms on Worker1 (3/3)

16/02/08 12:26:31 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool

16/02/08 12:26:31 INFO scheduler.DAGScheduler: ResultStage 1 (reduce at <console>:30) finished in 0.199 s

16/02/08 12:26:31 INFO scheduler.DAGScheduler: Job 1 finished: reduce at <console>:30, took 0.327446 s

res2: Int = 5050

reduce:把上次的计算结果,作为下一个输入的第一个参数

~~~ collect~~~

scala> val results = numbers.map(_*2)

results: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[2] at map at <console>:29

scala> val data = results.collect

16/02/08 12:31:43 INFO spark.SparkContext: Starting job: collect at <console>:31

16/02/08 12:31:43 INFO scheduler.DAGScheduler: Got job 2 (collect at <console>:31) with 3 output partitions

16/02/08 12:31:43 INFO scheduler.DAGScheduler: Final stage: ResultStage 2 (collect at <console>:31)

16/02/08 12:31:43 INFO scheduler.DAGScheduler: Parents of final stage: List()

16/02/08 12:31:43 INFO scheduler.DAGScheduler: Missing parents: List()

16/02/08 12:31:43 INFO scheduler.DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[2] at map at <console>:29), which has no missing parents

16/02/08 12:31:43 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 1952.0 B, free 1952.0 B)

16/02/08 12:31:43 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1212.0 B, free 3.1 KB)

16/02/08 12:31:43 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.145.132:39834 (size: 1212.0 B, free: 1247.2 MB)

16/02/08 12:31:43 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006

16/02/08 12:31:43 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ResultStage 2 (MapPartitionsRDD[2] at map at <console>:29)

16/02/08 12:31:43 INFO scheduler.TaskSchedulerImpl: Adding task set 2.0 with 3 tasks

16/02/08 12:31:43 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 2.0 (TID 27, Master, partition 0,PROCESS_LOCAL, 2078 bytes)

16/02/08 12:31:43 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 2.0 (TID 28, Worker2, partition 1,PROCESS_LOCAL, 2078 bytes)

16/02/08 12:31:43 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 2.0 (TID 29, Worker1, partition 2,PROCESS_LOCAL, 2135 bytes)

16/02/08 12:31:44 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on Worker1:51526 (size: 1212.0 B, free: 511.1 MB)

16/02/08 12:31:44 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on Master:59109 (size: 1212.0 B, free: 511.1 MB)

16/02/08 12:31:45 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 2.0 (TID 27) in 1074 ms on Master (1/3)

16/02/08 12:32:27 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 2.0 (TID 29) in 43345 ms on Worker1 (2/3)

16/02/08 12:32:28 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on Worker2:36686 (size: 1212.0 B, free: 511.1 MB)

16/02/08 12:32:28 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 2.0 (TID 28) in 44523 ms on Worker2 (3/3)

16/02/08 12:32:28 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool

16/02/08 12:32:28 INFO scheduler.DAGScheduler: ResultStage 2 (collect at <console>:31) finished in 44.541 s

16/02/08 12:32:28 INFO scheduler.DAGScheduler: Job 2 finished: collect at <console>:31, took 44.623573 s

data: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98, 100, 102, 104, 106, 108, 110, 112, 114, 116, 118, 120, 122, 124, 126, 128, 130, 132, 134, 136, 138, 140, 142, 144, 146, 148, 150, 152, 154, 156, 158, 160, 162, 164, 166, 168, 170, 172, 174, 176, 178, 180, 182, 184, 186, 188, 190, 192, 194, 196, 198, 200)

通过collect,每个executor的执行结果会被搜集到driver上,如果想在命令终端中看到执行结果,就必须collect。源代码中runJob,凡是action级别的操作都会触发sc.runJob

/**
 * Return an array that contains all of the elements in this RDD.
 */
def collect(): Array[T] = withScope {
  val results = sc.runJob(this(iter: Iterator[T]) => iter.toArray)
  Array.concat(results: _*)
}

spacer.gif

~~~ count~~~

scala> numbers.count

16/02/08 12:38:37 INFO spark.SparkContext: Starting job: count at <console>:30

16/02/08 12:38:37 INFO scheduler.DAGScheduler: Got job 3 (count at <console>:30) with 3 output partitions

16/02/08 12:38:37 INFO scheduler.DAGScheduler: Final stage: ResultStage 3 (count at <console>:30)

16/02/08 12:38:37 INFO scheduler.DAGScheduler: Parents of final stage: List()

16/02/08 12:38:37 INFO scheduler.DAGScheduler: Missing parents: List()

16/02/08 12:38:37 INFO scheduler.DAGScheduler: Submitting ResultStage 3 (ParallelCollectionRDD[1] at parallelize at <console>:27), which has no missing parents

16/02/08 12:38:37 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 1096.0 B, free 4.2 KB)

16/02/08 12:38:37 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 804.0 B, free 4.9 KB)

16/02/08 12:38:37 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on 192.168.145.132:39834 (size: 804.0 B, free: 1247.2 MB)

16/02/08 12:38:37 INFO spark.SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1006

16/02/08 12:38:37 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ResultStage 3 (ParallelCollectionRDD[1] at parallelize at <console>:27)

16/02/08 12:38:37 INFO scheduler.TaskSchedulerImpl: Adding task set 3.0 with 3 tasks

16/02/08 12:38:37 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 3.0 (TID 30, Worker1, partition 0,PROCESS_LOCAL, 2078 bytes)

16/02/08 12:38:37 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 3.0 (TID 31, Master, partition 1,PROCESS_LOCAL, 2078 bytes)

16/02/08 12:38:37 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 3.0 (TID 32, Worker2, partition 2,PROCESS_LOCAL, 2135 bytes)

16/02/08 12:38:37 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on Master:59109 (size: 804.0 B, free: 511.1 MB)

16/02/08 12:38:37 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on Worker2:36686 (size: 804.0 B, free: 511.1 MB)

16/02/08 12:38:37 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 3.0 (TID 32) in 102 ms on Worker2 (1/3)

16/02/08 12:38:37 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 3.0 (TID 31) in 148 ms on Master (2/3)

16/02/08 12:38:40 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on Worker1:51526 (size: 804.0 B, free: 511.1 MB)

16/02/08 12:38:41 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 3.0 (TID 30) in 3628 ms on Worker1 (3/3)

16/02/08 12:38:41 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool

16/02/08 12:38:41 INFO scheduler.DAGScheduler: ResultStage 3 (count at <console>:30) finished in 3.629 s

16/02/08 12:38:41 INFO scheduler.DAGScheduler: Job 3 finished: count at <console>:30, took 3.734941 s

res4: Long = 100

/**
 * Return the number of elements in the RDD.
 */
def count(): Long = sc.runJob(thisUtils.getIteratorSize _).sum

~~~ take~~~

scala> val topN = numbers.take(5)

16/02/08 12:39:48 INFO spark.SparkContext: Starting job: take at <console>:29

16/02/08 12:39:48 INFO scheduler.DAGScheduler: Got job 4 (take at <console>:29) with 1 output partitions

16/02/08 12:39:48 INFO scheduler.DAGScheduler: Final stage: ResultStage 4 (take at <console>:29)

16/02/08 12:39:48 INFO scheduler.DAGScheduler: Parents of final stage: List()

16/02/08 12:39:48 INFO scheduler.DAGScheduler: Missing parents: List()

16/02/08 12:39:48 INFO scheduler.DAGScheduler: Submitting ResultStage 4 (ParallelCollectionRDD[1] at parallelize at <console>:27), which has no missing parents

16/02/08 12:39:48 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 1288.0 B, free 6.2 KB)

16/02/08 12:39:48 INFO storage.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 871.0 B, free 7.1 KB)

16/02/08 12:39:48 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on 192.168.145.132:39834 (size: 871.0 B, free: 1247.2 MB)

16/02/08 12:39:48 INFO spark.SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1006

16/02/08 12:39:48 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 4 (ParallelCollectionRDD[1] at parallelize at <console>:27)

16/02/08 12:39:48 INFO scheduler.TaskSchedulerImpl: Adding task set 4.0 with 1 tasks

16/02/08 12:39:48 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 4.0 (TID 33, Worker2, partition 0,PROCESS_LOCAL, 2078 bytes)

16/02/08 12:39:48 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on Worker2:36686 (size: 871.0 B, free: 511.1 MB)

16/02/08 12:39:49 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 4.0 (TID 33) in 93 ms on Worker2 (1/1)

16/02/08 12:39:49 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool

16/02/08 12:39:49 INFO scheduler.DAGScheduler: ResultStage 4 (take at <console>:29) finished in 0.092 s

16/02/08 12:39:49 INFO scheduler.DAGScheduler: Job 4 finished: take at <console>:29, took 0.132127 s

topN: Array[Int] = Array(1, 2, 3, 4, 5)

~~~ countByKey~~~

scala> val scores = Array(Tuple2(1,100),Tuple2(2,95),Tuple2(3,70),Tuple2(1,77),Tuple2(3,78))

scores: Array[(Int, Int)] = Array((1,100), (2,95), (3,70), (1,77), (3,78))

scala> val haha = sc.parallelize(scores,3)

haha: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[3] at parallelize at <console>:29

scala> val data = haha.countByKey

16/02/08 12:47:53 INFO spark.SparkContext: Starting job: countByKey at <console>:31

16/02/08 12:47:53 INFO scheduler.DAGScheduler: Registering RDD 4 (countByKey at <console>:31)

16/02/08 12:47:53 INFO scheduler.DAGScheduler: Got job 5 (countByKey at <console>:31) with 3 output partitions

16/02/08 12:47:53 INFO scheduler.DAGScheduler: Final stage: ResultStage 6 (countByKey at <console>:31)

16/02/08 12:47:53 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 5)

16/02/08 12:47:53 INFO scheduler.DAGScheduler: Missing parents: List(ShuffleMapStage 5)

16/02/08 12:47:53 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 5 (MapPartitionsRDD[4] at countByKey at <console>:31), which has no missing parents

16/02/08 12:47:54 INFO storage.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 2.6 KB, free 9.7 KB)

16/02/08 12:47:54 INFO storage.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 1574.0 B, free 11.2 KB)

16/02/08 12:47:54 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on 192.168.145.132:39834 (size: 1574.0 B, free: 1247.2 MB)

16/02/08 12:47:54 INFO spark.SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1006

16/02/08 12:47:54 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ShuffleMapStage 5 (MapPartitionsRDD[4] at countByKey at <console>:31)

16/02/08 12:47:54 INFO scheduler.TaskSchedulerImpl: Adding task set 5.0 with 3 tasks

16/02/08 12:47:54 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 5.0 (TID 34, Worker2, partition 0,PROCESS_LOCAL, 2183 bytes)

16/02/08 12:47:54 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 5.0 (TID 35, Master, partition 1,PROCESS_LOCAL, 2199 bytes)

16/02/08 12:47:54 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 5.0 (TID 36, Worker1, partition 2,PROCESS_LOCAL, 2199 bytes)

16/02/08 12:47:54 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on Worker2:36686 (size: 1574.0 B, free: 511.1 MB)

16/02/08 12:47:54 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on Master:59109 (size: 1574.0 B, free: 511.1 MB)

16/02/08 12:47:56 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on Worker1:51526 (size: 1574.0 B, free: 511.1 MB)

16/02/08 12:47:57 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 5.0 (TID 35) in 3330 ms on Master (1/3)

16/02/08 12:47:59 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 5.0 (TID 34) in 5726 ms on Worker2 (2/3)

16/02/08 12:47:59 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 5.0 (TID 36) in 5698 ms on Worker1 (3/3)

16/02/08 12:47:59 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool

16/02/08 12:47:59 INFO scheduler.DAGScheduler: ShuffleMapStage 5 (countByKey at <console>:31) finished in 5.759 s

16/02/08 12:47:59 INFO scheduler.DAGScheduler: looking for newly runnable stages

16/02/08 12:47:59 INFO scheduler.DAGScheduler: running: Set()

16/02/08 12:47:59 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 6)

16/02/08 12:47:59 INFO scheduler.DAGScheduler: failed: Set()

16/02/08 12:47:59 INFO scheduler.DAGScheduler: Submitting ResultStage 6 (ShuffledRDD[5] at countByKey at <console>:31), which has no missing parents

16/02/08 12:48:00 INFO storage.MemoryStore: Block broadcast_6 stored as values in memory (estimated size 2.6 KB, free 13.9 KB)

16/02/08 12:48:00 INFO storage.MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 1593.0 B, free 15.4 KB)

16/02/08 12:48:00 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on 192.168.145.132:39834 (size: 1593.0 B, free: 1247.2 MB)

16/02/08 12:48:00 INFO spark.SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:1006

16/02/08 12:48:00 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ResultStage 6 (ShuffledRDD[5] at countByKey at <console>:31)

16/02/08 12:48:00 INFO scheduler.TaskSchedulerImpl: Adding task set 6.0 with 3 tasks

16/02/08 12:48:00 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 6.0 (TID 37, Worker2, partition 1,NODE_LOCAL, 1894 bytes)

16/02/08 12:48:00 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 6.0 (TID 38, Worker1, partition 0,NODE_LOCAL, 1894 bytes)

16/02/08 12:48:00 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 6.0 (TID 39, Master, partition 2,NODE_LOCAL, 1894 bytes)

16/02/08 12:48:00 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on Master:59109 (size: 1593.0 B, free: 511.1 MB)

16/02/08 12:48:00 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on Worker2:36686 (size: 1593.0 B, free: 511.1 MB)

16/02/08 12:48:00 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on Worker1:51526 (size: 1593.0 B, free: 511.1 MB)

16/02/08 12:48:00 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to Master:37061

16/02/08 12:48:00 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to Worker2:41346

16/02/08 12:48:00 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 181 bytes

16/02/08 12:48:00 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 181 bytes

16/02/08 12:48:01 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 6.0 (TID 39) in 679 ms on Master (1/3)

16/02/08 12:48:01 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to Worker1:40369

16/02/08 12:48:01 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 6.0 (TID 37) in 1564 ms on Worker2 (2/3)

16/02/08 12:48:02 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 6.0 (TID 38) in 1948 ms on Worker1 (3/3)

16/02/08 12:48:02 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks have all completed, from pool

16/02/08 12:48:02 INFO scheduler.DAGScheduler: ResultStage 6 (countByKey at <console>:31) finished in 2.020 s

16/02/08 12:48:02 INFO scheduler.DAGScheduler: Job 5 finished: countByKey at <console>:31, took 8.543698 s

data: scala.collection.Map[Int,Long] = Map(3 -> 2, 1 -> 2, 2 -> 1)

/**
 * Count the number of elements for each key, collecting the results to a local Map.
 *
 * Note that this method should only be used if the resulting map is expected to be small, as
 * the whole thing is loaded into the driver's memory.
 * To handle very large results, consider using rdd.mapValues(_ => 1L).reduceByKey(_ + _), which
 * returns an RDD[T, Long] instead of a map.
 */
def countByKey(): Map[K, Long] = self.withScope {
  self.mapValues(_ => 1L).reduceByKey(_ + _).collect().toMap
}

~~~ saveAsTextFile~~~

/**
 * Save this RDD as a text file, using string representations of elements.
 */
def saveAsTextFile(path: String): Unit = withScope {
  // https://issues.apache.org/jira/browse/SPARK-2075
  //
  // NullWritable is a `Comparable` in Hadoop 1.+, so the compiler cannot find an implicit
  // Ordering for it and will use the default `null`. However, it's a `Comparable[NullWritable]`
  // in Hadoop 2.+, so the compiler will call the implicit `Ordering.ordered` method to create an
  // Ordering for `NullWritable`. That's why the compiler will generate different anonymous
  // classes for `saveAsTextFile` in Hadoop 1.+ and Hadoop 2.+.
  //
  // Therefore, here we provide an explicit Ordering `null` to make sure the compiler generate
  // same bytecodes for `saveAsTextFile`.
  val nullWritableClassTag = implicitly[ClassTag[NullWritable]]
  val textClassTag = implicitly[ClassTag[Text]]
  val r = this.mapPartitions { iter =>
    val text = new Text()
    iter.map { x =>
      text.set(x.toString)
      (NullWritable.get()text)
    }
  }
  RDD.rddToPairRDDFunctions(r)(nullWritableClassTagtextClassTagnull)
    .saveAsHadoopFile[TextOutputFormat[NullWritableText]](path)
}

========== Persist============

Spark数据放在内存中,适合高速迭代。但是风险非常高,所以容易出错,这个时候涉及到容错。

RDD有血统继承,后面的RDD出错,会根据前面的步骤算出来。如果前面某个点没做过cache或者persist,那要从头做。

适用情况:

1、如果某步骤计算时间特别耗时;

2、计算链条特别长的情况;

3、checkpoint所在的RDD也一定要持久化数据;(lazy级别的,发现有checkpoint。会触发新作业,checkpoint那个步骤不做持久化,则要重新算,所以在checkpoint之前要持久化)

4、shuffle之后要persist(因为shuffle要网络传输,传输就有数据丢失的风险,persist则可以确保效率);

5、shuffle之前persist(框架默认帮助我们把数据持久化到本地磁盘);

/**
 * Set this RDD's storage level to persist its values across operations after the first time
 * it is computed. This can only be used to assign a new storage level if the RDD does not
 * have a storage level set yet. Local checkpointing is an exception.
 */
def persist(newLevel: StorageLevel): this.type = {
  if (isLocallyCheckpointed) {
    // This means the user previously called localCheckpoint(), which should have already
    // marked this RDD for persisting. Here we should override the old storage level with
    // one that is explicitly requested by the user (after adapting it to use disk).
    persist(LocalRDDCheckpointData.transformStorageLevel(newLevel)allowOverride = true)
  } else {
    persist(newLevelallowOverride = false)
  }
}

/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)

/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def cache(): this.type = persist()

val NONE new StorageLevel(falsefalsefalsefalse)
val DISK_ONLY new StorageLevel(truefalsefalsefalse)
val DISK_ONLY_2 new StorageLevel(truefalsefalsefalse2)
val MEMORY_ONLY = new StorageLevel(falsetruefalsetrue)    //只考虑内存,有可能oom
val MEMORY_ONLY_2 new StorageLevel(falsetruefalsetrue2)
val MEMORY_ONLY_SER new StorageLevel(falsetruefalsefalse)     //序列化
val MEMORY_ONLY_SER_2 new StorageLevel(falsetruefalsefalse2)  //序列化且有两份
val MEMORY_AND_DISK new StorageLevel(truetruefalsetrue)
val MEMORY_AND_DISK_2 new StorageLevel(truetruefalsetrue2)
val MEMORY_AND_DISK_SER new StorageLevel(truetruefalsefalse)   //优先考虑内存,不够才磁盘
val MEMORY_AND_DISK_SER_2 new StorageLevel(truetruefalsefalse2)  //上一种模式2份副本
val OFF_HEAP new StorageLevel(falsefalsetruefalse)  //TACHYON

cache是persist的一部分,就是MEMERY_ONLY级别的

为什么要序列号:减小体积,防止oom

序列化不好:使用数据都要反序列化,反序列化都要耗CPU

2份副本,可能会耗空间或者时间,但是如果一台机器出问题,会节省时间,典型的空间换时间

MEMORY_AND_DISK是最安全的,MEMERY_ONLY是最快的,能用内存尽量用内存,能用两份内存就尽量用两份内存

MEMORY_AND_DISK会极大降低oom的可能性

scala> val x = sc.textFile("/historyserverforSpark/README.md", 3).flatMap(_.split(" ")).map(word=>(word,1)).reduceByKey(_+_,1)

16/02/08 13:15:11 INFO storage.MemoryStore: Block broadcast_12 stored as values in memory (estimated size 212.8 KB, free 688.3 KB)

16/02/08 13:15:11 INFO storage.MemoryStore: Block broadcast_12_piece0 stored as bytes in memory (estimated size 19.7 KB, free 708.0 KB)

16/02/08 13:15:11 INFO storage.BlockManagerInfo: Added broadcast_12_piece0 in memory on 192.168.145.132:39834 (size: 19.7 KB, free: 1247.2 MB)

16/02/08 13:15:11 INFO spark.SparkContext: Created broadcast 12 from textFile at <console>:27

x: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[23] at reduceByKey at <console>:27

scala> x.collect

16/02/08 13:15:17 INFO spark.SparkContext: Starting job: collect at <console>:30

16/02/08 13:15:17 INFO mapred.FileInputFormat: Total input paths to process : 1

16/02/08 13:15:17 INFO scheduler.DAGScheduler: Registering RDD 22 (map at <console>:27)

16/02/08 13:15:17 INFO scheduler.DAGScheduler: Got job 7 (collect at <console>:30) with 1 output partitions

16/02/08 13:15:17 INFO scheduler.DAGScheduler: Final stage: ResultStage 10 (collect at <console>:30)

16/02/08 13:15:17 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 9)

16/02/08 13:15:17 INFO scheduler.DAGScheduler: Missing parents: List(ShuffleMapStage 9)

16/02/08 13:15:17 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 9 (MapPartitionsRDD[22] at map at <console>:27), which has no missing parents

16/02/08 13:15:17 INFO storage.MemoryStore: Block broadcast_13 stored as values in memory (estimated size 4.1 KB, free 712.1 KB)

16/02/08 13:15:17 INFO storage.MemoryStore: Block broadcast_13_piece0 stored as bytes in memory (estimated size 2.3 KB, free 714.4 KB)

16/02/08 13:15:17 INFO storage.BlockManagerInfo: Added broadcast_13_piece0 in memory on 192.168.145.132:39834 (size: 2.3 KB, free: 1247.2 MB)

16/02/08 13:15:17 INFO spark.SparkContext: Created broadcast 13 from broadcast at DAGScheduler.scala:1006

16/02/08 13:15:17 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ShuffleMapStage 9 (MapPartitionsRDD[22] at map at <console>:27)

16/02/08 13:15:17 INFO scheduler.TaskSchedulerImpl: Adding task set 9.0 with 3 tasks

16/02/08 13:15:17 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 9.0 (TID 44, Worker1, partition 0,NODE_LOCAL, 2141 bytes)

16/02/08 13:15:17 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 9.0 (TID 45, Master, partition 1,NODE_LOCAL, 2141 bytes)

16/02/08 13:15:17 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 9.0 (TID 46, Worker1, partition 2,NODE_LOCAL, 2141 bytes)

16/02/08 13:15:17 INFO storage.BlockManagerInfo: Added broadcast_13_piece0 in memory on Master:59109 (size: 2.3 KB, free: 511.1 MB)

16/02/08 13:15:17 INFO storage.BlockManagerInfo: Added broadcast_13_piece0 in memory on Worker1:51526 (size: 2.3 KB, free: 511.1 MB)

16/02/08 13:15:17 INFO storage.BlockManagerInfo: Added broadcast_12_piece0 in memory on Master:59109 (size: 19.7 KB, free: 511.1 MB)

16/02/08 13:15:18 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 9.0 (TID 45) in 232 ms on Master (1/3)

16/02/08 13:15:18 INFO storage.BlockManagerInfo: Added broadcast_12_piece0 in memory on Worker1:51526 (size: 19.7 KB, free: 511.1 MB)

16/02/08 13:15:18 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 9.0 (TID 44) in 847 ms on Worker1 (2/3)

16/02/08 13:15:18 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 9.0 (TID 46) in 880 ms on Worker1 (3/3)

16/02/08 13:15:18 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 9.0, whose tasks have all completed, from pool

16/02/08 13:15:18 INFO scheduler.DAGScheduler: ShuffleMapStage 9 (map at <console>:27) finished in 0.876 s

16/02/08 13:15:18 INFO scheduler.DAGScheduler: looking for newly runnable stages

16/02/08 13:15:18 INFO scheduler.DAGScheduler: running: Set()

16/02/08 13:15:18 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 10)

16/02/08 13:15:18 INFO scheduler.DAGScheduler: failed: Set()

16/02/08 13:15:18 INFO scheduler.DAGScheduler: Submitting ResultStage 10 (ShuffledRDD[23] at reduceByKey at <console>:27), which has no missing parents

16/02/08 13:15:18 INFO storage.MemoryStore: Block broadcast_14 stored as values in memory (estimated size 2.6 KB, free 717.0 KB)

16/02/08 13:15:18 INFO storage.MemoryStore: Block broadcast_14_piece0 stored as bytes in memory (estimated size 1580.0 B, free 718.5 KB)

16/02/08 13:15:18 INFO storage.BlockManagerInfo: Added broadcast_14_piece0 in memory on 192.168.145.132:39834 (size: 1580.0 B, free: 1247.2 MB)

16/02/08 13:15:18 INFO spark.SparkContext: Created broadcast 14 from broadcast at DAGScheduler.scala:1006

16/02/08 13:15:18 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 10 (ShuffledRDD[23] at reduceByKey at <console>:27)

16/02/08 13:15:18 INFO scheduler.TaskSchedulerImpl: Adding task set 10.0 with 1 tasks

16/02/08 13:15:18 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 10.0 (TID 47, Master, partition 0,NODE_LOCAL, 1894 bytes)

16/02/08 13:15:18 INFO storage.BlockManagerInfo: Added broadcast_14_piece0 in memory on Master:59109 (size: 1580.0 B, free: 511.1 MB)

16/02/08 13:15:18 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 2 to Master:37061

16/02/08 13:15:18 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 2 is 166 bytes

16/02/08 13:15:20 INFO scheduler.DAGScheduler: ResultStage 10 (collect at <console>:30) finished in 1.588 s

16/02/08 13:15:20 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 10.0 (TID 47) in 1588 ms on Master (1/1)

16/02/08 13:15:20 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, whose tasks have all completed, from pool

16/02/08 13:15:20 INFO scheduler.DAGScheduler: Job 7 finished: collect at <console>:30, took 2.586912 s

res11: Array[(String, Int)] = Array((package,1), (For,2), (Programs,1), (processing.,1), (Because,1), (The,1), (cluster.,1), (its,1), ([run,1), (APIs,1), (computation,1), (Try,1), (have,1), (through,1), (several,1), (This,2), (graph,1), (Hive,2), (storage,1), (["Specifying,1), (To,2), ("yarn",1), (page](http://spark.apache.org/documentation.html),1), (Once,1), (prefer,1), (SparkPi,2), (engine,1), (version,1), (file,1), (documentation,,1), (processing,,1), (the,21), (are,1), (systems.,1), (params,1), (not,1), (different,1), (refer,2), (Interactive,2), (R,,1), (given.,1), (if,4), (build,3), (when,1), (be,2), (Tests,1), (Apache,1), (./bin/run-example,2), (programs,,1), (including,3), (Spark.,1), (package.,1), (1000).count(),1), (Versions,1), (HDFS,1), (Data.,1), (>>>,1), (programming,1), (...

scala> x.cache(注意,注意!!!cache了!!!)

res12: x.type = ShuffledRDD[23] at reduceByKey at <console>:27

scala> x.collect

16/02/08 13:15:29 INFO spark.SparkContext: Starting job: collect at <console>:30

16/02/08 13:15:29 INFO scheduler.DAGScheduler: Got job 8 (collect at <console>:30) with 1 output partitions

16/02/08 13:15:29 INFO scheduler.DAGScheduler: Final stage: ResultStage 12 (collect at <console>:30)

16/02/08 13:15:29 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 11)

16/02/08 13:15:29 INFO scheduler.DAGScheduler: Missing parents: List()

16/02/08 13:15:29 INFO scheduler.DAGScheduler: Submitting ResultStage 12 (ShuffledRDD[23] at reduceByKey at <console>:27), which has no missing parents

16/02/08 13:15:29 INFO storage.MemoryStore: Block broadcast_15 stored as values in memory (estimated size 2.6 KB, free 721.1 KB)

16/02/08 13:15:29 INFO storage.MemoryStore: Block broadcast_15_piece0 stored as bytes in memory (estimated size 1580.0 B, free 722.7 KB)

16/02/08 13:15:29 INFO storage.BlockManagerInfo: Added broadcast_15_piece0 in memory on 192.168.145.132:39834 (size: 1580.0 B, free: 1247.2 MB)

16/02/08 13:15:29 INFO spark.SparkContext: Created broadcast 15 from broadcast at DAGScheduler.scala:1006

16/02/08 13:15:29 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 12 (ShuffledRDD[23] at reduceByKey at <console>:27)

16/02/08 13:15:29 INFO scheduler.TaskSchedulerImpl: Adding task set 12.0 with 1 tasks

16/02/08 13:15:29 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 12.0 (TID 48, Master, partition 0,NODE_LOCAL, 1894 bytes)

16/02/08 13:15:29 INFO storage.BlockManagerInfo: Added broadcast_15_piece0 in memory on Master:59109 (size: 1580.0 B, free: 511.1 MB)

16/02/08 13:15:29 INFO storage.BlockManagerInfo: Added rdd_23_0 in memory on Master:59109 (size: 22.7 KB, free: 511.1 MB)

16/02/08 13:15:29 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 12.0 (TID 48) in 183 ms on Master (1/1)

16/02/08 13:15:29 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 12.0, whose tasks have all completed, from pool

16/02/08 13:15:29 INFO scheduler.DAGScheduler: ResultStage 12 (collect at <console>:30) finished in 0.182 s

16/02/08 13:15:29 INFO scheduler.DAGScheduler: Job 8 finished: collect at <console>:30, took 0.292463 s

res13: Array[(String, Int)] = Array((package,1), (For,2), (Programs,1), (processing.,1), (Because,1), (The,1), (cluster.,1), (its,1), ([run,1), (APIs,1), (computation,1), (Try,1), (have,1), (through,1), (several,1), (This,2), (graph,1), (Hive,2), (storage,1), (["Specifying,1), (To,2), ("yarn",1), (page](http://spark.apache.org/documentation.html),1), (Once,1), (prefer,1), (SparkPi,2), (engine,1), (version,1), (file,1), (documentation,,1), (processing,,1), (the,21), (are,1), (systems.,1), (params,1), (not,1), (different,1), (refer,2), (Interactive,2), (R,,1), (given.,1), (if,4), (build,3), (when,1), (be,2), (Tests,1), (Apache,1), (./bin/run-example,2), (programs,,1), (including,3), (Spark.,1), (package.,1), (1000).count(),1), (Versions,1), (HDFS,1), (Data.,1), (>>>,1), (programming,1), (...

scala>

注意:cache之后一定不能有其它算子,不然会重复触发计算的过程

          cache不是一个action

          cache缓存清空,unpersist理清清空

          persist是lazy级别的,unpersist是eager级别的

          executor有个local.directory,用来persist放在本地磁盘上

========== 广播============

为什么要用广播?

讨论:

1、每个Task处理数据的时候都要拷贝一份数据副本,如果有100万份,就要拷贝100万份。函数式编程,变量不变,所以要拷贝。如果不拷贝,别人改变状态就傻了。如果数据量小,没关系。数据量大就傻了。直接OOM!

2、两个表join,一般join都会shuffle;

3、同步。broadcast过去的东西不能修改,只读,不能修改的。

4、传输性能,极大提升。

broadcast全局唯一。广播到Worker的Local

总结:

广播是由drvier发给当前application分配的所有executor及内存级别的全局只读变量

executor中的线程池中的线程共享该全局变量,极大的减少了网络传输,否则的话每个task都要传输一次该变量,并极大的节省了内存,当然,也隐形的提高了CPU的有效工作

原理:

两种方式:

方式一:

drvier,给executor,executor有多少Task,就发多少次,就会产生多少的数据副本。可以减少网络传输浪费。另外一方面,内存占用大,如果变量比较大,则极易出现OOM

方式二:

driver broadcast到executor的内存中,各个Task共享唯一的广播变量,极大的减少网络传输和内存消耗。

变量应用程序存在就存在,应用程序销毁(SC销毁)就销毁

第一种一般情况下没有优势。

spacer.gif

额外:executor之间共享数据的方式:HDFS或者TACHYON

实战:

scala> val number = 10

number: Int = 10

scala> val numberBroadcast = sc.broadcast(number)

16/02/08 14:43:45 INFO storage.MemoryStore: Block broadcast_16 stored as values in memory (estimated size 40.0 B, free 232.5 KB)

16/02/08 14:43:45 INFO storage.MemoryStore: Block broadcast_16_piece0 stored as bytes in memory (estimated size 97.0 B, free 232.6 KB)

16/02/08 14:43:45 INFO storage.BlockManagerInfo: Added broadcast_16_piece0 in memory on 192.168.145.132:39834 (size: 97.0 B, free: 1247.2 MB)

16/02/08 14:43:45 INFO spark.SparkContext: Created broadcast 16 from broadcast at <console>:29

numberBroadcast: org.apache.spark.broadcast.Broadcast[Int] = Broadcast(16)

scala> val data = sc.parallelize(1 to 10000,3)

data: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[24] at parallelize at <console>:27

scala> val bn = data.map(_*numberBroadcast.value)

bn: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[25] at map at <console>:33

scala> bn.collect

16/02/08 14:48:27 INFO spark.SparkContext: Starting job: collect at <console>:36

16/02/08 14:48:27 INFO scheduler.DAGScheduler: Got job 9 (collect at <console>:36) with 3 output partitions

16/02/08 14:48:27 INFO scheduler.DAGScheduler: Final stage: ResultStage 13 (collect at <console>:36)

16/02/08 14:48:27 INFO scheduler.DAGScheduler: Parents of final stage: List()

16/02/08 14:48:27 INFO scheduler.DAGScheduler: Missing parents: List()

16/02/08 14:48:27 INFO scheduler.DAGScheduler: Submitting ResultStage 13 (MapPartitionsRDD[25] at map at <console>:33), which has no missing parents

16/02/08 14:48:27 INFO storage.MemoryStore: Block broadcast_17 stored as values in memory (estimated size 6.9 KB, free 239.6 KB)

16/02/08 14:48:27 INFO storage.MemoryStore: Block broadcast_17_piece0 stored as bytes in memory (estimated size 3.0 KB, free 242.5 KB)

16/02/08 14:48:27 INFO storage.BlockManagerInfo: Added broadcast_17_piece0 in memory on 192.168.145.132:39834 (size: 3.0 KB, free: 1247.2 MB)

16/02/08 14:48:27 INFO spark.SparkContext: Created broadcast 17 from broadcast at DAGScheduler.scala:1006

16/02/08 14:48:27 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ResultStage 13 (MapPartitionsRDD[25] at map at <console>:33)

16/02/08 14:48:27 INFO scheduler.TaskSchedulerImpl: Adding task set 13.0 with 3 tasks

16/02/08 14:48:28 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 13.0 (TID 49, Worker1, partition 0,PROCESS_LOCAL, 2078 bytes)

16/02/08 14:48:28 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 13.0 (TID 50, Master, partition 1,PROCESS_LOCAL, 2078 bytes)

16/02/08 14:48:28 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 13.0 (TID 51, Worker2, partition 2,PROCESS_LOCAL, 2135 bytes)

16/02/08 14:48:28 INFO storage.BlockManagerInfo: Added broadcast_17_piece0 in memory on Master:59109 (size: 3.0 KB, free: 511.1 MB)

16/02/08 14:48:28 INFO storage.BlockManagerInfo: Added broadcast_17_piece0 in memory on Worker2:36686 (size: 3.0 KB, free: 511.1 MB)

16/02/08 14:48:28 INFO storage.BlockManagerInfo: Added broadcast_16_piece0 in memory on Worker2:36686 (size: 97.0 B, free: 511.1 MB)

16/02/08 14:48:28 INFO storage.BlockManagerInfo: Added broadcast_16_piece0 in memory on Master:59109 (size: 97.0 B, free: 511.1 MB)

16/02/08 14:48:28 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 13.0 (TID 51) in 865 ms on Worker2 (1/3)

16/02/08 14:48:28 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 13.0 (TID 50) in 884 ms on Master (2/3)

16/02/08 14:48:29 INFO storage.BlockManagerInfo: Added broadcast_17_piece0 in memory on Worker1:51526 (size: 3.0 KB, free: 511.1 MB)

16/02/08 14:48:35 INFO storage.BlockManagerInfo: Added broadcast_16_piece0 in memory on Worker1:51526 (size: 97.0 B, free: 511.1 MB)

16/02/08 14:48:35 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 13.0 (TID 49) in 7726 ms on Worker1 (3/3)

16/02/08 14:48:35 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 13.0, whose tasks have all completed, from pool

16/02/08 14:48:35 INFO scheduler.DAGScheduler: ResultStage 13 (collect at <console>:36) finished in 7.726 s

16/02/08 14:48:35 INFO scheduler.DAGScheduler: Job 9 finished: collect at <console>:36, took 7.824894 s

res14: Array[Int] = Array(10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290, 300, 310, 320, 330, 340, 350, 360, 370, 380, 390, 400, 410, 420, 430, 440, 450, 460, 470, 480, 490, 500, 510, 520, 530, 540, 550, 560, 570, 580, 590, 600, 610, 620, 630, 640, 650, 660, 670, 680, 690, 700, 710, 720, 730, 740, 750, 760, 770, 780, 790, 800, 810, 820, 830, 840, 850, 860, 870, 880, 890, 900, 910, 920, 930, 940, 950, 960, 970, 980, 990, 1000, 1010, 1020, 1030, 1040, 1050, 1060, 1070, 1080, 1090, 1100, 1110, 1120, 1130, 1140, 1150, 1160, 1170, 1180, 1190, 1200, 1210, 1220, 1230, 1240, 1250, 1260, 1270, 1280, 1290, 1300, 1310, 1320, 1330, 1340, 1350, 1360, 1370, 1380, 1390, 1400, 1410, 1420, 1430, 1440, 1450, 1460, 147...

========== 累加器============

晚上要累加器?

有全局变量,有局部变量不是不是就完美了?为什么还需要累加器?

累加器的特征是:全局级别的,切executor中的task只能够修改它,只有driver可读。在记录集群的状态,尤其是全局唯一的状态的时候,至关重要。

直接动手上代码:

scala> val sum = sc.accumulator(0)

sum: org.apache.spark.Accumulator[Int] = 0

scala> val data = sc.parallelize(1 to 100,3)

data: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[26] at parallelize at <console>:27

scala> data.foreach(item=>sum+=item)

16/02/08 15:06:55 INFO spark.SparkContext: Starting job: foreach at <console>:32

16/02/08 15:06:56 INFO scheduler.DAGScheduler: Got job 0 (foreach at <console>:32) with 3 output partitions

16/02/08 15:06:56 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (foreach at <console>:32)

16/02/08 15:06:56 INFO scheduler.DAGScheduler: Parents of final stage: List()

16/02/08 15:06:56 INFO scheduler.DAGScheduler: Missing parents: List()

16/02/08 15:06:56 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (ParallelCollectionRDD[0] at parallelize at <console>:27), which has no missing parents

16/02/08 15:06:57 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 5.1 KB, free 5.1 KB)

16/02/08 15:06:57 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 2.3 KB, free 7.4 KB)

16/02/08 15:06:57 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.145.132:48569 (size: 2.3 KB, free: 1247.2 MB)

16/02/08 15:06:57 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006

16/02/08 15:06:57 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ResultStage 0 (ParallelCollectionRDD[0] at parallelize at <console>:27)

16/02/08 15:06:57 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 3 tasks

16/02/08 15:06:57 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, Worker1, partition 0,PROCESS_LOCAL, 2078 bytes)

16/02/08 15:06:57 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, Master, partition 1,PROCESS_LOCAL, 2078 bytes)

16/02/08 15:06:57 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, Worker2, partition 2,PROCESS_LOCAL, 2135 bytes)

16/02/08 15:06:58 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on Worker2:52845 (size: 2.3 KB, free: 511.1 MB)

16/02/08 15:06:58 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on Master:40522 (size: 2.3 KB, free: 511.1 MB)

16/02/08 15:07:02 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 3729 ms on Worker2 (1/3)

16/02/08 15:07:02 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 4415 ms on Master (2/3)

16/02/08 15:07:02 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on Worker1:51014 (size: 2.3 KB, free: 511.1 MB)

16/02/08 15:07:07 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 9662 ms on Worker1 (3/3)

16/02/08 15:07:07 INFO scheduler.DAGScheduler: ResultStage 0 (foreach at <console>:32) finished in 9.676 s

16/02/08 15:07:07 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool

16/02/08 15:07:07 INFO scheduler.DAGScheduler: Job 0 finished: foreach at <console>:32, took 11.582405 s

scala> print(sum)

5050

scala> data.foreach(item=>sum+=item)

16/02/08 15:09:22 INFO spark.SparkContext: Starting job: foreach at <console>:32

16/02/08 15:09:22 INFO scheduler.DAGScheduler: Got job 1 (foreach at <console>:32) with 3 output partitions

16/02/08 15:09:22 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (foreach at <console>:32)

16/02/08 15:09:22 INFO scheduler.DAGScheduler: Parents of final stage: List()

16/02/08 15:09:22 INFO scheduler.DAGScheduler: Missing parents: List()

16/02/08 15:09:22 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (ParallelCollectionRDD[0] at parallelize at <console>:27), which has no missing parents

16/02/08 15:09:22 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 5.1 KB, free 12.5 KB)

16/02/08 15:09:22 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.3 KB, free 14.8 KB)

16/02/08 15:09:22 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.145.132:48569 (size: 2.3 KB, free: 1247.2 MB)

16/02/08 15:09:22 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006

16/02/08 15:09:22 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ResultStage 1 (ParallelCollectionRDD[0] at parallelize at <console>:27)

16/02/08 15:09:22 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 3 tasks

16/02/08 15:09:22 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 3, Worker1, partition 0,PROCESS_LOCAL, 2078 bytes)

16/02/08 15:09:22 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 4, Master, partition 1,PROCESS_LOCAL, 2078 bytes)

16/02/08 15:09:22 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 1.0 (TID 5, Worker2, partition 2,PROCESS_LOCAL, 2135 bytes)

16/02/08 15:09:22 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on Master:40522 (size: 2.3 KB, free: 511.1 MB)

16/02/08 15:09:22 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on Worker1:51014 (size: 2.3 KB, free: 511.1 MB)

16/02/08 15:09:22 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on Worker2:52845 (size: 2.3 KB, free: 511.1 MB)

16/02/08 15:09:22 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 1.0 (TID 4) in 199 ms on Master (1/3)

16/02/08 15:09:22 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 1.0 (TID 5) in 254 ms on Worker2 (2/3)

16/02/08 15:09:22 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 3) in 258 ms on Worker1 (3/3)

16/02/08 15:09:22 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool

16/02/08 15:09:22 INFO scheduler.DAGScheduler: ResultStage 1 (foreach at <console>:32) finished in 0.264 s

16/02/08 15:09:22 INFO scheduler.DAGScheduler: Job 1 finished: foreach at <console>:32, took 0.350268 s

scala> print(sum)

10100

因为是全局唯一,所以这里可以一直加,每次操作只增不减。累加器在executor累加的时候是不会相互覆盖的,因为它是被加锁的。

在运行的时候,本来就是并发进行。每次计算都会对操作计算标记。每次读取的时候都是最后状态的读取。

/**
 * Create an [[org.apache.spark.Accumulator]] variable of a given type, which tasks can "add"
 * values to using the `+=method. Only the driver can access the accumulator's `value`.
 */
def accumulator[T](initialValue: T)(implicit param: AccumulatorParam[T]): Accumulator[T] =
{
  val acc = new Accumulator(initialValueparam)
  cleaner.foreach(_.registerAccumulatorForCleanup(acc))
  acc
}

作业:

1、手动清除缓存,清除persist内容;

2、自己阅读accumulator的源码以及它的工作机制;

王家林老师名片:

中国Spark第一人

新浪微博:http://weibo.com/ilovepains

微信公众号:DT_Spark

博客:http://blog.sina.com.cn/ilovepains

手机:18610086859

QQ:1740415547

邮箱:[email protected]


本文出自 “一枝花傲寒” 博客,谢绝转载!

你可能感兴趣的:(广播,RDD持久化,累加器)