内容:
1、RDD持久化实战;
2、Spark广播实战;
3、Spark累加器实战;
持久化实战几个方面:
1、怎么保存结果;
2、实现算法的时候cache、persist;
3、checkpoint
广播:
构建算法至关重要,降低网络传输数据量、提高内存的使用效率、加快程序的运行速度
累加器:
全局的指针部件的变量,在executor中只能修改累加器的内容,不能读累加器的内容,在driver中才能读取
========== Action============
collect、count、saveTextFile、foreach、countByKey
用shell方式验证,因为action有个特点,就是输出
~~~reduce~~~
scala> val numbers = sc.parallelize(1 to 100,3)
numbers: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:27
scala> numbers.reduce(_+_)
16/02/08 12:26:31 INFO spark.SparkContext: Starting job: reduce at <console>:30
16/02/08 12:26:31 INFO scheduler.DAGScheduler: Got job 1 (reduce at <console>:30) with 3 output partitions
16/02/08 12:26:31 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (reduce at <console>:30)
16/02/08 12:26:31 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/02/08 12:26:31 INFO scheduler.DAGScheduler: Missing parents: List()
16/02/08 12:26:31 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (ParallelCollectionRDD[1] at parallelize at <console>:27), which has no missing parents
16/02/08 12:26:31 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 1216.0 B, free 3.2 KB)
16/02/08 12:26:31 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 858.0 B, free 4.1 KB)
16/02/08 12:26:31 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.145.132:39834 (size: 858.0 B, free: 1247.2 MB)
16/02/08 12:26:31 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/02/08 12:26:31 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ResultStage 1 (ParallelCollectionRDD[1] at parallelize at <console>:27)
16/02/08 12:26:31 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 3 tasks
16/02/08 12:26:31 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 24, Master, partition 0,PROCESS_LOCAL, 2078 bytes)
16/02/08 12:26:31 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 25, Worker2, partition 1,PROCESS_LOCAL, 2078 bytes)
16/02/08 12:26:31 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 1.0 (TID 26, Worker1, partition 2,PROCESS_LOCAL, 2135 bytes)
16/02/08 12:26:31 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on Worker2:36686 (size: 858.0 B, free: 511.1 MB)
16/02/08 12:26:31 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on Master:59109 (size: 858.0 B, free: 511.1 MB)
16/02/08 12:26:31 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on Worker1:51526 (size: 858.0 B, free: 511.1 MB)
16/02/08 12:26:31 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 1.0 (TID 25) in 107 ms on Worker2 (1/3)
16/02/08 12:26:31 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 24) in 135 ms on Master (2/3)
16/02/08 12:26:31 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 1.0 (TID 26) in 181 ms on Worker1 (3/3)
16/02/08 12:26:31 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
16/02/08 12:26:31 INFO scheduler.DAGScheduler: ResultStage 1 (reduce at <console>:30) finished in 0.199 s
16/02/08 12:26:31 INFO scheduler.DAGScheduler: Job 1 finished: reduce at <console>:30, took 0.327446 s
res2: Int = 5050
reduce:把上次的计算结果,作为下一个输入的第一个参数
~~~ collect~~~
scala> val results = numbers.map(_*2)
results: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[2] at map at <console>:29
scala> val data = results.collect
16/02/08 12:31:43 INFO spark.SparkContext: Starting job: collect at <console>:31
16/02/08 12:31:43 INFO scheduler.DAGScheduler: Got job 2 (collect at <console>:31) with 3 output partitions
16/02/08 12:31:43 INFO scheduler.DAGScheduler: Final stage: ResultStage 2 (collect at <console>:31)
16/02/08 12:31:43 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/02/08 12:31:43 INFO scheduler.DAGScheduler: Missing parents: List()
16/02/08 12:31:43 INFO scheduler.DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[2] at map at <console>:29), which has no missing parents
16/02/08 12:31:43 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 1952.0 B, free 1952.0 B)
16/02/08 12:31:43 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1212.0 B, free 3.1 KB)
16/02/08 12:31:43 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.145.132:39834 (size: 1212.0 B, free: 1247.2 MB)
16/02/08 12:31:43 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006
16/02/08 12:31:43 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ResultStage 2 (MapPartitionsRDD[2] at map at <console>:29)
16/02/08 12:31:43 INFO scheduler.TaskSchedulerImpl: Adding task set 2.0 with 3 tasks
16/02/08 12:31:43 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 2.0 (TID 27, Master, partition 0,PROCESS_LOCAL, 2078 bytes)
16/02/08 12:31:43 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 2.0 (TID 28, Worker2, partition 1,PROCESS_LOCAL, 2078 bytes)
16/02/08 12:31:43 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 2.0 (TID 29, Worker1, partition 2,PROCESS_LOCAL, 2135 bytes)
16/02/08 12:31:44 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on Worker1:51526 (size: 1212.0 B, free: 511.1 MB)
16/02/08 12:31:44 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on Master:59109 (size: 1212.0 B, free: 511.1 MB)
16/02/08 12:31:45 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 2.0 (TID 27) in 1074 ms on Master (1/3)
16/02/08 12:32:27 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 2.0 (TID 29) in 43345 ms on Worker1 (2/3)
16/02/08 12:32:28 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on Worker2:36686 (size: 1212.0 B, free: 511.1 MB)
16/02/08 12:32:28 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 2.0 (TID 28) in 44523 ms on Worker2 (3/3)
16/02/08 12:32:28 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
16/02/08 12:32:28 INFO scheduler.DAGScheduler: ResultStage 2 (collect at <console>:31) finished in 44.541 s
16/02/08 12:32:28 INFO scheduler.DAGScheduler: Job 2 finished: collect at <console>:31, took 44.623573 s
data: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98, 100, 102, 104, 106, 108, 110, 112, 114, 116, 118, 120, 122, 124, 126, 128, 130, 132, 134, 136, 138, 140, 142, 144, 146, 148, 150, 152, 154, 156, 158, 160, 162, 164, 166, 168, 170, 172, 174, 176, 178, 180, 182, 184, 186, 188, 190, 192, 194, 196, 198, 200)
通过collect,每个executor的执行结果会被搜集到driver上,如果想在命令终端中看到执行结果,就必须collect。源代码中runJob,凡是action级别的操作都会触发sc.runJob
/**
* Return an array that contains all of the elements in this RDD.
*/
def collect(): Array[T] = withScope {
val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
Array.concat(results: _*)
}
~~~ count~~~
scala> numbers.count
16/02/08 12:38:37 INFO spark.SparkContext: Starting job: count at <console>:30
16/02/08 12:38:37 INFO scheduler.DAGScheduler: Got job 3 (count at <console>:30) with 3 output partitions
16/02/08 12:38:37 INFO scheduler.DAGScheduler: Final stage: ResultStage 3 (count at <console>:30)
16/02/08 12:38:37 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/02/08 12:38:37 INFO scheduler.DAGScheduler: Missing parents: List()
16/02/08 12:38:37 INFO scheduler.DAGScheduler: Submitting ResultStage 3 (ParallelCollectionRDD[1] at parallelize at <console>:27), which has no missing parents
16/02/08 12:38:37 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 1096.0 B, free 4.2 KB)
16/02/08 12:38:37 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 804.0 B, free 4.9 KB)
16/02/08 12:38:37 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on 192.168.145.132:39834 (size: 804.0 B, free: 1247.2 MB)
16/02/08 12:38:37 INFO spark.SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1006
16/02/08 12:38:37 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ResultStage 3 (ParallelCollectionRDD[1] at parallelize at <console>:27)
16/02/08 12:38:37 INFO scheduler.TaskSchedulerImpl: Adding task set 3.0 with 3 tasks
16/02/08 12:38:37 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 3.0 (TID 30, Worker1, partition 0,PROCESS_LOCAL, 2078 bytes)
16/02/08 12:38:37 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 3.0 (TID 31, Master, partition 1,PROCESS_LOCAL, 2078 bytes)
16/02/08 12:38:37 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 3.0 (TID 32, Worker2, partition 2,PROCESS_LOCAL, 2135 bytes)
16/02/08 12:38:37 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on Master:59109 (size: 804.0 B, free: 511.1 MB)
16/02/08 12:38:37 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on Worker2:36686 (size: 804.0 B, free: 511.1 MB)
16/02/08 12:38:37 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 3.0 (TID 32) in 102 ms on Worker2 (1/3)
16/02/08 12:38:37 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 3.0 (TID 31) in 148 ms on Master (2/3)
16/02/08 12:38:40 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on Worker1:51526 (size: 804.0 B, free: 511.1 MB)
16/02/08 12:38:41 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 3.0 (TID 30) in 3628 ms on Worker1 (3/3)
16/02/08 12:38:41 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
16/02/08 12:38:41 INFO scheduler.DAGScheduler: ResultStage 3 (count at <console>:30) finished in 3.629 s
16/02/08 12:38:41 INFO scheduler.DAGScheduler: Job 3 finished: count at <console>:30, took 3.734941 s
res4: Long = 100
/**
* Return the number of elements in the RDD.
*/
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
~~~ take~~~
scala> val topN = numbers.take(5)
16/02/08 12:39:48 INFO spark.SparkContext: Starting job: take at <console>:29
16/02/08 12:39:48 INFO scheduler.DAGScheduler: Got job 4 (take at <console>:29) with 1 output partitions
16/02/08 12:39:48 INFO scheduler.DAGScheduler: Final stage: ResultStage 4 (take at <console>:29)
16/02/08 12:39:48 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/02/08 12:39:48 INFO scheduler.DAGScheduler: Missing parents: List()
16/02/08 12:39:48 INFO scheduler.DAGScheduler: Submitting ResultStage 4 (ParallelCollectionRDD[1] at parallelize at <console>:27), which has no missing parents
16/02/08 12:39:48 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 1288.0 B, free 6.2 KB)
16/02/08 12:39:48 INFO storage.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 871.0 B, free 7.1 KB)
16/02/08 12:39:48 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on 192.168.145.132:39834 (size: 871.0 B, free: 1247.2 MB)
16/02/08 12:39:48 INFO spark.SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1006
16/02/08 12:39:48 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 4 (ParallelCollectionRDD[1] at parallelize at <console>:27)
16/02/08 12:39:48 INFO scheduler.TaskSchedulerImpl: Adding task set 4.0 with 1 tasks
16/02/08 12:39:48 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 4.0 (TID 33, Worker2, partition 0,PROCESS_LOCAL, 2078 bytes)
16/02/08 12:39:48 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on Worker2:36686 (size: 871.0 B, free: 511.1 MB)
16/02/08 12:39:49 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 4.0 (TID 33) in 93 ms on Worker2 (1/1)
16/02/08 12:39:49 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool
16/02/08 12:39:49 INFO scheduler.DAGScheduler: ResultStage 4 (take at <console>:29) finished in 0.092 s
16/02/08 12:39:49 INFO scheduler.DAGScheduler: Job 4 finished: take at <console>:29, took 0.132127 s
topN: Array[Int] = Array(1, 2, 3, 4, 5)
~~~ countByKey~~~
scala> val scores = Array(Tuple2(1,100),Tuple2(2,95),Tuple2(3,70),Tuple2(1,77),Tuple2(3,78))
scores: Array[(Int, Int)] = Array((1,100), (2,95), (3,70), (1,77), (3,78))
scala> val haha = sc.parallelize(scores,3)
haha: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[3] at parallelize at <console>:29
scala> val data = haha.countByKey
16/02/08 12:47:53 INFO spark.SparkContext: Starting job: countByKey at <console>:31
16/02/08 12:47:53 INFO scheduler.DAGScheduler: Registering RDD 4 (countByKey at <console>:31)
16/02/08 12:47:53 INFO scheduler.DAGScheduler: Got job 5 (countByKey at <console>:31) with 3 output partitions
16/02/08 12:47:53 INFO scheduler.DAGScheduler: Final stage: ResultStage 6 (countByKey at <console>:31)
16/02/08 12:47:53 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 5)
16/02/08 12:47:53 INFO scheduler.DAGScheduler: Missing parents: List(ShuffleMapStage 5)
16/02/08 12:47:53 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 5 (MapPartitionsRDD[4] at countByKey at <console>:31), which has no missing parents
16/02/08 12:47:54 INFO storage.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 2.6 KB, free 9.7 KB)
16/02/08 12:47:54 INFO storage.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 1574.0 B, free 11.2 KB)
16/02/08 12:47:54 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on 192.168.145.132:39834 (size: 1574.0 B, free: 1247.2 MB)
16/02/08 12:47:54 INFO spark.SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1006
16/02/08 12:47:54 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ShuffleMapStage 5 (MapPartitionsRDD[4] at countByKey at <console>:31)
16/02/08 12:47:54 INFO scheduler.TaskSchedulerImpl: Adding task set 5.0 with 3 tasks
16/02/08 12:47:54 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 5.0 (TID 34, Worker2, partition 0,PROCESS_LOCAL, 2183 bytes)
16/02/08 12:47:54 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 5.0 (TID 35, Master, partition 1,PROCESS_LOCAL, 2199 bytes)
16/02/08 12:47:54 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 5.0 (TID 36, Worker1, partition 2,PROCESS_LOCAL, 2199 bytes)
16/02/08 12:47:54 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on Worker2:36686 (size: 1574.0 B, free: 511.1 MB)
16/02/08 12:47:54 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on Master:59109 (size: 1574.0 B, free: 511.1 MB)
16/02/08 12:47:56 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on Worker1:51526 (size: 1574.0 B, free: 511.1 MB)
16/02/08 12:47:57 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 5.0 (TID 35) in 3330 ms on Master (1/3)
16/02/08 12:47:59 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 5.0 (TID 34) in 5726 ms on Worker2 (2/3)
16/02/08 12:47:59 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 5.0 (TID 36) in 5698 ms on Worker1 (3/3)
16/02/08 12:47:59 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool
16/02/08 12:47:59 INFO scheduler.DAGScheduler: ShuffleMapStage 5 (countByKey at <console>:31) finished in 5.759 s
16/02/08 12:47:59 INFO scheduler.DAGScheduler: looking for newly runnable stages
16/02/08 12:47:59 INFO scheduler.DAGScheduler: running: Set()
16/02/08 12:47:59 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 6)
16/02/08 12:47:59 INFO scheduler.DAGScheduler: failed: Set()
16/02/08 12:47:59 INFO scheduler.DAGScheduler: Submitting ResultStage 6 (ShuffledRDD[5] at countByKey at <console>:31), which has no missing parents
16/02/08 12:48:00 INFO storage.MemoryStore: Block broadcast_6 stored as values in memory (estimated size 2.6 KB, free 13.9 KB)
16/02/08 12:48:00 INFO storage.MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 1593.0 B, free 15.4 KB)
16/02/08 12:48:00 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on 192.168.145.132:39834 (size: 1593.0 B, free: 1247.2 MB)
16/02/08 12:48:00 INFO spark.SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:1006
16/02/08 12:48:00 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ResultStage 6 (ShuffledRDD[5] at countByKey at <console>:31)
16/02/08 12:48:00 INFO scheduler.TaskSchedulerImpl: Adding task set 6.0 with 3 tasks
16/02/08 12:48:00 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 6.0 (TID 37, Worker2, partition 1,NODE_LOCAL, 1894 bytes)
16/02/08 12:48:00 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 6.0 (TID 38, Worker1, partition 0,NODE_LOCAL, 1894 bytes)
16/02/08 12:48:00 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 6.0 (TID 39, Master, partition 2,NODE_LOCAL, 1894 bytes)
16/02/08 12:48:00 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on Master:59109 (size: 1593.0 B, free: 511.1 MB)
16/02/08 12:48:00 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on Worker2:36686 (size: 1593.0 B, free: 511.1 MB)
16/02/08 12:48:00 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on Worker1:51526 (size: 1593.0 B, free: 511.1 MB)
16/02/08 12:48:00 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to Master:37061
16/02/08 12:48:00 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to Worker2:41346
16/02/08 12:48:00 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 181 bytes
16/02/08 12:48:00 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 181 bytes
16/02/08 12:48:01 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 6.0 (TID 39) in 679 ms on Master (1/3)
16/02/08 12:48:01 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to Worker1:40369
16/02/08 12:48:01 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 6.0 (TID 37) in 1564 ms on Worker2 (2/3)
16/02/08 12:48:02 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 6.0 (TID 38) in 1948 ms on Worker1 (3/3)
16/02/08 12:48:02 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks have all completed, from pool
16/02/08 12:48:02 INFO scheduler.DAGScheduler: ResultStage 6 (countByKey at <console>:31) finished in 2.020 s
16/02/08 12:48:02 INFO scheduler.DAGScheduler: Job 5 finished: countByKey at <console>:31, took 8.543698 s
data: scala.collection.Map[Int,Long] = Map(3 -> 2, 1 -> 2, 2 -> 1)
/**
* Count the number of elements for each key, collecting the results to a local Map.
*
* Note that this method should only be used if the resulting map is expected to be small, as
* the whole thing is loaded into the driver's memory.
* To handle very large results, consider using rdd.mapValues(_ => 1L).reduceByKey(_ + _), which
* returns an RDD[T, Long] instead of a map.
*/
def countByKey(): Map[K, Long] = self.withScope {
self.mapValues(_ => 1L).reduceByKey(_ + _).collect().toMap
}
~~~ saveAsTextFile~~~
/**
* Save this RDD as a text file, using string representations of elements.
*/
def saveAsTextFile(path: String): Unit = withScope {
// https://issues.apache.org/jira/browse/SPARK-2075
//
// NullWritable is a `Comparable` in Hadoop 1.+, so the compiler cannot find an implicit
// Ordering for it and will use the default `null`. However, it's a `Comparable[NullWritable]`
// in Hadoop 2.+, so the compiler will call the implicit `Ordering.ordered` method to create an
// Ordering for `NullWritable`. That's why the compiler will generate different anonymous
// classes for `saveAsTextFile` in Hadoop 1.+ and Hadoop 2.+.
//
// Therefore, here we provide an explicit Ordering `null` to make sure the compiler generate
// same bytecodes for `saveAsTextFile`.
val nullWritableClassTag = implicitly[ClassTag[NullWritable]]
val textClassTag = implicitly[ClassTag[Text]]
val r = this.mapPartitions { iter =>
val text = new Text()
iter.map { x =>
text.set(x.toString)
(NullWritable.get(), text)
}
}
RDD.rddToPairRDDFunctions(r)(nullWritableClassTag, textClassTag, null)
.saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path)
}
========== Persist============
Spark数据放在内存中,适合高速迭代。但是风险非常高,所以容易出错,这个时候涉及到容错。
RDD有血统继承,后面的RDD出错,会根据前面的步骤算出来。如果前面某个点没做过cache或者persist,那要从头做。
适用情况:
1、如果某步骤计算时间特别耗时;
2、计算链条特别长的情况;
3、checkpoint所在的RDD也一定要持久化数据;(lazy级别的,发现有checkpoint。会触发新作业,checkpoint那个步骤不做持久化,则要重新算,所以在checkpoint之前要持久化)
4、shuffle之后要persist(因为shuffle要网络传输,传输就有数据丢失的风险,persist则可以确保效率);
5、shuffle之前persist(框架默认帮助我们把数据持久化到本地磁盘);
/**
* Set this RDD's storage level to persist its values across operations after the first time
* it is computed. This can only be used to assign a new storage level if the RDD does not
* have a storage level set yet. Local checkpointing is an exception.
*/
def persist(newLevel: StorageLevel): this.type = {
if (isLocallyCheckpointed) {
// This means the user previously called localCheckpoint(), which should have already
// marked this RDD for persisting. Here we should override the old storage level with
// one that is explicitly requested by the user (after adapting it to use disk).
persist(LocalRDDCheckpointData.transformStorageLevel(newLevel), allowOverride = true)
} else {
persist(newLevel, allowOverride = false)
}
}
/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)
/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def cache(): this.type = persist()
val NONE = new StorageLevel(false, false, false, false)
val DISK_ONLY = new StorageLevel(true, false, false, false)
val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)
val MEMORY_ONLY = new StorageLevel(false, true, false, true) //只考虑内存,有可能oom
val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)
val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false) //序列化
val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2) //序列化且有两份
val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)
val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false) //优先考虑内存,不够才磁盘
val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2) //上一种模式2份副本
val OFF_HEAP = new StorageLevel(false, false, true, false) //TACHYON
cache是persist的一部分,就是MEMERY_ONLY级别的
为什么要序列号:减小体积,防止oom
序列化不好:使用数据都要反序列化,反序列化都要耗CPU
2份副本,可能会耗空间或者时间,但是如果一台机器出问题,会节省时间,典型的空间换时间
MEMORY_AND_DISK是最安全的,MEMERY_ONLY是最快的,能用内存尽量用内存,能用两份内存就尽量用两份内存
MEMORY_AND_DISK会极大降低oom的可能性
scala> val x = sc.textFile("/historyserverforSpark/README.md", 3).flatMap(_.split(" ")).map(word=>(word,1)).reduceByKey(_+_,1)
16/02/08 13:15:11 INFO storage.MemoryStore: Block broadcast_12 stored as values in memory (estimated size 212.8 KB, free 688.3 KB)
16/02/08 13:15:11 INFO storage.MemoryStore: Block broadcast_12_piece0 stored as bytes in memory (estimated size 19.7 KB, free 708.0 KB)
16/02/08 13:15:11 INFO storage.BlockManagerInfo: Added broadcast_12_piece0 in memory on 192.168.145.132:39834 (size: 19.7 KB, free: 1247.2 MB)
16/02/08 13:15:11 INFO spark.SparkContext: Created broadcast 12 from textFile at <console>:27
x: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[23] at reduceByKey at <console>:27
scala> x.collect
16/02/08 13:15:17 INFO spark.SparkContext: Starting job: collect at <console>:30
16/02/08 13:15:17 INFO mapred.FileInputFormat: Total input paths to process : 1
16/02/08 13:15:17 INFO scheduler.DAGScheduler: Registering RDD 22 (map at <console>:27)
16/02/08 13:15:17 INFO scheduler.DAGScheduler: Got job 7 (collect at <console>:30) with 1 output partitions
16/02/08 13:15:17 INFO scheduler.DAGScheduler: Final stage: ResultStage 10 (collect at <console>:30)
16/02/08 13:15:17 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 9)
16/02/08 13:15:17 INFO scheduler.DAGScheduler: Missing parents: List(ShuffleMapStage 9)
16/02/08 13:15:17 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 9 (MapPartitionsRDD[22] at map at <console>:27), which has no missing parents
16/02/08 13:15:17 INFO storage.MemoryStore: Block broadcast_13 stored as values in memory (estimated size 4.1 KB, free 712.1 KB)
16/02/08 13:15:17 INFO storage.MemoryStore: Block broadcast_13_piece0 stored as bytes in memory (estimated size 2.3 KB, free 714.4 KB)
16/02/08 13:15:17 INFO storage.BlockManagerInfo: Added broadcast_13_piece0 in memory on 192.168.145.132:39834 (size: 2.3 KB, free: 1247.2 MB)
16/02/08 13:15:17 INFO spark.SparkContext: Created broadcast 13 from broadcast at DAGScheduler.scala:1006
16/02/08 13:15:17 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ShuffleMapStage 9 (MapPartitionsRDD[22] at map at <console>:27)
16/02/08 13:15:17 INFO scheduler.TaskSchedulerImpl: Adding task set 9.0 with 3 tasks
16/02/08 13:15:17 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 9.0 (TID 44, Worker1, partition 0,NODE_LOCAL, 2141 bytes)
16/02/08 13:15:17 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 9.0 (TID 45, Master, partition 1,NODE_LOCAL, 2141 bytes)
16/02/08 13:15:17 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 9.0 (TID 46, Worker1, partition 2,NODE_LOCAL, 2141 bytes)
16/02/08 13:15:17 INFO storage.BlockManagerInfo: Added broadcast_13_piece0 in memory on Master:59109 (size: 2.3 KB, free: 511.1 MB)
16/02/08 13:15:17 INFO storage.BlockManagerInfo: Added broadcast_13_piece0 in memory on Worker1:51526 (size: 2.3 KB, free: 511.1 MB)
16/02/08 13:15:17 INFO storage.BlockManagerInfo: Added broadcast_12_piece0 in memory on Master:59109 (size: 19.7 KB, free: 511.1 MB)
16/02/08 13:15:18 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 9.0 (TID 45) in 232 ms on Master (1/3)
16/02/08 13:15:18 INFO storage.BlockManagerInfo: Added broadcast_12_piece0 in memory on Worker1:51526 (size: 19.7 KB, free: 511.1 MB)
16/02/08 13:15:18 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 9.0 (TID 44) in 847 ms on Worker1 (2/3)
16/02/08 13:15:18 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 9.0 (TID 46) in 880 ms on Worker1 (3/3)
16/02/08 13:15:18 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 9.0, whose tasks have all completed, from pool
16/02/08 13:15:18 INFO scheduler.DAGScheduler: ShuffleMapStage 9 (map at <console>:27) finished in 0.876 s
16/02/08 13:15:18 INFO scheduler.DAGScheduler: looking for newly runnable stages
16/02/08 13:15:18 INFO scheduler.DAGScheduler: running: Set()
16/02/08 13:15:18 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 10)
16/02/08 13:15:18 INFO scheduler.DAGScheduler: failed: Set()
16/02/08 13:15:18 INFO scheduler.DAGScheduler: Submitting ResultStage 10 (ShuffledRDD[23] at reduceByKey at <console>:27), which has no missing parents
16/02/08 13:15:18 INFO storage.MemoryStore: Block broadcast_14 stored as values in memory (estimated size 2.6 KB, free 717.0 KB)
16/02/08 13:15:18 INFO storage.MemoryStore: Block broadcast_14_piece0 stored as bytes in memory (estimated size 1580.0 B, free 718.5 KB)
16/02/08 13:15:18 INFO storage.BlockManagerInfo: Added broadcast_14_piece0 in memory on 192.168.145.132:39834 (size: 1580.0 B, free: 1247.2 MB)
16/02/08 13:15:18 INFO spark.SparkContext: Created broadcast 14 from broadcast at DAGScheduler.scala:1006
16/02/08 13:15:18 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 10 (ShuffledRDD[23] at reduceByKey at <console>:27)
16/02/08 13:15:18 INFO scheduler.TaskSchedulerImpl: Adding task set 10.0 with 1 tasks
16/02/08 13:15:18 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 10.0 (TID 47, Master, partition 0,NODE_LOCAL, 1894 bytes)
16/02/08 13:15:18 INFO storage.BlockManagerInfo: Added broadcast_14_piece0 in memory on Master:59109 (size: 1580.0 B, free: 511.1 MB)
16/02/08 13:15:18 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 2 to Master:37061
16/02/08 13:15:18 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 2 is 166 bytes
16/02/08 13:15:20 INFO scheduler.DAGScheduler: ResultStage 10 (collect at <console>:30) finished in 1.588 s
16/02/08 13:15:20 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 10.0 (TID 47) in 1588 ms on Master (1/1)
16/02/08 13:15:20 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, whose tasks have all completed, from pool
16/02/08 13:15:20 INFO scheduler.DAGScheduler: Job 7 finished: collect at <console>:30, took 2.586912 s
res11: Array[(String, Int)] = Array((package,1), (For,2), (Programs,1), (processing.,1), (Because,1), (The,1), (cluster.,1), (its,1), ([run,1), (APIs,1), (computation,1), (Try,1), (have,1), (through,1), (several,1), (This,2), (graph,1), (Hive,2), (storage,1), (["Specifying,1), (To,2), ("yarn",1), (page](http://spark.apache.org/documentation.html),1), (Once,1), (prefer,1), (SparkPi,2), (engine,1), (version,1), (file,1), (documentation,,1), (processing,,1), (the,21), (are,1), (systems.,1), (params,1), (not,1), (different,1), (refer,2), (Interactive,2), (R,,1), (given.,1), (if,4), (build,3), (when,1), (be,2), (Tests,1), (Apache,1), (./bin/run-example,2), (programs,,1), (including,3), (Spark.,1), (package.,1), (1000).count(),1), (Versions,1), (HDFS,1), (Data.,1), (>>>,1), (programming,1), (...
scala> x.cache(注意,注意!!!cache了!!!)
res12: x.type = ShuffledRDD[23] at reduceByKey at <console>:27
scala> x.collect
16/02/08 13:15:29 INFO spark.SparkContext: Starting job: collect at <console>:30
16/02/08 13:15:29 INFO scheduler.DAGScheduler: Got job 8 (collect at <console>:30) with 1 output partitions
16/02/08 13:15:29 INFO scheduler.DAGScheduler: Final stage: ResultStage 12 (collect at <console>:30)
16/02/08 13:15:29 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 11)
16/02/08 13:15:29 INFO scheduler.DAGScheduler: Missing parents: List()
16/02/08 13:15:29 INFO scheduler.DAGScheduler: Submitting ResultStage 12 (ShuffledRDD[23] at reduceByKey at <console>:27), which has no missing parents
16/02/08 13:15:29 INFO storage.MemoryStore: Block broadcast_15 stored as values in memory (estimated size 2.6 KB, free 721.1 KB)
16/02/08 13:15:29 INFO storage.MemoryStore: Block broadcast_15_piece0 stored as bytes in memory (estimated size 1580.0 B, free 722.7 KB)
16/02/08 13:15:29 INFO storage.BlockManagerInfo: Added broadcast_15_piece0 in memory on 192.168.145.132:39834 (size: 1580.0 B, free: 1247.2 MB)
16/02/08 13:15:29 INFO spark.SparkContext: Created broadcast 15 from broadcast at DAGScheduler.scala:1006
16/02/08 13:15:29 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 12 (ShuffledRDD[23] at reduceByKey at <console>:27)
16/02/08 13:15:29 INFO scheduler.TaskSchedulerImpl: Adding task set 12.0 with 1 tasks
16/02/08 13:15:29 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 12.0 (TID 48, Master, partition 0,NODE_LOCAL, 1894 bytes)
16/02/08 13:15:29 INFO storage.BlockManagerInfo: Added broadcast_15_piece0 in memory on Master:59109 (size: 1580.0 B, free: 511.1 MB)
16/02/08 13:15:29 INFO storage.BlockManagerInfo: Added rdd_23_0 in memory on Master:59109 (size: 22.7 KB, free: 511.1 MB)
16/02/08 13:15:29 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 12.0 (TID 48) in 183 ms on Master (1/1)
16/02/08 13:15:29 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 12.0, whose tasks have all completed, from pool
16/02/08 13:15:29 INFO scheduler.DAGScheduler: ResultStage 12 (collect at <console>:30) finished in 0.182 s
16/02/08 13:15:29 INFO scheduler.DAGScheduler: Job 8 finished: collect at <console>:30, took 0.292463 s
res13: Array[(String, Int)] = Array((package,1), (For,2), (Programs,1), (processing.,1), (Because,1), (The,1), (cluster.,1), (its,1), ([run,1), (APIs,1), (computation,1), (Try,1), (have,1), (through,1), (several,1), (This,2), (graph,1), (Hive,2), (storage,1), (["Specifying,1), (To,2), ("yarn",1), (page](http://spark.apache.org/documentation.html),1), (Once,1), (prefer,1), (SparkPi,2), (engine,1), (version,1), (file,1), (documentation,,1), (processing,,1), (the,21), (are,1), (systems.,1), (params,1), (not,1), (different,1), (refer,2), (Interactive,2), (R,,1), (given.,1), (if,4), (build,3), (when,1), (be,2), (Tests,1), (Apache,1), (./bin/run-example,2), (programs,,1), (including,3), (Spark.,1), (package.,1), (1000).count(),1), (Versions,1), (HDFS,1), (Data.,1), (>>>,1), (programming,1), (...
scala>
注意:cache之后一定不能有其它算子,不然会重复触发计算的过程
cache不是一个action
cache缓存清空,unpersist理清清空
persist是lazy级别的,unpersist是eager级别的
executor有个local.directory,用来persist放在本地磁盘上
========== 广播============
为什么要用广播?
讨论:
1、每个Task处理数据的时候都要拷贝一份数据副本,如果有100万份,就要拷贝100万份。函数式编程,变量不变,所以要拷贝。如果不拷贝,别人改变状态就傻了。如果数据量小,没关系。数据量大就傻了。直接OOM!
2、两个表join,一般join都会shuffle;
3、同步。broadcast过去的东西不能修改,只读,不能修改的。
4、传输性能,极大提升。
broadcast全局唯一。广播到Worker的Local
总结:
广播是由drvier发给当前application分配的所有executor及内存级别的全局只读变量
executor中的线程池中的线程共享该全局变量,极大的减少了网络传输,否则的话每个task都要传输一次该变量,并极大的节省了内存,当然,也隐形的提高了CPU的有效工作
原理:
两种方式:
方式一:
drvier,给executor,executor有多少Task,就发多少次,就会产生多少的数据副本。可以减少网络传输浪费。另外一方面,内存占用大,如果变量比较大,则极易出现OOM
方式二:
driver broadcast到executor的内存中,各个Task共享唯一的广播变量,极大的减少网络传输和内存消耗。
变量应用程序存在就存在,应用程序销毁(SC销毁)就销毁
第一种一般情况下没有优势。
额外:executor之间共享数据的方式:HDFS或者TACHYON
实战:
scala> val number = 10
number: Int = 10
scala> val numberBroadcast = sc.broadcast(number)
16/02/08 14:43:45 INFO storage.MemoryStore: Block broadcast_16 stored as values in memory (estimated size 40.0 B, free 232.5 KB)
16/02/08 14:43:45 INFO storage.MemoryStore: Block broadcast_16_piece0 stored as bytes in memory (estimated size 97.0 B, free 232.6 KB)
16/02/08 14:43:45 INFO storage.BlockManagerInfo: Added broadcast_16_piece0 in memory on 192.168.145.132:39834 (size: 97.0 B, free: 1247.2 MB)
16/02/08 14:43:45 INFO spark.SparkContext: Created broadcast 16 from broadcast at <console>:29
numberBroadcast: org.apache.spark.broadcast.Broadcast[Int] = Broadcast(16)
scala> val data = sc.parallelize(1 to 10000,3)
data: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[24] at parallelize at <console>:27
scala> val bn = data.map(_*numberBroadcast.value)
bn: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[25] at map at <console>:33
scala> bn.collect
16/02/08 14:48:27 INFO spark.SparkContext: Starting job: collect at <console>:36
16/02/08 14:48:27 INFO scheduler.DAGScheduler: Got job 9 (collect at <console>:36) with 3 output partitions
16/02/08 14:48:27 INFO scheduler.DAGScheduler: Final stage: ResultStage 13 (collect at <console>:36)
16/02/08 14:48:27 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/02/08 14:48:27 INFO scheduler.DAGScheduler: Missing parents: List()
16/02/08 14:48:27 INFO scheduler.DAGScheduler: Submitting ResultStage 13 (MapPartitionsRDD[25] at map at <console>:33), which has no missing parents
16/02/08 14:48:27 INFO storage.MemoryStore: Block broadcast_17 stored as values in memory (estimated size 6.9 KB, free 239.6 KB)
16/02/08 14:48:27 INFO storage.MemoryStore: Block broadcast_17_piece0 stored as bytes in memory (estimated size 3.0 KB, free 242.5 KB)
16/02/08 14:48:27 INFO storage.BlockManagerInfo: Added broadcast_17_piece0 in memory on 192.168.145.132:39834 (size: 3.0 KB, free: 1247.2 MB)
16/02/08 14:48:27 INFO spark.SparkContext: Created broadcast 17 from broadcast at DAGScheduler.scala:1006
16/02/08 14:48:27 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ResultStage 13 (MapPartitionsRDD[25] at map at <console>:33)
16/02/08 14:48:27 INFO scheduler.TaskSchedulerImpl: Adding task set 13.0 with 3 tasks
16/02/08 14:48:28 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 13.0 (TID 49, Worker1, partition 0,PROCESS_LOCAL, 2078 bytes)
16/02/08 14:48:28 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 13.0 (TID 50, Master, partition 1,PROCESS_LOCAL, 2078 bytes)
16/02/08 14:48:28 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 13.0 (TID 51, Worker2, partition 2,PROCESS_LOCAL, 2135 bytes)
16/02/08 14:48:28 INFO storage.BlockManagerInfo: Added broadcast_17_piece0 in memory on Master:59109 (size: 3.0 KB, free: 511.1 MB)
16/02/08 14:48:28 INFO storage.BlockManagerInfo: Added broadcast_17_piece0 in memory on Worker2:36686 (size: 3.0 KB, free: 511.1 MB)
16/02/08 14:48:28 INFO storage.BlockManagerInfo: Added broadcast_16_piece0 in memory on Worker2:36686 (size: 97.0 B, free: 511.1 MB)
16/02/08 14:48:28 INFO storage.BlockManagerInfo: Added broadcast_16_piece0 in memory on Master:59109 (size: 97.0 B, free: 511.1 MB)
16/02/08 14:48:28 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 13.0 (TID 51) in 865 ms on Worker2 (1/3)
16/02/08 14:48:28 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 13.0 (TID 50) in 884 ms on Master (2/3)
16/02/08 14:48:29 INFO storage.BlockManagerInfo: Added broadcast_17_piece0 in memory on Worker1:51526 (size: 3.0 KB, free: 511.1 MB)
16/02/08 14:48:35 INFO storage.BlockManagerInfo: Added broadcast_16_piece0 in memory on Worker1:51526 (size: 97.0 B, free: 511.1 MB)
16/02/08 14:48:35 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 13.0 (TID 49) in 7726 ms on Worker1 (3/3)
16/02/08 14:48:35 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 13.0, whose tasks have all completed, from pool
16/02/08 14:48:35 INFO scheduler.DAGScheduler: ResultStage 13 (collect at <console>:36) finished in 7.726 s
16/02/08 14:48:35 INFO scheduler.DAGScheduler: Job 9 finished: collect at <console>:36, took 7.824894 s
res14: Array[Int] = Array(10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290, 300, 310, 320, 330, 340, 350, 360, 370, 380, 390, 400, 410, 420, 430, 440, 450, 460, 470, 480, 490, 500, 510, 520, 530, 540, 550, 560, 570, 580, 590, 600, 610, 620, 630, 640, 650, 660, 670, 680, 690, 700, 710, 720, 730, 740, 750, 760, 770, 780, 790, 800, 810, 820, 830, 840, 850, 860, 870, 880, 890, 900, 910, 920, 930, 940, 950, 960, 970, 980, 990, 1000, 1010, 1020, 1030, 1040, 1050, 1060, 1070, 1080, 1090, 1100, 1110, 1120, 1130, 1140, 1150, 1160, 1170, 1180, 1190, 1200, 1210, 1220, 1230, 1240, 1250, 1260, 1270, 1280, 1290, 1300, 1310, 1320, 1330, 1340, 1350, 1360, 1370, 1380, 1390, 1400, 1410, 1420, 1430, 1440, 1450, 1460, 147...
========== 累加器============
晚上要累加器?
有全局变量,有局部变量不是不是就完美了?为什么还需要累加器?
累加器的特征是:全局级别的,切executor中的task只能够修改它,只有driver可读。在记录集群的状态,尤其是全局唯一的状态的时候,至关重要。
直接动手上代码:
scala> val sum = sc.accumulator(0)
sum: org.apache.spark.Accumulator[Int] = 0
scala> val data = sc.parallelize(1 to 100,3)
data: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[26] at parallelize at <console>:27
scala> data.foreach(item=>sum+=item)
16/02/08 15:06:55 INFO spark.SparkContext: Starting job: foreach at <console>:32
16/02/08 15:06:56 INFO scheduler.DAGScheduler: Got job 0 (foreach at <console>:32) with 3 output partitions
16/02/08 15:06:56 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (foreach at <console>:32)
16/02/08 15:06:56 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/02/08 15:06:56 INFO scheduler.DAGScheduler: Missing parents: List()
16/02/08 15:06:56 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (ParallelCollectionRDD[0] at parallelize at <console>:27), which has no missing parents
16/02/08 15:06:57 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 5.1 KB, free 5.1 KB)
16/02/08 15:06:57 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 2.3 KB, free 7.4 KB)
16/02/08 15:06:57 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.145.132:48569 (size: 2.3 KB, free: 1247.2 MB)
16/02/08 15:06:57 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
16/02/08 15:06:57 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ResultStage 0 (ParallelCollectionRDD[0] at parallelize at <console>:27)
16/02/08 15:06:57 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 3 tasks
16/02/08 15:06:57 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, Worker1, partition 0,PROCESS_LOCAL, 2078 bytes)
16/02/08 15:06:57 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, Master, partition 1,PROCESS_LOCAL, 2078 bytes)
16/02/08 15:06:57 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, Worker2, partition 2,PROCESS_LOCAL, 2135 bytes)
16/02/08 15:06:58 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on Worker2:52845 (size: 2.3 KB, free: 511.1 MB)
16/02/08 15:06:58 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on Master:40522 (size: 2.3 KB, free: 511.1 MB)
16/02/08 15:07:02 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 3729 ms on Worker2 (1/3)
16/02/08 15:07:02 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 4415 ms on Master (2/3)
16/02/08 15:07:02 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on Worker1:51014 (size: 2.3 KB, free: 511.1 MB)
16/02/08 15:07:07 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 9662 ms on Worker1 (3/3)
16/02/08 15:07:07 INFO scheduler.DAGScheduler: ResultStage 0 (foreach at <console>:32) finished in 9.676 s
16/02/08 15:07:07 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/02/08 15:07:07 INFO scheduler.DAGScheduler: Job 0 finished: foreach at <console>:32, took 11.582405 s
scala> print(sum)
5050
scala> data.foreach(item=>sum+=item)
16/02/08 15:09:22 INFO spark.SparkContext: Starting job: foreach at <console>:32
16/02/08 15:09:22 INFO scheduler.DAGScheduler: Got job 1 (foreach at <console>:32) with 3 output partitions
16/02/08 15:09:22 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (foreach at <console>:32)
16/02/08 15:09:22 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/02/08 15:09:22 INFO scheduler.DAGScheduler: Missing parents: List()
16/02/08 15:09:22 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (ParallelCollectionRDD[0] at parallelize at <console>:27), which has no missing parents
16/02/08 15:09:22 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 5.1 KB, free 12.5 KB)
16/02/08 15:09:22 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.3 KB, free 14.8 KB)
16/02/08 15:09:22 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.145.132:48569 (size: 2.3 KB, free: 1247.2 MB)
16/02/08 15:09:22 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/02/08 15:09:22 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ResultStage 1 (ParallelCollectionRDD[0] at parallelize at <console>:27)
16/02/08 15:09:22 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 3 tasks
16/02/08 15:09:22 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 3, Worker1, partition 0,PROCESS_LOCAL, 2078 bytes)
16/02/08 15:09:22 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 4, Master, partition 1,PROCESS_LOCAL, 2078 bytes)
16/02/08 15:09:22 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 1.0 (TID 5, Worker2, partition 2,PROCESS_LOCAL, 2135 bytes)
16/02/08 15:09:22 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on Master:40522 (size: 2.3 KB, free: 511.1 MB)
16/02/08 15:09:22 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on Worker1:51014 (size: 2.3 KB, free: 511.1 MB)
16/02/08 15:09:22 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on Worker2:52845 (size: 2.3 KB, free: 511.1 MB)
16/02/08 15:09:22 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 1.0 (TID 4) in 199 ms on Master (1/3)
16/02/08 15:09:22 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 1.0 (TID 5) in 254 ms on Worker2 (2/3)
16/02/08 15:09:22 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 3) in 258 ms on Worker1 (3/3)
16/02/08 15:09:22 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
16/02/08 15:09:22 INFO scheduler.DAGScheduler: ResultStage 1 (foreach at <console>:32) finished in 0.264 s
16/02/08 15:09:22 INFO scheduler.DAGScheduler: Job 1 finished: foreach at <console>:32, took 0.350268 s
scala> print(sum)
10100
因为是全局唯一,所以这里可以一直加,每次操作只增不减。累加器在executor累加的时候是不会相互覆盖的,因为它是被加锁的。
在运行的时候,本来就是并发进行。每次计算都会对操作计算标记。每次读取的时候都是最后状态的读取。
/**
* Create an [[org.apache.spark.Accumulator]] variable of a given type, which tasks can "add"
* values to using the `+=` method. Only the driver can access the accumulator's `value`.
*/
def accumulator[T](initialValue: T)(implicit param: AccumulatorParam[T]): Accumulator[T] =
{
val acc = new Accumulator(initialValue, param)
cleaner.foreach(_.registerAccumulatorForCleanup(acc))
acc
}
作业:
1、手动清除缓存,清除persist内容;
2、自己阅读accumulator的源码以及它的工作机制;
王家林老师名片:
中国Spark第一人
新浪微博:http://weibo.com/ilovepains
微信公众号:DT_Spark
博客:http://blog.sina.com.cn/ilovepains
手机:18610086859
QQ:1740415547
本文出自 “一枝花傲寒” 博客,谢绝转载!