Spark入门--求中位数

数据如下:

1 2 3 4 5 6 8 9 11 12 13 15 18 20 22 23 25 27 29

代码如下:

import org.apache.spark.{SparkConf, SparkContext}
import scala.util.control.Breaks._
/**
 * Created by xuyao on 15-7-24.
 * 求中位数,数据是分布式存储的
 * 将整体的数据分为K个桶,统计每个桶内的数据量,然后统计整个数据量
 * 根据桶的数量和总的数据量,可以判断数据落在哪个桶里,以及中位数的偏移量
 * 取出这个中位数
 */
object Median {
   def main (args: Array[String]) {
    val conf =new SparkConf().setAppName("Median")
     val sc=new SparkContext(conf)
     //通过textFile读入的是字符串型,所以要进行类型转换
     val data =sc.textFile("data").flatMap(x=>x.split(' ')).map(x=>x.toInt)
     //将数据分为4组,当然我这里的数据少
     val  mappeddata =data.map(x=>(x/4,x)).sortByKey()
     //p_count为每个分组的个数
     val p_count =data.map(x=>(x/4,1)).reduceByKey(_+_).sortByKey()
     p_count.foreach(println)
     //p_count是一个RDD,不能进行Map集合操作,所以要通过collectAsMap方法将其转换成scala的集合
     val scala_p_count=p_count.collectAsMap()
     //根据key值得到value值
     println(scala_p_count(0))
     //sum_count是统计总的个数,不能用count(),因为会得到多少个map对。
     val sum_count = p_count.map(x=>x._2).sum().toInt
     println(sum_count)
     var temp =0//中值所在的区间累加的个数
     var temp2=0//中值所在区间的前面所有的区间累加的个数
     var index=0//中值的区间
     var mid= 0
     if(sum_count%2!=0){
        mid =sum_count/2+1//中值在整个数据的偏移量
     }
     else{
        mid =sum_count/2
     }
     val pcount=p_count.count()
     breakable{
       for(i <- 0 to pcount.toInt-1){
         temp =temp + scala_p_count(i)
         temp2 =temp-scala_p_count(i)
         if(temp>=mid){
           index=i
           break
         }
       }
     }
     println(mid+" "+index+" "+temp+" "+temp2)
     //中位数在桶中的偏移量
     val offset =mid-temp2
     //takeOrdered它默认可以将key从小到大排序后,获取rdd中的前n个元素
     val result =mappeddata.filter(x=>x._1==index).takeOrdered(offset)
     println(result(offset-1)._2)
     sc.stop()
  }

}

运行结果如下:

/usr/lib/jvm/java-7-sun/bin/java -Dspark.master=local -Didea.launcher.port=7535 -Didea.launcher.bin.path=/opt/idea/bin -Dfile.encoding=UTF-8 -classpath /usr/lib/jvm/java-7-sun/jre/lib/jfr.jar:/usr/lib/jvm/java-7-sun/jre/lib/javaws.jar:/usr/lib/jvm/java-7-sun/jre/lib/resources.jar:/usr/lib/jvm/java-7-sun/jre/lib/plugin.jar:/usr/lib/jvm/java-7-sun/jre/lib/jfxrt.jar:/usr/lib/jvm/java-7-sun/jre/lib/jsse.jar:/usr/lib/jvm/java-7-sun/jre/lib/charsets.jar:/usr/lib/jvm/java-7-sun/jre/lib/deploy.jar:/usr/lib/jvm/java-7-sun/jre/lib/management-agent.jar:/usr/lib/jvm/java-7-sun/jre/lib/rt.jar:/usr/lib/jvm/java-7-sun/jre/lib/jce.jar:/usr/lib/jvm/java-7-sun/jre/lib/ext/sunpkcs11.jar:/usr/lib/jvm/java-7-sun/jre/lib/ext/sunjce_provider.jar:/usr/lib/jvm/java-7-sun/jre/lib/ext/sunec.jar:/usr/lib/jvm/java-7-sun/jre/lib/ext/dnsns.jar:/usr/lib/jvm/java-7-sun/jre/lib/ext/zipfs.jar:/usr/lib/jvm/java-7-sun/jre/lib/ext/localedata.jar:/opt/IdeaProjects/SparkTest/target/scala-2.10/classes:/home/xuyao/.sbt/boot/scala-2.10.4/lib/scala-library.jar:/home/xuyao/spark/lib/spark-assembly-1.4.0-hadoop2.4.0.jar:/opt/idea/lib/idea_rt.jar com.intellij.rt.execution.application.AppMain Median
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/07/29 12:43:28 INFO SparkContext: Running Spark version 1.4.0
15/07/29 12:43:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/07/29 12:43:29 WARN Utils: Your hostname, hadoop resolves to a loopback address: 127.0.1.1; using 192.168.73.129 instead (on interface eth0)
15/07/29 12:43:29 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/07/29 12:43:29 INFO SecurityManager: Changing view acls to: xuyao
15/07/29 12:43:29 INFO SecurityManager: Changing modify acls to: xuyao
15/07/29 12:43:29 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(xuyao); users with modify permissions: Set(xuyao)
15/07/29 12:43:30 INFO Slf4jLogger: Slf4jLogger started
15/07/29 12:43:31 INFO Remoting: Starting remoting
15/07/29 12:43:32 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:58364]
15/07/29 12:43:32 INFO Utils: Successfully started service 'sparkDriver' on port 58364.
15/07/29 12:43:32 INFO SparkEnv: Registering MapOutputTracker
15/07/29 12:43:33 INFO SparkEnv: Registering BlockManagerMaster
15/07/29 12:43:33 INFO DiskBlockManager: Created local directory at /tmp/spark-329d9ad9-4ed6-4a79-97f3-254cab1a13b8/blockmgr-f9da5521-a9c0-4801-bffb-3a92f089d1cd
15/07/29 12:43:33 INFO MemoryStore: MemoryStore started with capacity 131.6 MB
15/07/29 12:43:33 INFO HttpFileServer: HTTP File server directory is /tmp/spark-329d9ad9-4ed6-4a79-97f3-254cab1a13b8/httpd-fd2adba3-06b9-4035-9c2b-6733e379207a
15/07/29 12:43:33 INFO HttpServer: Starting HTTP Server
15/07/29 12:43:33 INFO Utils: Successfully started service 'HTTP file server' on port 58175.
15/07/29 12:43:33 INFO SparkEnv: Registering OutputCommitCoordinator
15/07/29 12:43:38 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/07/29 12:43:38 INFO SparkUI: Started SparkUI at http://192.168.73.129:4040
15/07/29 12:43:39 INFO Executor: Starting executor ID driver on host localhost
15/07/29 12:43:39 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 56974.
15/07/29 12:43:39 INFO NettyBlockTransferService: Server created on 56974
15/07/29 12:43:39 INFO BlockManagerMaster: Trying to register BlockManager
15/07/29 12:43:39 INFO BlockManagerMasterEndpoint: Registering block manager localhost:56974 with 131.6 MB RAM, BlockManagerId(driver, localhost, 56974)
15/07/29 12:43:39 INFO BlockManagerMaster: Registered BlockManager
15/07/29 12:43:40 WARN SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
15/07/29 12:43:41 INFO MemoryStore: ensureFreeSpace(137512) called with curMem=0, maxMem=137948037
15/07/29 12:43:41 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 134.3 KB, free 131.4 MB)
15/07/29 12:43:41 INFO MemoryStore: ensureFreeSpace(12633) called with curMem=137512, maxMem=137948037
15/07/29 12:43:41 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 12.3 KB, free 131.4 MB)
15/07/29 12:43:41 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:56974 (size: 12.3 KB, free: 131.5 MB)
15/07/29 12:43:41 INFO SparkContext: Created broadcast 0 from textFile at Median.scala:15
15/07/29 12:43:41 INFO FileInputFormat: Total input paths to process : 1
15/07/29 12:43:41 INFO SparkContext: Starting job: foreach at Median.scala:20
15/07/29 12:43:41 INFO DAGScheduler: Registering RDD 6 (map at Median.scala:19)
15/07/29 12:43:41 INFO DAGScheduler: Registering RDD 7 (reduceByKey at Median.scala:19)
15/07/29 12:43:41 INFO DAGScheduler: Got job 0 (foreach at Median.scala:20) with 1 output partitions (allowLocal=false)
15/07/29 12:43:41 INFO DAGScheduler: Final stage: ResultStage 2(foreach at Median.scala:20)
15/07/29 12:43:41 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 1)
15/07/29 12:43:41 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 1)
15/07/29 12:43:41 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[6] at map at Median.scala:19), which has no missing parents
15/07/29 12:43:41 INFO MemoryStore: ensureFreeSpace(4168) called with curMem=150145, maxMem=137948037
15/07/29 12:43:41 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.1 KB, free 131.4 MB)
15/07/29 12:43:41 INFO MemoryStore: ensureFreeSpace(2376) called with curMem=154313, maxMem=137948037
15/07/29 12:43:41 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.3 KB, free 131.4 MB)
15/07/29 12:43:41 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:56974 (size: 2.3 KB, free: 131.5 MB)
15/07/29 12:43:41 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:874
15/07/29 12:43:41 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[6] at map at Median.scala:19)
15/07/29 12:43:41 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
15/07/29 12:43:42 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1399 bytes)
15/07/29 12:43:42 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
15/07/29 12:43:42 INFO HadoopRDD: Input split: file:/opt/IdeaProjects/SparkTest/data:0+49
15/07/29 12:43:42 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
15/07/29 12:43:42 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
15/07/29 12:43:42 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
15/07/29 12:43:42 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
15/07/29 12:43:42 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
15/07/29 12:43:42 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2001 bytes result sent to driver
15/07/29 12:43:42 INFO DAGScheduler: ShuffleMapStage 0 (map at Median.scala:19) finished in 0.435 s
15/07/29 12:43:42 INFO DAGScheduler: looking for newly runnable stages
15/07/29 12:43:42 INFO DAGScheduler: running: Set()
15/07/29 12:43:42 INFO DAGScheduler: waiting: Set(ShuffleMapStage 1, ResultStage 2)
15/07/29 12:43:42 INFO DAGScheduler: failed: Set()
15/07/29 12:43:42 INFO DAGScheduler: Missing parents for ShuffleMapStage 1: List()
15/07/29 12:43:42 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 411 ms on localhost (1/1)
15/07/29 12:43:42 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
15/07/29 12:43:42 INFO DAGScheduler: Missing parents for ResultStage 2: List(ShuffleMapStage 1)
15/07/29 12:43:42 INFO DAGScheduler: Submitting ShuffleMapStage 1 (ShuffledRDD[7] at reduceByKey at Median.scala:19), which is now runnable
15/07/29 12:43:42 INFO MemoryStore: ensureFreeSpace(2608) called with curMem=156689, maxMem=137948037
15/07/29 12:43:42 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.5 KB, free 131.4 MB)
15/07/29 12:43:42 INFO MemoryStore: ensureFreeSpace(1586) called with curMem=159297, maxMem=137948037
15/07/29 12:43:42 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1586.0 B, free 131.4 MB)
15/07/29 12:43:42 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:56974 (size: 1586.0 B, free: 131.5 MB)
15/07/29 12:43:42 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:874
15/07/29 12:43:42 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 1 (ShuffledRDD[7] at reduceByKey at Median.scala:19)
15/07/29 12:43:42 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
15/07/29 12:43:42 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1154 bytes)
15/07/29 12:43:42 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
15/07/29 12:43:42 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
15/07/29 12:43:42 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 3 ms
15/07/29 12:43:42 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1095 bytes result sent to driver
15/07/29 12:43:42 INFO DAGScheduler: ShuffleMapStage 1 (reduceByKey at Median.scala:19) finished in 0.071 s
15/07/29 12:43:42 INFO DAGScheduler: looking for newly runnable stages
15/07/29 12:43:42 INFO DAGScheduler: running: Set()
15/07/29 12:43:42 INFO DAGScheduler: waiting: Set(ResultStage 2)
15/07/29 12:43:42 INFO DAGScheduler: failed: Set()
15/07/29 12:43:42 INFO DAGScheduler: Missing parents for ResultStage 2: List()
15/07/29 12:43:42 INFO DAGScheduler: Submitting ResultStage 2 (ShuffledRDD[8] at sortByKey at Median.scala:19), which is now runnable
15/07/29 12:43:42 INFO MemoryStore: ensureFreeSpace(2456) called with curMem=160883, maxMem=137948037
15/07/29 12:43:42 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 2.4 KB, free 131.4 MB)
15/07/29 12:43:42 INFO MemoryStore: ensureFreeSpace(1501) called with curMem=163339, maxMem=137948037
15/07/29 12:43:42 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 1501.0 B, free 131.4 MB)
15/07/29 12:43:42 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 85 ms on localhost (1/1)
15/07/29 12:43:42 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
15/07/29 12:43:42 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:56974 (size: 1501.0 B, free: 131.5 MB)
15/07/29 12:43:42 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:874
15/07/29 12:43:42 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 2 (ShuffledRDD[8] at sortByKey at Median.scala:19)
15/07/29 12:43:42 INFO TaskSchedulerImpl: Adding task set 2.0 with 1 tasks
15/07/29 12:43:42 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, localhost, PROCESS_LOCAL, 1165 bytes)
15/07/29 12:43:42 INFO Executor: Running task 0.0 in stage 2.0 (TID 2)
15/07/29 12:43:42 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
15/07/29 12:43:42 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 2 ms
(0,3)
(1,3)
(2,3)
(3,3)
(4,1)
(5,3)
(6,2)
(7,1)
15/07/29 12:43:42 INFO Executor: Finished task 0.0 in stage 2.0 (TID 2). 886 bytes result sent to driver
15/07/29 12:43:42 INFO DAGScheduler: ResultStage 2 (foreach at Median.scala:20) finished in 0.015 s
15/07/29 12:43:42 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 23 ms on localhost (1/1)
15/07/29 12:43:42 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 
15/07/29 12:43:42 INFO DAGScheduler: Job 0 finished: foreach at Median.scala:20, took 0.882846 s
15/07/29 12:43:42 INFO SparkContext: Starting job: collectAsMap at Median.scala:22
15/07/29 12:43:42 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 1 is 143 bytes
15/07/29 12:43:42 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 143 bytes
15/07/29 12:43:42 INFO DAGScheduler: Got job 1 (collectAsMap at Median.scala:22) with 1 output partitions (allowLocal=false)
15/07/29 12:43:42 INFO DAGScheduler: Final stage: ResultStage 5(collectAsMap at Median.scala:22)
15/07/29 12:43:42 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 4)
15/07/29 12:43:42 INFO DAGScheduler: Missing parents: List()
15/07/29 12:43:42 INFO DAGScheduler: Submitting ResultStage 5 (ShuffledRDD[8] at sortByKey at Median.scala:19), which has no missing parents
15/07/29 12:43:42 INFO MemoryStore: ensureFreeSpace(2552) called with curMem=164840, maxMem=137948037
15/07/29 12:43:42 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 2.5 KB, free 131.4 MB)
15/07/29 12:43:42 INFO MemoryStore: ensureFreeSpace(1502) called with curMem=167392, maxMem=137948037
15/07/29 12:43:42 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 1502.0 B, free 131.4 MB)
15/07/29 12:43:42 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:56974 (size: 1502.0 B, free: 131.5 MB)
15/07/29 12:43:42 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:874
15/07/29 12:43:42 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 5 (ShuffledRDD[8] at sortByKey at Median.scala:19)
15/07/29 12:43:42 INFO TaskSchedulerImpl: Adding task set 5.0 with 1 tasks
15/07/29 12:43:42 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 3, localhost, PROCESS_LOCAL, 1165 bytes)
15/07/29 12:43:42 INFO Executor: Running task 0.0 in stage 5.0 (TID 3)
15/07/29 12:43:42 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
15/07/29 12:43:42 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 2 ms
15/07/29 12:43:42 INFO Executor: Finished task 0.0 in stage 5.0 (TID 3). 1179 bytes result sent to driver
15/07/29 12:43:42 INFO DAGScheduler: ResultStage 5 (collectAsMap at Median.scala:22) finished in 0.009 s
15/07/29 12:43:42 INFO DAGScheduler: Job 1 finished: collectAsMap at Median.scala:22, took 0.063031 s
15/07/29 12:43:42 INFO TaskSetManager: Finished task 0.0 in stage 5.0 (TID 3) in 18 ms on localhost (1/1)
15/07/29 12:43:42 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool 
3
15/07/29 12:43:42 INFO SparkContext: Starting job: sum at Median.scala:26
15/07/29 12:43:42 INFO DAGScheduler: Got job 2 (sum at Median.scala:26) with 1 output partitions (allowLocal=false)
15/07/29 12:43:42 INFO DAGScheduler: Final stage: ResultStage 8(sum at Median.scala:26)
15/07/29 12:43:42 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 7)
15/07/29 12:43:42 INFO DAGScheduler: Missing parents: List()
15/07/29 12:43:42 INFO DAGScheduler: Submitting ResultStage 8 (MapPartitionsRDD[10] at numericRDDToDoubleRDDFunctions at Median.scala:26), which has no missing parents
15/07/29 12:43:42 INFO MemoryStore: ensureFreeSpace(3464) called with curMem=168894, maxMem=137948037
15/07/29 12:43:42 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 3.4 KB, free 131.4 MB)
15/07/29 12:43:42 INFO MemoryStore: ensureFreeSpace(2011) called with curMem=172358, maxMem=137948037
15/07/29 12:43:42 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 2011.0 B, free 131.4 MB)
15/07/29 12:43:42 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on localhost:56974 (size: 2011.0 B, free: 131.5 MB)
15/07/29 12:43:42 INFO SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:874
15/07/29 12:43:42 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 8 (MapPartitionsRDD[10] at numericRDDToDoubleRDDFunctions at Median.scala:26)
15/07/29 12:43:42 INFO TaskSchedulerImpl: Adding task set 8.0 with 1 tasks
15/07/29 12:43:42 INFO TaskSetManager: Starting task 0.0 in stage 8.0 (TID 4, localhost, PROCESS_LOCAL, 1165 bytes)
15/07/29 12:43:42 INFO Executor: Running task 0.0 in stage 8.0 (TID 4)
15/07/29 12:43:42 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
15/07/29 12:43:42 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
15/07/29 12:43:42 INFO Executor: Finished task 0.0 in stage 8.0 (TID 4). 926 bytes result sent to driver
15/07/29 12:43:42 INFO TaskSetManager: Finished task 0.0 in stage 8.0 (TID 4) in 11 ms on localhost (1/1)
15/07/29 12:43:42 INFO TaskSchedulerImpl: Removed TaskSet 8.0, whose tasks have all completed, from pool 
15/07/29 12:43:42 INFO DAGScheduler: ResultStage 8 (sum at Median.scala:26) finished in 0.006 s
15/07/29 12:43:42 INFO DAGScheduler: Job 2 finished: sum at Median.scala:26, took 0.038065 s
15/07/29 12:43:42 INFO SparkContext: Starting job: count at Median.scala:38
19
15/07/29 12:43:42 INFO DAGScheduler: Got job 3 (count at Median.scala:38) with 1 output partitions (allowLocal=false)
15/07/29 12:43:42 INFO DAGScheduler: Final stage: ResultStage 11(count at Median.scala:38)
15/07/29 12:43:42 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 10)
15/07/29 12:43:42 INFO DAGScheduler: Missing parents: List()
15/07/29 12:43:42 INFO DAGScheduler: Submitting ResultStage 11 (ShuffledRDD[8] at sortByKey at Median.scala:19), which has no missing parents
15/07/29 12:43:42 INFO MemoryStore: ensureFreeSpace(2376) called with curMem=174369, maxMem=137948037
15/07/29 12:43:42 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 2.3 KB, free 131.4 MB)
15/07/29 12:43:42 INFO MemoryStore: ensureFreeSpace(1451) called with curMem=176745, maxMem=137948037
15/07/29 12:43:42 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 1451.0 B, free 131.4 MB)
15/07/29 12:43:42 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on localhost:56974 (size: 1451.0 B, free: 131.5 MB)
15/07/29 12:43:42 INFO SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:874
15/07/29 12:43:42 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 11 (ShuffledRDD[8] at sortByKey at Median.scala:19)
15/07/29 12:43:42 INFO TaskSchedulerImpl: Adding task set 11.0 with 1 tasks
15/07/29 12:43:42 INFO TaskSetManager: Starting task 0.0 in stage 11.0 (TID 5, localhost, PROCESS_LOCAL, 1165 bytes)
15/07/29 12:43:42 INFO Executor: Running task 0.0 in stage 11.0 (TID 5)
15/07/29 12:43:42 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
15/07/29 12:43:42 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
15/07/29 12:43:42 INFO BlockManagerInfo: Removed broadcast_5_piece0 on localhost:56974 in memory (size: 2011.0 B, free: 131.5 MB)
15/07/29 12:43:42 INFO Executor: Finished task 0.0 in stage 11.0 (TID 5). 924 bytes result sent to driver
15/07/29 12:43:42 INFO TaskSetManager: Finished task 0.0 in stage 11.0 (TID 5) in 9 ms on localhost (1/1)
15/07/29 12:43:42 INFO TaskSchedulerImpl: Removed TaskSet 11.0, whose tasks have all completed, from pool 
15/07/29 12:43:42 INFO DAGScheduler: ResultStage 11 (count at Median.scala:38) finished in 0.010 s
15/07/29 12:43:42 INFO DAGScheduler: Job 3 finished: count at Median.scala:38, took 0.167034 s
10 3 12 9
15/07/29 12:43:42 INFO BlockManagerInfo: Removed broadcast_4_piece0 on localhost:56974 in memory (size: 1502.0 B, free: 131.5 MB)
15/07/29 12:43:42 INFO BlockManagerInfo: Removed broadcast_3_piece0 on localhost:56974 in memory (size: 1501.0 B, free: 131.5 MB)
15/07/29 12:43:42 INFO BlockManagerInfo: Removed broadcast_2_piece0 on localhost:56974 in memory (size: 1586.0 B, free: 131.5 MB)
15/07/29 12:43:42 INFO BlockManagerInfo: Removed broadcast_1_piece0 on localhost:56974 in memory (size: 2.3 KB, free: 131.5 MB)
15/07/29 12:43:43 INFO SparkContext: Starting job: takeOrdered at Median.scala:53
15/07/29 12:43:43 INFO DAGScheduler: Registering RDD 4 (map at Median.scala:17)
15/07/29 12:43:43 INFO DAGScheduler: Got job 4 (takeOrdered at Median.scala:53) with 1 output partitions (allowLocal=false)
15/07/29 12:43:43 INFO DAGScheduler: Final stage: ResultStage 13(takeOrdered at Median.scala:53)
15/07/29 12:43:43 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 12)
15/07/29 12:43:43 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 12)
15/07/29 12:43:43 INFO DAGScheduler: Submitting ShuffleMapStage 12 (MapPartitionsRDD[4] at map at Median.scala:17), which has no missing parents
15/07/29 12:43:43 INFO MemoryStore: ensureFreeSpace(4328) called with curMem=153972, maxMem=137948037
15/07/29 12:43:43 INFO MemoryStore: Block broadcast_7 stored as values in memory (estimated size 4.2 KB, free 131.4 MB)
15/07/29 12:43:43 INFO MemoryStore: ensureFreeSpace(2424) called with curMem=158300, maxMem=137948037
15/07/29 12:43:43 INFO MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 2.4 KB, free 131.4 MB)
15/07/29 12:43:43 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on localhost:56974 (size: 2.4 KB, free: 131.5 MB)
15/07/29 12:43:43 INFO SparkContext: Created broadcast 7 from broadcast at DAGScheduler.scala:874
15/07/29 12:43:43 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 12 (MapPartitionsRDD[4] at map at Median.scala:17)
15/07/29 12:43:43 INFO TaskSchedulerImpl: Adding task set 12.0 with 1 tasks
15/07/29 12:43:43 INFO TaskSetManager: Starting task 0.0 in stage 12.0 (TID 6, localhost, PROCESS_LOCAL, 1399 bytes)
15/07/29 12:43:43 INFO Executor: Running task 0.0 in stage 12.0 (TID 6)
15/07/29 12:43:43 INFO HadoopRDD: Input split: file:/opt/IdeaProjects/SparkTest/data:0+49
15/07/29 12:43:43 INFO Executor: Finished task 0.0 in stage 12.0 (TID 6). 2001 bytes result sent to driver
15/07/29 12:43:43 INFO TaskSetManager: Finished task 0.0 in stage 12.0 (TID 6) in 9 ms on localhost (1/1)
15/07/29 12:43:43 INFO TaskSchedulerImpl: Removed TaskSet 12.0, whose tasks have all completed, from pool 
15/07/29 12:43:43 INFO DAGScheduler: ShuffleMapStage 12 (map at Median.scala:17) finished in 0.004 s
15/07/29 12:43:43 INFO DAGScheduler: looking for newly runnable stages
15/07/29 12:43:43 INFO DAGScheduler: running: Set()
15/07/29 12:43:43 INFO DAGScheduler: waiting: Set(ResultStage 13)
15/07/29 12:43:43 INFO DAGScheduler: failed: Set()
15/07/29 12:43:43 INFO DAGScheduler: Missing parents for ResultStage 13: List()
15/07/29 12:43:43 INFO DAGScheduler: Submitting ResultStage 13 (MapPartitionsRDD[12] at takeOrdered at Median.scala:53), which is now runnable
15/07/29 12:43:43 INFO MemoryStore: ensureFreeSpace(3600) called with curMem=160724, maxMem=137948037
15/07/29 12:43:43 INFO MemoryStore: Block broadcast_8 stored as values in memory (estimated size 3.5 KB, free 131.4 MB)
15/07/29 12:43:43 INFO MemoryStore: ensureFreeSpace(2075) called with curMem=164324, maxMem=137948037
15/07/29 12:43:43 INFO MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 2.0 KB, free 131.4 MB)
15/07/29 12:43:43 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on localhost:56974 (size: 2.0 KB, free: 131.5 MB)
15/07/29 12:43:43 INFO SparkContext: Created broadcast 8 from broadcast at DAGScheduler.scala:874
15/07/29 12:43:43 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 13 (MapPartitionsRDD[12] at takeOrdered at Median.scala:53)
15/07/29 12:43:43 INFO TaskSchedulerImpl: Adding task set 13.0 with 1 tasks
15/07/29 12:43:43 INFO TaskSetManager: Starting task 0.0 in stage 13.0 (TID 7, localhost, PROCESS_LOCAL, 1165 bytes)
15/07/29 12:43:43 INFO Executor: Running task 0.0 in stage 13.0 (TID 7)
15/07/29 12:43:43 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
15/07/29 12:43:43 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
15/07/29 12:43:43 INFO Executor: Finished task 0.0 in stage 13.0 (TID 7). 1486 bytes result sent to driver
15/07/29 12:43:43 INFO TaskSetManager: Finished task 0.0 in stage 13.0 (TID 7) in 32 ms on localhost (1/1)
15/07/29 12:43:43 INFO TaskSchedulerImpl: Removed TaskSet 13.0, whose tasks have all completed, from pool 
15/07/29 12:43:43 INFO DAGScheduler: ResultStage 13 (takeOrdered at Median.scala:53) finished in 0.028 s
15/07/29 12:43:43 INFO DAGScheduler: Job 4 finished: takeOrdered at Median.scala:53, took 0.071571 s
12
15/07/29 12:43:43 INFO SparkUI: Stopped Spark web UI at http://192.168.73.129:4040
15/07/29 12:43:43 INFO DAGScheduler: Stopping DAGScheduler
15/07/29 12:43:43 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
15/07/29 12:43:43 INFO Utils: path = /tmp/spark-329d9ad9-4ed6-4a79-97f3-254cab1a13b8/blockmgr-f9da5521-a9c0-4801-bffb-3a92f089d1cd, already present as root for deletion.
15/07/29 12:43:43 INFO MemoryStore: MemoryStore cleared
15/07/29 12:43:43 INFO BlockManager: BlockManager stopped
15/07/29 12:43:43 INFO BlockManagerMaster: BlockManagerMaster stopped
15/07/29 12:43:43 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
15/07/29 12:43:43 INFO SparkContext: Successfully stopped SparkContext
15/07/29 12:43:43 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
15/07/29 12:43:43 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
15/07/29 12:43:43 INFO Utils: Shutdown hook called
15/07/29 12:43:43 INFO Utils: Deleting directory /tmp/spark-329d9ad9-4ed6-4a79-97f3-254cab1a13b8

Process finished with exit code 0

你可能感兴趣的:(大数据)