代码如下
package rdd import org.apache.spark.{SparkContext, SparkConf} /** * Created by 汪本成 on 2016/7/2. */ object rddJoin { def main(args: Array[String]) { val conf = new SparkConf().setAppName("rddJoin").setMaster("local") val sc = new SparkContext(conf) val rdd1 = sc.parallelize(Array((1, 21), (2, 42), (3, 41)), 1) val rdd2 = sc.parallelize(Array((3, 4), (4, 41)), 1) val rdd3 = rdd1.join(rdd2) rdd3.foreach(println) rdd1.zipWithIndex.foreach(println) } }
16/07/02 22:42:13 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 19 ms 16/07/02 22:42:13 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks 16/07/02 22:42:13 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms (3,(41,4)) 16/07/02 22:42:13 INFO Executor: Finished task 0.0 in stage 2.0 (TID 2). 1165 bytes result sent to driver 16/07/02 22:42:13 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 113 ms on localhost (1/1) 16/07/02 22:42:13 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 16/07/02 22:42:13 INFO DAGScheduler: ResultStage 2 (foreach at rddJoin.scala:18) finished in 0.114 s 16/07/02 22:42:13 INFO DAGScheduler: Job 0 finished: foreach at rddJoin.scala:18, took 1.312562 s 16/07/02 22:42:13 INFO SparkContext: Starting job: foreach at rddJoin.scala:19 16/07/02 22:42:13 INFO DAGScheduler: Got job 1 (foreach at rddJoin.scala:19) with 1 output partitions 16/07/02 22:42:13 INFO DAGScheduler: Final stage: ResultStage 3 (foreach at rddJoin.scala:19) 16/07/02 22:42:13 INFO DAGScheduler: Parents of final stage: List() 16/07/02 22:42:13 INFO DAGScheduler: Missing parents: List() 16/07/02 22:42:13 INFO DAGScheduler: Submitting ResultStage 3 (ZippedWithIndexRDD[5] at zipWithIndex at rddJoin.scala:19), which has no missing parents 16/07/02 22:42:13 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 1608.0 B, free 12.3 KB) 16/07/02 22:42:13 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 1075.0 B, free 13.3 KB) 16/07/02 22:42:13 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:46987 (size: 1075.0 B, free: 5.1 GB) 16/07/02 22:42:13 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1006 16/07/02 22:42:13 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 3 (ZippedWithIndexRDD[5] at zipWithIndex at rddJoin.scala:19) 16/07/02 22:42:13 INFO TaskSchedulerImpl: Adding task set 3.0 with 1 tasks 16/07/02 22:42:13 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 3, localhost, partition 0,PROCESS_LOCAL, 2336 bytes) 16/07/02 22:42:13 INFO Executor: Running task 0.0 in stage 3.0 (TID 3) ((1,21),0) ((2,42),1) ((3,41),2)
join的在spark rdd中的使用现在书上说的也很详细,这里就直接用程序给大家展示了