Spark函数讲解:coalesce

函数原型

1 def coalesce(numPartitions: Int, shuffle: Boolean = false)
2     (implicit ord: Ordering[T] = null): RDD[T]

  返回一个新的RDD,且该RDD的分区个数等于numPartitions个数。如果shuffle设置为true,则会进行shuffle。

实例

01 /**
02  * User: 过往记忆
03  * Date: 15-03-09
04  * Time: 上午06:30
05  * bolg: http://www.iteblog.com
06  * 本文地址:http://www.iteblog.com/archives/1279
07  * 过往记忆博客,专注于hadoop、hive、spark、shark、flume的技术博客,大量的干货
08  * 过往记忆博客微信公共帐号:iteblog_hadoop
09  */
10 scala> var data = sc.parallelize(List(1,2,3,4))
11 data: org.apache.spark.rdd.RDD[Int] =
12     ParallelCollectionRDD[45] at parallelize at :12
13  
14 scala> data.partitions.length
15 res68: Int = 30
16  
17 scala> val result = data.coalesce(2false)
18 result: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[57] at coalesce at :14
19  
20 scala> result.partitions.length
21 res77: Int = 2
22  
23 scala> result.toDebugString
24 res75: String =
25 (2) CoalescedRDD[57] at coalesce at :14 []
26  |  ParallelCollectionRDD[45] at parallelize at :12 []
27  
28 scala> val result1 = data.coalesce(2true)
29 result1: org.apache.spark.rdd.RDD[Int] = MappedRDD[61] at coalesce at :14
30  
31 scala> result1.toDebugString
32 res76: String =
33 (2) MappedRDD[61] at coalesce at :14 []
34  |  CoalescedRDD[60] at coalesce at :14 []
35  |  ShuffledRDD[59] at coalesce at :14 []
36  +-(30) MapPartitionsRDD[58] at coalesce at :14 []
37     |   ParallelCollectionRDD[45] at parallelize at :12 []

  从上面可以看出shuffle为false的时候并不进行shuffle操作;而为true的时候会进行shuffle操作。RDD.partitions.length可以获取相关RDD的分区数。

你可能感兴趣的:(spark)