def
aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): U
Aggregate the elements of each partition, and then the results for all the partitions, using given combine functions and a neutral "zero value". This function can return a different result type, U, than the type of this RDD, T. Thus, we need one operation for merging a T into an U and one operation for merging two U's, as in scala.TraversableOnce. Both of these functions are allowed to modify and return their first argument instead of creating a new U to avoid memory allocation.
zeroValue
the initial value for the accumulated result of each partition for the seqOp operator, and also the initial value for the combine results from different partitions for the combOp operator - this will typically be the neutral element (e.g. Nil for list concatenation or 0 for summation)
seqOp
an operator used to accumulate results within a partition
combOp
an associative operator used to combine results from different partitions
aggregate函数将每个分区里面的元素进行聚合(seqOp),然后用combine函数将每个分区的结果和初始值(zeroValue)进行combine操作。这个函数最终返回的类型不需要和RDD中元素类型一致。
scala> def seqOP(a:Int, b:Int) : Int = {
| val r = a*b
| println("seqOp: " + a + "\t" + b+"=>"+r)
| r
| }
seqOP: (a: Int, b: Int)Int
scala> def combOp(a:Int, b:Int): Int = {
| val r= a+b
| println("combOp: " + a + "\t" + b+"=>"+r)
| r
| }
combOp: (a: Int, b: Int)Int
scala> val z = sc. parallelize ( List (1 ,2 ,3 ,4 ,5 ,6) , 2)
z: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at :27
scala> z. aggregate(3)(seqOP, combOp)
combOp: 3 18=>21
combOp: 21 360=>381
res20: Int = 381
1、对List(1,2,3,4,5,6)分区,分成(1,2,3)(4,5,6)
2、对(1,2,3)执行seqOp方法:
3(初始值)*1=>3
3(上轮计算结果)*2=>6
6*3=>18
对(4,5,6)执行seqOp方法
3(初始值)*4=>12
12(上轮计算结果)*5=>60
60*6=>360
3、对分区结果惊醒combine操作
3(初始值)+18(分区结果)=>21
21(上轮计算结果)+360(分区结果) =>381
1、reduce函数和combine函数必须满足交换律(commutative)和结合律(associative)
2、从aggregate 函数的定义可知,combine函数的输出类型必须和输入的类型一致
本文参考:http://www.iteblog.com/archives/1268