【Spark Java API】Transformation(11)—reduceByKey、foldByKey



Merge the values for each key using an associative reduce function. 
This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce.


def reduceByKey(partitioner: Partitioner, func: JFunction2[V, V, V]): JavaPairRDD[K, V]

def reduceByKey(func: JFunction2[V, V, V], numPartitions: Int): JavaPairRDD[K, V]


  • func:映射函数,根据需求自定义;
  • partitioner:分区函数;
  • numPartitions:分区数,默认的分区函数是HashPartitioner。


def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {  
  combineByKey[V]((v: V) => v, func, func, partitioner)

从源码中可以看出,reduceByKey()是基于combineByKey()实现的,其中createCombiner只是简单的转化,而mergeValue和mergeCombiners相同,都是利用用户自定义函数。reduceyByKey() 相当于传统的 MapReduce,整个数据流也与 Hadoop 中的数据流基本一样。在combineByKey()中在 map 端开启 combine(),因此,reduceyByKey() 默认也在 map 端开启 combine(),这样在 shuffle 之前先通过 mapPartitions 操作进行 combine,得到 MapPartitionsRDD, 然后 shuffle 得到 ShuffledRDD,再进行 reduce(通过 aggregate + mapPartitions() 操作来实现)得到 MapPartitionsRDD。


List data = Arrays.asList(1, 2, 4, 3, 5, 6, 7);
JavaRDD javaRDD = javaSparkContext.parallelize(data);

JavaPairRDD javaPairRDD = javaRDD.mapToPair(new PairFunction() {    
    public Tuple2 call(Integer integer) throws Exception {        
      return new Tuple2(integer,1);    
JavaPairRDD reduceByKeyRDD = javaPairRDD.reduceByKey(new Function2() {    
    public Integer call(Integer v1, Integer v2) throws Exception {        
      return v1 + v2;    

JavaPairRDD reduceByKeyRDD2 = javaPairRDD.reduceByKey(new Function2() {    
    public Integer call(Integer v1, Integer v2) throws Exception {        
      return v1 + v2;    

JavaPairRDD reduceByKeyRDD4 = javaPairRDD.reduceByKey(new Partitioner() {    
      public int numPartitions() {    return 2;    }    
      public int getPartition(Object o) {        
        return (o.toString()).hashCode()%numPartitions();    
}, new Function2() {    
    public Integer call(Integer v1, Integer v2) throws Exception {        
      return v1 + v2;    



Merge the values for each key using an associative function and a neutral "zero value" which 
may be added to the result an arbitrary number of times, and must not change the result 
(e.g., Nil for list concatenation, 0 for addition, or 1 for multiplication.).


def foldByKey(zeroValue: V, partitioner: Partitioner, func: JFunction2[V, V, V]): JavaPairRDD[K, V]

def foldByKey(zeroValue: V, numPartitions: Int, func: JFunction2[V, V, V]): JavaPairRDD[K, V]

def foldByKey(zeroValue: V, func: JFunction2[V, V, V]): JavaPairRDD[K, V]


  • zeroValue:初始值;
  • numPartitions:分区数,默认的分区函数是HashPartitioner;
  • partitioner:分区函数;
  • func:映射函数,用户自定义函数。


def foldByKey( zeroValue: V,  partitioner: Partitioner)(func: (V, V) => V): RDD[(K, V)] = self.withScope {  
    // Serialize the zero value to a byte array so that we can get a new clone of it on each key  
    val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)  
    val zeroArray = new Array[Byte](zeroBuffer.limit)  
    // When deserializing, use a lazy val to create just one instance of the serializer per task  
    lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()  
    val createZero = () => cachedSerializer.deserialize[V](ByteBuffer.wrap(zeroArray))  
    val cleanedFunc = self.context.clean(func)  
    combineByKey[V]((v: V) => cleanedFunc(createZero(), v), cleanedFunc, cleanedFunc, partitioner)



List data = Arrays.asList(1, 2, 4, 3, 5, 6, 7, 1, 2);
JavaRDD javaRDD = javaSparkContext.parallelize(data);
final Random rand = new Random(10);
JavaPairRDD javaPairRDD = javaRDD.mapToPair(new PairFunction() {    
    public Tuple2 call(Integer integer) throws Exception {  
      return new Tuple2(integer,Integer.toString(rand.nextInt(10)));    

JavaPairRDD foldByKeyRDD = javaPairRDD.foldByKey("X", new Function2() {    
    public String call(String v1, String v2) throws Exception {        
      return v1 + ":" + v2;    
System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + foldByKeyRDD.collect());

JavaPairRDD foldByKeyRDD1 = javaPairRDD.foldByKey("X", 2, new Function2() {    
    public String call(String v1, String v2) throws Exception {        
      return v1 + ":" + v2;    
System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + foldByKeyRDD1.collect());

JavaPairRDD foldByKeyRDD2 = javaPairRDD.foldByKey("X", new Partitioner() {    
    public int numPartitions() {        return 3;    }    
    public int getPartition(Object key) {        
      return key.toString().hashCode()%numPartitions();    
}, new Function2() {    
    public String call(String v1, String v2) throws Exception {        
      return v1 + ":" + v2;    
System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + foldByKeyRDD2.collect());

你可能感兴趣的:(【Spark Java API】Transformation(11)—reduceByKey、foldByKey)