Spark算子[07]:reduce,reduceByKey,count,countByKey

算子 reduce,reduceByKey,count,countByKey 可分为两类:

action操作:reduce,count,countByKey
transformation操作:reduceByKey


1、reduce

reduce(func) 是对JavaRDD的操作
使用函数func聚合rdd的元素(它需要两个参数并返回一个参数)。这个函数应该是可交换的,并且是相关联的,这样它就可以并行地计算出来。

scala版本

val rdd1 = sc.parallelize(List("a","b","b","c")) 
scala> val res = rdd1.reduce(_+"-"+_)

res: String = b-c-a-b

java版本

JavaRDD<String> rdd1 = sc.parallelize(Arrays.asList("a", "b", "b", "c"));
String res = rdd1.reduce(new Function2<String, String, String>() {
    @Override
    public String call(String v1, String v2) throws Exception {
        return v1 +"-"+v2;
    }
});
System.out.println(res);

# b-c-a-b

2、reduceByKey

reduceByKey(func, [numTasks]) 是对JavapairRDD的操作;
针对(K, V) 的rdd,使用给定的reduce函数func聚合每个K的值,返回(K, V);可通过第二个参数定义Tasks个数。

scala版本

val scoreList = Array(
Tuple2("class1", 90), 
Tuple2("class1", 60), 
Tuple2("class2", 60), 
Tuple2("class2", 50)
)
val scoreRdd = sc.parallelize(scoreList)
val resRdd = scoreRdd.reduceByKey(_ + _)
resRdd.foreach(res => println(res._1 + ":" + res._2))

# -----------------
class1:150
class2:110

java版本

List<Tuple2<String, Integer>> scoreList = Arrays.asList(
        new Tuple2<String, Integer>("class1", 90),
        new Tuple2<String, Integer>("class2", 60),
        new Tuple2<String, Integer>("class1", 60),
        new Tuple2<String, Integer>("class2", 50)
);

//平行化集合 生成JavaPairRDD  此处使用的是parallelizePairs
JavaPairRDD<String, Integer> scoreRdd = sc.parallelizePairs(scoreList);
//
JavaPairRDD<String, Integer> resRdd = scoreRdd.reduceByKey(new Function2<Integer, Integer, Integer>() {
    public Integer call(Integer v1, Integer v2) throws Exception {
        return v1 + v2;
    }
});

//打印输出
resRdd.foreach(new VoidFunction<Tuple2<String, Integer>>() {
    public void call(Tuple2<String, Integer> tuple2) throws Exception {
        System.out.println(tuple2._1() + ":" + tuple2._2());
    }
});

3、count

count() 返回rdd中的元素个数。

scala版本

val rdd1 = sc.parallelize(List("a","b","b","c")) 
scala> val res = rdd1.count

res: Long = 4

java版本

List<Tuple2<String, Integer>> scoreList = Arrays.asList(
        new Tuple2<String, Integer>("class1", 90),
        new Tuple2<String, Integer>("class2", 60),
        new Tuple2<String, Integer>("class1", 60),
        new Tuple2<String, Integer>("class2", 50)
);

//平行化集合 生成JavaPairRDD  此处使用的是parallelizePairs
JavaPairRDD<String, Integer> scoreRdd = sc.parallelizePairs(scoreList);

Long count = scoreRdd.count();

#4

4、countByKey

countByKey() 只有对类型(K,V)类型的RDDs上才可用。返回一个Map(K,Long)对每个键的计数。

scala版本

val scoreList = Array(
Tuple2("class1", 90), 
Tuple2("class1", 60), 
Tuple2("class2", 60), 
Tuple2("class2", 50)
)
val scoreRdd = sc.parallelize(scoreList)
scala> val res = scoreRdd.countByKey

res: scala.collection.Map[String,Long] = Map(class2 -> 2, class1 -> 2)

java版本

ListString, Integer>> scoreList = Arrays.asList(
        new Tuple2<String, Integer>("class1", 90),
        new Tuple2<String, Integer>("class2", 60),
        new Tuple2<String, Integer>("class1", 60),
        new Tuple2<String, Integer>("class2", 50)
);

JavaPairRDD<String, Integer> scoreRdd = sc.parallelizePairs(scoreList);
Map<String,Long> res = scoreRdd.countByKey();

System.out.println(res);

# {class1=2, class2=2}

你可能感兴趣的:(spark)