算子 reduce,reduceByKey,count,countByKey 可分为两类:
action操作:reduce,count,countByKey
transformation操作:reduceByKey
reduce(func) 是对JavaRDD的操作
使用函数func聚合rdd的元素(它需要两个参数并返回一个参数)。这个函数应该是可交换的,并且是相关联的,这样它就可以并行地计算出来。
scala版本
val rdd1 = sc.parallelize(List("a","b","b","c"))
scala> val res = rdd1.reduce(_+"-"+_)
res: String = b-c-a-b
java版本
JavaRDD<String> rdd1 = sc.parallelize(Arrays.asList("a", "b", "b", "c"));
String res = rdd1.reduce(new Function2<String, String, String>() {
@Override
public String call(String v1, String v2) throws Exception {
return v1 +"-"+v2;
}
});
System.out.println(res);
# b-c-a-b
reduceByKey(func, [numTasks]) 是对JavapairRDD的操作;
针对(K, V) 的rdd,使用给定的reduce函数func聚合每个K的值,返回(K, V);可通过第二个参数定义Tasks个数。
scala版本
val scoreList = Array(
Tuple2("class1", 90),
Tuple2("class1", 60),
Tuple2("class2", 60),
Tuple2("class2", 50)
)
val scoreRdd = sc.parallelize(scoreList)
val resRdd = scoreRdd.reduceByKey(_ + _)
resRdd.foreach(res => println(res._1 + ":" + res._2))
# -----------------
class1:150
class2:110
java版本
List<Tuple2<String, Integer>> scoreList = Arrays.asList(
new Tuple2<String, Integer>("class1", 90),
new Tuple2<String, Integer>("class2", 60),
new Tuple2<String, Integer>("class1", 60),
new Tuple2<String, Integer>("class2", 50)
);
//平行化集合 生成JavaPairRDD 此处使用的是parallelizePairs
JavaPairRDD<String, Integer> scoreRdd = sc.parallelizePairs(scoreList);
//
JavaPairRDD<String, Integer> resRdd = scoreRdd.reduceByKey(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer v1, Integer v2) throws Exception {
return v1 + v2;
}
});
//打印输出
resRdd.foreach(new VoidFunction<Tuple2<String, Integer>>() {
public void call(Tuple2<String, Integer> tuple2) throws Exception {
System.out.println(tuple2._1() + ":" + tuple2._2());
}
});
count() 返回rdd中的元素个数。
scala版本
val rdd1 = sc.parallelize(List("a","b","b","c"))
scala> val res = rdd1.count
res: Long = 4
java版本
List<Tuple2<String, Integer>> scoreList = Arrays.asList(
new Tuple2<String, Integer>("class1", 90),
new Tuple2<String, Integer>("class2", 60),
new Tuple2<String, Integer>("class1", 60),
new Tuple2<String, Integer>("class2", 50)
);
//平行化集合 生成JavaPairRDD 此处使用的是parallelizePairs
JavaPairRDD<String, Integer> scoreRdd = sc.parallelizePairs(scoreList);
Long count = scoreRdd.count();
#4
countByKey() 只有对类型(K,V)类型的RDDs上才可用。返回一个Map(K,Long)对每个键的计数。
scala版本
val scoreList = Array(
Tuple2("class1", 90),
Tuple2("class1", 60),
Tuple2("class2", 60),
Tuple2("class2", 50)
)
val scoreRdd = sc.parallelize(scoreList)
scala> val res = scoreRdd.countByKey
res: scala.collection.Map[String,Long] = Map(class2 -> 2, class1 -> 2)
java版本
ListString, Integer>> scoreList = Arrays.asList(
new Tuple2<String, Integer>("class1", 90),
new Tuple2<String, Integer>("class2", 60),
new Tuple2<String, Integer>("class1", 60),
new Tuple2<String, Integer>("class2", 50)
);
JavaPairRDD<String, Integer> scoreRdd = sc.parallelizePairs(scoreList);
Map<String,Long> res = scoreRdd.countByKey();
System.out.println(res);
# {class1=2, class2=2}