groupByKey和reduceByKey是常用的聚合函数,作用的数据集为PairRDD
scala
reduceByKey函数原型
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
}
我们可以发现,输入参数是两个,第一个参数是分区函数,第二个是作用在value上的一个函数,如果第一个参数为空,则使用默认的hashPartitioner;
reduceByKey样例
val conf = new SparkConf().setAppName("jiangtao_demo").setMaster("local")
val sc = new SparkContext(conf)
val data = sc.makeRDD(List("pandas","numpy","pip","pip","pip"))
//mapToPair
val dataPair = data.map((_,1))
//reduceByKey
val result1 = dataPair.reduceByKey(_+_)
//或者
// val result2 = dataPair.reduceByKey((x,y)=>(x+y))
groupByKey函数原型
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
// groupByKey shouldn’t use map side combine because map side combine does not
// reduce the amount of data shuffled and requires all map side data be inserted
// into a hash table, leading to more objects in the old gen.
val createCombiner = (v: V) => CompactBuffer(v)
val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}
我们发现groupByKey输入参数为分区方式,参数为空,代表采用默认的hashPartitioner,返回值为RDD[(K, Iterable[V])
其实reduceByKey与groupByKey底层调用的都是combineByKeyWithClassTag,我们都知道,reduceByKey是分区内先进行一个合并,在进行shuffle混洗;而groupByKey是直接进行shuffle混洗,效率上来看,一定是reduceByKey更优,下面我们看下combineByKeyWithClassTag,会发现,有个参数叫mapSideCombine,这个参数是控制在shuffle前是否进行分区内聚合操作,reduceByKey默认是true
def combineByKeyWithClassTag[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
require(mergeCombiners != null, “mergeCombiners must be defined”) // required as of Spark 0.9.0
if (keyClass.isArray) {
if (mapSideCombine) {
throw new SparkException(“Cannot use map-side combining with array keys.”)
}
if (partitioner.isInstanceOf[HashPartitioner]) {
throw new SparkException(“Default partitioner cannot partition array keys.”)
}
}
val aggregator = new Aggregator[K, V, C](
self.context.clean(createCombiner),
self.context.clean(mergeValue),
self.context.clean(mergeCombiners))
if (self.partitioner == Some(partitioner)) {
self.mapPartitions(iter => {
val context = TaskContext.get()
new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
}, preservesPartitioning = true)
} else {
new ShuffledRDD[K, V, C](self, partitioner)
.setSerializer(serializer)
.setAggregator(aggregator)
.setMapSideCombine(mapSideCombine)
}
}
groupByKey样例
val conf = new SparkConf().setAppName("jiangtao_demo").setMaster("local")
val sc = new SparkContext(conf)
val data = sc.makeRDD(List("pandas","numpy","pip","pip","pip"))
//mapToPair
val dataPair = data.map((_,1))
//groupByKey
val result3 = dataPair.groupByKey()
//此时返回的是RDD:String,Iterable,下面的代码对 x._2进行一个累加
val result4 = result3.map(x=>(x._1,x._2.sum))
================================================================
java
reduceByKey样例
SparkConf conf = new SparkConf().setAppName("jiangtao_demo").setMaster("local");
JavaSparkContext jsc = new JavaSparkContext(conf);
//并行集合生成JavaRDD
JavaRDD lines = jsc.parallelize(Arrays.asList("pandas","numpy","pip","pip","pip"));
JavaPairRDD<String,Integer> mapToPairResult = lines.mapToPair(new PairFunction<String,String,Integer>() {
@Override
public Tuple2<String,Integer> call(String o) throws Exception {
Tuple2 tuple2 = new Tuple2(o,1);
//System.out.println(tuple2._1()+":"+tuple2._2());
return tuple2;
}
});
//reduceByKey 统计词频
JavaPairRDD reduceByKeyResult = mapToPairResult.reduceByKey(new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer i1, Integer i2) throws Exception {
return i1+i2;
}
});
groupByKey样例
SparkConf conf = new SparkConf().setAppName("jiangtao_demo").setMaster("local");
JavaSparkContext jsc = new JavaSparkContext(conf);
//并行集合生成JavaRDD
JavaRDD lines = jsc.parallelize(Arrays.asList("pandas","numpy","pip","pip","pip"));
JavaPairRDD<String,Integer> mapToPairResult = lines.mapToPair(new PairFunction<String,String,Integer>() {
@Override
public Tuple2<String,Integer> call(String o) throws Exception {
Tuple2 tuple2 = new Tuple2(o,1);
//System.out.println(tuple2._1()+":"+tuple2._2());
return tuple2;
}
});
//groupByKey 统计词频
JavaPairRDD groupByKeyResult = mapToPairResult.groupByKey();
//此时返回的结果是JavaPairRDD
//[(pip,[1, 1, 1]), (pandas,[1]), (numpy,[1])]
System.out.println(groupByKeyResult.collect());
JavaPairRDD<String,Integer> gr = groupByKeyResult.mapToPair(new PairFunction<Tuple2<String,Iterable>,String,Integer>(){
public Tuple2<String,Integer> call(Tuple2<String,Iterable> tuple2){
int sum = 0;
Iterator<Integer> it = tuple2._2.iterator();
while(it.hasNext()){
sum += it.next();
}
return new Tuple2<String,Integer>(tuple2._1,sum);
}
});
System.out.println(gr.collect());
//返回结果[(pip,3), (pandas,1), (numpy,1)]
后续会补全更多的常用算子,敬请关注,此文原创,转载请注明出处