内容:
1、基础TOP N算法实战;
2、分组TOP N算法实战;
3、Range Partitioner内幕解密;
社交、新闻等等地方都比较在意TOP N这个方面的东西
take和top n的差别:take知识拿几个元素,top n有排序,可能还有复杂的算法,比如分组top n算法等等。
==========基础TOP N算法 ============
基础内容先排序,再take出前面的N个元素
package com.dt.spark.cores
import org.apache.spark.{SparkContext, SparkConf}
/**
* 基础top n案例实战
* Created by 威 on 2016/2/9.
*/
object TopNBasic {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()//创建SparkConf对象
conf.setAppName("TopNBasic")//设置应用程序的名称,在程序运行的监控界面可以看到名称
conf.setMaster("local")//此时程序在本地运行,不需要安装Spark集群
val sc = new SparkContext(conf)//通过创建SparkContext对象,通过传入SparkConf实例来定制Spark运行的具体参数和配置信息
val lines = sc.textFile("F:/exepro/scala/TopNBasic.txt")//通过HadoopRDD以及MapPartitionRDD获取文件中每一行的内容本身
sc.setLogLevel("ERROR")
val pairs = lines.map(line=>(line.toInt,line))//生成key,value键值对已方便sortByKey进行排序
val sortedPairs = pairs.sortByKey(false)//降序排序
val sortedDate = sortedPairs.map(pair => pair._2)//过滤出排序的内容
val top5 = sortedDate.take(5)//获取排名前5的元素内容,元素构建成一个Array
top5.foreach(println)
sc.stop()
}
}
==========分组TOP N算法 ============
不同类型中的每种类型数据的TOP N元素
package com.dt.spark.SparkApps.cores;
import java.util.Arrays;
import java.util.Iterator;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.function.VoidFunction;
import scala.Tuple2;
/**
* 使用JAVA的方式开发TOP N
* @author 威
*
*/
public class TopNGroup {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("TopNGroup" ).setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);// 其地产曾实际上就是Scala的SparkContext
JavaRDD<String> lines = sc.textFile("F:/exepro/scala/TopNGroup.txt" );
//把每行数据变成符合要求的key-value方式
JavaPairRDD<String, Integer> pairs = lines.mapToPair( new PairFunction<String, String, Integer>() {
@Override
public Tuple2<String, Integer> call(String line) throws Exception {
String[] splitedLine = line .split(" " );
return new Tuple2<String, Integer>(splitedLine[0],Integer.valueOf( splitedLine[1]));
}
});
JavaPairRDD<String, Iterable<Integer>> groupedPairs = pairs.groupByKey();
JavaPairRDD<String, Iterable<Integer>> top5 = groupedPairs.mapToPair(new PairFunction<Tuple2<String,Iterable<Integer>>, String, Iterable<Integer>>() {
@Override
public Tuple2<String, Iterable<Integer>> call(Tuple2<String, Iterable<Integer>> greoupedData ) throws Exception {
Integer top5[] = new Integer[5];//保存top5的数据本身
String groupedKey = greoupedData ._1 ;//获取的组名
Iterator<Integer> groupedValue = greoupedData._2 .iterator();//获取每组的内容集合
while(groupedValue .hasNext()){//查看是否有下一个元素,如果有,则继续进行循环
Integer value = groupedValue.next();//获取当前循环的元素本身的内容
for(int i =0;i <5;i ++){
if(top5 [i ]==null){
top5[ i] = value;
break;
} else if (value >top5 [i ]){
for(int j =4;j >i ;j --){
top5[ j] = top5[ j-1];
}
top5[ i] = value;
break;
}
}
}
return new Tuple2<String, Iterable<Integer>>(groupedKey ,Arrays.asList( top5)) ;
}
});
top5.foreach( new VoidFunction<Tuple2<String,Iterable<Integer>>>() {
@Override
public void call(Tuple2<String, Iterable<Integer>> topped) throws Exception {
System. out.println("Group Key:" +topped ._1 );//获取Group Key
Iterator<Integer> topValue = topped._2 .iterator();//获取Group Value
while(topValue .hasNext()){
Integer value = topValue.next();
System. out.println(value );
}
System. out.println("**********************************" );
}
});
}
}
运行结果:
16/02/09 22:17:18 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 9 ms
Group Key:Spark
100
99
98
95
91
**********************************
Group Key:Hadoop
99
98
97
95
69
**********************************
16/02/09 22:17:19 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1165 bytes result sent to driver
==========Range Partitioner内幕解密 ============
Range Partitioner把父RDD依赖的RDD的数据划分成不同的范围,关键的地方是不同的范围是有序的,
Google面试题,如何在一个不确定规模的数据范围内进行排序。
/**
* Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling
* `collect` or `save` on the resulting RDD will return or output an ordered list of records
* (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in
* order of the keys).
*/
// TODO: this currently doesn't work on P other than Tuple2!
def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
: RDD[(K, V)] = self.withScope
{
val part = new RangePartitioner(numPartitions, self, ascending)
new ShuffledRDD[K, V, V](self, part)
.setKeyOrdering(if (ascending) ordering else ordering.reverse)
}
1.1之前的版本sortByKey用的HashPartitioner数据倾斜,某个分区不均匀,极端情况下,某个分区拥有RDD的所有数据
现在用的RangePartotioner,除了是结果有序的基石以外,最为重要的是,尽量保证每个partitioner的数据量是均匀的
// An array of upper bounds for the first (partitions - 1) partitions
private var rangeBounds: Array[K] = {
if (partitions <= 1) {
Array.empty
} else {
// This is the sample size we need to have roughly balanced output partitions, capped at 1M.
val sampleSize = math.min(20.0 * partitions, 1e6)
// Assume the input partitions are roughly balanced and over-sample a little bit.
val sampleSizePerPartition = math.ceil(3.0 * sampleSize / rdd.partitions.size).toInt //数据量小有足够数据,数据量特别大的数据可以再次抽样
val (numItems, sketched) = RangePartitioner.sketch(rdd.map(_._1), sampleSizePerPartition)
if (numItems == 0L) {
Array.empty
} else {
// If a partition contains much more than the average number of items, we re-sample from it
// to ensure that enough items are collected from that partition.
val fraction = math.min(sampleSize / math.max(numItems, 1L), 1.0)
val candidates = ArrayBuffer.empty[(K, Float)]
val imbalancedPartitions = mutable.Set.empty[Int]
sketched.foreach { case (idx, n, sample) =>
if (fraction * n > sampleSizePerPartition) {
imbalancedPartitions += idx
} else {
// The weight is 1 over the sampling probability.
val weight = (n.toDouble / sample.size).toFloat
for (key <- sample) {
candidates += ((key, weight))
}
}
}
if (imbalancedPartitions.nonEmpty) {
// Re-sample imbalanced partitions with the desired sampling probability.
val imbalanced = new PartitionPruningRDD(rdd.map(_._1), imbalancedPartitions.contains)
val seed = byteswap32(-rdd.id - 1)
val reSampled = imbalanced.sample(withReplacement = false, fraction, seed).collect()
val weight = (1.0 / fraction).toFloat
candidates ++= reSampled.map(x => (x, weight))
}
RangePartitioner.determineBounds(candidates, partitions)
}
}
}
def numPartitions: Int = rangeBounds.length + 1
/**
* Sketches the input RDD via reservoir sampling on each partition.
*
* @param rdd the input RDD to sketch
* @param sampleSizePerPartition max sample size per partition
* @return (total number of items, an array of (partitionId, number of items, sample))
*/
def sketch[K : ClassTag](
rdd: RDD[K],
sampleSizePerPartition: Int): (Long, Array[(Int, Long, Array[K])]) = {
val shift = rdd.id
// val classTagK = classTag[K] // to avoid serializing the entire partitioner object
val sketched = rdd.mapPartitionsWithIndex { (idx, iter) =>
val seed = byteswap32(idx ^ (shift << 16))
val (sample, n) = SamplingUtils.reservoirSampleAndCount(
iter, sampleSizePerPartition, seed)
Iterator((idx, n, sample))
}.collect()
val numItems = sketched.map(_._2).sum
(numItems, sketched)
}
水塘算法
到现在为止,标志Spark入门实战结束。
如果真的完全掌握了,Spark算入门。
下面就要将内部机制和性能调优了。
作业:
Scala版本写分组top N排序,并且key也进行排序
package com.dt.spark.cores
import org.apache.spark.{SparkContext, SparkConf}
/**
* Created by 威 on 2016/2/10.
*/
object TopNGroup {
def main(args: Array[String]) {
val conf = new SparkConf() //创建SparkConf对象
conf.setAppName("SecondarySortApp") //设置应用程序的名称,在程序运行的监控界面可以看到名称
conf.setMaster("local") //此时程序在本地运行,不需要安装Spark集群
val sc = new SparkContext(conf) //通过创建SparkContext对象,通过传入SparkConf实例来定制Spark运行的具体参数和配置信息
val lines = sc.textFile("F:/exepro/scala/TopNGroup.txt") //通过HadoopRDD以及MapPartitionRDD获取文件中每一行的内容本身
val groupRDD = lines.map(line=>(line.split(" ")(0),line.split(" ")(1).toInt)).groupByKey()
val top5 = groupRDD.map(pair=>(pair._1,pair._2.toList.sortWith(_>_).take(5))).sortByKey()
top5.collect().foreach(pair=>{
println(pair._1+":")
pair._2.foreach(println)
println("****************")
})
}
}
本文出自 “一枝花傲寒” 博客,谢绝转载!