找出分组内数据的 TopN。
找出每个城市(province)应用点击数(click)排前5的人(name)。这里假定省内每个人的点击数以及计算好了。
思路: 这里的名字只是附属属性,求每个省份最高的5个点击数即可。首先按照省份分组,在组内聚合求 top 5参考 top()
算子的实现方法,使用一个有固定长度(这里即5)的优先队列,每个组内迭代将元素add进优先队列中,最终队列中的数据即为所需要的 Top5,最后将每个分区计算后的优先队列合并即最总结果。
首先构建一个固定大小的优先队列,这里使用 org.apache.spark.util.BoundedPriorityQueue
的实现,由于访问限制,这里重新写一个类:
import java.io.Serializable
import java.util.{PriorityQueue => JPriorityQueue}
import scala.collection.JavaConverters._
import scala.collection.generic.Growable
class MyBoundedPriorityQueue[A](maxSize: Int)(implicit ord: Ordering[A])
extends Iterable[A] with Growable[A] with Serializable {
private val underlying = new JPriorityQueue[A](maxSize, ord)
override def iterator: Iterator[A] = underlying.iterator.asScala
override def size: Int = underlying.size
override def ++=(xs: TraversableOnce[A]): this.type = {
xs.foreach { this += _ }
this
}
override def +=(elem: A): this.type = {
if (size < maxSize) {
underlying.offer(elem)
} else {
maybeReplaceLowest(elem)
}
this
}
override def +=(elem1: A, elem2: A, elems: A*): this.type = {
this += elem1 += elem2 ++= elems
}
override def clear() { underlying.clear() }
private def maybeReplaceLowest(a: A): Boolean = {
val head = underlying.peek()
if (head != null && ord.gt(a, head)) {
underlying.poll()
underlying.offer(a)
} else {
false
}
}
}
构造需要传入容量大小以及排序 Ordering。使用 +=
插入一个元素,使用 ++=
插入另个队列。
计算过程如下:这里使用 aggregateByKey
算子进行计算:
def tuple2FirstOrdering[T1, T2](implicit ord1: Ordering[T1]): Ordering[(T1, T2)] =
new Ordering[(T1, T2)]{
def compare(x: (T1, T2), y: (T1, T2)): Int = {
val compare1 = ord1.compare(x._1, y._1)
if (compare1 != 0) return compare1
0
}
}
val rdd = sc.makeRDD(Seq(
("beijing", "hehe", 38),
("beijing", "lala", 32),
("beijing", "lele", 74),
("beijing", "yarn", 199),
("beijing", "hdfs", 208),
("beijing", "rdd", 19),
("beijing", "idea", 349),
("beijing", "lilei", 19),
("shanghai", "mayun", 194),
("shanghai", "flume", 87),
("shanghai", "spark", 8585),
("shanghai", "mapreduce", 40),
("shanghai", "xiaomei", 123),
("shanghai", "dazui", 84),
("shanghai", "xiaomao", 385),
("shanghai", "panghong", 983),
("shanghai", "xiaohu", 364),
("shanghai", "xiaojiang", 3748),
("shanghai", "shuaihu", 7474),
("shanghai", "kafka", 67)))
val result = rdd.map(t => {(t._1, (t._3, t._2))} ).aggregateByKey(
new MyBoundedPriorityQueue[Tuple2[Int, String]](5)(tuple2FirstOrdering[Int, String](Ordering.Int)))(
(queue, ele) => {
queue.+=(ele)
queue
},
(queue1, queue2) => {
queue1.++=(queue2)
queue1
}).mapValues(qu => {
qu.toList
})
result.foreach(println(_))
}
结果:
(beijing,List((38,hehe), (199,yarn), (74,lele), (208,hdfs), (349,idea)))
(shanghai,List((385,xiaomao), (983,panghong), (8585,spark), (3748,xiaojiang), (7474,shuaihu)))
End!!