Spark 计算 - 实现分组求 TopN

问题抽象

找出分组内数据的 TopN。

问题实例

找出每个城市(province)应用点击数(click)排前5的人(name)。这里假定省内每个人的点击数以及计算好了。

思路

思路: 这里的名字只是附属属性,求每个省份最高的5个点击数即可。首先按照省份分组,在组内聚合求 top 5参考 top() 算子的实现方法,使用一个有固定长度(这里即5)的优先队列,每个组内迭代将元素add进优先队列中,最终队列中的数据即为所需要的 Top5,最后将每个分区计算后的优先队列合并即最总结果。

实现:

首先构建一个固定大小的优先队列,这里使用 org.apache.spark.util.BoundedPriorityQueue 的实现,由于访问限制,这里重新写一个类:

import java.io.Serializable
import java.util.{PriorityQueue => JPriorityQueue}

import scala.collection.JavaConverters._
import scala.collection.generic.Growable

class MyBoundedPriorityQueue[A](maxSize: Int)(implicit ord: Ordering[A])
  extends Iterable[A] with Growable[A] with Serializable {

  private val underlying = new JPriorityQueue[A](maxSize, ord)

  override def iterator: Iterator[A] = underlying.iterator.asScala

  override def size: Int = underlying.size

  override def ++=(xs: TraversableOnce[A]): this.type = {
    xs.foreach { this += _ }
    this
  }

  override def +=(elem: A): this.type = {
    if (size < maxSize) {
      underlying.offer(elem)
    } else {
      maybeReplaceLowest(elem)
    }
    this
  }

  override def +=(elem1: A, elem2: A, elems: A*): this.type = {
    this += elem1 += elem2 ++= elems
  }

  override def clear() { underlying.clear() }

  private def maybeReplaceLowest(a: A): Boolean = {
    val head = underlying.peek()
    if (head != null && ord.gt(a, head)) {
      underlying.poll()
      underlying.offer(a)
    } else {
      false
    }
  }

}

构造需要传入容量大小以及排序 Ordering。使用 += 插入一个元素,使用 ++= 插入另个队列。

计算过程如下:这里使用 aggregateByKey 算子进行计算:

    def tuple2FirstOrdering[T1, T2](implicit ord1: Ordering[T1]): Ordering[(T1, T2)] =
      new Ordering[(T1, T2)]{
        def compare(x: (T1, T2), y: (T1, T2)): Int = {
          val compare1 = ord1.compare(x._1, y._1)
          if (compare1 != 0) return compare1
          0
        }
      }

    val rdd = sc.makeRDD(Seq(

                    ("beijing", "hehe", 38),
                    ("beijing", "lala", 32),
                    ("beijing", "lele", 74),
                    ("beijing", "yarn", 199),
                    ("beijing", "hdfs", 208),
                    ("beijing", "rdd", 19),
                    ("beijing", "idea", 349),
                    ("beijing", "lilei", 19),

                    ("shanghai", "mayun", 194),
                    ("shanghai", "flume", 87),
                    ("shanghai", "spark", 8585),
                    ("shanghai", "mapreduce", 40),
                    ("shanghai", "xiaomei", 123),
                    ("shanghai", "dazui", 84),
                    ("shanghai", "xiaomao", 385),
                    ("shanghai", "panghong", 983),
                    ("shanghai", "xiaohu", 364),
                    ("shanghai", "xiaojiang", 3748),
                    ("shanghai", "shuaihu", 7474),
                    ("shanghai", "kafka", 67)))


    val result = rdd.map(t => {(t._1, (t._3, t._2))} ).aggregateByKey(
      new MyBoundedPriorityQueue[Tuple2[Int, String]](5)(tuple2FirstOrdering[Int, String](Ordering.Int)))(
      (queue, ele) => {
        queue.+=(ele)
        queue
      },
      (queue1, queue2) => {
        queue1.++=(queue2)
        queue1
      }).mapValues(qu => {
      qu.toList
    })

    result.foreach(println(_))
  }

结果:

(beijing,List((38,hehe), (199,yarn), (74,lele), (208,hdfs), (349,idea)))
(shanghai,List((385,xiaomao), (983,panghong), (8585,spark), (3748,xiaojiang), (7474,shuaihu)))

End!!

你可能感兴趣的:(Spark-例子们,Spark,计算,分组,TopN)