(十二)Spark Core求IP访问次数的TopN

需求:求IP访问次数的TopN
1) 获取ip => (ip,1)
2)reduceByKey(+)
3)排序 sortBy

object test {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("test").setMaster("local[2]")
    val sc = new SparkContext(sparkConf)
    val lines = sc.textFile("file:///E:/BigDataSoftware/data/baidu.log")

   lines.map(x=> {
      val tmp = x.split("\t")
      (tmp(5),1)       //取出IP这一列,并转换成tuple类型(IP,1)
    }).reduceByKey(_+_).sortBy(_._2,false).take(N)

sc.stop()
  }
}

sortBy默认是升序,sortBy(_._2,false)指的是按降序排列,下面是sortBy的源码

/**
   * Return this RDD sorted by the given key function.
   */
  def sortBy[K](
      f: (T) => K,
      ascending: Boolean = true,
      numPartitions: Int = this.partitions.length)
      (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {
    this.keyBy[K](f)
        .sortByKey(ascending, numPartitions)
        .values
  }

你可能感兴趣的:((十二)Spark Core求IP访问次数的TopN)