Spark RDD统计每个域名下,不同的URL对应的访问次数的top3

文章原地址:
https://www.codeobj.com/?p=574

1、数据来源

1)数据样例

http://www.itpub.net/?username=04209	[2018-12-19 10:14:28]	63
https://www.cnblogs.com/AK47Sonic/?username=03053	[2018-12-19 10:33:44]	33
http://mail.163.com/?username=09727	2018-12-19 08:42:30]	26
http://mail.163.com/?username=09981		44
https://www.baidu.com/?username=00768	[2018-12-19 11:11:49]	54
http://www.itpub.net/?username=04039	[2018-12-19 10:17:18]	48
https://blog.csdn.net/u013429010/article/details/?username=02248	[2018-12-19 10:47:09]	78
http://bigdata.ruoze.cn/?username=07331	[2018-12-19 09:22:26]	47
http://www.itpub.net/?username=04680	[2018-12-19 10:06:37]	57
http://mail.163.com/?username=09541	2018-12-19 08:45:36]	33

2)尽量模仿多种异常情况

URL参数后缀,或者空的URL数据。如果不知道怎么造数据,可以查看如下文章,代码已经写完,如有其他的测试需求,可自行修改

使用Scala制造数据,以便使用Spark进行数据分析

2、Spark分析代码

package com.post.spark

import org.apache.spark.{SparkConf, SparkContext}

object URLCount {

  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("URLCount")

    val sc = new SparkContext(sparkConf)
    val filePath = getClass.getClassLoader.getResource("data.txt").getPath
    println(filePath)
    //读本地文件,将data.txt放置到resource目录下
    val content = sc.textFile("file:///" + filePath)
    content.map(_.split("\t"))
      .filter(_.size >= 3)  //如果数据长度小于3则不采用
      .map(x => getDomain(x))
      .groupByKey()
      .map(x => (x._1,countValus(x._2)))
        .foreach(println)

    sc.stop()

  }

  /**
    * 对domain的异常情况进行处理,如果为空,则跳过
    * @param x
    * @return
    */
  def getDomain(x:Array[String]) ={
      var domain = x(0)
      try{
        domain = x(0).split("[?]")(0)
      }catch {
        case e:Exception => e.printStackTrace()
      }
      (domain,(x(0),x(1),x(2)))
  }

  /**
    * 对value进行数据统计
    * @param x
    */
  def countValus(x:Iterable[(String,String,String)]) ={
    val values = x.toArray.groupBy(_._1).map(x => (x._1,x._2.size))
    val max = values.toArray.sortBy(-_._2).slice(0,3).toList
    max
  }

}

Spark RDD统计每个域名下,不同的URL对应的访问次数的top3_第1张图片

3、查看top3打印结果

100万数据单机执行耗时10秒

18/12/19 11:34:19 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 192.168.16.121:52450 in memory (size: 2.9 KB, free: 912.3 MB)
(https://www.cnblogs.com/AK47Sonic/,List((https://www.cnblogs.com/AK47Sonic/?username=03877,126), (https://www.cnblogs.com/AK47Sonic/?username=03027,122), (https://www.cnblogs.com/AK47Sonic/?username=03801,121)))
(https://www.sina.com.cn/,List((https://www.sina.com.cn/?username=06872,126), (https://www.sina.com.cn/?username=06418,123), (https://www.sina.com.cn/?username=06324,123)))
18/12/19 11:34:20 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 1224 bytes result sent to driver
18/12/19 11:34:20 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 1884 ms on localhost (executor driver) (1/2)
(https://blog.csdn.net/u013429010/article/details/,List((https://blog.csdn.net/u013429010/article/details/?username=02403,126), (https://blog.csdn.net/u013429010/article/details/?username=02683,123), (https://blog.csdn.net/u013429010/article/details/?username=02302,119)))
(http://www.qq.com/,List((http://www.qq.com/?username=08785,132), (http://www.qq.com/?username=08501,125), (http://www.qq.com/?username=08199,123)))
(http://archive.cloudera.com/,List((http://archive.cloudera.com/?username=01871,131), (http://archive.cloudera.com/?username=01678,123), (http://archive.cloudera.com/?username=01525,121)))
(http://mail.163.com/,List((http://mail.163.com/?username=09100,135), (http://mail.163.com/?username=09567,121), (http://mail.163.com/?username=09267,120)))
(https://www.oschina.net/,List((https://www.oschina.net/?username=05743,129), (https://www.oschina.net/?username=05964,129), (https://www.oschina.net/?username=05016,125)))
(http://www.itpub.net/,List((http://www.itpub.net/?username=04391,132), (http://www.itpub.net/?username=04176,124), (http://www.itpub.net/?username=04951,122)))
(https://www.baidu.com/,List((https://www.baidu.com/,47088), (https://www.baidu.com/?username=00576,129), (https://www.baidu.com/?username=00518,128)))
(http://bigdata.ruoze.cn/,List((http://bigdata.ruoze.cn/?username=07673,130), (http://bigdata.ruoze.cn/?username=07671,130), (http://bigdata.ruoze.cn/?username=07022,122)))
18/12/19 11:34:22 INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 1181 bytes result sent to driver

你可能感兴趣的:(Spark)