文章原地址:
https://www.codeobj.com/?p=574
http://www.itpub.net/?username=04209 [2018-12-19 10:14:28] 63
https://www.cnblogs.com/AK47Sonic/?username=03053 [2018-12-19 10:33:44] 33
http://mail.163.com/?username=09727 2018-12-19 08:42:30] 26
http://mail.163.com/?username=09981 44
https://www.baidu.com/?username=00768 [2018-12-19 11:11:49] 54
http://www.itpub.net/?username=04039 [2018-12-19 10:17:18] 48
https://blog.csdn.net/u013429010/article/details/?username=02248 [2018-12-19 10:47:09] 78
http://bigdata.ruoze.cn/?username=07331 [2018-12-19 09:22:26] 47
http://www.itpub.net/?username=04680 [2018-12-19 10:06:37] 57
http://mail.163.com/?username=09541 2018-12-19 08:45:36] 33
URL参数后缀,或者空的URL数据。如果不知道怎么造数据,可以查看如下文章,代码已经写完,如有其他的测试需求,可自行修改
使用Scala制造数据,以便使用Spark进行数据分析
package com.post.spark
import org.apache.spark.{SparkConf, SparkContext}
object URLCount {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[2]").setAppName("URLCount")
val sc = new SparkContext(sparkConf)
val filePath = getClass.getClassLoader.getResource("data.txt").getPath
println(filePath)
//读本地文件,将data.txt放置到resource目录下
val content = sc.textFile("file:///" + filePath)
content.map(_.split("\t"))
.filter(_.size >= 3) //如果数据长度小于3则不采用
.map(x => getDomain(x))
.groupByKey()
.map(x => (x._1,countValus(x._2)))
.foreach(println)
sc.stop()
}
/**
* 对domain的异常情况进行处理,如果为空,则跳过
* @param x
* @return
*/
def getDomain(x:Array[String]) ={
var domain = x(0)
try{
domain = x(0).split("[?]")(0)
}catch {
case e:Exception => e.printStackTrace()
}
(domain,(x(0),x(1),x(2)))
}
/**
* 对value进行数据统计
* @param x
*/
def countValus(x:Iterable[(String,String,String)]) ={
val values = x.toArray.groupBy(_._1).map(x => (x._1,x._2.size))
val max = values.toArray.sortBy(-_._2).slice(0,3).toList
max
}
}
100万数据单机执行耗时10秒
18/12/19 11:34:19 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 192.168.16.121:52450 in memory (size: 2.9 KB, free: 912.3 MB)
(https://www.cnblogs.com/AK47Sonic/,List((https://www.cnblogs.com/AK47Sonic/?username=03877,126), (https://www.cnblogs.com/AK47Sonic/?username=03027,122), (https://www.cnblogs.com/AK47Sonic/?username=03801,121)))
(https://www.sina.com.cn/,List((https://www.sina.com.cn/?username=06872,126), (https://www.sina.com.cn/?username=06418,123), (https://www.sina.com.cn/?username=06324,123)))
18/12/19 11:34:20 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 1224 bytes result sent to driver
18/12/19 11:34:20 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 1884 ms on localhost (executor driver) (1/2)
(https://blog.csdn.net/u013429010/article/details/,List((https://blog.csdn.net/u013429010/article/details/?username=02403,126), (https://blog.csdn.net/u013429010/article/details/?username=02683,123), (https://blog.csdn.net/u013429010/article/details/?username=02302,119)))
(http://www.qq.com/,List((http://www.qq.com/?username=08785,132), (http://www.qq.com/?username=08501,125), (http://www.qq.com/?username=08199,123)))
(http://archive.cloudera.com/,List((http://archive.cloudera.com/?username=01871,131), (http://archive.cloudera.com/?username=01678,123), (http://archive.cloudera.com/?username=01525,121)))
(http://mail.163.com/,List((http://mail.163.com/?username=09100,135), (http://mail.163.com/?username=09567,121), (http://mail.163.com/?username=09267,120)))
(https://www.oschina.net/,List((https://www.oschina.net/?username=05743,129), (https://www.oschina.net/?username=05964,129), (https://www.oschina.net/?username=05016,125)))
(http://www.itpub.net/,List((http://www.itpub.net/?username=04391,132), (http://www.itpub.net/?username=04176,124), (http://www.itpub.net/?username=04951,122)))
(https://www.baidu.com/,List((https://www.baidu.com/,47088), (https://www.baidu.com/?username=00576,129), (https://www.baidu.com/?username=00518,128)))
(http://bigdata.ruoze.cn/,List((http://bigdata.ruoze.cn/?username=07673,130), (http://bigdata.ruoze.cn/?username=07671,130), (http://bigdata.ruoze.cn/?username=07022,122)))
18/12/19 11:34:22 INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 1181 bytes result sent to driver