Spark-Streaming: 分析tomcat的日志
要求统计TOP 100的 IP
- 通过spark streaming得到(ip, ip_count),按照ip_count倒序100
- 程序:
package io.github.sparkstream
import java.io.FileInputStream
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext
/**
* Created by sunyonggang on 16/5/10.
*/
class TomcatLog {
}
object TomcatLog {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("wordcount")
val isDebug = true
val duration = 5
if (isDebug) {
conf.setMaster("local")
}
val ssc = new StreamingContext(conf, Seconds(duration))
// textFileStream means to Create a input stream that monitors a Hadoop-compatible filesystem for new files and reads them as text files
val lines = ssc.textFileStream("/Users/sunyonggang/Downloads/softwareforspark/tenten")
val ips = lines.map(line => (line.split(" ")(0), 1)).reduceByKey(_ + _)
// get all the (ip, ip_count)
ips.saveAsTextFiles("/Users/sunyonggang/Downloads/spark-1.5.2/result")
ssc.start()
ssc.awaitTermination()
}
}
得到多个文件夹:
[sunyonggang@sunyongangdeMBP ~/Downloads/spark-1.5.2]$ du -sh result*
0B result
8.0K result-1462926380000
20K result-1462926385000
8.0K result-1462926390000
8.0K result-1462926395000
查看result-1462926385000中的输出:
[sunyonggang@sunyongangdeMBP ~/Downloads/spark-1.5.2/result-1462926385000]$ head part-00000
(220.181.108.157,1)
(207.46.13.95,2)
(1.59.65.67,2)
(192.250.46.129,87)
(66.249.71.137,14)
(117.136.30.147,3)
(72.14.202.87,3)
(117.136.11.190,1)
(159.226.202.13,2)
(183.9.112.2,25)
排序后,输出top100( 显示部分):
[sunyonggang@sunyongangdeMBP ~/Downloads/spark-1.5.2/result-1462926385000]$ cat part-00000 | tr -d '()' | sort -t ',' -k2nr,2 | head -100
218.20.24.203,4597
221.194.180.166,4576
119.146.220.12,1850
117.136.31.144,1647
121.28.95.48,1597
113.109.183.126,1596
182.48.112.2,870
120.84.24.200,773
61.144.125.162,750
27.115.124.75,470
115.236.48.226,439
59.41.62.100,339
89.126.54.40,305
114.247.10.132,243
125.46.45.78,236
220.181.94.221,205
218.19.42.168,181
118.112.183.164,179
116.235.194.89,171
114.43.237.117,167
61.155.206.81,165
202.108.18.253,164
218.107.55.254,164
14.213.176.184,133
121.14.162.28,125
123.150.182.147,125
121.14.162.124,124
123.150.182.180,124
统计Top 50 页面PV
- 与第一个问题类似,现在找的是dst的url
- 输出:
[sunyonggang@sunyongangdeMBP ~/Downloads/spark-1.5.2]$ du -sh result*
0B result
172K result-1462935920000
8.0K result-1462935925000
8.0K result-1462935930000
[sunyonggang@sunyongangdeMBP ~/Downloads/spark-1.5.2]$ cd result-1462935920000
[sunyonggang@sunyongangdeMBP ~/Downloads/spark-1.5.2/result-1462935920000]$ ls
_SUCCESS part-00000
[sunyonggang@sunyongangdeMBP ~/Downloads/spark-1.5.2/result-1462935920000]$ head part-00000
(/home.php?mod=misc&ac=sendmail&rand=1327969460,1)
(/static/js/smilies.js?AZH,194)
(/home.php?mod=misc&ac=sendmail&rand=1328006543,1)
(/space-username-Dafuyang.html?ajaxmenu=1&inajax=1&ajaxtarget=aaiqJdWSScgksYAgcXYJYOLWYaWQOaNJ_menu_content,1)
(/forum.php?mod=ajax&action=forumchecknew&fid=53&time=1328023418&inajax=yes,3)
(/home.php?mod=space&do=pm,1)
(/group.php?sgid=25,1)
(/home.php?mod=space&uid=35,1)
(/static/image/smiley/qq/tsh.gif,2)
(/forum.php?mod=ajax&action=forumchecknew&fid=46&time=1328005619&inajax=yes,10)
输出:
[sunyonggang@sunyongangdeMBP ~/Downloads/spark-1.5.2/result-1462935920000]$ cat part-00000 | tr -d '()' | sort -t ',' -k2nr,2 | head -50
/static/js/floating-jf.js,1329
/static/js/jquery-1.6.js,1263
/data/cache/style_2_common.css?AZH,657
/data/cache/style_2_widthauto.css?AZH,615
/static/js/common.js?AZH,570
/static/js/forum.js?AZH,495
/forum-58-1.html,462
/popwin_js.php?fid=58,387
/static/image/common/arrwd.gif,373
/static/image/common/scrolltop.png,345
/data/cache/style_2_forum_forumdisplay.css?AZH,332
/ads/banner-01.gif,308
/static/image/common/logo.png,308
/static/image/common/nv_a.png,296
/static/image/common/qmenu.png,291
/popwin_js.php?fid=,281
/static/image/common/house.gif,279
/static/image/common/pt_item.png,275
/static/js/seditor.js?AZH,261
/static/image/common/pn_post.png,239
/static/image/common/fav.gif,234
/static/image/common/arw_l.gif,233
/popwin_js.php?fid=53,230
/static/js/share_icon.js,230
/data/cache/common_smilies_var.js?AZH,226
/static/image/common/user_online.gif,219
/forum.php,218
/static/image/common/folder_common.gif,211
/static/image/common/pin_3.gif,209
/static/image/common/feed.gif,208
/static/image/filetype/image_s.gif,207
/static/image/common/px.png,203
/static/image/common/atarget.png,202
/static/image/common/refresh.png,202
/static/js/smilies.js?AZH,194
/,181
/static/image/common/tip_bottom.png,178
/static/image/editor/editor.gif,174
/static/image/filetype/common.gif,170
/static/image/common/login.gif,162
/data/cache/style_2_forum_index.css?AZH,157
/popwin_js.php?fid=0,154
/static/image/common/notice.gif,127
/data/cache/style_2_forum_viewthread.css?AZH,113
/static/image/common/arw_r.gif,112
/static/js/forum_viewthread.js?AZH,109
/popwin_js.php?fid=46,101
/static/image/common/collapsed_no.gif,101
/static/image/common/forum.gif,101
/static/image/common/16x16.gif,92
统计浏览器的类型和版本
- 与上面类似
- 具体
[sunyonggang@sunyongangdeMBP ~/Downloads/spark-1.5.2/result-1462937565000]$ cat part-00000 | tr -d '"()' | sort -t ',' -k2nr,2
Mozilla/5.0,14519
Mozilla/4.0,10191
MQQBrowser/2.9/ZTE-TU880_TD/1.0,1724
MQQBrowser/2.9/Adr,526
-,252
Sogou,205
JUCLinux;U;2.2.2;Zh_cn;TCL,165
JUC,125
JUCLinux;,124
ZTE-TU880_TD/1.0,93
Opera/9.80,90
DoCoMo/2.0,28
Sosospider++http://help.soso.com/webspider.htm,20
Huaweisymantecspider,13
ia_archiver,13
AdsBot-Google-Mobile,9
HuaweiSymantecSpider/[email protected]+compatible;,5
Shockwave,5
Dalvik/1.4.0,4
libwww-perl/5.834,4
Mozilla/0.6,2
Mozilla/4.0compatible;,2
Mozilla/4.7,2
Yahoo!,2
curl/7.15.5,2
myCrawl/Nutch-1.3,2
360se,1
AdsBot-Google,1
Baiduspider++http://www.baidu.com/search/spider.htm,1
MOT-MT620_TD/1.0,1
TencentTraveler,1
milodns,1
mozilla/4.0,1