本博打算通过若干系列文章来讲讲如何有效运用访问日志来改善程序和用户行为分析:)也就是说:站在程序员的角度来审视程序的运行状态;通过数字手段来分析用户访问状况
既然一切都从访问日志开始,那么首先就来parser weblog吧.让数据变得更加可读些,嘿嘿,自然也更方便处理些.
这里的日志格式如下:
124.205.30.210 - - [14/Dec/2009:16:02:46 +0800] "GET /HLRelationLog/ent.hunantv.com/y/l/20090306/225172.html H
TTP/1.1" 404 252
124.205.30.210 - - [14/Dec/2009:16:04:51 +0800] "GET /HLRelationLog/ent.hunantv.com/y/l/20090901/410652.html H
TTP/1.1" 404 252
211.152.32.122 - - [14/Dec/2009:16:15:10 +0800] "GET /manager/html HTTP/1.1" 404 210
124.205.30.210 - - [14/Dec/2009:17:02:18 +0800] "GET / HTTP/1.1" 200 44
124.205.30.210 - - [14/Dec/2009:17:02:19 +0800] "GET /favicon.ico HTTP/1.1" 404 209
124.205.30.210 - - [14/Dec/2009:17:24:59 +0800] "GET /HLRelationLog/ent.hunantv.com/x/20090616/331092.html"
TTP/1.1" 404 252
124.205.30.210 - - [14/Dec/2009:16:04:51 +0800] "GET /HLRelationLog/ent.hunantv.com/y/l/20090901/410652.html H
TTP/1.1" 404 252
211.152.32.122 - - [14/Dec/2009:16:15:10 +0800] "GET /manager/html HTTP/1.1" 404 210
124.205.30.210 - - [14/Dec/2009:17:02:18 +0800] "GET / HTTP/1.1" 200 44
124.205.30.210 - - [14/Dec/2009:17:02:19 +0800] "GET /favicon.ico HTTP/1.1" 404 209
124.205.30.210 - - [14/Dec/2009:17:24:59 +0800] "GET /HLRelationLog/ent.hunantv.com/x/20090616/331092.html"
格式很简单:来路IP,本地时间,以及访问的路径.实际情况下往往还有一个User-Agent.但是在本篇我们无视这个参数.先将这些数据"取"出来存入数据库再说.
awk -F ' ' '/HLRelationLog/ {printf("insert into logs (url,ip,time) values (\"%s\",\"%s\",\"%s\");",$7,$1,substr($4,2))}' access.log > target.sql
然后将数据导入
mysql -u root -p -h 8.8.8.8 < target.sql
这里简单再来看看大致的访问情况:
awk -F ' ' '/HLRelationLog/ {print $7}' access.log |sort|uniq -c|sort -nr > sort.txt
结果摘抄如下:
13102 /HLRelationLog/ent.hunantv.com/e/h/20080917/49514.html
7112 /HLRelationLog/ent.hunantv.com/e/h/20080719/23308.html
6539 /HLRelationLog/ent.hunantv.com/m/20091231/536520.html
6451 /HLRelationLog/ent.hunantv.com/m/20090427/281233.html
5739 /HLRelationLog/ent.hunantv.com/m/20091222/527487.html
5248 /HLRelationLog/ent.hunantv.com/x/20091224/529517.html
4863 /HLRelationLog/ent.hunantv.com/y/20091219/524512.html
4597 /HLRelationLog/ent.hunantv.com/m/20091221/526802.html
4168 /HLRelationLog/ent.hunantv.com/m/20091221/526782.html
4162 /HLRelationLog/ent.hunantv.com/y/20081028/87956.html
3951 /HLRelationLog/ent.hunantv.com/m/20091222/527342.html
3729 /HLRelationLog/ent.hunantv.com/e/20080716/22250.html
3631 /HLRelationLog/ent.hunantv.com/e/h/20080728/26581.html
3561 /HLRelationLog/ent.hunantv.com/m/20091222/527785.html
3372 /HLRelationLog/ent.hunantv.com/e/h/20080903/42076.html
3339 /HLRelationLog/ent.hunantv.com/y/20091216/522056.html
3163 /HLRelationLog/ent.hunantv.com/y/l/20091221/525879.html
2998 /HLRelationLog/ent.hunantv.com/y/20091216/521082.html
2914 /HLRelationLog/ent.hunantv.com/x/20091221/526583.html
2889 /HLRelationLog/ent.hunantv.com/y/20081231/154998.html
2851 /HLRelationLog/ent.hunantv.com/z/20091225/530766.html
7112 /HLRelationLog/ent.hunantv.com/e/h/20080719/23308.html
6539 /HLRelationLog/ent.hunantv.com/m/20091231/536520.html
6451 /HLRelationLog/ent.hunantv.com/m/20090427/281233.html
5739 /HLRelationLog/ent.hunantv.com/m/20091222/527487.html
5248 /HLRelationLog/ent.hunantv.com/x/20091224/529517.html
4863 /HLRelationLog/ent.hunantv.com/y/20091219/524512.html
4597 /HLRelationLog/ent.hunantv.com/m/20091221/526802.html
4168 /HLRelationLog/ent.hunantv.com/m/20091221/526782.html
4162 /HLRelationLog/ent.hunantv.com/y/20081028/87956.html
3951 /HLRelationLog/ent.hunantv.com/m/20091222/527342.html
3729 /HLRelationLog/ent.hunantv.com/e/20080716/22250.html
3631 /HLRelationLog/ent.hunantv.com/e/h/20080728/26581.html
3561 /HLRelationLog/ent.hunantv.com/m/20091222/527785.html
3372 /HLRelationLog/ent.hunantv.com/e/h/20080903/42076.html
3339 /HLRelationLog/ent.hunantv.com/y/20091216/522056.html
3163 /HLRelationLog/ent.hunantv.com/y/l/20091221/525879.html
2998 /HLRelationLog/ent.hunantv.com/y/20091216/521082.html
2914 /HLRelationLog/ent.hunantv.com/x/20091221/526583.html
2889 /HLRelationLog/ent.hunantv.com/y/20081231/154998.html
2851 /HLRelationLog/ent.hunantv.com/z/20091225/530766.html
下面来处理下频道的访问分布,每个link的倒数第三段[x,y,e...]:首先通过awk得到每个link的频道的识别符和访问数量,丢进一个文件
map(lambda x:[x,sum(map(lambda y:int(y[1]),filter(lambda z:z[0]==x,t)))],set(map(lambda x:x[0],t)))
>>>
[['x\n', 48173], ['y\n', 266146], ['d\n', 4035], ['z\n', 40631], ['e\n', 98009], ['ent\n', 558], ['m\n', 174542], ['t\n', 59660]]
>>>
[['x\n', 48173], ['y\n', 266146], ['d\n', 4035], ['z\n', 40631], ['e\n', 98009], ['ent\n', 558], ['m\n', 174542], ['t\n', 59660]]
来看看并发的情况:
tail -10000 access.log | awk '{print $4;}' | sort | uniq -c | sort -nr | head
=====我是分割线====
33 [31/Dec/2009:22:03:18
16 [25/Dec/2009:18:16:07
14 [25/Dec/2009:18:16:06
9 [31/Dec/2009:22:03:19
8 [25/Dec/2009:18:13:35
7 [31/Dec/2009:22:06:38
7 [31/Dec/2009:19:22:40
7 [25/Dec/2009:19:55:54
7 [25/Dec/2009:14:39:06
7 [02/Jan/2010:18:37:54
=====我是分割线====
33 [31/Dec/2009:22:03:18
16 [25/Dec/2009:18:16:07
14 [25/Dec/2009:18:16:06
9 [31/Dec/2009:22:03:19
8 [25/Dec/2009:18:13:35
7 [31/Dec/2009:22:06:38
7 [31/Dec/2009:19:22:40
7 [25/Dec/2009:19:55:54
7 [25/Dec/2009:14:39:06
7 [02/Jan/2010:18:37:54