使用awk处理网站访问日志[上]

对于一个程序来说,Loging是一件非常有利的武器,其可以帮助程序员快速的找到BUG,分析性能瓶颈等等...甚至还可以在技术社区彰显一番代码的华丽,实乃死coder居家旅行必备之宝.而网站访问日志除了对开发者有修改缺陷,提升功力之良效以外,对待运营人员,也是分析用户行为的第一手宝贵资料.

本博打算通过若干系列文章来讲讲如何有效运用访问日志来改善程序和用户行为分析:)也就是说:站在程序员的角度来审视程序的运行状态;通过数字手段来分析用户访问状况

既然一切都从访问日志开始,那么首先就来parser weblog吧.让数据变得更加可读些,嘿嘿,自然也更方便处理些.
这里的日志格式如下:

205.30.210 - - [14/Dec/2009:16:02:46 +0800] "GET /HLRelationLog/ent.hunantv.com/y/l/20090306/225172.html H 
  
TTP/1.1" 404 252 
  
205.30.210 - - [14/Dec/2009:16:04:51 +0800] "GET /HLRelationLog/ent.hunantv.com/y/l/20090901/410652.html H 
  
TTP/1.1" 404 252 
  
152.32.122 - - [14/Dec/2009:16:15:10 +0800] "GET /manager/html HTTP/1.1" 404 210 
  
205.30.210 - - [14/Dec/2009:17:02:18 +0800] "GET / HTTP/1.1" 200 44 
  
205.30.210 - - [14/Dec/2009:17:02:19 +0800] "GET /favicon.ico HTTP/1.1" 404 209 
  
205.30.210 - - [14/Dec/2009:17:24:59 +0800] "GET /HLRelationLog/ent.hunantv.com/x/20090616/331092.html"

格式很简单:来路IP,本地时间,以及访问的路径.实际情况下往往还有一个User-Agent.但是在本篇我们无视这个参数.先将这些数据"取"出来存入数据库再说.

    awk -F ' ' '/HLRelationLog/ {printf("insert into logs (url,ip,time) values (\"%s\",\"%s\",\"%s\");",$7,$1,substr($4,2))}' access.log > target.sql 
  

    然后将数据导入 
  

mysql -u root -p -h 8.8.8.8 < target.sql

这里简单再来看看大致的访问情况:

   awk -F ' ' '/HLRelationLog/ {print $7}' access.log |sort|uniq -c|sort -nr > sort.txt 
 

结果摘抄如下:

    13102 /HLRelationLog/ent.hunantv.com/e/h/20080917/49514.html 
  
   7112 /HLRelationLog/ent.hunantv.com/e/h/20080719/23308.html 
  
   6539 /HLRelationLog/ent.hunantv.com/m/20091231/536520.html 
  
   6451 /HLRelationLog/ent.hunantv.com/m/20090427/281233.html 
  
   5739 /HLRelationLog/ent.hunantv.com/m/20091222/527487.html 
  
   5248 /HLRelationLog/ent.hunantv.com/x/20091224/529517.html 
  
   4863 /HLRelationLog/ent.hunantv.com/y/20091219/524512.html 
  
   4597 /HLRelationLog/ent.hunantv.com/m/20091221/526802.html 
  
   4168 /HLRelationLog/ent.hunantv.com/m/20091221/526782.html 
  
   4162 /HLRelationLog/ent.hunantv.com/y/20081028/87956.html 
  
   3951 /HLRelationLog/ent.hunantv.com/m/20091222/527342.html 
  
   3729 /HLRelationLog/ent.hunantv.com/e/20080716/22250.html 
  
   3631 /HLRelationLog/ent.hunantv.com/e/h/20080728/26581.html 
  
   3561 /HLRelationLog/ent.hunantv.com/m/20091222/527785.html 
  
   3372 /HLRelationLog/ent.hunantv.com/e/h/20080903/42076.html 
  
   3339 /HLRelationLog/ent.hunantv.com/y/20091216/522056.html 
  
   3163 /HLRelationLog/ent.hunantv.com/y/l/20091221/525879.html 
  
   2998 /HLRelationLog/ent.hunantv.com/y/20091216/521082.html 
  
   2914 /HLRelationLog/ent.hunantv.com/x/20091221/526583.html 
  
   2889 /HLRelationLog/ent.hunantv.com/y/20081231/154998.html 
  
   2851 /HLRelationLog/ent.hunantv.com/z/20091225/530766.html

下面来处理下频道的访问分布,每个link的倒数第三段[x,y,e...]:首先通过awk得到每个link的频道的识别符和访问数量,丢进一个文件

   map(lambda x:[x,sum(map(lambda y:int(y[1]),filter(lambda z:z[0]==x,t)))],set(map(lambda x:x[0],t)))  
  
>>> 
  
[['x\n', 48173], ['y\n', 266146], ['d\n', 4035], ['z\n', 40631], ['e\n', 98009], ['ent\n', 558], ['m\n', 174542], ['t\n', 59660]]

来看看并发的情况:

   tail -10000 access.log | awk '{print $4;}' | sort | uniq -c | sort -nr | head 
  
=====我是分割线==== 
  
33 [31/Dec/2009:22:03:18 
  
     16 [25/Dec/2009:18:16:07 
  
     14 [25/Dec/2009:18:16:06 
  
      9 [31/Dec/2009:22:03:19 
  
      8 [25/Dec/2009:18:13:35 
  
      7 [31/Dec/2009:22:06:38 
  
      7 [31/Dec/2009:19:22:40 
  
      7 [25/Dec/2009:19:55:54 
  
      7 [25/Dec/2009:14:39:06 
  
      7 [02/Jan/2010:18:37:54

使用awk处理网站访问日志[上]

http://guangfeng.blogspot.com/2010/01/awk.html
使用awk处理网站访问日志[上]

你可能感兴趣的:(awk)

使用awk处理网站访问日志[上]

http://guangfeng.blogspot.com/2010/01/awk.html使用awk处理网站访问日志[上]

你可能感兴趣的:(awk)

http://guangfeng.blogspot.com/2010/01/awk.html
使用awk处理网站访问日志[上]