在之前有一篇文章介绍如何通过python抓取网页,见Python抓取中文网页,但是不久之后就发现这种方式对于CSDN的个人博客的抓取行不通了。早就听说了curl的强大,今天就拿curl来试一试。
curl的功能很强大,这里有一个curl使用简介,大家可以参考,其他问题请自行百度google之。这里我们只用到了最基本的--connect-timeout 和-o,以抓取本博客为例:
curl -s --connect-timeout 10 -o blog "http://blog.csdn.net/nevasun"OK,在当前目录下就会有一个blog的文件,我们以纯文本文件的方式打开,就会发现有如下的信息:
<li>访问:<span>10598次</span></li> <li>积分:<span>610分</span></li> <li>排名:<span>第13159名</span></li>接下来,我们要做的工作就是从blog文件中提取访问、积分、排名等信息,正好前几天学了awk,那么就用awk来试一试。关于awk的介绍和学习请参考 AWK学习总结及练习,以下所涉及的awk相关内容都可以在里面找到。
awk 'BEGIN {FS="[<>]"; ORS="\t"} /(访问|积分|排名):<span>.*<\/span>/ \ { if($3 == "访问:") {gsub(/[^0-9]+/, ""); print} \ else if($3 == "积分:") {gsub(/[^0-9]+/, ""); print} \ else if($3 == "排名:") {gsub(/[^0-9]+/, ""); print}}' blog到此为止,我们已经可以抓取到所需要的数据了,再配合shell和python脚本,利用cron,我们可以让系统定时记录每天的访问情况。
shell脚本account.sh:
#!/bin/bash try_time=0 fetch_url="http://blog.csdn.net/nevasun" while [ $try_time -lt 3 ] do curl -s --connect-timeout 3 -o blog $fetch_url if [ $? = 0 ]; then break; fi try_time=$((try_time+1)) done accout_info=$(\ awk 'BEGIN {FS="[<>]"; ORS="\t"} /(访问|积分|排名):<span>.*<\/span>/ \ { if($3 == "访问:") {gsub(/[^0-9]+/, ""); print} \ else if($3 == "积分:") {gsub(/[^0-9]+/, ""); print} \ else if($3 == "排名:") {gsub(/[^0-9]+/, ""); print}}' blog) if [ "$1" == "daily_routine" ]; then ./dbroutine.py $accout_info 1 else ./dbroutine.py $accout_info 0 fiPython脚本dbroutine.py:
#!/usr/bin/python import os import cPickle as pcl import sys from time import localtime def load_record(db_file): recordlist = [{}] if os.path.exists(db_file): readf = file(db_file) try: recordlist = pcl.load(readf) except: recordlist = [{}] readf.close() return recordlist def dump_record(db_file, recordlist, total_access, score, rank): writef = file(db_file, "wb") if recordlist[0].has_key("total_access"): day_count = total_access-recordlist[0]["total_access"] else: day_count = total_access date = "%s.%s.%s" % \ (localtime().tm_year, localtime().tm_mon, localtime().tm_mday) day_record = dict(date=date, day_access=day_count, \ total_access=total_access, score=score, rank=rank) recordlist.insert(0, day_record) pcl.dump(recordlist, writef) writef.close() def print_record(recordlist, total_access, score, rank): if recordlist[0].has_key("total_access"): day_count = total_access-recordlist[0]["total_access"] else: day_count = total_access date = "%s.%s.%s" % \ (localtime().tm_year, localtime().tm_mon, localtime().tm_mday) day_record = dict(date=date, day_access=day_count, \ total_access=total_access, score=score, rank=rank) print day_record for i in range(0, len(recordlist)): print recordlist[i] db_file = "blog_record.dat" recordlist = [{}] if len(sys.argv) != 5: raise Exception total_access = int(sys.argv[1]) score = int(sys.argv[2]) rank = int(sys.argv[3]) flag = int(sys.argv[4]) recordlist = load_record(db_file) if flag != 0: dump_record(db_file, recordlist, total_access, score, rank) else: print_record(recordlist, total_access, score, rank)