1工具

用Python爬取双色球开奖信息,了解一下_第1张图片



2具体方法


1、使用python2.7编写爬取脚本

这里除了正常的爬取操作,还增加了独立的参数设定。如果没有参数,爬取的数据就在当前目录下;如果有参数,可以设定保存目录、保存文件名后缀。这样的话,这个脚本既可以单独使用,也可以配合sh定时任务使用。

双色球爬取代码grab500_ssq.py内容:

# -*- coding:utf-8 -*- import re import urllib import time import sys datapath = sys.path[0] datasuffix = 'txt' if (len(sys.argv)>1): datapath = sys.argv[1] datasuffix = sys.argv[2] def getHtml(url): html = urllib.urlopen(url) return html.read() html = getHtml("http://zx.500.com/ssq/") reg = ['

([0-9]\d*).*
'] reg.append('
  • ([0-9]\d*)
  • ') reg.append('
  • ([0-9]\d*)
  • ') outstr = ""; for i in range(len(reg)): page = re.compile(reg[i]) rs = re.findall(page,html) for j in range(len(rs)): outstr+= rs[j] + "," #print time.strftime('%Y-%m-%d',time.localtime(time.time()))+":"+outstr[:-1] with open(datapath+'/lot_500_ssq.'+datasuffix, 'a') as f: f.write(time.strftime('%Y-%m-%d',time.localtime(time.time()))+":"+outstr[:-1]+'\n')

    大乐透爬取代码grab500_dlt.py内容:

    # -*- coding:utf-8 -*- import re import urllib import time import sys datapath = sys.path[0] datasuffix ='txt' if(len(sys.argv)>1): datapath = sys.argv[1] datasuffix = sys.argv[2] defgetHtml(url): html = urllib.urlopen(url) return html.read() html = getHtml("http://zx.500.com/dlt/") reg =['

    ([0-9]\d*).*
    '] reg.append('
  • ([0-9]\d*)
  • ') reg.append('
  • ([0-9]\d*)
  • ') outstr =""; for i in range(len(reg)): page = re.compile(reg[i]) rs = re.findall(page,html) for j in range(len(rs)): outstr+= rs[j]+"," #print time.strftime('%Y-%m-%d',time.localtime(time.time()))+":"+outstr[:-1] with open(datapath+'/lot_500_dlt.'+datasuffix,'a')as f: f.write(time.strftime('%Y-%m-%d',time.localtime(time.time()))+":"+outstr[:-1]+'\n')


    2、编写一个执行的sh脚本

    我们需要编写执行python的sh脚本bwb_lottery_everyday.sh,要注意的是sh的date获取的星期天值是0而不是7,而crontab则可以设定0或者7。


    #!/bin/sh basepath=$(cd `dirname $0`; pwd) #shell's dir datapath=$basepath'/lotterydata' #shell's datadir datasuffix='txt' #datasuffix a=`date -d "${date}" +%w` if [ $a -eq 1 ] || [ $a -eq 3 ] || [ $a -eq 6 ]; then python "${basepath}/grab500_ssq.py" $datapath $datasuffix elif [ $a -eq 2 ] || [ $a -eq 4 ] || [ $a -eq 0 ]; then python "${basepath}/grab500_dlt.py" $datapath $datasuffix fi


    3、编写一个主sh脚本

    编写一个主要的sh脚本bwb_lottery_main.sh,执行清理和设定的工作。需要注意的是,这里直接使用了系统的/etc/crontab文件来达到周期执行的目的,其实并不太好,但crontab -e的方法很难自动化,所以只能设定为系统任务。

    #!/bin/sh cronfile="/etc/crontab" #debian cronfile basepath=$(cd `dirname $0`; pwd) #shell's dir datapath=$basepath'/lotterydata' #shell's datadir datasuffix='txt' #datasuffix crontaskname="bwb_lottery_everyday.sh" #shell's name crontasktime="0 23\t* * 1-4,6-7" #crontab task run time,default everyday except friday 23:00 echo "checking..." if [ ! -f ${cronfile} ]; then echo "crontab file $cronfile doesn't exsits.\nplease check file or modify shell setting and run shell again." exit 1 fi pyver=`python -V 2>&1|awk '{print $2}'|awk -F '.' '{print $1}'` if [ $pyver != '2' ]; then echo "python2(.7) is needed." exit 1 fi echo "writing crontab file..." if [ `grep -c "${crontaskname}" ${cronfile}` -eq '0' ]; then echo "${crontasktime}\troot\t${basepath}/${crontaskname}">>${cronfile} else sed -i "s#^.*${crontaskname}.*#${crontasktime}\troot\t${basepath}/${crontaskname}#" ${cronfile} fi /etc/init.d/cron restart echo "making data dir..." if [ ! -d "${datapath}" ]; then mkdir ${datapath} else if [ ! -d "${datapath}/bak" ]; then mkdir "${datapath}/bak" else mv ${datapath}/*.${datasuffix} ${datapath}/bak/ 2>/dev/null fi fi echo "changing permission..." chmod +x "$basepath/$crontaskname" chmod +w -R $datapath echo "finished!"

    我们最后只需要执行这个主脚本,就能一键自动完成彩票爬虫的布置。


    完整的项目代码已经上传到github上去了~