学会用框架,能站在巨人肩膀上的人,能力往往都不会太差。
这里我们学习的是PySpider
pip install pyspider
安装pyspider (前面python 已经安装了2.7)
下载phantomjs-2.1.1-windows
加入环境变量,动态加载js会用到
我们使用mysql存储
如果不需要存储到mysql, 这步可以直接跳过
安装mysql,Navicat Premium(db管理工具)
运行 cmd -> pyspider all
到这里环境就搭建完成了
先了解下pyspider的功能:
pyspider中文网
pyspider官网
下面我们从豌豆荚,百度手机助手,应用宝爬出某个应用的下载量。
code说明:
Code实例
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2017-11-29 16:32:33
# Project: ZenTalk_Download
from pyspider.libs.base_handler import *
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import uniout
import re
import time
class Handler(BaseHandler):
crawl_config = {
}
def __init__(self):
self.wdjurl = 'http://www.wandoujia.com/apps/com.asus.cnzentalk'
self.baiduurl = 'http://shouji.baidu.com/software/11306355.html'
self.yingyongbaourl = 'http://sj.qq.com/myapp/detail.htm?apkName=com.asus.cnzentalk'
self.tool = Tool()
self.time = self.tool.getCurrentIntTime()
FILE_NAME = 'zentalk_download.txt'
#@every(minutes=24 * 60) # 每天执行一次
@every(minutes=1) # 每1min执行一次
def on_start(self):
self.crawl(self.wdjurl, callback=self.index_wdjpage, fetch_type='js')
self.crawl(self.baiduurl, callback=self.index_baidupage, fetch_type='js')
self.crawl(self.yingyongbaourl, callback=self.index_yingyongbaopage, fetch_type='js')
@config(age=60) # 有效期1min
@config(priority=3)
def index_wdjpage(self, response):
raw_download = response.doc('[itemprop="interactionCount"]').text()
download = self.tool.getCount(raw_download)
# write to file
self.writeToFile(download, '豌豆荚')
@config(age=60) # 有效期1min
@config(priority=2)
def index_baidupage(self, response):
raw_download = response.doc('.yui3-g .download-num').text()
download = self.tool.getCount(re.split(':', raw_download)[1])
# write to file
self.writeToFile(download, '百度手机助手')
@config(age=60) # 有效期1min
@config(priority=1)
def index_yingyongbaopage(self, response):
raw_download = response.doc('.det-ins-num').text()[:-2]
download = self.tool.getCount(raw_download)
# write to file
self.writeToFile(download, '应用宝' )
def writeToFile(self, download, platform, append=True):
if append :
f = open(self.FILE_NAME, 'a')
f.write(platform + ": " + str(download) + '\n')
else:
f = open(self.FILE_NAME, 'w')
f.write("\n------ time: " + self.tool.getCurrentTime() + "," + str(self.tool.getCurrentIntTime()) + " ------------------\n")
f.write(platform + ": " + str(download) + '\n')
f.close()
# 工具类
class Tool:
# 将超链接广告剔除
removeADLink = re.compile('
Pyspider的Tip
- Pyspider的创建的项目都放在下面的位置,以db的形式存储