我们的这个爬虫设计来爬取京东图书(jd.com)。
scrapy框架相信大家比较了解了。里面有很多复杂的机制,超出本文的范围。
1、爬虫spider
tips:
1、xpath的语法比较坑,但是你可以在chrome上装一个xpath helper,轻松帮你搞定xpath正则表达式
2、动态内容,比如价格等是不能爬取到的
3、如本代码中,评论爬取部分代码涉及xpath对象的链式调用,可以参考
# -*- coding: utf-8 -*- # import scrapy # 可以用这句代替下面三句,但不推荐 from scrapy.spiders import Spider from scrapy.selector import Selector from scrapy import Request from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor from jdbook.items import JDBookItem # 如果报错是pyCharm对目录理解错误的原因,不影响 class JDBookSpider(Spider): name = "jdbook" allowed_domains = ["jd.com"] # 允许爬取的域名,非此域名的网页不会爬取 start_urls = [ # 起始url,这里设置为从最大tid开始,向0的方向迭代 "http://item.jd.com/11678007.html" ] # 用来保持登录状态,可把chrome上拷贝下来的字符串形式cookie转化成字典形式,粘贴到此处 cookies = {} # 发送给服务器的http头信息,有的网站需要伪装出浏览器头进行爬取,有的则不需要 headers = { # 'Connection': 'keep - alive', 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36' } # 对请求的返回进行处理的配置 meta = { 'dont_redirect': True, # 禁止网页重定向 'handle_httpstatus_list': [301, 302] # 对哪些异常返回进行处理 } def get_next_url(self, old_url): ''' description: 返回下次迭代的url :param oldUrl: 上一个爬去过的url :return: 下次要爬取的url ''' # 传入的url格式:http://www.heartsong.top/forum.php?mod=viewthread&tid=34 list = old_url.split('/') #用等号分割字符串 old_item_id = int(list[3].split('.')[0]) new_item_id = old_item_id - 1 if new_item_id == 0: # 如果tid迭代到0了,说明网站爬完,爬虫可以结束了 return new_url = '/'.join([list[0], list[1], list[2], str(new_item_id)+ '.html']) # 构造出新的url return str(new_url) # 返回新的url def start_requests(self): """ 这是一个重载函数,它的作用是发出第一个Request请求 :return: """ # 带着headers、cookies去请求self.start_urls[0],返回的response会被送到 # 回调函数parse中 yield Request(self.start_urls[0], callback=self.parse, headers=self.headers, cookies=self.cookies, meta=self.meta) def parse(self, response): """ 用以处理主题贴的首页 :param response: :return: """ selector = Selector(response) item = JDBookItem() extractor = LxmlLinkExtractor(allow=r'http://item.jd.com/\d.*html') link = extractor.extract_links(response) try: item['_id'] = response.url.split('/')[3].split('.')[0] item['url'] = response.url item['title'] = selector.xpath('/html/head/title/text()').extract()[0] item['keywords'] = selector.xpath('/html/head/meta[2]/@content').extract()[0] item['description'] = selector.xpath('/html/head/meta[3]/@content').extract()[0] item['img'] = 'http:' + selector.xpath('//*[@id="spec-n1"]/img/@src').extract()[0] item['channel'] = selector.xpath('//*[@id="root-nav"]/div/div/strong/a/text()').extract()[0] item['tag'] = selector.xpath('//*[@id="root-nav"]/div/div/span[1]/a[1]/text()').extract()[0] item['sub_tag'] = selector.xpath('//*[@id="root-nav"]/div/div/span[1]/a[2]/text()').extract()[0] item['value'] = selector.xpath('//*[@id="root-nav"]/div/div/span[1]/a[2]/text()').extract()[0] comments = list() node_comments = selector.xpath('//*[@id="hidcomment"]/div') for node_comment in node_comments: comment = dict() node_comment_attrs = node_comment.xpath('.//div[contains(@class, "i-item")]') for attr in node_comment_attrs: url = attr.xpath('.//div/strong/a/@href').extract()[0] comment['url'] = 'http:' + url content = attr.xpath('.//div/strong/a/text()').extract()[0] comment['content'] = content time = attr.xpath('.//div/span[2]/text()').extract()[0] comment['time'] = time comments.append(comment) item['comments'] = comments except Exception, ex: print 'something wrong', str(ex) print 'success, go for next' yield item next_url = self.get_next_url(response.url) # response.url就是原请求的url if next_url != None: # 如果返回了新的url yield Request(next_url, callback=self.parse, headers=self.headers, cookies=self.cookies, meta=self.meta)
2、存储管道:pipelines
tips:
1、本pipelines将爬取的数据存入mongo,比写本地文件靠谱,特别是多实例或者分布式情况。
# -*- coding: utf-8 -*- import pymongo from datetime import datetime from scrapy.exceptions import DropItem class JDBookPipeline(object): def __init__(self, mongo_uri, mongo_db, mongo_coll): self.ids = set() self.mongo_uri = mongo_uri self.mongo_db = mongo_db self.mongo_coll = mongo_coll @classmethod def from_crawler(cls, crawler): return cls( mongo_uri=crawler.settings.get('MONGO_URI'), mongo_db=crawler.settings.get('MONGO_DB'), mongo_coll=crawler.settings.get('MONGO_COLL') ) def open_spider(self, spider): self.client = pymongo.MongoClient(self.mongo_uri) # 数据库登录需要帐号密码的话 # self.client.admin.authenticate(settings['MINGO_USER'], settings['MONGO_PSW']) self.db = self.client[self.mongo_db] self.coll = self.db[self.mongo_coll] def close_spider(self, spider): self.client.close() def process_item(self, item, spider): if item['_id'] in self.ids: raise DropItem("Duplicate item found: %s" % item) if item['channel'] != u'图书': raise Exception('not book') else: #self.coll.insert(dict(item)) # 如果你不想锁死collection名称的话 self.ids.add(item['_id']) collection_name = item.__class__.__name__ + '_' + str(datetime.now().date()).replace('-', '') self.db[collection_name].insert(dict(item)) return item
3、数据结构:items
tips:
1、看到scrapy的item就笑了,这不是django么
# -*- coding: utf-8 -*- import scrapy class JDBookItem(scrapy.Item): _id = scrapy.Field() title = scrapy.Field() url = scrapy.Field() keywords = scrapy.Field() description = scrapy.Field() img = scrapy.Field() channel = scrapy.Field() tag = scrapy.Field() sub_tag = scrapy.Field() value = scrapy.Field() comments = scrapy.Field()
4、scrapyd部署
很多朋友想做分布式爬虫,比如通过celery任务调起scarpy爬虫任务。
但是很不幸,scrapy想实现这样的方式并不简单。一个比较好的办法是用scrapyd管理爬虫任务。
你需要保证你的python环境安装了3个东西。
source kangaroo.env/bin/activate
pip install scrapy scrapyd scrapyd-client
在你的spider路径下启动scrapyd守护进程。
scrapyd
下面注册你的spider,先写配置文件scrapy.cfg
# Automatically created by: scrapy startproject # # For more information about the [deploy] section see: # https://scrapyd.readthedocs.org/en/latest/deploy.html [settings] default = jdbook.settings [deploy:jdbook] url = http://localhost:6800/ project = jdbook
开始注册
#注册spider scrapyd-deploy -p jdbook -d jdbook #列出已注册的spider scrapyd-deploy -l
输出:jdbook http://localhost:6800/
这样就已经注册好了
开始/停止爬虫:
curl -XPOST http://10.94.99.55:6800/schedule.json? -d project=jdbook -d spider=jdbook
输出:{"status": "ok", "jobid": "9d50b3dcabfc11e69aa3525400128d39", "node_name": "kvm33093.sg"}
curl -XPOST http://10.94.99.55:6800/cancel.json? -d project=jdbook -d job=9d50b3dcabfc11e69aa3525400128d39
输出:{"status": "ok", "prevstate": "running", "node_name": "kvm33093.sg"}
至此,你可以在celery任务中调用爬虫了,只需要发送如上url就可以。
而各个爬虫可以存放在不同的机器上,实现分布式爬取。