CrawlSpider的使用

CrawlSpider是爬取那些具有一定规则网站的常用的爬虫，它基于Spider并有一些独特属性

rules: 是Rule对象的集合，用于匹配目标网站并排除干扰
parse_start_url: 用于爬取起始响应，必须要返回Item，Request中的一个。

rules是Rule对象的集合

rules的参数

link_extractor, : linkExtractor对象
callback=None, ：设置回调函数
follow=None, ：设置是否跟进
process_links=None, ：可以设置回调函数，对所有提取到的url进行拦截
process_request=identity ：可以设置回调函数，对request对象进行拦截

其中的link_extractor既可以自己定义，也可以使用已有LinkExtractor类，主要参数为：

*   allow：满足括号中“正则表达式”的值会被提取，如果为空，则全部匹配。
*   deny：与这个正则表达式(或正则表达式列表)不匹配的URL一定不提取。
*   allow_domains：会被提取的链接的domains。
*   deny_domains：一定不会被提取链接的domains。
*   **restrict_xpaths**：使用**xpath**表达式，和**allow**共同作用过滤链接。还有一个类似的restrict_css

以Chinaz为例:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from china.items import ChinazprojectWebInfoItem

class ChinazSpider(CrawlSpider):
    # 爬虫名称
    name = 'chinaz'
    # 允许爬去的域
    allowed_domains = ['chinaz.com']
    # 起始url地址
    start_urls = ['http://top.chinaz.com/hangyemap.html']
    # 存放定制的获取链接的规则对象（可以是一个列表也可以是元组）
    # 根据规则提取到的所有链接，会由crawlspider构建request对象，并交给引擎处理
    '''
        LinkExtractor : 设置提取链接的规则（正则表达式）
        allow=(), ： 设置允许提取的url
        restrict_xpaths=(), ：根据xpath语法，定位到某一标签下提取链接
        restrict_css=(), ：根据css选择器，定位到某一标签下提取链接
        deny=(), ： 设置不允许提取的url（优先级比allow高）
        allow_domains=(), ： 设置允许提取url的域
        deny_domains=(), ：设置不允许提取url的域（优先级比allow_domains高）
        unique=True, ：如果出现多个相同的url只会保留一个
        strip=True ：默认为True,表示自动去除url首尾的空格
    '''
    '''
    rule
        link_extractor, : linkExtractor对象
        callback=None,  ： 设置回调函数  
        follow=None, ： 设置是否跟进
        process_links=None, ：可以设置回调函数，对所有提取到的url进行拦截
        process_request=identity ： 可以设置回调函数，对request对象进行拦截
    '''
    rules = (
        # 规则对象'音乐网站'
        Rule(LinkExtractor(allow=r'http://top.chinaz.com/hangye/index_.*?html',
                           restrict_xpaths='//div[@class="Taright"]'),
             callback='parse_item',
             follow=True),
    )
    # 注意： CrawlSpider中一定不要出现parse回调方法
    def parse_item(self, response):
        print(response.status, response.url)
        '''
        解析分页的网站数据，提取后交给item处理
        '''
        webInfos = response.xpath('//ul[@class="listCentent"]/li')
        for webInfo in webInfos:
            web_item = ChinazprojectWebInfoItem()
            # 封面图片
            web_item['coverImage'] = webInfo.xpath('.//div[@class="leftImg"]/a/img/@src').extract_first('')
            # 标题
            web_item['title'] = webInfo.xpath('.//h3[@class="rightTxtHead"]/a/text()').extract_first('')
            # 域名
            web_item['domenis'] = webInfo.xpath('.//h3[@class="rightTxtHead"]/span[@class="col-gray"]/text()').extract_first('')
            # 周排名
            web_item['weekRank'] = webInfo.xpath('.//div[@class="RtCPart clearfix"]/p[1]/a/text()').extract_first('')
            # 反连接数
            web_item['ulink'] = webInfo.xpath('.//div[@class="RtCPart clearfix"]/p[4]/a/text()').extract_first('')
            # 网站简介
            web_item['info'] = webInfo.xpath('.//p[@class="RtCInfo"]/text()').extract_first('')
            # 得分
            web_item['score'] = webInfo.xpath('.//div[@class="RtCRateCent"]/span/text()').re('\d+')[0]
            # 排名
            web_item['rank'] = webInfo.xpath('.//div[@class="RtCRateCent"]/strong/text()').extract_first('')
            print(web_item)
            yield web_item

CrawlSpider如何工作的？

因为CrawlSpider继承了Spider，所以具有Spider的所有函数。
首先由start_requests对start_urls中的每一个url发起请求（make_requests_from_url)，这个请求会被parse接收。在Spider里面的parse需要我们定义，但CrawlSpider定义parse去解析响应（self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)）
_parse_response根据有无callback,follow和self.follow_links执行不同的操作

eg:

    def _parse_response(self, response, callback, cb_kwargs, follow=True):
    ##如果传入了callback，使用这个callback解析页面并获取解析得到的reques或item
        if callback:
            cb_res = callback(response, **cb_kwargs) or ()
            cb_res = self.process_results(response, cb_res)
            for requests_or_item in iterate_spider_output(cb_res):
                yield requests_or_item
    ## 其次判断有无follow，用_requests_to_follow解析响应是否有符合要求的link。
        if follow and self._follow_links:
            for request_or_item in self._requests_to_follow(response):
                yield request_or_item

其中_requests_to_follow又会获取link_extractor（这个是我们传入的LinkExtractor）解析页面得到的link（link_extractor.extract_links(response)）,对url进行加工（process_links，需要自定义），对符合的link发起Request。使用.process_request(需要自定义）处理响应。

CrawlSpider如何获取rules？

CrawlSpider类会在__init__方法中调用_compile_rules方法，然后在其中浅拷贝rules中的各个Rule获取要用于回调(callback)，要进行处理的链接（process_links）和要进行的处理请求（process_request)

eg:

    def _compile_rules(self):
        def get_method(method):
            if callable(method):
                return method
            elif isinstance(method, six.string_types):
                return getattr(self, method, None)

        self._rules = [copy.copy(r) for r in self.rules]
        for rule in self._rules:
            rule.callback = get_method(rule.callback)
            rule.process_links = get_method(rule.process_links)
            rule.process_request = get_method(rule.process_request)

`Rule`是怎么样定义的

    class Rule(object):

        def __init__(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity):
            self.link_extractor = link_extractor
            self.callback = callback
            self.cb_kwargs = cb_kwargs or {}
            self.process_links = process_links
            self.process_request = process_request
            if follow is None:
                self.follow = False if callback else True
            else:
                self.follow = follow

因此LinkExtractor会传给link_extractor。

有callback的是由指定的函数处理，如果没有callback的由哪个函数处理？

`_parse_response`会处理有`callback`的（响应）respons。
cb_res = callback(response, **cb_kwargs) or ()
而`_requests_to_follow`会将`self._response_downloaded`传给`callback`用于对页面中匹配的url发起请求（request）。
r = Request(url=link.url, callback=self._response_downloaded)

如何在CrawlSpider进行模拟登陆

CrawlSpider和Spider一样，都要使用start_requests发起请求

以知乎为例:

##替换原来的start_requests，callback为
def start_requests(self):
    return [Request("http://www.zhihu.com/#signin", meta = {'cookiejar' : 1}, callback = self.post_login)]
def post_login(self, response):
    print 'Preparing login'
    #下面这句话用于抓取请求网页后返回网页中的_xsrf字段的文字, 用于成功提交表单
    xsrf = Selector(response).xpath('//input[@name="_xsrf"]/@value').extract()[0]
    print xsrf
    #FormRequeset.from_response是Scrapy提供的一个函数, 用于post表单
    #登陆成功后, 会调用after_login回调函数
    return [FormRequest.from_response(response,   #"http://www.zhihu.com/login",
                        meta = {'cookiejar' : response.meta['cookiejar']},
                        headers = self.headers,
                        formdata = {
                        '_xsrf': xsrf,
                        'email': '[email protected]',
                        'password': '********'
                        },
                        callback = self.after_login,
                        dont_filter = True
                        )]
#make_requests_from_url会调用parse，就可以与CrawlSpider的parse进行衔接了
def after_login(self, response) :
    for url in self.start_urls :
        yield self.make_requests_from_url(url)

仅为个人学习小结，若有错处，欢迎指正~