Python分布式爬虫：scrapy爬取单个网页

明确爬取网站：http://blog.jobbole.com/
抓取策略：按照所有文章的分页，逐页抓取。
具体策略一：更改页码值http://blog.jobbole.com/all-posts/page/8/
弊端：总页数发生变化的时候，需要修改源码
具体策略二：逐步提取下一页，随着页面发生改变也不用修改源码
下面使用的是策略二。

准备工作：
新建虚拟环境：

C:\Users\wex>mkvirtualenv article
Using base prefix 'c:\\users\\wex\\appdata\\local\\programs\\python\\python35'
New python executable in C:\Users\wex\Envs\article\Scripts\python.exe
Installing setuptools, pip, wheel...done.

创建项目：

I:\python项目>scrapy startproject ArticleSpider
New Scrapy project 'ArticleSpider', using template directory 'c:\\users\\wex\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\scrapy\\templates\\project', created in:
    I:\python项目\ArticleSpider

You can start your first spider with:
    cd ArticleSpider
    scrapy genspider example example.com

根据默认模板创建爬虫文件：

I:\python项目\ArticleSpider>scrapy genspider jobbole blog.jobbole.com
Created spider 'jobbole' using template 'basic' in module:
  ArticleSpider.spiders.jobbole

scrapy启动某个spider的方法（其中的name为spider中的name）：

scrapy  crawl jobbole

当报错的时候：

ImportError: No module named 'win32api'

根据报错安装：

I:\python项目\ArticleSpider>pip install   pypiwin32
Collecting pypiwin32
  Downloading pypiwin32-219-cp35-none-win_amd64.whl (8.6MB)
    100% |████████████████████████████████| 8.6MB 88kB/s
Installing collected packages: pypiwin32
Successfully installed pypiwin32-219

这时候再启动不会报错。

在Pycharm中新建main.py文件，用于运行spider文件。

#调用这个函数可以执行scrapy文件
from scrapy.cmdline import execute

import sys
import os

#设置ArticleSpider工程目录
sys.path.append(os.path.dirname(os.path.abspath(__file__)))

#调用execute函数运行spider
execute(["scrapy","crawl","jobbole"])

注意一点，在settings文件中的：

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

防止不符合协议的url过滤掉。

关于xpath：简介，术语，语法

1，xpath使用多路径表达式在xml和html中进行导航
2，xpath包含标准函数库
3，xpath是一个w3c的标准

xpath节点关系
1，父节点
2，子节点
3，兄弟节点
4，先辈节点
5，后代节点

语法：

选取节点

表达式      说明
.      选取当前节点
..     选取当前节点的父节点
@      选取属性

article       选取所有article元素的所有子节点
/article      选取根元素article
article/a    选取所有属于article的子元素的a元素
//div          选取所有div元素（不论出现在文档任何地方）
//@class   选取所有名为class的属性

谓语（Predicates）

谓语用来查找某个特定的节点或者包含某个指定的值的节点。
谓语被嵌在方括号中

/article/div[1]         选取属于article子元素的第一个div元素
/article/div[last()]   选取属于article子元素的最后一个div元素
/article/div[last()-1]      选取属于article子元素的倒数第二个div元素
/article/div[last()<3]     选取最前面的两个article元素的子元素的div元素
//div[@lang]           选取所有拥有lang属性的div元素
//div[@lang='eng']      选取所有lang属性为eng的div元素
/bookstore/book[price>35.00]      选取 bookstore 元素的所有 book 元素，且其中的 price 元素的值须大于 35.00。
/bookstore/book[price>35.00]/title      选取 bookstore 元素中的 book 元素的所有 title 元素，且其中的 price 元素的值须大于 35.00。

选取未知节点

*      匹配任何元素节点
@*      匹配任何属性节点
node()      匹配任何类型的节点

通过在路径表达式中使用“|”运算符，您可以选取若干个路径。

/div/*         选取属于div元素的所有子节点
//*              选取所有元素
//div[@*]    获取所有有带属性的div元素
/div/a | /div/p   获取所有div元素的a和p元素
//span | //ul     选取文档中的span和ul元素
article/div/p | //span    选取所有属于article元素的div元素的p元素  以及文档中所有的span元素

注意一点：通过F12查看得到的是包括加载了js和css的，和通过查看源代码得到的可能不一致。

我们提取网页的内容：
标题，日期，评论，正文内容

我们调用scrapy shell，可以一次抓取，然后调试。

通过xpath获取一些值：
通过调用scrapy的shell ，可以一次抓取，省得每次抓取都进行一次url请求。

scrapy shell http://blog.jobbole.com/110287/

>>> title =   response.xpath('//*[@id = "post-110287"]/div[1]/h1/text()')
>>> title
[]
>>> title.extract()
['2016 腾讯软件开发面试题（部分）']
>>> title.extract()[0]
'2016 腾讯软件开发面试题（部分）'

注意：xpath返回值可以再进行xpath选取节点，但是经过extract()之后，就会变成列表。

在Pycharm环境中：

    def parse(self, response):
        #/html/body/div[2]/div[2]/div[1]/div[1]/h1/
        #//*[@id="post-110923"]/div[1]/h1/

        title = response.xpath('//*[@id="post-110923"]/div[1]/h1/text()').extract()[0]
        #通过ID准确定位
        #re_selector = response.xpath("/html/body/div[2]/div[2]/div[1]/div[1]/h1/)
        #之所以没值，是因为有额度div是加载过js生成的
        #re3_selector = response.xpath('//div[@class="entry-header"]/h1/text()')
        #这里通过class定位


        create_date = response.xpath('//p[@class="entry-meta-hide-on-mobile"]/text()').extract()[0].strip().replace("·","").strip()
        praise_num  = response.xpath('//span[contains(@class,"vote-post-up")]/h10/text()').extract()[0]
        collect_num = response.xpath('//span[contains(@class,"bookmark-btn")]/text()').extract()[0]
        math_re = re.match(".*(\d+).*",collect_num)
        if math_re:
            collect_num = math_re.group(1)
        # comment_num = response.xpath('//*[@id="post-110923"]/div[3]/div[3]/a/').extract()
        #评论数为空

        content = response.xpath('//div[@class="entry"]').extract()[0]
        tag_list= response.xpath('//*[@id="post-110923"]/div[2]/p/a/text()').extract()
        tags    = ','.join(tag_list)

CSS选择器

表达式        说明
*              选取所有节点
#container    选择id为container的节点
.container    选择所有class包含container的节点
li  a      选择所有li下的所有a节点
ul + p    选择ul后面的第一个p元素（兄弟节点）
div#container > ul    选择id为container的div的第一个ul子元素

ul ~ p    选取与ul相邻的所有p元素
a[title]    选取所有有title属性的a元素
a[href="http://jobbole.com"]    选取所有href属性为jobbole.com值的a元素
a[href*="jobbole"]    选取所有href属性包含jobbole的a元素
a[href^="http"]    选取所有href属性值以http开头的a元素
a[href$=".jpg"]    选取所有href属性值以.jpg结尾的a元素
input[type=radio]:checked    选择选中的radio的元素

div:not(#container)    选取所有id非container的div属性
li:nth-child(3)    选取第3个li元素
tr:nth-child(2n)    第偶数个tr

scrapy shell http://blog.jobbole.com/110923/

在此环境下进行

        #通过css选择器
        title = response.css(".entry-header h1::text").extract()
        create_date = response.css("p.entry-meta-hide-on-mobile::text").extract()[0].strip().replace("·","").strip()
        praise_num = response.css(".post-adds h10::text").extract()[0]
        collect_num = response.css(".bookmark-btn::text").extract()
        comment_num = response.css("a[href='#article-comment'] span::text").extract()
        content  =   response.css("div.entry").extract()
        tag_list  =  response.css("p.entry-meta-hide-on-mobile a::text").extract()

注意：
当我们明确结果是一个列表，不确定是否存在值，担心取值报错出现异常的时候，可以使用：

In [14]: response.css("p.entry-meta-hide-on-mobile a::text").extract_first("")
Out[14]: 'IT技术'

函数的默认返回值为None。