Python Scrapy 爬虫框架实例（一）

之前有介绍 scrapy 的相关知识，但是没有介绍相关实例，在这里做个小例，供大家参考学习。

注：后续不强调python 版本，默认即为python3.x。

爬取目标这里简单找一个图片网站，获取图片的先关信息。

该网站网址： http://www.58pic.com/c/

创建项目终端命令行执行以下命令

scrapy startproject AdilCrawler

命令执行后，会生成如下结构的项目。

执行结果如下

如上图提示，cd 到项目下，可以执行 scrapy genspider example example.com 命令，创建名为example,域名为example.com 的爬虫文件。

编写items.py

这里先简单抓取图片的作者名称、图片主题等信息。

# -*- coding: utf-8 -*-
# Define here the models for your scraped items

# See documentation in:# https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass AdilcrawlerItem(scrapy.Item):

# define the fields for your item here like:# name = scrapy.Field()

author = scrapy.Field() # 作者

theme = scrapy.Field() # 主题

编写spider文件

进入AdilCrawler目录，使用命令创建一个基础爬虫类：

scrapy genspider thousandPic www.58pic.com# thousandPic为爬虫名，www.58pic.com为爬虫作用范围

执行命令后会在spiders文件夹中创建一个thousandPic.py的文件，现在开始对其编写：

# -*- coding: utf-8 -*-import scrapy# 爬虫小试class ThousandpicSpider(scrapy.Spider):

name ='thousandPic' allowed_domains = ['www.58pic.com']

start_urls = ['http://www.58pic.com/c/']

def parse(self, response):

'''

查看页面元素

/html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()

因为页面中有多张图，而图是以 /html/body/div[4]/div[3]/div[i] 其中i 为变量作为区分的，所以为了获取当前页面所有的图

这里不写 i 程序会遍历该路径下的所有图片。

'''

        # author 作者

        # theme 主题

author = response.xpath('/html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()').extract()

theme = response.xpath('/html/body/div[4]/div[3]/div/a/p[1]/span[1]/text()').extract()

# 使用爬虫的log 方法在控制台输出爬取的内容。

        self.log(author)

self.log(theme)

# 使用遍历的方式打印出爬取的内容，因为当前一页有20张图片。foriinrange(1, 21):

print(i,' **** ',theme[i - 1],': ',author[i - 1] )

执行命令,查看打印结果

scrapy crawl thousandPic

结果如下，其中DEBUG为 log 输出。

代码优化

引入 itemAdilcrawlerItem

# -*- coding: utf-8 -*-import scrapy# 这里使用 import 或是下面from 的方式都行，关键要看当前项目在pycharm的打开方式，是否是作为一个项目打开的，建议使用这一种方式。import AdilCrawler.items as items# 使用from 这种方式，AdilCrawler 需要作为一个项目打开。# from AdilCrawler.items import AdilcrawlerItemclass ThousandpicSpider(scrapy.Spider):

name ='thousandPic' allowed_domains = ['www.58pic.com']

start_urls = ['http://www.58pic.com/c/']

def parse(self, response):

'''

查看页面元素

/html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()

因为页面中有多张图，而图是以 /html/body/div[4]/div[3]/div[i] 其中i 为变量作为区分的，所以为了获取当前页面所有的图

这里不写 i 程序会遍历该路径下的所有图片。

'''

item = items.AdilcrawlerItem()

# author 作者# theme 主题 author = response.xpath('/html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()').extract()

theme = response.xpath('/html/body/div[4]/div[3]/div/a/p[1]/span[1]/text()').extract()

item['author'] = author

item['theme'] = theme

return item

再次运营爬虫，执行结果如下

保存结果到文件

执行命令如下

scrapy crawl thousandPic -o items.json

会生成如图的文件

再次优化，使用 ItemLoader 功能类

使用itemLoader ，以取代杂乱的extract()和xpath()。

代码如下：

# -*- coding: utf-8 -*-

import scrapyfromAdilCrawler.items

import AdilcrawlerItem

# 导入 ItemLoader 功能类fromscrapy.loaderimport ItemLoader

# optimize 优化

# 爬虫项目优化

class ThousandpicoptimizeSpider(scrapy.Spider):

name ='thousandPicOptimize' allowed_domains = ['www.58pic.com']

start_urls = ['http://www.58pic.com/c/']

def parse(self, response):

'''

查看页面元素

/html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()

因为页面中有多张图，而图是以 /html/body/div[4]/div[3]/div[i] 其中i 为变量作为区分的，所以为了获取当前页面所有的图

这里不写 i 程序会遍历该路径下的所有图片。

'''

# 使用功能类 itemLoader,以取代看起来杂乱的 extract() 和 xpath() ，优化如下

i = ItemLoader(item = AdilcrawlerItem(),response = response )

# author 作者# theme 主题 i.add_xpath('author','/html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()')

i.add_xpath('theme','/html/body/div[4]/div[3]/div/a/p[1]/span[1]/text()')

return i.load_item()

编写pipelines文件

默认pipelines.py 文件

# -*- coding: utf-8 -*-# Define your item pipelines here#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

class Adilcrawler1Pipeline(object):

def process_item(self, item, spider):

return item

优化后代码如下

# -*- coding: utf-8 -*-# Define your item pipelines here#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json

class AdilcrawlerPipeline(object):

'''

    保存item数据

'''

        def__init__(self):

        self.filename = open('thousandPic.json','w')

    def process_item(self, item, spider):

        # ensure_ascii=False 可以解决 json 文件中乱码的问题。

                text = json.dumps(dict(item), ensure_ascii=False) +',\n'

                # 这里是一个字典一个字典存储的，后面加个 ',\n' 以便分隔和换行。

                self.filename.write(text)

        return item

    def close_spider(self,spider):

        self.filename.close()

settings文件设置

修改settings.py配置文件

找到pipelines 配置进行修改

# Configure item pipelines

# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

# ITEM_PIPELINES = {

# 'AdilCrawler.pipelines.AdilcrawlerPipeline': 300,

# }

# 启动pipeline 必须将其加入到“ITEM_PIPLINES”的配置中

# 其中根目录是tutorial，pipelines是我的pipeline文件名，TutorialPipeline是类名

ITEM_PIPELINES = {

'AdilCrawler.pipelines.AdilcrawlerPipeline': 300,

}

# 加入后，相当于开启pipeline，此时在执行爬虫，会执行对应的pipelines下的类，并执行该类相关的方法，比如这里上面的保存数据功能。

执行命令

scrapy crawl thousandPicOptimize

执行后生成如下图文件及保存的数据

使用CrawlSpider类进行翻页抓取

使用crawl 模板创建一个 CrawlSpider

执行命令如下

scrapy genspider -t crawl thousandPicPaging www.58pic.com

items.py 文件不变，查看爬虫 thousandPicPaging.py 文件

# -*- coding: utf-8 -*-

import scrapyfromscrapy.linkextractorsimport LinkExtractorfromscrapy.spidersimport CrawlSpider, Ruleclass ThousandpicpagingSpider(CrawlSpider):

name ='thousandPicPaging' allowed_domains = ['www.58pic.com']

start_urls = ['http://www.58pic.com/']

rules = (

Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),

)

def parse_item(self, response):

i = {}

#i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()

#i['name'] = response.xpath('//div[@id="name"]').extract()

#i['description'] = response.xpath('//div[@id="description"]').extract()

return i

修改后如下

# -*- coding: utf-8 -*-import scrapy

# 导入链接规则匹配类，用来提取符合规则的连接fromscrapy.linkextractorsimport LinkExtractor

# 导入CrawlSpider类和Rulefromscrapy.spidersimport CrawlSpider, Ruleimport AdilCrawler.items as itemsclass ThousandpicpagingSpider(CrawlSpider):

name ='thousandPicPaging' allowed_domains = ['www.58pic.com']

# 修改起始页地址start_urls = ['http://www.58pic.com/c/']

# Response里链接的提取规则，返回的符合匹配规则的链接匹配对象的列表# http://www.58pic.com/c/1-0-0-03.html 根据翻页连接地址，找到相应的正则表达式 1-0-0-03 -> \S-\S-\S-\S\S 而且这里使用 allow# 不能使用 restrict_xpaths ，使用他的话，正则将失效page_link = LinkExtractor(allow='http://www.58pic.com/c/\S-\S-\S-\S\S.html', allow_domains='www.58pic.com')

rules = (

# 获取这个列表里的链接，依次发送请求，并且继续跟进，调用指定回调函数处理

        Rule(page_link, callback='parse_item', follow=True),

        # 注意这里的 ',' 要不会报错

        )

# 加上这个方法是为了解决 parse_item() 不能抓取第一页数据的问题 parse_start_url 是 CrawlSpider() 类下的方法，这里重写一下即可

def parse_start_url(self, response):

i = items.AdilcrawlerItem()

author = response.xpath('/html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()').extract()

theme = response.xpath('/html/body/div[4]/div[3]/div/a/p[1]/span[1]/text()').extract()

i['author'] = author

i['theme'] = theme

yield i

# 指定的回调函数def parse_item(self, response):

i = items.AdilcrawlerItem()

author = response.xpath('/html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()').extract()

theme = response.xpath('/html/body/div[4]/div[3]/div/a/p[1]/span[1]/text()').extract()

i['author'] = author

i['theme'] = theme

yieldi

再次执行

scrapy crawl thousandPicPaging

查看执行结果，可以看到是有4页的内容

再次优化引入 ItemLoader类

# -*- coding: utf-8 -*-

import scrapy

# 导入链接规则匹配类，用来提取符合规则的连接

fromscrapy.linkextractorsimport LinkExtractor

# 导入CrawlSpider类和Rulefromscrapy.loader

import ItemLoaderfromscrapy.spiders

import CrawlSpider, Rule

import AdilCrawler.items as items

class ThousandpicpagingopSpider(CrawlSpider):

name ='thousandPicPagingOp' allowed_domains = ['www.58pic.com']

# 修改起始页地址start_urls = ['http://www.58pic.com/c/']

# Response里链接的提取规则，返回的符合匹配规则的链接匹配对象的列表# http://www.58pic.com/c/1-0-0-03.html 根据翻页连接地址，找到相应的正则表达式 1-0-0-03 -> \S-\S-\S-\S\S 而且这里使用 allow# 不能使用 restrict_xpaths ，使用他的话，正则将失效page_link = LinkExtractor(allow='http://www.58pic.com/c/\S-\S-\S-\S\S.html', allow_domains='www.58pic.com')

rules = (

# 获取这个列表里的链接，依次发送请求，并且继续跟进，调用指定回调函数处理

        Rule(page_link, callback='parse_item', follow=True),

        # 注意这里的 ',' 要不会报错

)

# 加上这个方法是为了解决 parse_item() 不能抓取第一页数据的问题 parse_start_url 是 CrawlSpider() 类下的方法，这里重写一下即可

def parse_start_url(self, response):

i = ItemLoader(item = items.AdilcrawlerItem(),response = response )

i.add_xpath('author','/html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()')

i.add_xpath('theme','/html/body/div[4]/div[3]/div/a/p[1]/span[1]/text()')

yield i.load_item()

    # 指定的回调函数def parse_item(self, response):

i = ItemLoader(item = items.AdilcrawlerItem(),response = response )

i.add_xpath('author','/html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()')

i.add_xpath('theme','/html/body/div[4]/div[3]/div/a/p[1]/span[1]/text()')

yieldi.load_item()

执行结果是一样的。

最后插播一条在线正则表达式测试工具的广告，地址： http://tool.oschina.net/regex/

应用如下

至此，简单完成了一个网站的简单信息的爬取。后面还会有其他内容的介绍~

如果你要觉得对你有用的话，请不要吝惜你打赏，这将是我无尽的动力，谢谢！