Python爬虫采集框架——Scrapy初学入门

一、安装Scrapy依赖包

pip install Scrapy

二、创建Scrapy项目(tutorial)

scrapy startproject tutorial

项目目录包含以下内容

tutorial/
    scrapy.cfg            # deploy configuration file
    tutorial/             # project's Python module, you'll import your code from here
        __init__.py
        items.py          # project items definition file
        middlewares.py    # project middlewares file
        pipelines.py      # project pipelines file
        settings.py       # project settings file
        spiders/          # a directory where you'll later put your spiders
            __init__.py

三、tutorial/spiders目录下编写蜘蛛(quotes_spider.py)

1、蜘蛛(version1.0)

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f'quotes-{page}.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log(f'Saved file {filename}')

蜘蛛类QuotesSpider必须得继承scrapy.Spider,并定义以下属性和方法:

name:蜘蛛名称(它在一个项目中必须是唯一的)

start_requests():蜘蛛开始请求(方法返回的是请求的iterable)

请求:scrapy.Request(url=url, callback=self.parse)

parse():蜘蛛解析请求的响应(Response参数是响应的页面内容)

2、蜘蛛(version2.0)

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]
    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f'quotes-{page}.html'
        with open(filename, 'wb') as f:
            f.write(response.body)

只需定义start_urls属性,不需重写start_requests() 方法,由默认的start_requests()方法根据start_urls属性开始请求也可。

3、蜘蛛(version3.0)

页面HTML代码:

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” by Albert Einstein (about)

蜘蛛代码:

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

该蜘蛛将提取页面中的text、author和tags,并使用终端打印。

4、蜘蛛(version4.0)

下一页HTML代码:

蜘蛛代码:

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

该蜘蛛将提取页面中的text、author和tags,使用终端打印,提取下一页的地址并开始请求,以此递归。

#做如下测试操作
urljoin("http://www.xoxxoo.com/a/b/c.html", "d.html")  
#结果为:'http://www.xoxxoo.com/a/b/d.html'  
urljoin("http://www.xoxxoo.com/a/b/c.html", "/d.html") 
#结果为:'http://www.xoxxoo.com/d.html'   
urljoin("http://www.xoxxoo.com/a/b/c.html", "../d.html") 
#结果为:'http://www.xoxxoo.com/a/d.html' 

5、蜘蛛(version5.0)

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

蜘蛛(version4.0)和蜘蛛(version5.0)的区别:scrapy.Request支持绝对URL,而response.follow支持相对URL,并返回一个请求实例。

6、蜘蛛(version6.0)

import scrapy
class AuthorSpider(scrapy.Spider):
    name = 'author'
    start_urls = ['http://quotes.toscrape.com/']
    def parse(self, response):
        author_page_links = response.css('.author + a')
        yield from response.follow_all(author_page_links, self.parse_author)
        pagination_links = response.css('li.next a')
        yield from response.follow_all(pagination_links, self.parse)
    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).get(default='').strip()
        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }

使用选择器选择元素,会默认返回href属性,因此response.css('li.next a::attr(href)')response.css('li.next a')是一样的。

response.follow_allresponse.follow的区别是:前者返回多个请求实例的iterable,而后者返回一个请求 实例。

通过设置 DUPEFILTER_CLASS,scrapy可以过滤掉已经访问过的URL,避免了重复请求。

四、使用蜘蛛参数

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    def start_requests(self):
        url = 'http://quotes.toscrape.com/'
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + 'tag/' + tag
        yield scrapy.Request(url, self.parse)
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
            }
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

运行命令行

scrapy crawl quotes -O quotes-humor.json -a tag=humor

通过-a传递的参数,被蜘蛛的 __init__ 方法赋值为蜘蛛的属性。在本例中,通过getattr()获取该参数并使用。

五、CSS选择器与XPath

scrapy shell是一个交互式shell,可以用来快速调试scrape代码,特别是测试数据提取代码(CSS选择器与XPath)。打开命令如下:

scrapy shell "http://quotes.toscrape.com/page/1/"

网页源码:


    
    Quotes to Scrape
    
    

CSS选择器提取代码

//获取符合的数据的列表
>>>response.css('title::text').getall()
//获取第一个符合的数据,没有数据返回None
>>>response.css('title::text').get()
//获取第一个符合的数据,没有数据会引发IndexError
>>>response.css('title::text')[0].get()
//css选择器选择出来后进行正则匹配
>>>response.css('title::text').re(r'Quotes.*')

XPath提取代码

>>>response.xpath('//title')
[]
>>>response.xpath('//title/text()').get()
'Quotes to Scrape'

更多爬虫知识以及实例源码,可关注微信公众号:angry_it_man

你可能感兴趣的:(JavaScript,python,爬虫,scrapy)