Scrapy入门实例

文档地址:https://doc.scrapy.org/en/latest/intro/tutorial.html

1.创建工程:

C:\Users\Gunner>scrapy startproject tutorial

自动在当前目录下创建工程文件。

2.定义蜘蛛spider

tutorial/spiders目录下,添加quotes_spider.py文件。

import scrapy

 

class QuotesSpider(scrapy.Spider):

    name = "quotes"

 

    def start_requests(self):

        urls = [

            'http://quotes.toscrape.com/page/1/',

            'http://quotes.toscrape.com/page/2/',

        ]

        for url in urls:

            yield scrapy.Request(url=url, callback=self.parse)

 

    def parse(self, response):

        page = response.url.split("/")[-2]

        filename = 'quotes-%s.html' % page

        with open(filename, 'wb') as f:

            f.write(response.body)

        self.log('Saved file %s' % filename)

其中:

l name用于定义spider的名称;

l start_requests()方法定义开始爬取得网址并回调解析函数parse

l Parse函数用于处理爬取网页的响应;

3.运行spider

scrapy crawl quotes

4.提取数据:

提取数据最好的学习方式是使用scrapy shell,例如:

scrapy shell “http://quotes.toscrape.com/page/1/”

2017-05-30 16:46:33 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None)

[s] Available Scrapy objects:

[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)

[s]   crawler    

[s]   item       {}

[s]   request    

[s]   response   <200 http://quotes.toscrape.com/page/1/>

[s]   settings   

[s]   spider     

[s] Useful shortcuts:

[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)

[s]   fetch(req)                  Fetch a scrapy.Request and update local objects

[s]   shelp()           Shell help (print this help)

[s]   view(response)    View response in a browser

>>>

可以使用responsecss方法获取不同元素:

>>> response.css('title')

[Quotes to Scrape'>]

>>> response.css('title').extract()

[u'Quotes to Scrape']

>>> response.css('title').extract_first()

u'Quotes to Scrape'

>>> response.css('title')[0].extract()

u'Quotes to Scrape'

>>>

此外,re()方法可以使用正则表达式进行提取。

>>> response.css('title::text').re(r'Quotes.*')

[u'Quotes to Scrape']

>>> response.css('title::text').re(r'Q\w+')

[u'Quotes']

>>>

浏览器查看response的方法:

>>> view(response)

True

5.使用xpath

除了支持cssscrapy还支持使用xpath进行数据的定位提取。实际上css也是在内部被转成xpath进行使用的。

>>> response.xpath('//title')

[Quotes to Scrape'>]

6.使用yeild进行抽取数据,如quotes1.py

import scrapy

 

class QuotesSpider(scrapy.Spider):

    name = "quotes1"

    start_urls = [

        'http://quotes.toscrape.com/page/1/',

        'http://quotes.toscrape.com/page/2/',

    ]

 

    def parse(self, response):

        for quote in response.css('div.quote'):

            yield {

                'text': quote.css('span.text::text').extract_first(),

                'author': quote.css('small.author::text').extract_first(),

                'tags': quote.css('div.tags a.tag::text').extract(),

            }

运行scrapy crawl quotes1时会打印提取信息:textauthortags

7.保存数据:

scrapy crawl quotes1 -o quotes.json

或者保存为流式的json lines格式:

scrapy crawl quotes1 -o quotes.jl

8.获取下一个链接:

response.css('li.next a::attr(href)').extract_first()

attr用于获取标签的属性。

9.获取下一页链接,quotes2.py

import scrapy

 

class QuotesSpider(scrapy.Spider):

    name = "quotes2"

    start_urls = [

        'http://quotes.toscrape.com/page/1/',

    ]

 

    def parse(self, response):

        for quote in response.css('div.quote'):

            yield {

                'text': quote.css('span.text::text').extract_first(),

                'author': quote.css('small.author::text').extract_first(),

                'tags': quote.css('div.tags a.tag::text').extract(),

            }

 

        next_page = response.css('li.next a::attr(href)').extract_first()

        if next_page is not None:

            next_page = response.urljoin(next_page)

            yield scrapy.Request(next_page, callback=self.parse)

还可以使用resopnse.follow(next_page,callback=self.parse)

你可能感兴趣的:(Python)