Python爬虫——Scrapy爬取名人名言

toscrape是一个名人名言的网站

Python爬虫——Scrapy爬取名人名言_第1张图片
image.png

一条名人名言的结构如下

“I have not failed. I've just found 10,000 ways that won't work.” by (about)

下一页


Python爬虫——Scrapy爬取名人名言_第2张图片
image.png

quotes.py 使用css 选择器实现

from tutorial.items import TutorialItem

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        quotes = response.css('.quote')
        for quote in quotes:
            item = TutorialItem()
            item['text'] = quote.css('.text::text').extract_first()
            item['author'] = quote.css('.author::text').extract_first()
            item['tags'] = quote.css('.tags .tag::text').extract()
            yield item
        next = response.css('.pager .next a::attr("href")').extract_first()
        url = response.urljoin(next)
        yield scrapy.Request(url=url, callback=self.parse)

quotes.py 使用xpath 实现

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        quotes = response.xpath('//div[@class="quote"]')
        for quote in quotes:
            item = TutorialItem()
            item['text'] = quote.xpath('span[@class="text"]/text()').extract_first()
            item['author'] = quote.xpath('span/small[@class="author"]/text()').extract_first()
            item['tags'] = quote.xpath('div[@class="tags"]/a[@class="tag"]/text()').extract()
            yield item
        next = response.xpath('//li[@class="next"]/a/@href').extract_first()
        url = response.urljoin(next)
        yield scrapy.Request(url=url, callback=self.parse)

items.py

class TutorialItem(scrapy.Item):

    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()

你可能感兴趣的:(Python爬虫——Scrapy爬取名人名言)