scrapy爬虫实践 --- day one

第一个爬虫项目

该项目的源代码见: GitHub - scrapy/quotesbot: This is a sample Scrapy project for educational purposes

网站的页面如下:


scrapy爬虫实践 --- day one_第1张图片
qutoesbot页面.png

我们可以抓取页面中的正文,作者,和标签三个部分。Let's start!

step one:

新建一个项目,姑且就叫quotesbot吧。在terminal的某个目录下中输入如下命令:

scrapy startproject quotesbot

然后我们就可以看到如下的目录结构:

scrapy爬虫实践 --- day one_第2张图片
8179906-137d9b6db40fdb89.png

目录结构的内容暂且不表。

step two:

编写源代码。需要在spiders目录下新建一个文件。可以叫它quotesbot.py。
源码如下:

from scrapy import Spider

class Quotesbot(Spider):
    name = 'quotesbot'
    start_urls = ['http://quotes.toscrape.com/',]

    def parse(self, response):
        quotes = response.xpath("//div[@class='quote']")
        for quote in quotes:
            yield {
                'text': quote.xpath("./span[@class='text']/text()").extract_first(),
                'author': quote.xpath(".//small[@class='author']/text()").extract_first(),
                'tags': quote.xpath(".//a[@class='tag']/text()").extract()
                }

step three:

进入quotesbot目录,在terminal中输入如下命令:

 scrapy crawl quotesbot -o quotesbot.json

-o 表示将数据保存到后面的文件中。
执行完成后,我们可以看到目录中新生成了该文件。


scrapy爬虫实践 --- day one_第3张图片
9EDB274E-566E-41AB-BB94-7036A971667D.png

该文件的内容如下:

[
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": "change"},
{"text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d", "author": "J.K. Rowling", "tags": "abilities"},
{"text": "\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d", "author": "Albert Einstein", "tags": "inspirational"},
{"text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d", "author": "Jane Austen", "tags": "aliteracy"},
{"text": "\u201cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\u201d", "author": "Marilyn Monroe", "tags": "be-yourself"},
{"text": "\u201cTry not to become a man of success. Rather become a man of value.\u201d", "author": "Albert Einstein", "tags": "adulthood"},
{"text": "\u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d", "author": "Andr\u00e9 Gide", "tags": "life"},
{"text": "\u201cI have not failed. I've just found 10,000 ways that won't work.\u201d", "author": "Thomas A. Edison", "tags": "edison"},
{"text": "\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d", "author": "Eleanor Roosevelt", "tags": "misattributed-eleanor-roosevelt"},
{"text": "\u201cA day without sunshine is like, you know, night.\u201d", "author": "Steve Martin", "tags": "humor"}
]

可以看到生成了我们预期的数据。good job!

你可能感兴趣的:(scrapy爬虫实践 --- day one)