scrapy学习笔记(二)：连续抓取与数据保存

抓取论坛、贴吧这种多分页的信息时，没接触scrapy之前，是前确定有多少页，使用for循环抓取。这方法略显笨重，使用scrapy则可以直接组合下一页的链接，然后传给request持续进行抓取，一直到没有下一页链接为止。

还是以官方教程的网站为例子，先分析下元素：

可以看到下一页的标签：

Next →

其中的href属性值/page/2与www.quotes.toscrape.com组合起来就是下一页的网址，同理第二页next的href属性值组合起来就是第三页，因此只要我们判断出是否有下一页的关键字，就可以进行持续抓取。

上代码：

import scrapy

class myspider(scrapy.Spider):

# 设置爬虫名称
name = "get_quotes"

# 设置起始网址
start_urls = ['http://quotes.toscrape.com']

def parse(self, response):

    #使用 css 选择要素进行抓取，如果喜欢用BeautifulSoup之类的也可以
    #先定位一整块的quote，在这个网页块下进行作者、名言,标签的抓取
    for quote in response.css('.quote'):
        yield {
            'author' : quote.css('small.author::text').extract_first(),
            'tags' : quote.css('div.tags a.tag::text').extract(),
            'content' : quote.css('span.text::text').extract_first()
        }

    # 使用xpath获取next按钮的href属性值
    next_href = response.xpath('//li[@class="next"]/a/@href').extract_first()
    # 判断next_page的值是否存在
    if next_href is not None:

        # 如果下一页属性值存在，则通过urljoin函数组合下一页的url:
        # www.quotes.toscrape.com/page/2
        next_page = response.urljoin(next_href)

        #回调parse处理下一页的url
        yield scrapy.Request(next_page,callback=self.parse)

下面是处理结果：

可以看到一直抓取了10页，此网站也只有10页

整个网站的名人名言就全部抓取到了，是不是很方便

现在只是把抓取得到的只是打印到屏幕上，并没有存储起来，接下来我们使用Mongodb进行存储，mongodb的优点可自行google，这里就不说了。从官网下载，参考官方安装教程进行配置安装。

要使用Mongodb需要pymongo，直接pip install pymongo
先演示下直接存储，当做Mongodb存储例子，实际不推荐这么使用：

import scrapy

# 导入pymongo
import pymongo

class myspider(scrapy.Spider):

# 设置爬虫名称
name = "get_quotes"

# 设置起始网址
start_urls = ['http://quotes.toscrape.com']

# 配置client，默认地址localhost，端口27017
client = pymongo.MongoClient('localhost',27017)
# 创建一个数据库，名称store_quote
db_name = client['store_quotes']
# 创建一个表
quotes_list = db_name['quotes']

def parse(self, response):

    #使用 css 选择要素进行抓取，如果喜欢用BeautifulSoup之类的也可以
    #先定位一整块的quote，在这个网页块下进行作者、名言,标签的抓取
    for quote in response.css('.quote'):
        # 将页面抓取的数据存入mongodb,使用insert
        yield self.quotes_list.insert({
            'author' : quote.css('small.author::text').extract_first(),
            'tags' : quote.css('div.tags a.tag::text').extract(),
            'content' : quote.css('span.text::text').extract_first()
        })

    # 使用xpath获取next按钮的href属性值
    next_href = response.xpath('//li[@class="next"]/a/@href').extract_first()
    # 判断next_page的值是否存在
    if next_href is not None:

        # 如果下一页属性值存在，则通过urljoin函数组合下一页的url:
        # www.quotes.toscrape.com/page/2
        next_page = response.urljoin(next_href)

        #回调parse处理下一页的url
        yield scrapy.Request(next_page,callback=self.parse)

如果使用的是pycharm编辑器，有一个mongodb插件，可以方便的查看数据库，Mongo plugin，在plugin里面添加

添加之后，重启pycharm，可以在setting -> other setting里面看到Mongo Servers,点击Mongo servers配置mongodb：

Label随意填写，server url已经有默认，test一下，连接成功确认即可，完成之后，可在pycharm左侧看到插件mongo explorer,点击展开可看到数据库。

OK运行一下我们的代码，

scrapy crawl get_quotes

然后刷新下数据库，可看到数据已经保存到mongodb中了

查看数据库内容：

很清晰，每一项都有保存

scrapy学习笔记(二)：连续抓取与数据保存

你可能感兴趣的:(scrapy学习笔记(二)：连续抓取与数据保存)