Python爬虫入门——3.9 Scrapy爬虫实战

声明:搬运自“ 从零开始学Python网络爬虫  ”作者:罗攀,蒋仟机械工业出版社ISBN:9787111579991

上一节我们讲了Scrapy框架的安装以及基本信息,这一节我们就开始使用Scrapy框架进行知乎数据的爬取。

首先利用命令管理器创建一个知乎的项目项目。具体做法是在打开的命令管理器输入  

  • ˚F:(我要创建项目的盘)
  • cd F:\ soft_exercise \ python(我要创建项目的目录)
  • scrapy startproject zhihu(利用scrapy startproject命令创建名为zhihu的项目)

结果如下:

Python爬虫入门——3.9 Scrapy爬虫实战_第1张图片

创建完成后,使用pycharm打开可以看到如下文件:其中zhihuspiders是我创建的文件

 

Python爬虫入门——3.9 Scrapy爬虫实战_第2张图片

 

 现在开始项目的分析与实现

1,我们要分析一下我们的目的,以及技术路线。我们的目的是爬取知乎蟒话题的相关信息。然后将其存储在MongoDB的数据库中。

2,我们需要爬去的信息有:蟒问题,点赞数,回答用户数,用户信息和回答内容

3,代码编写

3.1,items.py文件的编写

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class ZhihuItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    question = scrapy.Field()
    favour = scrapy.Field()
    user = scrapy.Field()
    user_info = scrapy.Field()
    content = scrapy.Field()
    pass

3.2,zhihuspiders.py文件的编写

#导入相应的库文件
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request
from zhihu.items import ZhihuItem

class zhihu(CrawlSpider):
    #爬虫的唯一名称
    name = 'zhihu'
    start_url = ['https://www.zhihu.com/topic/19552832/top-answers?page=1']
    def parse(self, response):
        item = ZhihuItem()
        selector = Selector(response)
        infos = selector.xpath('//div[@class="zu-top-feed-list"]/div')
        for info in infos:
            try:
                question = info.xpath('div/div/h2/a/text()').exextract()[0].strip()
                favour = info.xpath('div/div/div[1]div[1]/a/text()').extract()[0]
                user = info.xpath('div/div/div[1]/div[3]/span/span[1]/a/text()').extract()[0]
                user_info = info.xpath('div/div/div[1]/div[3]/span/span[2]/text()').extract()[0].strip()
                content = info.xpath('div/div/div[1]/div[5]/div/text()').extract()[0].strip()
                item['question'] = question
                item['favour'] = favour
                item['user'] = user
                item['user_info'] = user_info
                item['content'] = content
                yield item
            except IndexError:
                pass

3.3,pipelines.py文件的编写

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class ZhihuPipeline(object):
    def __init__(self):
        #连接数据库
        client = pymongo.MongoClient('localhost', 27017)
        test = client['test']
        tieba = test['zhihu']
        self.post = zhihu

    #插入数据库
    def process_item(self, item, spider):
        info = dict(item)
        self.post.insert(info)
        return item

3.4,settings.py文件的编写

ROBOTSTXT_OBEY = True
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'
DOWNLOAD_DELAY=2
ITEM_PIPELINES = {'zhihu.pipelines.ZhihuPipeline':300}  #指定处理文件

3.5,main.py文件的编写,主要文件是新建的文件

from scrapy import cmdline
cmdline.execute("scrapy crawl zhihu".split())

 

 

 

 

 

你可能感兴趣的:(Python算法入门,Python爬虫入门,Python爬虫)