声明:搬运自“ 从零开始学Python网络爬虫 ”作者:罗攀,蒋仟机械工业出版社ISBN:9787111579991
上一节我们讲了Scrapy框架的安装以及基本信息,这一节我们就开始使用Scrapy框架进行知乎数据的爬取。
首先利用命令管理器创建一个知乎的项目项目。具体做法是在打开的命令管理器输入
结果如下:
创建完成后,使用pycharm打开可以看到如下文件:其中zhihuspiders是我创建的文件
现在开始项目的分析与实现
1,我们要分析一下我们的目的,以及技术路线。我们的目的是爬取知乎蟒话题的相关信息。然后将其存储在MongoDB的数据库中。
2,我们需要爬去的信息有:蟒问题,点赞数,回答用户数,用户信息和回答内容
3,代码编写
3.1,items.py文件的编写
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class ZhihuItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
question = scrapy.Field()
favour = scrapy.Field()
user = scrapy.Field()
user_info = scrapy.Field()
content = scrapy.Field()
pass
3.2,zhihuspiders.py文件的编写
#导入相应的库文件
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request
from zhihu.items import ZhihuItem
class zhihu(CrawlSpider):
#爬虫的唯一名称
name = 'zhihu'
start_url = ['https://www.zhihu.com/topic/19552832/top-answers?page=1']
def parse(self, response):
item = ZhihuItem()
selector = Selector(response)
infos = selector.xpath('//div[@class="zu-top-feed-list"]/div')
for info in infos:
try:
question = info.xpath('div/div/h2/a/text()').exextract()[0].strip()
favour = info.xpath('div/div/div[1]div[1]/a/text()').extract()[0]
user = info.xpath('div/div/div[1]/div[3]/span/span[1]/a/text()').extract()[0]
user_info = info.xpath('div/div/div[1]/div[3]/span/span[2]/text()').extract()[0].strip()
content = info.xpath('div/div/div[1]/div[5]/div/text()').extract()[0].strip()
item['question'] = question
item['favour'] = favour
item['user'] = user
item['user_info'] = user_info
item['content'] = content
yield item
except IndexError:
pass
3.3,pipelines.py文件的编写
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
class ZhihuPipeline(object):
def __init__(self):
#连接数据库
client = pymongo.MongoClient('localhost', 27017)
test = client['test']
tieba = test['zhihu']
self.post = zhihu
#插入数据库
def process_item(self, item, spider):
info = dict(item)
self.post.insert(info)
return item
3.4,settings.py文件的编写
ROBOTSTXT_OBEY = True
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'
DOWNLOAD_DELAY=2
ITEM_PIPELINES = {'zhihu.pipelines.ZhihuPipeline':300} #指定处理文件
3.5,main.py文件的编写,主要文件是新建的文件
from scrapy import cmdline
cmdline.execute("scrapy crawl zhihu".split())