Scrapy pycharm 爬取豆瓣排名前250信息

一些基础的scrapy创建 项目这里就不说了,这里粘贴一下源代码,留作以后方便查阅。(大佬勿喷)
以下是项目的结构:
Scrapy pycharm 爬取豆瓣排名前250信息_第1张图片

由于这里只用到了items.py doubanspider.py main.py,其他的组件并没有用到,这里就不一一沾出了

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class SpiderItem(scrapy.Item):

    ranking = scrapy.Field()        # 排名

    movie_name = scrapy.Field()     # 电影名称

    score = scrapy.Field()          # 评分

    score_num = scrapy.Field()      # 评论人数

    quote = scrapy.Field()          # 经典语录

doubanspider.py

from scrapy.spiders import Spider       # 从什么文件中 导入什么类
from Spider.items import SpiderItem     # 保存接受的容器 都储存在items中
import scrapy
import re

class Douban(Spider):
    name = 'dmoz'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36',
    }

    def start_requests(self):
        url = 'https://movie.douban.com/top250'
        yield scrapy.Request(url, headers=self.headers)

    def parse(self, response):
        item = SpiderItem()
        # 先抓大后抓小的原则 这里用xpath将需要信息的html的源代码全部抓取
        movies = response.xpath('//ol[@class="grid_view"]/li')
        for movie in movies:
            item['ranking'] = movie.xpath('.//div[@class="pic"]/em/text()').extract()[0]
            item['movie_name'] = movie.xpath('.//div[@class="hd"]/a/span[1]/text()').extract()[0]
            item['score'] = movie.xpath('.//div[@class="star"]/span[@class="rating_num"]/text()').extract()[0]
            num = movie.xpath('.//div[@class="star"]/span/text()').extract()[1]
            item['score_num'] = re.sub(r'[^[0-9]','',num)
            item['quote'] = movie.xpath('.//div[@class="bd"]/p[@class="quote"]/span/text()').extract()
            # 爬取完 通过yield将结果抛出 这里应该是将结果直接抛出到了items.py文件当中
            yield item

        # 翻页的操作
        next_url = response.xpath('//span[@class="next"]/a/@href').extract()
        if next_url:
            next_url = 'https://movie.douban.com/top250' + next_url[0]
            yield scrapy.Request(next_url, headers=self.headers)

main.py

from scrapy import cmdline
cmdline.execute("scrapy crawl dmoz -o douban.csv".split())

结果展示:
Scrapy pycharm 爬取豆瓣排名前250信息_第2张图片
Scrapy pycharm 爬取豆瓣排名前250信息_第3张图片

你可能感兴趣的:(爬虫入门)