scrapy爬取猫眼电影信息

scrapy是一个优秀的爬虫框架,可以非常直观规整的进行数据爬取。下面以爬取猫眼电影信息为例:

首先我们我们需要建立一个scrapy项目:
在项目目录下cmd窗口输入:

scrapy startproject maoyan
  • 创建一个maoyan爬虫项目:
  • 接着我们进入maoyan项目目录:
  • 然后我们创建一个spider:
scrapy genspider maoyan_spider maoyan.com

这样我们就创建好目录了

接着我们开始写item

import scrapy


class MaoyanItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    index = scrapy.Field()
    title = scrapy.Field()
    star = scrapy.Field()
    releasetime = scrapy.Field()
    score = scrapy.Field()    

这是我们需要爬取的数据项

然后我们写spider

# -*- coding: utf-8 -*-
import scrapy

from maoyan.items import MaoyanItem



class MySpider(scrapy.Spider):
    name = 'maoyan'  # 项目名
    allowed_domains = ['maoyan.com']  # 允许访问的域名

    def start_requests(self):
        url_list = []
        for i in range(0,10):
            url_list.append('https://maoyan.com/board/4?offset='+str(i))
        # 定义爬取的链接
        urls = url_list
        for url in urls:
            # 爬取到的页面如何处理?提交给parse方法处理
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        dl = response.css('.board-wrapper dd')
        for dd in dl:
            item = MaoyanItem()
            # extract()[0]等同于extract_first()
            item['index'] = dd.css('.board-index::text').extract_first()
            item['title'] = dd.css('.name a::text').extract_first()
            # strip方法是去除空格和换行符
            item['star'] = dd.css('.star::text').extract_first().strip()
            item['releasetime'] = dd.css('.releasetime::text').extract_first()
            item['score'] = dd.css('.integer::text').extract_first() + dd.css('.fraction::text').extract_first()
            yield item

在这里面我们重写了start_url和parse方法,采用css解释器对response进行解析

接着我们写过滤器pipelines

import json
class MaoyanPipeline(object):
    def __init__(self):
        self.fp = open('budejie.json','w',encoding='utf-8')
    def process_item(self, item, spider):
        return item
    def process_item(self, item, spider):
        item_json = json.dumps(dict(item),ensure_ascii=False)
        self.fp.write(item_json+'\n')
        return item

这里我们在初始化方法中添加文件保存方法,并重写process_item方法,将爬取内容转化成json
配置设置文件settings
我们需要把robots协议设为: ROBOTSTXT_OBEY = False
并添加:

DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36',

}
ITEM_PIPELINES = {
   'maoyan.pipelines.MaoyanPipeline': 300,
}

最后我们在项目maoyan下设置启动文件start.py

from scrapy import cmdline
cmdline.execute(['scrapy','crawl','maoyan'])

这样我们直接启动start.py就获得了爬取数据的json文件

你可能感兴趣的:(python,爬虫,后端,大数据,数据挖掘)