Scrapy初探,爬取猫眼网排行100的电影

1.创建scrapy项目

scrapy startproject maoyanspider

2.写items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class MaoyanspiderItem(scrapy.Item):
    # define the fields for your item here like:
    movie_name = scrapy.Field()
    actor = scrapy.Field()
    time = scrapy.Field()
    score = scrapy.Field()
    img = scrapy.Field()

3.设置setting.py
把setting.py中的DEFAULT_REQUEST_HEADERS的注释去除,添加User-Agent

  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36',

再去除ITEM_PIPELINES的注释
4.创建爬虫文件

scrapy genspider Myspider 'maoyan.com'

创建成功后再文件目录spiders中会出现Myspider.py文件,然后开始写爬虫

# -*- coding: utf-8 -*-
import scrapy
from maoyanspider.items import MaoyanspiderItem

class MyspiderSpider(scrapy.Spider):
    name = 'Myspider'
    allowed_domains = ['maoyan.com/']
    baseUrl = 'https://maoyan.com/board/4?offset='
    offset = 0
    start_urls = [baseUrl+str(offset)]


    def parse(self, response):
        item = MaoyanspiderItem()

        info = response.xpath(".//div[@class='board-item-content']")

        for each in info:
            movie_name = each.xpath('./div/p/a/text()').extract()
            actor = each.xpath('./div/p[@class = "star"]/text()').extract()
            time = each.xpath('./div/p[@class = "releasetime"]/text()').extract()
            score = each.xpath('./div/p[@class = "score"]/i/text()').extract()


            item["movie_name"] = movie_name[0]
            item["actor"] = str(actor[0]).strip()
            item["time"] = time[0]
            item["score"] = score[0]+score[1]

            yield item

        if self.offset < 90:
            self.offset += 10
            url = self.baseUrl + str(self.offset)
            yield scrapy.Request(url,callback=self.parse,dont_filter=True)

5.写pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import json

class MaoyanspiderPipeline(object):

    def __init__(self):
        self.file = open('Maoyan.json','w',encoding='utf-8')

    def process_item(self, item, spider):
        jsontext = json.dumps(dict(item),ensure_ascii=False,indent=1)+','
        self.file.write(jsontext)
        return item

    def close_file(self):
        self.file.close()

6.运行程序,两种方法
(1).在spiders文件下创建一个xx.py文件,内容写

# -*- coding: utf-8 -*-
#@Time -***-:2019/8/21 18:25-***-

from scrapy import cmdline
cmdline.execute('scrapy crawl Myspider'.split())

(2).直接在命令行中切换目录到第二个maoyanspider目录,写入

scrapy crawl Myspider

即可运行

你可能感兴趣的:(python爬虫,python编程)