使用Scrapy抓取电影top100数据

使用Scrapy抓取电影top100数据

文章目录

  • 使用Scrapy抓取电影top100数据
  • 前言
  • 一、创建项目
    • 1.创建项目命令如下:(详细的参考以往博客)
  • 二、编写各个文件
    • 1.编写items.py如下:
    • 2.编写pipelines.py如下:
    • 3.编写top100.py如下:
    • 4.设置settings.py启用管道文件pipelines.py:
    • 5.运行爬虫:


前言

抓取的数据为电影的名称、主演、上映日期、评分。将爬取的数据保存到本地.json文件


提示:以下是本篇文章正文内容,下面案例可供参考

一、创建项目

1.创建项目命令如下:(详细的参考以往博客)

更改目录使用命令 :cd
返回目录上一级:cd…
创建一个新项目:scrapy startproject maoyan1
更改到maoyan1目录:cd maoyan1
创建爬虫:scrapy genspider -t basic top100 maoyan.com

二、编写各个文件

1.编写items.py如下:

其中items.py代码如下:

import scrapy
class Maoyan1Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    actors = scrapy.Field()
    releasetime = scrapy.Field()
    score = scrapy.Field()

2.编写pipelines.py如下:

其中pipelines.py代码如下:

import json
import codecs


class Maoyan1Pipeline:
	def __init__(self):
		self.file = codecs.open("D:\\python\\代码文件\\代码\\Scrapy网络爬虫实战\\maoyan1\\maoyan.json", "wb", encoding="UTF-8")

	def process_item(self, item, spider):
		for j in range(len(item["name"])):
			name = item["name"][j]
			actors = item['actors'][j]
			releasetime = item['releasetime'][j]
			score = item['score'][j]
			goods = {
     "name": name, "actors": actors, "releasetime": releasetime, "score": score}
			line = json.dumps(dict(goods), ensure_ascii=False) + '\n'
			self.file.write(line)

	def close_spider(self, spider):
		self.file.close()

3.编写top100.py如下:

其中top100.py代码如下:

import scrapy
import re
from maoyan1.items import Maoyan1Item
from scrapy.http import Request


class Top100Spider(scrapy.Spider):
	name = 'top100'
	allowed_domains = ['maoyan.com']
	start_urls = ['https://maoyan.com/board/4?offset=0']

	def parse(self, response):

		item = Maoyan1Item()
		item['name'] = response.xpath('//a[@class="image-link"]/@title').extract()
		# 去除换行和空白符
		lis = []
		for i in response.xpath('//p[@class="star"]/text()').extract():
			lis.append(re.sub('\s+', '', i).strip())
		item['actors'] = lis
		item['releasetime'] = response.xpath('//p[@class="releasetime"]/text()').extract()
		item['score'] = response.xpath('//p[@class="score"]').xpath('string(.)').extract()
		yield item

		# 通过循环自动爬取10页数据
		for i in range(20, 100, 10):
			url = "https://maoyan.com/board/4?offset=" + str(i)
			yield Request(url, callback=self.parse)

4.设置settings.py启用管道文件pipelines.py:

其中代码如下:

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
     
   'maoyan1.pipelines.Maoyan1Pipeline': 300,
}

5.运行爬虫:

命令:scrapy crawl top100 --nolog
其中得到文件如下所示:
使用Scrapy抓取电影top100数据_第1张图片

你可能感兴趣的:(python,爬虫)