抓取的数据为电影的名称、主演、上映日期、评分。将爬取的数据保存到本地.json文件
提示:以下是本篇文章正文内容,下面案例可供参考
更改目录使用命令 :cd
返回目录上一级:cd…
创建一个新项目:scrapy startproject maoyan1
更改到maoyan1目录:cd maoyan1
创建爬虫:scrapy genspider -t basic top100 maoyan.com
其中items.py代码如下:
import scrapy
class Maoyan1Item(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
name = scrapy.Field()
actors = scrapy.Field()
releasetime = scrapy.Field()
score = scrapy.Field()
其中pipelines.py代码如下:
import json
import codecs
class Maoyan1Pipeline:
def __init__(self):
self.file = codecs.open("D:\\python\\代码文件\\代码\\Scrapy网络爬虫实战\\maoyan1\\maoyan.json", "wb", encoding="UTF-8")
def process_item(self, item, spider):
for j in range(len(item["name"])):
name = item["name"][j]
actors = item['actors'][j]
releasetime = item['releasetime'][j]
score = item['score'][j]
goods = {
"name": name, "actors": actors, "releasetime": releasetime, "score": score}
line = json.dumps(dict(goods), ensure_ascii=False) + '\n'
self.file.write(line)
def close_spider(self, spider):
self.file.close()
其中top100.py代码如下:
import scrapy
import re
from maoyan1.items import Maoyan1Item
from scrapy.http import Request
class Top100Spider(scrapy.Spider):
name = 'top100'
allowed_domains = ['maoyan.com']
start_urls = ['https://maoyan.com/board/4?offset=0']
def parse(self, response):
item = Maoyan1Item()
item['name'] = response.xpath('//a[@class="image-link"]/@title').extract()
# 去除换行和空白符
lis = []
for i in response.xpath('//p[@class="star"]/text()').extract():
lis.append(re.sub('\s+', '', i).strip())
item['actors'] = lis
item['releasetime'] = response.xpath('//p[@class="releasetime"]/text()').extract()
item['score'] = response.xpath('//p[@class="score"]').xpath('string(.)').extract()
yield item
# 通过循环自动爬取10页数据
for i in range(20, 100, 10):
url = "https://maoyan.com/board/4?offset=" + str(i)
yield Request(url, callback=self.parse)
其中代码如下:
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'maoyan1.pipelines.Maoyan1Pipeline': 300,
}