python scrapy爬取

总览

  • 前言
  • 实现
    • 创建项目
    • 创建爬虫
    • Item类
    • 爬虫类解析函数
    • Xpath解析
    • 翻页
    • 保存到xlsx
  • 爬取结果
  • 代码获取

前言

在本项目中,主要基于Scrapy库来爬取某瓣电影top250的信息,并将信息存储到xlsx文件中。

实现

创建项目

进入对应的文件夹,通过cmd运行以下语句(注:若未安装scrapy需要先安装scrapy库)

scrapy startproject douban_moive

创建爬虫

scrapy genspider myspider  "网址"

Item类

class DoubanMovieItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    moiveinfo = scrapy.Field()
    star = scrapy.Field()
    quote = scrapy.Field()

爬虫类解析函数

def parse(self, response):
        item = DoubanMovieItem()
        selector = Selector(response)
        Moives = selector.xpath("//div[@class='info']")
        for each_moive in Moives:
            title = each_moive.xpath('div[@class="hd"]/a/span/text()').extract()
            full_title = ""
            for each in title:
                full_title += each
            moiveinfo = each_moive.xpath(".//p/text()").extract()
            star = each_moive.xpath('div[@class="bd"]/div[@class="star"]/span[@class="rating_num"]/text()').extract()[0]
            quote = each_moive.xpath('div[@class="bd"]/p/span/text()').extract()
            quote = quote[0] if quote else None
            item["title"] = full_title
            item["moiveinfo"] = ";".join(moiveinfo).replace(' ', '').replace('\n', '')
            item["star"] = star
            item["quote"] = quote
            yield item

Xpath解析

获取得到代码后,通过Xpath方法获取信息

title = each_moive.xpath('div[@class="hd"]/a/span/text()').extract()
full_title = ""
for each in title:
    full_title += each
moiveinfo = each_moive.xpath(".//p/text()").extract()
star = each_moive.xpath('div[@class="bd"]/div[@class="star"]/span[@class="rating_num"]/text()').extract()[0]
quote = each_moive.xpath('div[@class="bd"]/p/span/text()').extract()
quote = quote[0] if quote else None

翻页

nextPage = selector.xpath('//span[@class="next"]/link/@href').extract()
if nextPage:
    nextPage = nextPage[0]
    print(self.url + str(nextPage))
    yield Request(self.url + str(nextPage), callback=self.parse)

保存到xlsx

本部分功能通过pipelines.py实现,主要代码如下:

line = [item['title'].encode('utf-8'),item['moiveinfo'].encode('utf-8'),item['star'].encode('utf-8'),item['quote'].encode('utf-8')]
self.ws.append((line))
self.wb.save(r'res.xlsx')
return item

爬取结果

python scrapy爬取_第1张图片
python scrapy爬取_第2张图片

代码获取

赞赏50,备注某瓣信息爬取,评论留下个人邮箱,收到即回复
python scrapy爬取_第3张图片

你可能感兴趣的:(爬虫,Python学习,python,scrapy)