使用Scrapy爬取美食网站

如今的爬虫技术主要还是人工定位信息位置,
本次爬取的对象是美食杰网站
使用的爬虫框架为scrpay

不得不说,scrapy的确十分好用,
尤其是scrapy shell,使用scrapy shell 写xpath对网页进行探索
非常的方便!

import scrapy
from cbspider.items import CbspiderItem
import os
import sqlite3
'''
    时间:2018/4/2
    作者:Output20
    说明:本爬虫用于爬取美食杰网站
'''

# 爬虫
class MeishiSpider(scrapy.Spider):
    name = "meishi"
    # allowed_domains=["http://www.meishij.net"]
    # 起始url
    start_urls = [
        # 首页:"http://www.meishij.net/chufang/diy/wucan/?&page=1"
        # 生成多页(这里设置为10页):
        "http://www.meishij.net/chufang/diy/wucan/?&page=%d"%(i+1) for i in range(10)
    ]
    # 首页爬取方法:抽取详细页面的链接
    def parse(self,response):
        # 获取详细页面
        hrefs = response.xpath("//div[@class='listtyle1']/a")
        for href in hrefs:
            # 抽取详细页面连接
            url = href.xpath("@href")[0].extract()
            # print(url)
            # 解析详细页面
            yield scrapy.Request(url, callback=self.parse_detail_page)

    # 具体页面爬取方法:人工定位各个信息的位置
    def parse_detail_page(self, response):
        # 选择抓取结构
        dish = CbspiderItem()
        info2 = response.xpath("//div[@class='info2']")
        # 获取成品图(链接)
        dish['chengpin'] = response.xpath("//div[@class='cp_headerimg_w']/img/@src")[0].extract()
        # 获取工艺
        dish['gongyi'] = (info2.xpath("//li[@class='w127']/a").xpath("text()").extract()+[''])[0]
        # 获取口味
        dish['kouwei'] = (info2.xpath("//li[@class='w127 bb0']/a").xpath("text()").extract()+[''])[0]
        # 获取难度
        dish['nandu'] = (info2.xpath("//li[@class='w270']//a").xpath("text()").extract()+[''])[0]
        # 获取人数
        dish['renshu'] = (info2.xpath("//li[@class='w270 br0']//a").xpath("text()").extract()+[''])[0]
        # 获取准备时间
        dish['zhunbeishijian'] = (info2.xpath("//li[@class='w270 bb0']//a").xpath("text()").extract()+[''])[0]
        # 获取烹饪时间
        dish['pengrenshijian'] = (info2.xpath("//li[@class='w270 bb0 br0']//a").xpath("text()").extract()+[''])[0]
        # 获取主料
        dish['zhuliao'] = dict()
        for h4 in response.xpath("//div[@class='yl zl clearfix']//h4"):
            dish['zhuliao'][h4.xpath("a/text()").extract()[0]] = h4.xpath("span/text()").extract()[0]
        # 获取辅料
        dish['fuliao'] = dict()
        for li in response.xpath("//div[@class='yl fuliao clearfix']//li"):
            dish['fuliao'][li.xpath("h4/a/text()").extract()[0]] = li.xpath("span/text()").extract()[0]
        # 获取过程(文本+链接)
        # 当前步数
        count = 0
        dish['guocheng'] = dict()
        for div in response.xpath("//div[@class='editnew edit']/div/div"):
            count += 1
            dish['guocheng'][count] = (div.xpath("p/text()")[0].extract(),div.xpath("p/img/@src")[0].extract())
        return dish

之后我对爬虫进行了改进
这次更新的主要内容是通用化
在之前的爬虫里,我们的注意力主要是对具体页面的数据抓取
有一个问题还没有解决,就是抓取范围有待提高(之前只抓取了一个分类下的菜谱)
使用Scrapy爬取美食网站_第1张图片
虽然网站中有非常多的菜谱,但是可能由于用户体验的问题,并不存在一个完整的所有的菜谱页面的目录

所以我们需要解决的问题在于获取足够多的菜谱
我的初步解决方案是先从有限的主题出发,然后从详细页面中提取新的主题
然后以新主题为出发点爬取对应页面(有点类似于图搜索)

然而在scrapy框架中,start_request方法只会执行一次,所以一般的动态爬取无法完成,
所以我们得先将爬取过程静态化(其实就是先得到尽量多的主题,再对每个主题进行爬取)

所以我创建了一个新的爬虫,专门爬取主题,这样获取足够多的主题之后,我们就可以
静态的爬取具体的菜谱页面了

以下是初步爬取结果:

http://www.meishij.net/chufang/diy/zaocan/
http://www.meishij.net/chufang/diy/yexiao/
http://www.meishij.net/chufang/diy/zhushi/
http://www.meishij.net/chufang/diy/diy/
http://www.meishij.net/chufang/diy/wucan/
http://www.meishij.net/chufang/diy/wancan/
http://www.meishij.net/chufang/diy/tangbaocaipu/
http://www.meishij.net/chufang/diy/jiangchangcaipu/
http://www.meishij.net/chufang/diy/gaodianxiaochi/
http://www.meishij.net/chufang/diy/jiachangtianpin/
http://www.meishij.net/chufang/diy/xiawucha/
http://www.meishij.net/chufang/diy/langcaipu/
http://www.meishij.net/china-food/caixi/yuecai/
http://www.meishij.net/chufang/diy/recaipu/
http://www.meishij.net/hongpei/tianpindianxin/
http://www.meishij.net/chufang/diy/lingshi/
http://www.meishij.net/hongpei/bingganpeifang/
http://www.meishij.net/hongpei/dangao/
http://www.meishij.net/chufang/diy/tianpindianxin/
http://www.meishij.net/chufang/diy/yinpin/
http://www.meishij.net/chufang/diy/baobaocaipu/
http://www.meishij.net/chufang/diy/shaokao/
http://www.meishij.net/chufang/diy/huoguo/
http://www.meishij.net/chufang/diy/meirong/
http://www.meishij.net/chufang/diy/shoushen/
http://www.meishij.net/chufang/diy/jiangliaozhanliao/
http://www.meishij.net/china-food/xiaochi/beijing/
http://www.meishij.net/chufang/diy/yaoshan/
http://www.meishij.net/chufang/diy/biandang/
http://www.meishij.net/hongpei/dangaomianbao/
http://www.meishij.net/hongpei/mianbao/
http://www.meishij.net/chufang/diy/haixian/
http://www.meishij.net/chufang/diy/gaodian/
http://www.meishij.net/chufang/diy/sijiacai/
http://www.meishij.net/china-food/caixi/lucai/
http://www.meishij.net/china-food/caixi/chuancai/
http://www.meishij.net/china-food/caixi/dongbeicai/
http://www.meishij.net/china-food/caixi/xiangcai/
http://www.meishij.net/chufang/diy/weibolucaipu/

你可能感兴趣的:(实训)