scrapy爬取京东图书的数据

strat_url:https://book.jd.com/booksort.html
文章末尾有完整的项目链接
scrapy爬取京东图书的数据_第1张图片

1.创建项目

scrapy startproject jd_book
cd jd_book
scrapy genspider jdBook jd.com

然后会出现以下文件
scrapy爬取京东图书的数据_第2张图片

2.修改项目设置(settings.py)

scrapy爬取京东图书的数据_第3张图片scrapy爬取京东图书的数据_第4张图片scrapy爬取京东图书的数据_第5张图片
scrapy爬取京东图书的数据_第6张图片
scrapy爬取京东图书的数据_第7张图片

3.编写爬取内容字段(items.py)

scrapy爬取京东图书的数据_第8张图片

    b_title = scrapy.Field()
    s_title = scrapy.Field()
    href = scrapy.Field()
    book_img = scrapy.Field()
    book_name = scrapy.Field()
    book_url = scrapy.Field()
    sku = scrapy.Field()
    author = scrapy.Field()
    price = scrapy.Field()

4.编写爬虫(jdBook.py)

scrapy爬取京东图书的数据_第9张图片

# -*- coding: utf-8 -*-
import scrapy,json,time
from copy import deepcopy
from ..items import JdBookItem

class JdbookSpider(scrapy.Spider):
    name = 'jdBook'
    allowed_domains = ['jd.com','360buyimg.com','p.3.cn']
    start_urls = ['https://book.jd.com/booksort.html']

    def parse(self, response):
        item=JdBookItem()
        dl_list=response.xpath('//div[@class="mc"]/dl')
        for dl in dl_list:
            item['b_title']=dl.xpath('./dt/a/text()').extract_first()
            em_list=dl.xpath('./dd/em')
            for em in em_list:
                item['s_title']=em.xpath('./a/text()').extract_first()
                item['href']='https:'+em.xpath('./a/@href').extract_first()
                yield scrapy.Request(item['href'],callback=self.book_list,
                                     meta=deepcopy({'item':item}))
    def book_list(self,response):#列表页
        item=response.meta['item']
        li_list=response.xpath('//div[@id="J_goodsList"]/ul/li')
        for li in li_list:
            item['sku']=li.xpath('./@data-sku').extract_first()
            item['book_url']='https://item.jd.com/{}.html'.format(item['sku'])
            # print(item['sku'])
            yield scrapy.Request(item['book_url'],callback=self.book,
                                 meta=deepcopy({'item':item}))
        #翻页
        sum_page=response.xpath('//span[@class="p-skip"]/em/b/text()').extract_first()
        if sum_page is not None:
            sum_page=(sum_page)
            for i in range(2,int(sum_page)):
                url=str(item['href']).replace('-','%').join('?page={}').format(i)
                print(url)
                yield scrapy.Request(url,callback=self.book_list,
                                     meta=deepcopy({'item':item}))





    def book(self,response):#详情页
        item=response.meta['item']
        item['book_name']=response.xpath('//div[@id="name"]/div[@class="sku-name"]/text()').extract()
        item['book_name']=[i.strip() for i in item['book_name'] if len(i)>=1]
        item['book_name']=''.join(item['book_name'])
        item['book_img']='http:'+response.xpath('//div[@id="spec-n1"]//img/@src').extract_first()
        item['author']=response.xpath('//div[@class="p-author"]//text()').extract()
        item['author']=[i.strip() for i in item['author'] if len(i)>=1]
        item['author']=''.join(item['author'])
        # print(item)
        yield scrapy.Request("https://p.3.cn/prices/mgets?skuIds={}".format(item["sku"])
                             , callback=self.price, meta={'item': deepcopy(item)})

    def price(self, response):#获取图书价格json
        item = response.meta['item']
        item['price'] = (json.loads(response.body.decode()))[0]['op'] + "¥"
        # print(item)


        yield item


5.编写管道保存数据(pipelines.py)

scrapy爬取京东图书的数据_第10张图片

6.编写中间件(middlewares.py)

scrapy爬取京东图书的数据_第11张图片

        ua = random.choice(spider.settings.get('USER_AGENT_LIST'))
        request.headers['User-Agent'] = ua

7.创建一个py文件运行scrapy爬虫(run_jd_book.py)

scrapy爬取京东图书的数据_第12张图片

import os
os.system('scrapy crawl jdBook --nolog')

8.运行结果展示

scrapy爬取京东图书的数据_第13张图片
scrapy爬取京东图书的数据_第14张图片

jd.csv文件展示

scrapy爬取京东图书的数据_第15张图片

完整的项目链接

链接:https://pan.baidu.com/s/1cduUC1vIvGyOD1ch2k4N6Q
提取码:x84q

你可能感兴趣的:(scrapy爬虫)