[爬虫]使用python抓取京东全站数据(商品,店铺,分类,评论)

一、环境

  • OS:win10
  • python:3.5
  • scrapy:1.3.2
  • pymongo:3.2
  • pycharm

环境搭建,自行百度

二、数据库说明

1. 产品分类

京东大概有1183个分类,这是除去了一些虚拟产品(话费、彩票、车票等)的分类,可以到如下网页查看:

https://www.jd.com/allSort.aspx

我们也是从这个网址开始抓取。由于这些分类里面也有属于频道的页面,也就是说,这个分类里面也有很多子分类,需要做一些特殊处理才可以拿到所有分类,具体方法,下文再说。
[爬虫]使用python抓取京东全站数据(商品,店铺,分类,评论)_第1张图片

    name  #分类名称
    url  #分类url
    _id  #分类id

2. 产品

[爬虫]使用python抓取京东全站数据(商品,店铺,分类,评论)_第2张图片

    url  #产品url
    _id #产品id
    category #产品分类
    reallyPrice #产品价格
    originalPrice #原价
    description #产品描述
    shopId #shop id
    venderId #vender id
    commentCount #评价总数
    goodComment #好评数
    generalComment #中评数
    poolComment #差评数
    favourableDesc1 #优惠描述1
    favourableDesc2 #优惠描述2

3. 评论

[爬虫]使用python抓取京东全站数据(商品,店铺,分类,评论)_第3张图片

    _id #评论id
    productId #产品id
    guid 
    content #评论内容
    creationTime #评论时间
    isTop 
    referenceId 
    referenceName 
    referenceType 
    referenceTypeId 
    firstCategory 
    secondCategory 
    thirdCategory 
    replyCount #回复次数
    score #分数
    status 
    title
    usefulVoteCount #被标记的有用评论数
    uselessVoteCount #被标记的无用评论数
    userImage
    userImageUrl 
    userLevelId
    userProvince 
    viewCount
    orderId #订单id
    isReplyGrade 
    nickname #评论人的名称
    userClient
    mergeOrderStatus 
    discussionId 
    productColor
    productSize 
    imageCount #评论中图片的数量
    integral 
    userImgFlag 
    anonymousFlag
    userLevelName 
    plusAvailable
    recommend 
    userLevelColor
    userClientShow
    isMobile #是否移动端评论
    days 
    afterDays #追加评论数

4. 店铺

[爬虫]使用python抓取京东全站数据(商品,店铺,分类,评论)_第4张图片

店铺有别名的,一般有两个url,例如宝梦旗舰店:
url1:http://mall.jd.com/index-596056.html
url2: https://baomeng.jd.com

    _id  #店铺名称
    name  #店铺名称
    url1  #店铺url1
    url2  #店铺url2
    shopId  #shop id
    venderId  #vender id

5. 评论总结

[爬虫]使用python抓取京东全站数据(商品,店铺,分类,评论)_第5张图片

    _id 
    goodRateShow #好评率
    poorRateShow #差评率
    poorCountStr #差评数字符串
    averageScore #平均分
    generalCountStr #中评数字符串
    showCount
    showCountStr
    goodCount #好评数
    generalRate #中评率
    generalCount #中评数
    skuId 
    goodCountStr #好评数字符串
    poorRate #差评率
    afterCount #追评数
    goodRateStyle
    poorCount 
    skuIds
    poorRateStyle 
    generalRateStyle
    commentCountStr
    commentCount
    productId  #产品id
    afterCountStr 
    goodRate 
    generalRateShow
    jwotestProduct
    maxPage
    score
    soType
    imageListCount

三、抓取说明

1. 抓取分类

代码如下:

    def parse_category(self, response):
        """获取分类页"""
        selector = Selector(response)
        try:
            texts = selector.xpath('//div[@class="category-item m"]/div[@class="mc"]/div[@class="items"]/dl/dd/a').extract()
            for text in texts:
                items = re.findall(r'(.*?)', text)
                for item in items:
                    if item[0].split('.')[0][2:] in key_word:
                        if item[0].split('.')[0][2:] != 'list':
                            yield Request(url='https:' + item[0], callback=self.parse_category)
                        else:
                            categoriesItem = CategoriesItem()
                            categoriesItem['name'] = item[1]
                            categoriesItem['url'] = 'https:' + item[0]
                            categoriesItem['_id'] = item[0].split('=')[1].split('&')[0]
                            yield categoriesItem
                            yield Request(url='https:' + item[0], callback=self.parse_list)
        except Exception as e:
            print('error:', e)

如前文所说,有些类别里面包含有很多子类别,所以对于这样的url,需要再次进行类别抓取:

if item[0].split('.')[0][2:] != 'list':
    yield Request(url='https:' + item[0], callback=self.parse_category)

2. 抓取产品

访问每个类别的url就可以获取得到产品列表,找到产品的URL,进入详情页面抓取产品的详情:

    def parse_list(self, response):
        """分别获得商品的地址和下一页地址"""
        meta = dict()
        meta['category'] = response.url.split('=')[1].split('&')[0]

        selector = Selector(response)
        texts = selector.xpath('//*[@id="plist"]/ul/li/div/div[@class="p-img"]/a').extract()
        for text in texts:
            items = re.findall(r'', text)
            yield Request(url='https:' + items[0], callback=self.parse_product, meta=meta)

产品的基本信息在详情页面基本可以获取,但是有些信息,比如:价格、优惠政策等信息,是需要动态获取的。

先来看价格信息,访问的URL格式为:

https://p.3.cn/prices/mgets?skuIds=J_(product_id)

这个url最后括号里面的信息就是产品的id,需要动态获取,代码如下:

response = requests.get(url=price_url + product_id)
price_json = response.json()
productsItem['reallyPrice'] = price_json[0]['p']
productsItem['originalPrice'] = price_json[0]['m']

获取得到的都是json格式,比较好解析。

再来看优惠信息,优惠信息分为两种:优惠券和满减描述:
[爬虫]使用python抓取京东全站数据(商品,店铺,分类,评论)_第6张图片
所以需要抓取这两种信息,都是动态加载,代码如下:

# 优惠
res_url = favourable_url % (product_id, shop_id, vender_id, category.replace(',', '%2c'))
# print(res_url)
response = requests.get(res_url)
fav_data = response.json()
if fav_data['skuCoupon']:
    desc1 = []
    for item in fav_data['skuCoupon']:
        start_time = item['beginTime']
        end_time = item['endTime']
        time_dec = item['timeDesc']
        fav_price = item['quota']
        fav_count = item['discount']
        fav_time = item['addDays']
        desc1.append(u'有效期%s至%s,满%s减%s' % (start_time, end_time, fav_price, fav_count))
    productsItem['favourableDesc1'] = ';'.join(desc1)

if fav_data['prom'] and fav_data['prom']['pickOneTag']:
    desc2 = []
    for item in fav_data['prom']['pickOneTag']:
        desc2.append(item['content'])
    productsItem['favourableDesc1'] = ';'.join(desc2)

3. 抓取店铺信息

在每个产品的详情页面都可以直接找到店铺id和vender id:

ids = re.findall(r"venderId:(.*?),\s.*?shopId:'(.*?)'", response.text)
       if not ids:
           ids = re.findall(r"venderId:(.*?),\s.*?shopId:(.*?),", response.text)
       vender_id = ids[0][0]
       shop_id = ids[0][1]

店铺的名称比较难取,有多种不同页面,店铺标题也在不同地方,而且自营产品,在详情页面也可以店铺名称,代码如下:

try:
    name = response.xpath('//ul[@class="parameter2 p-parameter-list"]/li/a//text()').extract()[0]
except:
    try:
        name = response.xpath('//div[@class="name"]/a//text()').extract()[0].strip()
    except:
        try:
            name = response.xpath('//div[@class="shopName"]/strong/span/a//text()').extract()[0].strip()
        except:
            try:
                name = response.xpath('//div[@class="seller-infor"]/a//text()').extract()[0].strip()
            except:
                name = u'京东自营'

4. 抓取评论

评论的信息也是动态加载,返回的格式也是json,访问url格式为:

https://club.jd.com/comment/productPageComments.action?productId=(product_id)&score=0&sortType=5&page=%s&pageSize=10

只需要产品的ID即可。

获取评论信息代码如下:

"""获取商品comment"""
       try:
           data = json.loads(response.text)
       except Exception as e:
           print('get comment failed:', e)
           return None

       product_id = response.meta['product_id']

       commentSummaryItem = CommentSummaryItem()
       commentSummary = data.get('productCommentSummary')
       commentSummaryItem['goodRateShow'] = commentSummary.get('goodRateShow')
       commentSummaryItem['poorRateShow'] = commentSummary.get('poorRateShow')
       commentSummaryItem['poorCountStr'] = commentSummary.get('poorCountStr')
       commentSummaryItem['averageScore'] = commentSummary.get('averageScore')
       commentSummaryItem['generalCountStr'] = commentSummary.get('generalCountStr')
       commentSummaryItem['showCount'] = commentSummary.get('showCount')
       commentSummaryItem['showCountStr'] = commentSummary.get('showCountStr')
       commentSummaryItem['goodCount'] = commentSummary.get('goodCount')
       commentSummaryItem['generalRate'] = commentSummary.get('generalRate')
       commentSummaryItem['generalCount'] = commentSummary.get('generalCount')
       commentSummaryItem['skuId'] = commentSummary.get('skuId')
       commentSummaryItem['goodCountStr'] = commentSummary.get('goodCountStr')
       commentSummaryItem['poorRate'] = commentSummary.get('poorRate')
       commentSummaryItem['afterCount'] = commentSummary.get('afterCount')
       commentSummaryItem['goodRateStyle'] = commentSummary.get('goodRateStyle')
       commentSummaryItem['poorCount'] = commentSummary.get('poorCount')
       commentSummaryItem['skuIds'] = commentSummary.get('skuIds')
       commentSummaryItem['poorRateStyle'] = commentSummary.get('poorRateStyle')
       commentSummaryItem['generalRateStyle'] = commentSummary.get('generalRateStyle')
       commentSummaryItem['commentCountStr'] = commentSummary.get('commentCountStr')
       commentSummaryItem['commentCount'] = commentSummary.get('commentCount')
       commentSummaryItem['productId'] = commentSummary.get('productId')  # 同ProductsItem的id相同
       commentSummaryItem['_id'] = commentSummary.get('productId')
       commentSummaryItem['afterCountStr'] = commentSummary.get('afterCountStr')
       commentSummaryItem['goodRate'] = commentSummary.get('goodRate')
       commentSummaryItem['generalRateShow'] = commentSummary.get('generalRateShow')
       commentSummaryItem['jwotestProduct'] = data.get('jwotestProduct')
       commentSummaryItem['maxPage'] = data.get('maxPage')
       commentSummaryItem['score'] = data.get('score')
       commentSummaryItem['soType'] = data.get('soType')
       commentSummaryItem['imageListCount'] = data.get('imageListCount')
       yield commentSummaryItem

       for hotComment in data['hotCommentTagStatistics']:
           hotCommentTagItem = HotCommentTagItem()
           hotCommentTagItem['_id'] = hotComment.get('id')
           hotCommentTagItem['name'] = hotComment.get('name')
           hotCommentTagItem['status'] = hotComment.get('status')
           hotCommentTagItem['rid'] = hotComment.get('rid')
           hotCommentTagItem['productId'] = hotComment.get('productId')
           hotCommentTagItem['count'] = hotComment.get('count')
           hotCommentTagItem['created'] = hotComment.get('created')
           hotCommentTagItem['modified'] = hotComment.get('modified')
           hotCommentTagItem['type'] = hotComment.get('type')
           hotCommentTagItem['canBeFiltered'] = hotComment.get('canBeFiltered')
           yield hotCommentTagItem

       for comment_item in data['comments']:
           comment = CommentItem()

           comment['_id'] = comment_item.get('id')
           comment['productId'] = product_id
           comment['guid'] = comment_item.get('guid')
           comment['content'] = comment_item.get('content')
           comment['creationTime'] = comment_item.get('creationTime')
           comment['isTop'] = comment_item.get('isTop')
           comment['referenceId'] = comment_item.get('referenceId')
           comment['referenceName'] = comment_item.get('referenceName')
           comment['referenceType'] = comment_item.get('referenceType')
           comment['referenceTypeId'] = comment_item.get('referenceTypeId')
           comment['firstCategory'] = comment_item.get('firstCategory')
           comment['secondCategory'] = comment_item.get('secondCategory')
           comment['thirdCategory'] = comment_item.get('thirdCategory')
           comment['replyCount'] = comment_item.get('replyCount')
           comment['score'] = comment_item.get('score')
           comment['status'] = comment_item.get('status')
           comment['title'] = comment_item.get('title')
           comment['usefulVoteCount'] = comment_item.get('usefulVoteCount')
           comment['uselessVoteCount'] = comment_item.get('uselessVoteCount')
           comment['userImage'] = 'http://' + comment_item.get('userImage')
           comment['userImageUrl'] = 'http://' + comment_item.get('userImageUrl')
           comment['userLevelId'] = comment_item.get('userLevelId')
           comment['userProvince'] = comment_item.get('userProvince')
           comment['viewCount'] = comment_item.get('viewCount')
           comment['orderId'] = comment_item.get('orderId')
           comment['isReplyGrade'] = comment_item.get('isReplyGrade')
           comment['nickname'] = comment_item.get('nickname')
           comment['userClient'] = comment_item.get('userClient')
           comment['mergeOrderStatus'] = comment_item.get('mergeOrderStatus')
           comment['discussionId'] = comment_item.get('discussionId')
           comment['productColor'] = comment_item.get('productColor')
           comment['productSize'] = comment_item.get('productSize')
           comment['imageCount'] = comment_item.get('imageCount')
           comment['integral'] = comment_item.get('integral')
           comment['userImgFlag'] = comment_item.get('userImgFlag')
           comment['anonymousFlag'] = comment_item.get('anonymousFlag')
           comment['userLevelName'] = comment_item.get('userLevelName')
           comment['plusAvailable'] = comment_item.get('plusAvailable')
           comment['recommend'] = comment_item.get('recommend')
           comment['userLevelColor'] = comment_item.get('userLevelColor')
           comment['userClientShow'] = comment_item.get('userClientShow')
           comment['isMobile'] = comment_item.get('isMobile')
           comment['days'] = comment_item.get('days')
           comment['afterDays'] = comment_item.get('afterDays')
           yield comment

           if 'images' in comment_item:
               for image in comment_item['images']:
                   commentImageItem = CommentImageItem()
                   commentImageItem['_id'] = image.get('id')
                   commentImageItem['associateId'] = image.get('associateId')  # 和CommentItem的discussionId相同
                   commentImageItem['productId'] = image.get('productId')  # 不是ProductsItem的id,这个值为0
                   commentImageItem['imgUrl'] = 'http:' + image.get('imgUrl')
                   commentImageItem['available'] = image.get('available')
                   commentImageItem['pin'] = image.get('pin')
                   commentImageItem['dealt'] = image.get('dealt')
                   commentImageItem['imgTitle'] = image.get('imgTitle')
                   commentImageItem['isMain'] = image.get('isMain')
                   yield commentImageItem

       # next page
       for i in range(1, int(data['maxPage'])):
           url = comment_url % (product_id, str(i))
           meta = dict()
           meta['product_id'] = product_id
           yield Request(url=url, callback=self.parse_comments2, meta=meta)

5. 抓取过程

[爬虫]使用python抓取京东全站数据(商品,店铺,分类,评论)_第7张图片

基本代码已经在文中贴出,写的比较乱,欢迎大家一起讨论。

部分数据截图:
[爬虫]使用python抓取京东全站数据(商品,店铺,分类,评论)_第8张图片
有需要数据的人也可以联系我。

你可能感兴趣的:(爬虫,爬虫)