[爬虫-python]爬取京东100页的图书(机器学习)的信息(价格,打折后价格,书名,作者,好评数,差评数,总评数)

Python爬取京东的机器学习类图书的信息

  • 一,配置搜索关键字和页数,
  • 二,查找用到的三个URL的过程
    • 1. 搜索图书的URL
    • 2. 评论总数,差评数,好评数的URL
    • 3. 当前价格与打折前价格URL
  • 四,代码分析
  • 五,完整代码
  • 六, 执行结果

一,配置搜索关键字和页数,

本例是搜索”机器学习“,页数我配了100页没封号。大概爬下来三千条图书。用时没有留意,大概就几分钟吧,很快的。

if __name__ == '__main__':
    # 测试, 只爬取两页搜索页与两页评论
    test = CrawlDog('机器学习')
    test.main(2)
    test.store_xsl()

二,查找用到的三个URL的过程

1. 搜索图书的URL

https://search.jd.com/Search?keyword=%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0&enc=utf-8&suggest=1.his.0.0&wq=&pvid=d73028f8cf3e46deb44d843ef082fcef
[爬虫-python]爬取京东100页的图书(机器学习)的信息(价格,打折后价格,书名,作者,好评数,差评数,总评数)_第1张图片

2. 评论总数,差评数,好评数的URL

F12 --> NetWork , 在机器学习搜索页面往下拖鼠标,你会看到有好多图片的加载或Json请求,
其中就有异步评论数的请求。

https://club.jd.com/comment/productCommentSummaries.action?referenceIds=69957954609,33316347153,20445809140,11166079878,40853170920&callback=jQuery4865124&_=1593055042908
[爬虫-python]爬取京东100页的图书(机器学习)的信息(价格,打折后价格,书名,作者,好评数,差评数,总评数)_第2张图片
在url里把 &callback=jQuery4865124去掉,因为加上的话就会有jQuery4865124这个字出现,

https://club.jd.com/comment/productCommentSummaries.action?referenceIds=69957954609,33316347153,20445809140,11166079878,40853170920&_=1593055042908
[爬虫-python]爬取京东100页的图书(机器学习)的信息(价格,打折后价格,书名,作者,好评数,差评数,总评数)_第3张图片

3. 当前价格与打折前价格URL

找的步骤和评论数的总结的那个URL一样的。
[爬虫-python]爬取京东100页的图书(机器学习)的信息(价格,打折后价格,书名,作者,好评数,差评数,总评数)_第4张图片

四,代码分析

主要包括三个方法
crawl_message() 爬取商品的基本信息,爬完后调用comments()和prices()方法
comments() 爬取( 总评数,平均得分,好评数,默认好评,好评率,追评数,视频晒单数,差评数,中评数)
prices() 爬取当前商品的价格,打折前的商品价格
store_xsl() 爬完后 存放到excel表格里

五,完整代码

import requests
from lxml import etree
from concurrent import futures
import json

import pandas as pd

class CrawlDog:

    comment_headers = {
        'Referer': 'https://item.jd.com/%s.html' % 12615065,
        'Accept-Charset': 'utf-8',
        'accept-language': 'zh,en-US;q=0.9,en;q=0.8,zh-TW;q=0.7,zh-CN;q=0.6',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                      'Chrome/74.0.3729.169 Safari/537.36'

    }

    def __init__(self, keyword):
        """
        初始化
        :param keyword: 搜索的关键词
        """
        self.keyword = keyword
        self.data = pd.DataFrame()

    def crawl_message(self, page):

        """
        从搜索页获取相应信息并存入数据库
        :param page: 搜索页的页码
        """
        url = 'https://search.jd.com/Search?keyword={}&enc=utf-8&page={}&s={}'.format(self.keyword, page, (page-1)*30+1)
        index_headers = {
            'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,'
                      'application/signed-exchange;v=b3',
            'accept-encoding': 'gzip, deflate, br',
            'Accept-Charset': 'utf-8',
            'accept-language': 'zh,en-US;q=0.9,en;q=0.8,zh-TW;q=0.7,zh-CN;q=0.6',
            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                          'Chrome/74.0.3729.169 Safari/537.36'
        }
        rsp = requests.get(url=url, headers=index_headers).content.decode()
        print(url)
       # print(rsp)
        rsp = etree.HTML(rsp)
        items = rsp.xpath('//li[contains(@class, "gl-item")]')
        ids = []
        for item in items:
            try:
                info = dict()
                p_name = item.xpath('.//div[@class="p-name"]/a/em')
                info['title'] = etree.tostring(p_name[0], method='text', encoding='unicode').replace('\r','').replace('\n','').replace('\t','')
                info['price'] = item.xpath('.//div[@class="p-price"]//i/text()')[0]
                info['shop'] = item.xpath('.//div[@class="p-shopnum"]//a/text()')[0]
                info['icon'] = item.xpath('.//div[@class="p-icons"]//i/text()')
                info['url'] = 'https:' + item.xpath('.//div[@class="p-name"]/a/@href')[0]
                info['item_id'] = info.get('url').split('/')[-1][:-5]
                book_details= item.xpath('.//div[@class="p-bookdetails"]//span/a')
                info['author'] = etree.tostring(book_details[0], method='text', encoding='unicode').replace('\r','').replace('\n','').replace('\t','')
                info['publish_date'] = item.xpath('.//div[@class="p-bookdetails"]//span[@class="p-bi-date"]/text()')
                #info['price_'] = 0
                info['old_price'] = 0
                info['commentCount'] = 0
                info['averageScore'] = 0
                info['goodCount'] = 0
                info['defaultGoodCount'] = 0
                info['goodRate'] = 0
                info['afterCount'] = 0
                info['videoCount'] = 0
                info['poorCount'] = 0
                info['generalCount'] = 0
                ids.append(info['item_id'])
                self.data = self.data.append(info, ignore_index=True)

            # 实际爬取过程中有一些广告, 其中的一些上述字段为空
            except IndexError:
                print('item信息不全, drop!')
                continue
        print(len(ids))
        self.comments(ids)
        self.prices(ids)

    def comments(self, ids):
        ids = ','.join([str(id) for id in ids])
        url = 'https://club.jd.com/comment/productCommentSummaries.action?my=pinglun&referenceIds={}'.format(ids)

        comments = requests.get(url=url, headers=self.comment_headers).json()
        print(ids)
        print(comments)
        # 总评数,平均得分,好评数,默认好评,好评率,追评数,视频晒单数,差评数,中评数
        comments_columns = ['commentCount', 'averageScore', 'goodCount', 'defaultGoodCount', 'goodRate',
                            'afterCount', 'videoCount', 'poorCount', 'generalCount']
        for comment in  comments["CommentsCount"]:
            comments_data= [comment["CommentCount"], comment["AverageScore"], comment["GoodCount"], comment["DefaultGoodCount"],
                 comment["GoodRate"], comment["AfterCount"],
                 comment["VideoCount"], comment["PoorCount"],
                 comment["GeneralCount"]]
            #print(self.data.loc[self.data.item_id == str(comment['SkuId']), comments_columns])
            self.data.loc[self.data.item_id == str(comment['SkuId']), comments_columns] = comments_data


    def prices(self, ids):
        str_ids = ','.join(['J_'+str(id) for id in ids])
        url = "https://p.3.cn/prices/mgets?ext=11000000&pin=&type=1&area=1_72_4137_0&skuIds=J_%s&pdbp=0&pdtk=&pdpin=&pduid=15229474889041156750382&source=list_pc_front" % str_ids
        prices = requests.get(url, headers=self.comment_headers).json()
        for price in prices:
            #self.data.loc[self.data.item_id == price['id'][2:], 'price_'] = price.get('p')
            self.data.loc[self.data.item_id == price['id'][2:], 'old_price'] = price.get("m")

    def main(self, index_pn):
        """
        实现爬取的函数
        :param index_pn: 爬取搜索页的页码总数
        :return:
        """
        # 爬取搜索页函数的参数列表
        #il = [i * 2 + 1 for i in range(index_pn)]
        il = [i +1 for i in range(index_pn)]
        #print(il)
        # 创建一定数量的线程执行爬取
        with futures.ThreadPoolExecutor(3) as executor:
            executor.map(self.crawl_message, il)
        #for i in range(index_pn):
            #self.get_index(i+1)

    def store_xsl(self):
        self.data.to_csv( 'data.csv', encoding = 'utf-8', index=False)

    def get_comments__top_100(self):
        df = pd.read_csv('final_data_attach_comment.csv', encoding='utf-8')

        ids =df[df.isnull().T.any()]['item_id'].tolist()
        print(len(ids))
        index_headers = {
            'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,'
                      'application/signed-exchange;v=b3',
            'accept-encoding': 'gzip, deflate, br',
            'Accept-Charset': 'utf-8',
            'accept-language': 'zh,en-US;q=0.9,en;q=0.8,zh-TW;q=0.7,zh-CN;q=0.6',
            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                          'Chrome/74.0.3729.169 Safari/537.36'
        }
        i = 0
        for id in ids:
            i = i + 1
            if i > 10: break
            url = "https://club.jd.com/comment/productPageComments.action?productId=%s&score=0&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1" % id
            try:
                comments_json = requests.get(url, headers=index_headers).json()
                print(comments_json, ' dd')
                comment_str = ''
                for comment_json in comments_json['comments']:
                    comment_str = comment_str + comment_json['content'] + '|'

                df.loc[df.item_id == id, 'comment'] = comment_str
                print(comment_str)

            except json.decoder.JSONDecodeError:
                print('item信息不全, drop!', id)
                continue
            finally:
                df.to_csv('final_data_attach_comment.csv', encoding='utf-8', index=False)
        df.to_csv('final_data_attach_comment.csv', encoding='utf-8', index=False)


if __name__ == '__main__':
    # 测试, 只爬取两页搜索页与两页评论
    test = CrawlDog('机器学习')
    test.main(100)
    test.store_xsl()
    #test.get_comments__top_100()

六, 执行结果

[爬虫-python]爬取京东100页的图书(机器学习)的信息(价格,打折后价格,书名,作者,好评数,差评数,总评数)_第5张图片

你可能感兴趣的:(爬虫)