用爬虫爬取京东物品的商品评价标签

要爬取的商品评价标签在物品页面上的显示:
用爬虫爬取京东物品的商品评价标签_第1张图片
这些评论是动态加载的,如果观察网页源码会找不到这些数据,因此就要找到这些数据在哪里存放了。
用谷歌浏览器打开https://item.jd.com/4957824.html?cpdad=1DLSUE,查找存放评论标签数据的js文件。
用爬虫爬取京东物品的商品评价标签_第2张图片
打开后
用爬虫爬取京东物品的商品评价标签_第3张图片
会发现想要的数据标签和用户评论是存放在一起的,因此要用正则表达式把数据标签提取出来。
观察以后会发现,数据标签的数据都存放在第一个[]里面。

re_text = re.findall(r'\[.*?\]', text)[0]

提取出来之后,可以看到数据是这样:

[{"id":"e73391bf8ddda234","name":"就是快","rid":"e73391bf8ddda234","count":1864,"type":0,"canBeFiltered":true,"stand":1},{"id":"c30f2e42283d0fa8","name":"物流很快","rid":"c30f2e42283d0fa8","count":1138,"type":0,"canBeFiltered":true,"stand":1},{"id":"0695d5cc36af5165","name":"货真价实","rid":"0695d5cc36af5165","count":818,"type":0,"canBeFiltered":true,"stand":1},{"id":"a0b0cd33c63522a3","name":"很漂亮","rid":"a0b0cd33c63522a3","count":619,"type":0,"canBeFiltered":true,"stand":1},{"id":"503da7f3b642b615","name":"性价比高","rid":"503da7f3b642b615","count":553,"type":0,"canBeFiltered":true,"stand":1},{"id":"7724d25ad160c100","name":"挺不错","rid":"7724d25ad160c100","count":328,"type":0,"canBeFiltered":true,"stand":1},{"id":"f7bda8c0f0584ea8","name":"外形美观","rid":"f7bda8c0f0584ea8","count":324,"type":0,"canBeFiltered":true,"stand":1},{"id":"bec27e1e615cc63e","name":"使用方便","rid":"bec27e1e615cc63e","count":240,"type":0,"canBeFiltered":true,"stand":1},{"id":"a25281805f626180","name":"手感超好","rid":"a25281805f626180","count":231,"type":0,"canBeFiltered":true,"stand":1},{"id":"236b04f43b63e400","name":"有质感","rid":"236b04f43b63e400","count":194,"type":0,"canBeFiltered":true,"stand":1},{"id":"6b254ad2f545dbbe","name":"自动关机","rid":"6b254ad2f545dbbe","count":26,"type":0,"canBeFiltered":true,"stand":2},{"id":"f0003219a7a77bab","name":"反应迟钝","rid":"f0003219a7a77bab","count":14,"type":0,"canBeFiltered":true,"stand":2},{"id":"028139332b37c299","name":"反应慢","rid":"028139332b37c299","count":6,"type":0,"canBeFiltered":true,"stand":2}]

这些数据看着像json格式其实不是的,可以打印一下这段数据的长度,发现长度是1515,很明显这个计算结果是把re_text当成了一段文本来进行计算长度的。
因此,要把re_text转换成json格式,再进行数据的提取。

import urllib.request
import re
import json

url = 'https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98vv16819&productId=4957824&score=0&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1'
html = urllib.request.urlopen(url)
text = html.read().decode('gbk')
re_text = re.findall(r'\[.*?\]', text)[0]
print(len(re_text))
json_text = json.loads(re_text)
print(json_text)
for i in json_text:
    print(i.get('name') + ' ' + str(i.get('count')))

结果显示:

[{'canBeFiltered': True, 'id': 'e73391bf8ddda234', 'name': '就是快', 'stand': 1, 'type': 0, 'count': 1864, 'rid': 'e73391bf8ddda234'}, {'canBeFiltered': True, 'id': 'c30f2e42283d0fa8', 'name': '物流很快', 'stand': 1, 'type': 0, 'count': 1138, 'rid': 'c30f2e42283d0fa8'}, {'canBeFiltered': True, 'id': '0695d5cc36af5165', 'name': '货真价实', 'stand': 1, 'type': 0, 'count': 818, 'rid': '0695d5cc36af5165'}, {'canBeFiltered': True, 'id': 'a0b0cd33c63522a3', 'name': '很漂亮', 'stand': 1, 'type': 0, 'count': 619, 'rid': 'a0b0cd33c63522a3'}, {'canBeFiltered': True, 'id': '503da7f3b642b615', 'name': '性价比高', 'stand': 1, 'type': 0, 'count': 553, 'rid': '503da7f3b642b615'}, {'canBeFiltered': True, 'id': '7724d25ad160c100', 'name': '挺不错', 'stand': 1, 'type': 0, 'count': 328, 'rid': '7724d25ad160c100'}, {'canBeFiltered': True, 'id': 'f7bda8c0f0584ea8', 'name': '外形美观', 'stand': 1, 'type': 0, 'count': 324, 'rid': 'f7bda8c0f0584ea8'}, {'canBeFiltered': True, 'id': 'bec27e1e615cc63e', 'name': '使用方便', 'stand': 1, 'type': 0, 'count': 240, 'rid': 'bec27e1e615cc63e'}, {'canBeFiltered': True, 'id': 'a25281805f626180', 'name': '手感超好', 'stand': 1, 'type': 0, 'count': 231, 'rid': 'a25281805f626180'}, {'canBeFiltered': True, 'id': '236b04f43b63e400', 'name': '有质感', 'stand': 1, 'type': 0, 'count': 194, 'rid': '236b04f43b63e400'}, {'canBeFiltered': True, 'id': '6b254ad2f545dbbe', 'name': '自动关机', 'stand': 2, 'type': 0, 'count': 26, 'rid': '6b254ad2f545dbbe'}, {'canBeFiltered': True, 'id': 'f0003219a7a77bab', 'name': '反应迟钝', 'stand': 2, 'type': 0, 'count': 14, 'rid': 'f0003219a7a77bab'}, {'canBeFiltered': True, 'id': '028139332b37c299', 'name': '反应慢', 'stand': 2, 'type': 0, 'count': 6, 'rid': '028139332b37c299'}]
就是快 1864
物流很快 1138
货真价实 818
很漂亮 619
性价比高 553
挺不错 328
外形美观 324
使用方便 240
手感超好 231
有质感 194
自动关机 26
反应迟钝 14
反应慢 6

你可能感兴趣的:(python爬虫)