爬虫第七课:python爬取淘宝商品评论

首先要查看杰克琼斯一款羽绒服的评论
爬虫第七课:python爬取淘宝商品评论_第1张图片
https://detail.tmall.com/item.htm?spm=a1z10.5-b-s.w4011-14620146553.153.211c5897owzGUF&id=575617865437&rn=951992109b473bd4b71f1783f61d163b&abbucket=4
要找到这款羽绒服的数据都在哪里
爬虫第七课:python爬取淘宝商品评论_第2张图片
先打开源代码页看看
爬虫第七课:python爬取淘宝商品评论_第3张图片
爬虫第七课:python爬取淘宝商品评论_第4张图片
发现源代码页没有任何关于商品评论的信息。
那我们就去检查页找一找。
最后在检查页的json数据里找到
爬虫第七课:python爬取淘宝商品评论_第5张图片
爬虫第七课:python爬取淘宝商品评论_第6张图片
然后就需要向这些数据发送请求了。

  • 向服务器发送请求
    爬虫第七课:python爬取淘宝商品评论_第7张图片
    我们找到的需要发送请求的url很长,其实不需要写那么长。写这么多久可以了’https://rate.tmall.com/list_detail_rate.htm?itemId=575617865437&spuId=1038280188&sellerId=305358018&order=3¤tPage=1&append=0&content=1’
    将上面这段代码复制到浏览器,就会看到我们需要的评论数据
    爬虫第七课:python爬取淘宝商品评论_第8张图片
    爬虫第七课:python爬取淘宝商品评论_第9张图片
    首先在headers里面找到需要的参数,那就是cookie和user-agent。
    然后就可以写代码了
headers={
        "cookie": "miid=292998242037415425; t=a210415a1655c0232f82eb7b3a6104df; UM_distinctid=166ceb653b3579-076dd76f89438e-b79183d-100200-166ceb653b4311; cna=bbxhFGBja3gCAW8e7cIps2En; thw=cn; hng=CN%7Czh-CN%7CCNY%7C156; tracknick=%5Cu68A6%5Cu4E00%5Cu6837%5Cu81EA%5Cu7531101; lgc=%5Cu68A6%5Cu4E00%5Cu6837%5Cu81EA%5Cu7531101; tg=0; ubn=p; ucn=center; enc=UIB9oC%2F4GcT7MT%2BeTYYspmIzgCQGQVgVtIdOafyHPB%2FddpEQuoTVRFhD3T2%2B4ZTQppw07b1yUBPdsBcmiZRl0Q%3D%3D; x=e%3D1%26p%3D*%26s%3D0%26c%3D0%26f%3D0%26g%3D0%26t%3D0; mt=ci=34_1&np=; _m_h5_tk=ab92273cd1f1994a79de75803c72eedd_1542899769491; _m_h5_tk_enc=9af4f2d58367798bd6fd9fc571623a03; v=0; cookie2=1e3e736fa27ece822d8e0584a52fc0e2; _tb_token_=e3e5b95fa56e5; unb=2193645594; sg=142; _l_g_=Ug%3D%3D; skt=6dfe74172437c7ae; cookie1=AVS2RlAz2mIjdZAY7fy%2BfYtP4kUpRn3V%2FbBr8i8CU%2BA%3D; csg=934ced39; uc3=vt3=F8dByR6oLTybe7NAPL0%3D&id2=UUkHLXG%2BJ1%2FZ%2BQ%3D%3D&nk2=oHTbYBpzsOUZCkBrgQ%3D%3D&lg2=VFC%2FuZ9ayeYq2g%3D%3D; existShop=MTU0Mjg5MjM3NQ%3D%3D; _cc_=WqG3DMC9EA%3D%3D; dnk=%5Cu68A6%5Cu4E00%5Cu6837%5Cu81EA%5Cu7531101; _nk_=%5Cu68A6%5Cu4E00%5Cu6837%5Cu81EA%5Cu7531101; cookie17=UUkHLXG%2BJ1%2FZ%2BQ%3D%3D; swfstore=183268; uc1=cookie16=VFC%2FuZ9az08KUQ56dCrZDlbNdA%3D%3D&cookie21=W5iHLLyFe3xm&cookie15=UIHiLt3xD8xYTw%3D%3D&existShop=false&pas=0&cookie14=UoTYNOeMOTy2Mw%3D%3D&cart_m=0&tag=8&lng=zh_CN; isg=BJ6eJxdPST-Vf513tvmQfSZu7zQg92O0wyXss0gnR-Hcaz9FsO_g6NBJZxdC01rx",
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
        }

def get_page(url):
    try:
        r=requests.get(url,headers=headers)
        r.raise_for_status()
        r.encoding='utf-8'      
        return r.text
    except Exception as e:
        print(e)
  • 提取信息
    用正则表达式的方法提取数据,只要提取评论就可以了,不需要提取其他信息。
    爬虫第七课:python爬取淘宝商品评论_第10张图片
    我们要找的评论信息都在rateCount下,所以提取数据的代码如下。
def get_info(page):
    try:
        items=re.findall(r'"rateContent":"(.*?)"',page,re.S)
        for item in items:
            yield item
    except Exception as e:
        print(e)
  • 保存数据
    因为我们提取的数据是纯文本的,不需要保存到Excel里,所以直接保存到txt文档里就可以了。
def save_data(datas):
    with open("E:\\淘宝评论.txt","a",encoding="utf-8") as f:
        for data in datas:
            f.write(data)
            f.write('\n')
        f.close()

  • 主程序
urls=['https://rate.tmall.com/list_detail_rate.htm?itemId=575617865437&spuId=1038280188&sellerId=305358018&order=3¤tPage={}&append=0&content=1'.format(i) for i in range(1,11)]
for url in urls:
    page=get_page(url)
    print(url)
    datas=get_info(page)
    save_data(datas)

所有代码:

import requests
import re


basic_url='https://rate.tmall.com/list_detail_rate.htm?itemId=575617865437&spuId=1038280188&sellerId=305358018&order=3¤tPage={}&append=0&content=1'

headers={
        "cookie": "miid=292998242037415425; t=a210415a1655c0232f82eb7b3a6104df; UM_distinctid=166ceb653b3579-076dd76f89438e-b79183d-100200-166ceb653b4311; cna=bbxhFGBja3gCAW8e7cIps2En; thw=cn; hng=CN%7Czh-CN%7CCNY%7C156; tracknick=%5Cu68A6%5Cu4E00%5Cu6837%5Cu81EA%5Cu7531101; lgc=%5Cu68A6%5Cu4E00%5Cu6837%5Cu81EA%5Cu7531101; tg=0; ubn=p; ucn=center; enc=UIB9oC%2F4GcT7MT%2BeTYYspmIzgCQGQVgVtIdOafyHPB%2FddpEQuoTVRFhD3T2%2B4ZTQppw07b1yUBPdsBcmiZRl0Q%3D%3D; x=e%3D1%26p%3D*%26s%3D0%26c%3D0%26f%3D0%26g%3D0%26t%3D0; mt=ci=34_1&np=; _m_h5_tk=ab92273cd1f1994a79de75803c72eedd_1542899769491; _m_h5_tk_enc=9af4f2d58367798bd6fd9fc571623a03; v=0; cookie2=1e3e736fa27ece822d8e0584a52fc0e2; _tb_token_=e3e5b95fa56e5; unb=2193645594; sg=142; _l_g_=Ug%3D%3D; skt=6dfe74172437c7ae; cookie1=AVS2RlAz2mIjdZAY7fy%2BfYtP4kUpRn3V%2FbBr8i8CU%2BA%3D; csg=934ced39; uc3=vt3=F8dByR6oLTybe7NAPL0%3D&id2=UUkHLXG%2BJ1%2FZ%2BQ%3D%3D&nk2=oHTbYBpzsOUZCkBrgQ%3D%3D&lg2=VFC%2FuZ9ayeYq2g%3D%3D; existShop=MTU0Mjg5MjM3NQ%3D%3D; _cc_=WqG3DMC9EA%3D%3D; dnk=%5Cu68A6%5Cu4E00%5Cu6837%5Cu81EA%5Cu7531101; _nk_=%5Cu68A6%5Cu4E00%5Cu6837%5Cu81EA%5Cu7531101; cookie17=UUkHLXG%2BJ1%2FZ%2BQ%3D%3D; swfstore=183268; uc1=cookie16=VFC%2FuZ9az08KUQ56dCrZDlbNdA%3D%3D&cookie21=W5iHLLyFe3xm&cookie15=UIHiLt3xD8xYTw%3D%3D&existShop=false&pas=0&cookie14=UoTYNOeMOTy2Mw%3D%3D&cart_m=0&tag=8&lng=zh_CN; isg=BJ6eJxdPST-Vf513tvmQfSZu7zQg92O0wyXss0gnR-Hcaz9FsO_g6NBJZxdC01rx",
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
        }

def get_page(url):
    try:
        r=requests.get(url,headers=headers)
        r.raise_for_status()
        r.encoding='utf-8'      
        return r.text
    except Exception as e:
        print(e)
        
def get_info(page):
    try:
        items=re.findall(r'"rateContent":"(.*?)"',page,re.S)
        for item in items:
            yield item
    except Exception as e:
        print(e)
        

def save_data(datas):
    with open("E:\\爬虫\\@爬虫教程\\数据\\淘宝评论.txt","a",encoding="utf-8") as f:
        for data in datas:
            f.write(data)
            f.write('\n')
        f.close()

urls=['https://rate.tmall.com/list_detail_rate.htm?itemId=575617865437&spuId=1038280188&sellerId=305358018&order=3¤tPage={}&append=0&content=1'.format(i) for i in range(1,11)]
for url in urls:
    page=get_page(url)
    print(url)
    datas=get_info(page)
    save_data(datas)

结果展示:
爬虫第七课:python爬取淘宝商品评论_第11张图片

你可能感兴趣的:(爬虫)