Python爬虫-“淘宝商品信息定向爬虫”实例介绍

目录

  • “淘宝商品信息定向爬虫”实例介绍
    • 功能描述
    • 程序的结构设计
    • 代码实现

“淘宝商品信息定向爬虫”实例介绍

功能描述

目标:获取淘宝搜索页面的信息,提取其中的商品名称和价格。
理解:淘宝的搜索接口 翻页的处理
技术路线:requests­ re

程序的结构设计

步骤1:提交商品搜索请求,循环获取页面
步骤2:对于每个页面,提取商品名称和价格信息
步骤3:将信息输出到屏幕上

代码实现

用爬虫爬取淘宝,淘宝网有robots协议所以不能直接爬取,需要登录获取头部headers信息。
步骤1:登录淘宝,进入搜索页,F12
步骤2:选择Network,Ctrl+R刷新,找到上方以search?为开头的文件,右键
Python爬虫-“淘宝商品信息定向爬虫”实例介绍_第1张图片
步骤3:选择copy,copy as cURL(bash)
步骤4:在转换,将上一步复制的内容粘贴到curl command窗口
Python爬虫-“淘宝商品信息定向爬虫”实例介绍_第2张图片
5.复制右侧的headers内容,在程序中用以变量header保存,作为参数传给requests.get(url,headers=header)
代码:

#淘宝商品信息定向爬虫
import re
import requests
def getHTMLText(url):#从网络获取网友内容
    try:
        headers = {
    'authority': 's.taobao.com',
    'cache-control': 'max-age=0',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
    'sec-fetch-user': '?1',
     'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-mode': 'navigate',
     'referer': 'https://s.taobao.com/search?q=lianyiq&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'zh-CN,zh;q=0.9',
    'cookie': 'thw=cn; UM_distinctid=1720211afb513a-0b07283cc9ace7-376b4502-100200-1720211afb617b; enc=rYjgSgMyYkg%2FWHRnkhBKczSuTdnmKNicHZCxkPESCfbhLRolDnVeHRnbdLgMUyHYA5%2Fvp9b6FITVmMBkYTpCQw%3D%3D; hng=CN%7Czh-CN%7CCNY%7C156; cna=Aj2tEmM7FXsCAd7SFZ7S6J2H; miid=44504151680266511; __guid=154677242.3002398488893500400.1589181913539.9639; t=7e6e4712321c707a754fc6568421c9b2; _m_h5_tk=8377efe03ede70801f1356259b646123_1589211624015; _m_h5_tk_enc=2d19bca4701c28e42fa77085806ce2e2; cookie2=1678c735d9d63323283551e33bf578d5; v=0; _tb_token_=e6306b649d53b; alitrackid=www.taobao.com; lastalitrackid=www.taobao.com; _samesite_flag_=true; sgcookie=E%2FWMJmPt9MNtf3%2Bry9c2P; unb=4108961318; uc3=id2=Vy0T7fP3FE8z2A%3D%3D&lg2=WqG3DMC9VAQiUQ%3D%3D&nk2=F5RHpCj6joaWIOg%3D&vt3=F8dBxGXFrgwrDIwhRj0%3D; csg=189c426b; lgc=tb253414682; cookie17=Vy0T7fP3FE8z2A%3D%3D; dnk=tb253414682; skt=8bcd7b231758f963; existShop=MTU4OTM3ODA3OA%3D%3D; uc4=id4=0%40VXqdHlRhUtsIpwQSgkmFlck8ep4m&nk4=0%40FY4MthZ8rXYbFhGt1m4DD7eA6Nemhg%3D%3D; tracknick=tb253414682; _cc_=W5iHLLyFfA%3D%3D; _l_g_=Ug%3D%3D; sg=28e; _nk_=tb253414682; cookie1=UIZs8e27JotrvGDNmcz3ohsrN8Jj6xEX6DshhvBtiN8%3D; tfstk=cxhhBI4vgvyBDnHIlwNIr5ESuRGhaNbUA-eZ_XwjuTQZk9Ga8s4dQt5kcMZZHpB5.; mt=ci=89_1; uc1=cookie14=UoTUM2M264i6GA%3D%3D&cookie15=Vq8l%2BKCLz3%2F65A%3D%3D&pas=0&cookie21=VFC%2FuZ9ainBZ&existShop=false&cookie16=UtASsssmPlP%2Ff1IHDsDaPRu%2BPw%3D%3D; JSESSIONID=3F6CDD6929AF9A0B3703ABC5E8E83DE2; monitor_count=9; l=eBxygvKRQ3TIO3fLBOfwourza77OsIRAXuPzaNbMiT5P_S1p5BAPWZbdRJ89CnhVh64WR3rEQAfvBeYBqIv4n5U62j-la1Dmn; isg=BHd3G57G5fIUk2F6YAvC2ajYBmvBPEueNmQ3fckknsateJe60Q0z7utaX9gm0yMW',
}
 r=requests.get(url,headers=headers)
        r.raise_for_status
        r.encoding=r.apparent_encoding
        return r.text
    except:
        print("怕去失败")
def parsePage(ilt,html):#解析
    try:
        plt = re.findall(r'\"view_price\":\"\d+\.\d*\"',html)
        tlt = re.findall(r'\"raw_title\":\".*?\"',html)
        
        for i in range(len(plt)):
            price = eval(plt[i].split('\"')[3])
            title = tlt[i].split('\"')[3]
            ilt.append([title,price])
    except:
        print("解析出错")
def printGoodsList(ilt,num):#输出
    tplt="{0:^10}\t{1:{3}^20}\t{2:^14}"#中间一行用第三元素填充(中文)
    #print(tplt.format("排名","学校名称","城市","总分",chr(12288)))
    #tplt="{:4}\t{:20}\t{:16}"
    print(tplt.format("序号","价格","商品名称",chr(12288)))
    count=0
    for g in ilt:
        count+=1
        if count <= num: 
            print(tplt.format(count,g[0],g[1],chr(12288)))
 def main():
    goods='连衣裙'
    depth=1
    start_url="https://s.taobao.com/search?q="+goods
    infolist=[]
    num=200
    for i in range(depth):
        try:
            url=start_url+'$S='+str(44*1)
            html=getHTMLText(url)
            parsePage(infolist,html)
        except:
            continue
    printGoodsList(infolist,num)
main()

输出结果·展示:(部分数据)
Python爬虫-“淘宝商品信息定向爬虫”实例介绍_第3张图片

你可能感兴趣的:(Python爬虫-“淘宝商品信息定向爬虫”实例介绍)