写在前面:修改request的headers属性,可以跳过登录界面,爬取成功
目标:获取淘宝搜索页面信息,提取其中商品的名称和价格
技术路线:Requests-Re
搜索接口:https://s.taobao.com/search?q=篮球
翻页接口:第二页 https://s.taobao.com/search?q=篮球&s=44
第三页 https://s.taobao.com/search?q=篮球&s=88
步骤1:提交商品请求,循环获取页面
步骤2:对于每一个页面,提取商品名称和价格信息
步骤3:将信息输出到屏幕上
备注:用爬虫爬淘宝,得到的页面是登录页面,需要“假登录”,获取头部headers信息,作为参数传给requests.get(url,headers = header),获取方法如下
详细步骤:以Google浏览器为例
1.登录淘宝,进入搜索页,F12
2.选择Network,刷新一下,找到最上方以search?开头的文件,右键
3.选择copy,copy as cURL(bush)
4.在https://curl.trillworks.com/,将上一步复制的内容粘贴到curl command窗口
5.复制右侧的headers内容,在程序中用以变量header保存,作为参数传给requests.get(url,headers=header)
、
#淘宝商品比价
import requests
import re
def getHtmlText(url):
try:
header = {
'authority': 's.taobao.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'referer': ,
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'zh-CN,zh;q=0.9',
'cookie': ,
}#隐去了cookie信息和referer信息
r = requests.get(url,headers = header)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:
print("爬取失败")
return ""
def parsePage(ilist,html):
try:
plt = re.findall(r'\"view_price\":\"\d+\.\d*\"',html)
tlt = re.findall(r'\"raw_title\":\".*?\"',html)
#print(tlt)
print(len(plt))
for i in range(len(plt)):
price = eval(plt[i].split('\"')[3])
title = tlt[i].split('\"')[3]
ilist.append([title,price])
#print(ilist)
except:
print("解析出错")
def printGoodsList(ilist,num):
print("=====================================================================================================")
tplt = "{0:<3}\t{1:<30}\t{2:>6}"
print(tplt.format("序号","商品名称","价格"))
count = 0
for g in ilist:
count += 1
if count <= num:
print(tplt.format(count,g[0],g[1]))
print("=====================================================================================================")
def main():
goods = "篮球"
depth = 1
start_url = "https://s.taobao.com/search?q="+goods
infoList = []
num = 20
for i in range(depth):
try:
url = start_url + '$S=' + str(44*i)
html = getHtmlText(url)
parsePage(infoList,html)
except:
continue
printGoodsList(infoList,num)
main()
结果输出: