python小白最近看老师课程,发现淘宝网页升级了,用以前的代码爬不了,查找了很多资料后发现了一些缺陷,在此分享给大家
老师的代码大体上没问题,就是需要加一个headers
这里我们可以用https://curl.trillworks.com/#这个来获取headers
附上代码:
import requests
import re
def getHTMLText(url):
try:
#在这里添加headers
header = {
'authority': 's.taobao.com',
'cache-control': 'max-age=0',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.72 Safari/537.36 Edg/89.0.774.45',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'referer': '*******',
'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6,ja;q=0.5',
'cookie': '*******',
}#这里我把cookie和referer改了,大家可以直接按照操作获取
r = requests.get(url,headers = header, timeout = 30) #这里别忘了把headers加进去
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:
return ""
def parsePage(ilt, html):
try:
plt = re.findall(r'"view_price":"[\d.]*"', html) #其实老师这里加了'\'来进行转义,但其实不加也行,加了更好
tlt = re.findall(r'"raw_title":".*?"', html)
for i in range(len(plt)):
price = eval(plt[i].split(':')[1])
title = eval(tlt[i].split(':')[1])
ilt.append([price, title])
except:
print("")
def printGoodsList(ilt):
tplt = "{:4}\t{:8}\t{:16}"
print(tplt.format("序号", "价格", "商品名称"))
count = 0
for g in ilt:
count = count + 1
print(tplt.format(count, g[0], g[1]))
def main():
goods = '书包'
depth = 3 #爬取深度
start_url = 'https://s.taobao.com/search?q=' + goods
infoList = []
for i in range(depth):
try:
url = start_url + '&s=' + str(44 * i)
html = getHTMLText(url)
parsePage(infoList, html)
except:
continue
printGoodsList(infoList)
main()
其实这篇文章和我上次那个爬取大学排名文章的处理方法很类似,只不过这个更麻烦,最后还是提醒以下,不要不加限制的爬取这个网页
最后附上爬取大学排名文章的链接:https://blog.csdn.net/wzy1414/article/details/114599525