最近在MOOC上看嵩老师的网络爬虫课程,按照老师的写法并不能进行爬取,遇到了一个问题,就是关于如何“绕开”淘宝登录界面,正确的爬取相关信息。通过百度找到了答案,在此记录一下。
注意:你需要先用自己的淘宝登录一下然后获得cookie才是有效的。
kv = {'cookie':'','user-agent':'Mozilla/5,0'}
r = requests.get(url,headers=kv,timeout=30)
(1)首先需要用到Google chrome浏览器登录淘宝
(2)然后就是按F12键进入开发者模式
(3)按照以下图片进行点击
import requests
import re
def getHTMLText(url):
kv = {'cookie':'','user-agent':'Mozilla/5,0'}
try:
r = requests.get(url,headers=kv,timeout=30)
r.encoding = r.apparent_encoding
return r.text
except:
return ""
def parsePage(ilt,html):
try:
plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"',html)
tlt = re.findall(r'\"raw_title\"\:\".*?\"',html)
for i in range(len(plt)):
price = eval(plt[i].split(':')[1])
title = eval(tlt[i].split(':')[1])
ilt.append([price,title])
except:
print("")
def printGoodsList(ilt):
tplt = "\t{:^4}{:^8}\t{:^16}"
count = 0
print(tplt.format("序号",'价格','名称'))
for i in ilt:
count = count + 1
print(tplt.format(count,i[0],i[1]))
def main():
goods='书包'
depth = 2
infoilt = []
start_url = 'https://s.taobao.com/search?q='+goods
for i in range(depth):
try:
url = start_url + '&s=' + str(44*i)
html = getHTMLText(url)
parsePage(infoilt,html)
except:
continue
printGoodsList(infoilt)
main()
PS:该文章仅用于学术探讨。/笑脸