python网页爬虫接口和常见反扒

一、手动获取cookie并自动登录

一.找json地址

1.进入谷歌浏览器点击检查,Network,Fetch/XHR,然后刷新,重新获取数据

2.在name里面查找需要的数据

3.选择数据:

1)可以通过name判断
2)可以通过size文件大小判断
3)最后点击数据的preview看看是不是自己想要的数据

4.选择成功后,去到他的Headers获取Request URL

二、进入pycharm解析数据

import requests
headers = {
   'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'
}
response = requests.get('https://game.gtimg.cn/images/lol/act/img/js/heroList/hero_list.js')
print(response.json())

如果结果显示是你需要的数据,那么选择成功
python网页爬虫接口和常见反扒_第1张图片

cookie获取
检查-network-all-name处网页-headers-cookie
python网页爬虫接口和常见反扒_第2张图片

二、自动登录

这里用知乎做介绍,即自动登录知乎
因为我们知乎的内容需要登录,账户,而这里的代码可以实现自动登录

import requests

headers = {
    'cookie': '_zap=b1124762-828e-435d-b04c-7c59a1786742; _xsrf=774cb199-0e1c-4b28-bb60-8c62b565c8bc; d_c0=AUCYBu3vvBWPTm-arz42Iw6N9McyUzXcK4c=|1666236614; __snaker__id=h9XPzR2HWZU7g32U; gdxidpyhxdE=0E5%2Fpw5xVQk4I8AjL4%5Czi82PtOTmygoSeGwhICxLLVZ7rKD0sGAX%2Fl7ag0qgWvwWbBzp%2Bxs12%2BMMs2IKlxRPe8L8sCamvqfgU1%2B%5CTCuuj%2Fq%2F%2F%2BHyiITWG0KpRs%5Ck6WWJmfc0GBXBxMInMsQ0ccwDz8m4fd%5Ct91fnkea26sfCjcpMjU1K%3A1666237515104; YD00517437729195%3AWM_NI=6f%2BZdRG4pSroFkLgghutDCxnyNtfeQ99uG2rLkD8zzsTok7nLjMSRCCUCwF9R4Fv9q8pTvFCpGD2fshT%2BjV6hSJ70OKqsxHrJR1HLJb6bcxJbjanWbk3byL2QpG%2BcPdNajk%3D; YD00517437729195%3AWM_NIKE=9ca17ae2e6ffcda170e2e6eea6d3439b8fbab8d247b0eb8bb3c85f869b9facc84db2eba0adfb5282a7a282e42af0fea7c3b92af6bae5bbe75a85ecc0b4f16586b58488e664b7ac8ca4fb66bb92bc87e85bfbe7bba8b17a90b58cd9f972b18ca18cb66690b683b4d149818b8d98c53392ec8a97bc7bfcb09c97c549b4998cd3e579ac8ce58bd53eac8d86d8b34589afbd99cb7a8a8effa9cf42a2eaae84fb42a2b6af8fd66aa9edae9acf3a93b089cce25ca892af8bea37e2a3; YD00517437729195%3AWM_TID=NdK2P0cmZUdBQUEFUULUXq9fP1YFijKK; captcha_session_v2=2|1:0|10:1666236622|18:captcha_session_v2|88:Y1VMdHV1MFJCTjMzQzFVYVExUEhLVW8ra1lqbDFVbzhpY1FtT1BJMGErUis0VE5GZmlPTnBHQ2FKVUx1Q09jLw==|e2e3d50d5966c7f04f11bbc9430caefa856fedbe0b81fb941a5601435ce1670c; captcha_ticket_v2=2|1:0|10:1666236651|17:captcha_ticket_v2|704:eyJ2YWxpZGF0ZSI6IkNOMzFfSy1WVy5MS3N0NmZsVzk1T0VLMTdOWU05bTI5anJOZTBFUHJlcG8wWG9ZNEppak5jc2REMFpOTzVsWThSSnFlUTRhZDQ5ZEwxdkJ3Um1XaERlR01PckUtWURkLmlhd3BjVUgwdk5GNld1NzlqY0FSWDUyTldVOEhueW9yTlpHSC5ZOVBfZWxnMDVpMndHaXZrakJWZVQ4ZEU3a1BmVUdHOFM1aHZTQ1hMZWpzQXhLWlYyQm1hUnlGVl93dGd1Li1YOWdmR1c0eXMyTHhvaGJGbjRWNnVGaHlPRy15X3EwcDdTa2YwS3dxZ3lRVENzcjVNSEtDRXgwTEc1RzE1ZWU1TEFkN1lVLXI4WUduRTdNVWhEZFZfQmRfNUJ0b0JoQ2h0Ql8tbDY1Nk9QNlI5VUF2YnVnd0haNHI2UUktR3NsSUdYQ2dJVl9MV1JBZEdCZTY0LktpZ0x3emd2STZQT2RTc2ppc1JEd0hzWXpmRXdzVUhTRlZzb0ZvTEttVHVtTHM0b0IyR2RqUkc3TnBLQ09wd19fajg5aFNUaUR2RFc4Rk96Wi1Md2U3c2QxbGlGV003QmNIbDA0bDlMQ3N0VjR1UHZnNE9QY3JZOG5YQmZ3aDh2SXplczVWVEl2LmpqVEVnSmYwLmZmc1d2NzFVd0ouMDZGaFJWTDJJQlZYMyJ9|8b8701de0198ab00e821f6deeb6ea8598ca444615c1166543fff1ce899f45d76; z_c0=2|1:0|10:1666236667|4:z_c0|92:Mi4xaW5CWUdRQUFBQUFCUUpnRzdlLThGU1lBQUFCZ0FsVk4tdzQtWkFDMUdXVVRpM2xMQkNpT2x6WGwyWWJHbkh5Uk53|2f44557aacbe41f585dcf8586df529133b0eabd958de5974908ca4b500840f96; q_c1=b4bc9487357a4804b8bf100ad46fb07e|1666236667000|1666236667000; NOT_UNREGISTER_WAITING=1; tst=r; SESSIONID=NMIvpzh1H0KgDQHvcVlyotltvV0Py4d5qCNQ3PXRe4T; KLBRSID=53650870f91603bc3193342a80cf198c|1666236826|1666236613',
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'
}
response = requests.get('https://www.zhihu.com/', headers=headers)
print(response.text)

三、selenium获取cookie

from selenium.webdriver import Chrome

# 1.用selenium打开需要获取cookie的网站
b = Chrome()
b.get('https://www.taobao.com')

# 2. 留足够多的时间来人工完成登录
input('完成登录:')

# 3.让浏览器对应的网页中出现登录成功信息,再获取cookie(获取整个网站所有的cookie)
cookies = b.get_cookies()
# print(cookies, type(cookies))

# 4. 将获取到的cookies写入文件
with open('files/taobao.txt', 'w', encoding='utf-8') as f:
    f.write(str(cookies))

四、selenium使用cookie

from selenium.webdriver import Chrome


# 1.打开需要自动登录的网页
b = Chrome()
b.get('https://www.taobao.com')

# 2.添加cookie值
with open('files/taobao.txt', encoding='utf-8') as f:
    cookies = eval(f.read())
    for x in cookies:
        b.add_cookie(x)

# 3. 重新打开网页
b.get('https://www.taobao.com')

五、requests使用代理ip

有时候,我们多次访问网址,会被网址ban ip,让我们无法进入探索,那么我们只要使用代理ip就行啦
我现在推荐的是极光ip

import requests

headers = {
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'
    }

# 设置代理IP
proxies = {
    'https': '175.22.188.25:4524',
}

response = requests.get('https://movie.douban.com/top250', headers=headers, proxies=proxies)

if response.status_code == 200:
    print(response.text)
else:
    print('请求失败')

六、selenium使用代理ip

from selenium.webdriver import Chrome, ChromeOptions

# 1. 给浏览器添加配置
options = ChromeOptions()

# 1)设置代理
options.add_argument('--proxy-server=http://171.83.191.223:4526')

b = Chrome(options=options)
b.get('https://movie.douban.com/top250')

你可能感兴趣的:(python爬虫,python,爬虫,开发语言)