python爬虫 代理、cookie的处理和模拟登陆

代理

代理服务器,可以接受请求然后将其转发。

匿名度

  • 高匿:啥也不知道
  • 匿名:知道你使用了代理,但是不知道你的真实ip
  • 透明:知道你使用了代理并且知道你的真实ip

类型

  • http
  • https

免费代理

  • www.goubanjia.com
  • 快代理
  • 西祠代理
  • https://www.zhiliandaili.cn/
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36",
    'Connection': "close"
}

proxy_list_http = ['123.169.168.153:9999', '223.242.225.169:9999', '113.195.20.166:9999']
for ip in proxy_list_http:
    response = requests.get("http://www.xiaohuar.com/", headers=headers, proxies={"http": ip})
    if response.status_code == '200':
        print(ip)

cookie

import requests

url = 'https://xueqiu.com/'
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36",
    'Connection': "close"
}
session = requests.Session()
session.get(url=url, headers=headers)

xq_url = "https://xueqiu.com/statuses/hot/listV2.json?since_id=-1&max_id=92138&size=15"
page_text = session.get(xq_url, headers=headers).json()
print(page_text)

模拟登陆

验证码识别

  • 超级鹰:http://www.chaojiying.com/about.html
  • 打码兔
  • 云打码

超级鹰使用

  1. 注册登陆
  2. 软件ID——生成一个软件ID
  3. 开发文档——选择语言——下载文档

经典案例

  • url:https://www.gushiwen.org/
  • 登录验证
  • 图片验证
  • 页面生成随机值
  • cookie
    python爬虫 代理、cookie的处理和模拟登陆_第1张图片

python爬虫 代理、cookie的处理和模拟登陆_第2张图片

在这里插入图片描述

import requests
from lxml import etree
from chaojiying import Chaojiying_Client


def tranformImageData(img_path, t_type):	# 调的超级鹰
    chaojiying = Chaojiying_Client('xxx', 'xxx', '1004')
    im = open(img_path, 'rb').read()
    return chaojiying.PostPic(im, t_type)['pic_str']


url = "https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx"
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36',
}
session = requests.Session()
response_text = session.get(url=url, headers=headers).text
tree = etree.HTML(response_text)
VIEWSTATE = tree.xpath("//*[@id='__VIEWSTATE']/@value")[0]  # 抓取随机验证字符
VIEWSTATEGENERATOR = tree.xpath("//*[@id='__VIEWSTATEGENERATOR']/@value")[0]  # 抓取随机验证字符
img_path = "https://so.gushiwen.cn" + tree.xpath('//*[@id="imgCode"]/@src')[0]  # 验证码
img_bytes = session.get(url=img_path, headers=headers).content

with open("./code.jpg", "wb") as fp:
    fp.write(img_bytes)

code_text = tranformImageData("./code.jpg", 1004)

data = {
    "__VIEWSTATE": VIEWSTATE,
    "__VIEWSTATEGENERATOR": VIEWSTATEGENERATOR,
    "from": "http://so.gushiwen.cn/user/collect.aspx",
    "email": "xxx",
    "pwd": "xxx",
    "code": code_text,
    "denglu": "登录",
}

login_url = 'https://so.gushiwen.cn/user/login.aspx?from=http%3a%2f%2fso.gushiwen.cn%2fuser%2fcollect.aspx'
# response = requests.post(url=login_url, headers=headers, data=data).text
response = session.post(url=login_url, headers=headers, data=data).text
with open('login.html', "w", encoding="utf-8") as fp:
    fp.write(response)

你可能感兴趣的:(python,#,python爬虫)