requests 模块的使用

1. 使用requests 发送post请求

 responst = requests.post(url,
                      data={请求体}
                      )

2. 代理

  正向代理和反向代理的区别
    反向代理:站在客户端的角度上,为服务器代理的都叫反向代理
    正向代理:站在客户端的角度上,为客户端代理的都叫正向代理

正向代理:对于浏览器知道服务器的真实地址,例如
反向代理:浏览器不知道服务器的 真实地址,例如nginx
使用:
requests.get(url,proxies = proxies)
proxies的形式:字典
proxies = {
        "http":"http://xxx.xx.xxx.xx"
        "https":"https://xxx.xx.xxx.xx"
}

代理的分类

  • 透明代理
  • 匿名代理
  • 高匿代理

请求使用的协议

  • http协议
  • https协议
  • socket代理

爬虫中使用cookie

  • 优点:
    带上cookie能够访问登陆后的界面
    能够实现部分反反爬

  • 缺点:
    一套cookieIbanez只是对应一个用户,不能太频繁的访问

requests处理cookie的方法

  1. cookie字符串放在headers中
    requests.get(url,headers={UA,Cookie})

  2. 把cookie字典传给请求方法的cookies参数接收
    构建cookie字典

cookie:
cookies_set = 'has_recent_activity=1; _octo=GH1.1.989874598.1564974045; _ga=GA1.2.503644495.1564974080; _gat=1; tz=Asia%2FShanghai; _device_id=40315822662c352b84396c5c807a7128; user_session=db2W8Iofe5msHvOHJDKNKScGtkGleioXUcxqTMWB_R1tTL6f; __Host-user_session_same_site=db2W8Iofe5msHvOHJDKNKScGtkGleioXUcxqTMWB_R1tTL6f; logged_in=yes; dotcom_user=changanbaimao; _gh_sess=NDdmeEcrVlEyQVdyRnFhOUtRWEJ5S2NJRjN6SDAvRkF0b09iK3I0YVhlMHJXVnZNVXdTQnJyRy9IOXV2RkpWWkVIRDBHOU5aRS84YmUwMGdRZzBzYlpwM3QveHBqczVBUXJoRFA1N0VFK1MyNTdvVU05WlBheTdkTlhRRm5nVG1wSkNza3ZQTjg3STZFMWJoanpyVld3UzB1SW83d3N0MDBZVForcFhPV0RGREs1SWFWNEh3N0VyL0ZZdlBJK242RDF3dGR6aXRiY2J3MXUwaE9xVUg1dVJ2RzIvdEpoT2grLzZteXRNMTFBSjVTWmlMU3JSTjhqZHhZdmZKdDZHYnM4ZWx1Qk8yNU5wcytzREs0ZTNTY0E9PS0tS1FPMUhzVC9GaURZT3NDQmZVd0FLUT09--b9eefb15a3ea8dc350874018950afc9b0aa64287'

字典推导式:
cookies_dict = {cookie.split('=')[0]:cookie.split('=')[1] for cookie in cookies_set.split('; ')}
  字典推导式:cookies_dict = {cookie.split('=')[0]:cookie.split('=')[1] for cookie in cookies_set.split('; ')}
  1. 先以 '; ' for循环形式切割cookies_set,
  2. 构建字典,key-vallue形式,
  3. 以 '=' 切割for循环中的cookie 索引值为0的为key,索引为1的为vallue

  requests,get(
              url,
              headers=headers,
              cookies=cookie_dict
  )

3. 使用requests提供的session模块

  • 需要先实例化session
使用session登陆github
import requests
import re
# 实例化session
session = requests.session()
headers = {
            'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1',
            # 'Cookie': 'has_recent_activity=1; _octo=GH1.1.989874598.1564974045; _ga=GA1.2.503644495.1564974080; _gat=1; tz=Asia%2FShanghai; _device_id=40315822662c352b84396c5c807a7128; user_session=db2W8Iofe5msHvOHJDKNKScGtkGleioXUcxqTMWB_R1tTL6f; __Host-user_session_same_site=db2W8Iofe5msHvOHJDKNKScGtkGleioXUcxqTMWB_R1tTL6f; logged_in=yes; dotcom_user=changanbaimao; _gh_sess=NDdmeEcrVlEyQVdyRnFhOUtRWEJ5S2NJRjN6SDAvRkF0b09iK3I0YVhlMHJXVnZNVXdTQnJyRy9IOXV2RkpWWkVIRDBHOU5aRS84YmUwMGdRZzBzYlpwM3QveHBqczVBUXJoRFA1N0VFK1MyNTdvVU05WlBheTdkTlhRRm5nVG1wSkNza3ZQTjg3STZFMWJoanpyVld3UzB1SW83d3N0MDBZVForcFhPV0RGREs1SWFWNEh3N0VyL0ZZdlBJK242RDF3dGR6aXRiY2J3MXUwaE9xVUg1dVJ2RzIvdEpoT2grLzZteXRNMTFBSjVTWmlMU3JSTjhqZHhZdmZKdDZHYnM4ZWx1Qk8yNU5wcytzREs0ZTNTY0E9PS0tS1FPMUhzVC9GaURZT3NDQmZVd0FLUT09--b9eefb15a3ea8dc350874018950afc9b0aa64287'
}
# 1. 获取登录页
url = 'https://github.com/login'
res = session.get(url,headers=headers)
authenticity_token = re.search(r'name="authenticity_token" value="(.*?)" />', res.text).group(1)

print(authenticity_token)

# 2. 发送post请求
url = 'https://github.com/session'
data = {
        'commit':'Sign in',
        'utf8':'✓',
        'authenticity_token':authenticity_token,
        'login':'xxxxx', # 输入你的用户名
        'password': 'xxxxxx', # 输入你的密码
    }
   # 应该也可以换成一个变量使用input的方式来写这一段代码
   
session.post(url,data=data,headers=headers)

# 3. 获取最终验证的页面
url = 'https://github.com/changanbaimao'
res = session.get(url,headers=headers)
print(res.content.decode())
  session = requests.session()
  response = session.get(url,headers)

requests模块的其他方法

  1. cookies_dict和cookies_jar的相互转换
    requests.utils.dict_from_cookiejar(cj) —>dict
    requests.utils.cookiejar_from_dict(cd) —>cookiejar

  2. 解决https证书没有认证的网站抛异常
    requests.packages.urllib3.disable_warnings()

    不显示安全提示,不推荐关闭warning
    requests.get(url, verify=False)

  3. 超时参数(timeout)
    response = requests.get(url,timeout=3) # 3表示发送请求之后最多等待3秒,如果没有返回就抛出异常

  4. retry装饰器
    被装饰的函数如果发生异常,就重新执行该函数,
    最多重试参数指定的最大的重试次数,抛出异常

    from retrying import retry
    @retry(stop_max_attempt_number=3)
    def func(): pass

你可能感兴趣的:(爬虫基础)