Requests框架的使用姿势

Requests 框架是什么

Python HTTP 库，API 简单好用

安装

要安装 Requests，只要在终端中运行这个简单命令即可：

pip3 install requests

HelloWorld

import requests

r = requests.get('https://github.com/')
print(r.text)

这样就能获取到 https://github.com/ 的网页内容了

HTTP 方法

requests 提供了几个 HTTP 方法

get
post
put
delete
patch
head
options

做网络爬虫用得最多的方法是 get 方法

HTTP 返回码

r.status_code

简单一句代码就可以获取 HTTP 请求的返回码

获取响应内容

r.text

拿到解码后的响应内容

r.content

获取二进制的响应内容

响应内容的编码

要解析返回的 HTML 内容，首先要知道内容的编码

r.encoding

不过这个编码通常是不准确的，要手动指定，比如

r.encoding = 'utf-8'

不过 requests 框架也可以通过分析相应内容，猜出内容的编码

r.apparent_encoding

这个编码通常是比较准确的

传递 URL 参数

只需要将参数放到字典里，然后传递给 requests 的 HTTP 方法即可，比如

query_params = {'wd': 'Google'}
url = 'https://www.baidu.com/s'
r = requests.get(url, params = query_params)
print(r.request.url)

打印结果

https://www.baidu.com/s?wd=Google

发送表单

要实现这个，只需简单地传递一个字典给 data 参数。数据字典在发出请求时会自动编码为表单形式

payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.post("http://httpbin.org/post", data = payload)
print(r.text)

打印结果

{
  "args": {},
  "data": "",
  "files": {},
  "form": {
    "key1": "value1",
    "key2": "value2"
  },
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Connection": "close",
    "Content-Length": "23",
    "Content-Type": "application/x-www-form-urlencoded",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.18.4"
  },
  "json": null,
  "origin": "123.123.123.123",
  "url": "http://httpbin.org/post"
}

定制请求头

requests 初始的请求头如下

r = requests.get('https://github.com/')
print(r.request.headers['user-agent'])

打印内容

python-requests/2.18.4

爬取网页内容的时候，网站会判断 HTTP Header 的 User-Agent 来拦截爬虫，可以通过定制 requests 的请求头骗过网站

user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}
r = requests.get('https://github.com/', headers = user_agent)
print(r.request.headers['user-agent'])

打印内容

Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36

可以看到 requests 的请求头变了

超时

可以告诉 requests 在经过以 timeout 参数设定的秒数时间之后停止等待响应。基本上所有的生产代码都应该使用这一参数。如果不使用，你的程序可能会永远失去响应

r = requests.get(url, timeout = timeout)

错误与异常

遇到网络问题（如：DNS 查询失败、拒绝连接等）时，Requests 会抛出一个 ConnectionError 异常。

如果 HTTP 请求返回了不成功的状态码， Response.raise_for_status() 会抛出一个 HTTPError 异常。

若请求超时，则抛出一个 Timeout 异常。

若请求超过了设定的最大重定向次数，则会抛出一个 TooManyRedirects 异常。

所有Requests显式抛出的异常都继承自 requests.exceptions.RequestException 。

通用的代码框架

import requests

def getHTML(url, timeout):
    try:
        r = requests.get(url, timeout = timeout)
        r.raise_for_code()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ''