书上给举了这个例子 https://spa16.scrape.center/. 这应该是崔老师自己的站点,自己做的样例,强烈推荐,这本最新的爬虫书. 那么问题来了,我们是怎么知道它使用的是http/2.0呢,requests不能用吗?
打开浏览器 去检查元素,看network那个标签下的All(全部)里面的第一个,右侧标头里面的请求标头Request Headers。如图。
既然是HTTPS2.0的,用requests库是无法爬取的,不妨尝试一下看:
import requests
url = "https://spa16.scrape.center"
resp = requests.get(url)
print(resp.text)
运行结果自然是报错。这里的报错,并不是没有设置请求头的问题。真实原因是requests这个库是使用HTTP/1.1访问目标网站,而这个网站是2.0,自然没法访问。
pip3 install httpx
pip3 install 'httpx[http2]'
httpx 和 requests 的很多 API 存在相似之处,我们先看下最基本的 GET 请求的用法:
import httpx
response = httpx.get('https://www.httpbin.org/get')
print(response.status_code)
print(response.headers)
print(response.text)
这里我们还是请求之前的测试网站,直接使用 httpx 的 get 方法即可,用法和 requests 里的一模一样,将返回结果赋值为response 变量,然后打印出它的status_code、headers、text等属性,运行结果如下:
200
Headers({'date': 'Mon, 26 Dec 2022 15:28:48 GMT', 'content-type': 'application/json', 'content-length': '311', 'connection': 'keep-alive', 'server': 'gunicorn/19.9.0', 'access-control-allow-origin': '*', 'access-control-allow-credentials': 'true'})
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "www.httpbin.org",
"User-Agent": "python-httpx/0.23.1",
"X-Amzn-Trace-Id": "Root=1-63a9bdb0-0d54247f5449601021648371"
},
"origin": "39.69.199.58",
"url": "https://www.httpbin.org/get"
}
输出结果包含三项内容,status_code属性对应状态码为200;headers 属性对应响应头,是一个 Headers 对象,类似于一个字典;text 属性对应响应体,可以看到其中的 User-Agent是 python-httpx/0.23.1,代表我们是用 httpx 请求的。
下面换一个 User-Agent 再请求一次,代码改写如下:
import httpx
headers = {
'User-Agent': 'Mozilla/5.0(Macintosh;Intel Mac OS X 10_15_7) ApplewebKit/537.36 (KHTML, like Gecko) '
'Chrome/90.0.4430.93 Safari/537.36'}
response = httpx.get('https://www.httpbin.org/get', headers=headers)
print(response.text)
这里我们换了一个User-Agent 重新请求,并将其赋值为 headers 变量,然后传递给 headers参数,运行结果如下:
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "www.httpbin.org",
"User-Agent": "Mozilla/5.0(Macintosh;Intel Mac OS X 10_15_7) ApplewebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-63a9be5f-5a590f5402f8cc9c0eb7ec6a"
},
"origin": "39.69.199.58",
"url": "https://www.httpbin.org/get"
}
发现更换后的User-Agent生效了,这是请求HTTP/1.1的情况。那再试试开始的那个网站,代码如下:
import httpx
response = httpx.get('https://spa16.scrape.center')
print(response.text)
运行后发现,依然报错。原因是httpx使用HTTP/2.0的时候需要声明一下,如下:
import httpx
client = httpx.Client(http2=True)
response = client.get('https://spa16.scrape.center')
print(response.text)
然后,就运行正常了。
然后其他用法和requests库是一样的,同样能用status_code得到状态码,响应text文本内容,以及content得到二进制内容和headers响应头、json方法等。
这个方法,可以结合request中的session对象类比学习。
下面我们学习Client对象的使用。官方的使用方式是with as语句。示例如下:
import httpx
with httpx.Client(http2=True) as client:
resp = client.get('https://spa16.scrape.center')
print(resp.text)
这个用法等价于:
import httpx
client = httpx.Client(http2=True)
try:
resp = client.get('https://spa16.scrape.center')
print(resp.text)
finally:
client.close()
HTTP/1.1的网站示例。
import httpx
url = "http://www.httpbin.org/headers"
client = httpx.Client(http2=True)
resp = client.get(url)
print(resp.text)
print(resp.http_version)
运行结果 如下:
{
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "www.httpbin.org",
"User-Agent": "python-httpx/0.23.1",
"X-Amzn-Trace-Id": "Root=1-63a9c6e9-503cf55e6ceb91b96402ec9e"
}
}
HTTP/1.1
HTTP/2.0的网站示例。
import httpx
url = "https://spa16.scrape.center"
client = httpx.Client(http2=True)
resp = client.get(url)
print(resp.text)
print(resp.http_version)
运行结果,如下:
DOCTYPE html>
<html lang=en>
<head>
<meta charset=utf-8>
<meta http-equiv=X-UA-Compatible content="IE=edge">
<meta name=viewport content="width=device-width,initial-scale=1">
<meta name=referrer content=no-referrer>
<link rel=icon href=/favicon.ico> <title>Scrape | Booktitle>
<link href=/css/chunk-50522e84.e4e1dae6.css rel=prefetch>
<link href=/css/chunk-f52d396c.4f574d24.css rel=prefetch>
<link href=/js/chunk-50522e84.6b3e24aa.js rel=prefetch>
<link href=/js/chunk-f52d396c.f8f41620.js rel=prefetch>
<link href=/css/app.ea9d802a.css rel=preload as=style>
<link href=/js/app.b93891e2.js rel=preload as=script>
<link href=/js/chunk-vendors.a02ff921.js rel=preload as=script>
<link href=/css/app.ea9d802a.css rel=stylesheet>
head>
<body>
<noscript>
<strong>We're sorry but portal doesn't work properly without JavaScript enabled. Please enable it to continue.strong>
noscript>
<div id=app>div>
<script src=/js/chunk-vendors.a02ff921.js> script> <script src=/js/app.b93891e2.js> script> body> html>
HTTP/2
httpx还支持异步客户端请求(即AsyncClient),支持Python的async请求模式,写法如下:
import httpx
import asyncio
async def fetch(url):
async with httpx.AsyncClient(http2=True) as client:
resp = await client.get(url)
print(resp.text)
if __name__ == '__main__':
asyncio.get_event_loop().run_until_complete(fetch('https://www.httpbin.org/get'))