爬虫技术－使用Requests抓取网页内容

requests

实现爬虫第一步：数据抓取。

不知道从什么时候，貌似谈到Python技术，必谈爬虫。

讲到爬虫也不得不说到Python

Python这门语言对抓取网页有什么相关的技术方案呢。

那本文就介绍如何实现抓取网页内容：RequestsHTTP库的使用。

有了好用的工具，就可以愉快的搞事情啦。

安装`Requests`

pip方式安装

pip install requests

源码安装

git clone git://github.com/kennethreitz/requests.git

Requests使用

来感受下Requests如何使用

>>> r = requests.get('https://api.github.com/user', auth=('user', 'pass'))
>>> r.status_code
200
>>> r.headers['content-type']
'application/json; charset=utf8'
>>> r.encoding
'utf-8'
>>> r.text
u'{"type":"User"...'
>>> r.json()
{u'private_gists': 419, u'total_private_repos': 77, ...}

上面是官方文档给出的示例。代码很直观，相信大家能看明白。

response.get方法： HTTP GET方式请求URL
r.status_code 响应的HTTP状态码
r.headers ：HTTP头信息
r.encoding ：编码格式
r.text 网页内容

下面将通过请求、响应的过程分别介绍Requests相应的方法。

请求

导入模块

import requests

GET 请求：

r = requests.get('https://github.com/timeline.json')

POST 请求

r = requests.post("http://httpbin.org/post")

POST 上传

url = 'http://httpbin.org/post'
files = {'file': open('report.xls', 'rb')}
r = requests.post(url, files=files)

其它HTTP方法

r = requests.put("http://httpbin.org/put")
r = requests.delete("http://httpbin.org/delete")
r = requests.head("http://httpbin.org/get")
r = requests.options("http://httpbin.org/get")

携带参数

payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.get("http://httpbin.org/get", params=payload)

payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.post("http://httpbin.org/post", data=payload)

import json
url = 'https://api.github.com/some/endpoint'
payload = {'some': 'data'}
r = requests.post(url, data=json.dumps(payload))

设置超时

r = requests.get('https://github.com', timeout=5)

禁止重定向

r = requests.get('http://github.com', allow_redirects=False)

设置请求头

url = 'https://api.github.com/some/endpoint'
headers = {'user-agent': 'my-app/0.0.1'}
r = requests.get(url, headers=headers)

设置代理

import requests

proxies = {
  "http": "http://10.10.1.10:3128",
  "https": "http://10.10.1.10:1080",
}

requests.get("http://example.org", proxies=proxies)

发送Cookie

url = 'http://httpbin.org/cookies'
cookies = dict(cookies_are='working')

r = requests.get(url, cookies=cookies)
r.text
'{"cookies": {"cookies_are": "working"}}'

import requests
params = {'username': 'Ryan', 'password': 'password'}
r = requests.post("http://pythonscraping.com/pages/cookies/welcome.php", params) print("Cookie is set to:")
print(r.cookies.get_dict())
print("-----------")
print("Going to profile page...")
r = requests.get("http://pythonscraping.com/pages/cookies/profile.php",
                      cookies=r.cookies)
print(r.text)

保持回话

import requests
s = requests.Session()

s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
r = s.get("http://httpbin.org/cookies")

print(r.text)
# '{"cookies": {"sessioncookie": "123456789"}}'

import requests
session = requests.Session()
params = {'username': 'username', 'password': 'password'}
s = session.post("http://pythonscraping.com/pages/cookies/welcome.php", params) print("Cookie is set to:")
print(s.cookies.get_dict())
print("-----------")
print("Going to profile page...")
s = session.get("http://pythonscraping.com/pages/cookies/profile.php") print(s.text)

忽略SSL证书验证

requests.get('https://kennethreitz.com', verify=False)

响应内容

状态码

r = requests.get('http://httpbin.org/get')
>>> r.status_code

URL

r.url

响应头

r.headers
r.headers['Content-Type']
r.headers.get('content-type')

Cookie

r.cookies
r.cookies['example_cookie_name']

文本编码

r.encoding

响应内容

r.text

二进制内容

r.content

JSON 响应内容

r.json()

原始响应内容

r.raw

以上内容就是Requests的常用方法。还有一些高级用法没列举，请查阅官方文档。

官方文档

Requests: HTTP for Humans
Requests: 让 HTTP 服务人类
Github

爬虫技术－使用Requests抓取网页内容

安装Requests

Requests使用

请求

导入模块

GET 请求：

POST 请求

POST 上传

其它HTTP方法

携带参数

设置超时

禁止重定向

设置请求头

设置代理

发送Cookie

保持回话

忽略SSL证书验证

响应内容

状态码

URL

响应头

Cookie

文本编码

响应内容

二进制内容

JSON 响应内容

原始响应内容

官方文档

你可能感兴趣的:(爬虫技术－使用Requests抓取网页内容)

安装`Requests`