爬虫的Request库的安装和使用

requests库

安装

pip install requests 

导入模块

 import requests

get请求方式

>>> r=requests.get('http://httpbin.org/post')
>>> print(r)
<Response [200]>

返回状态码为200,则请求正常。
如果想要看响应的内容,那么

>>> print(r.text)

post请求方式

我们一般向浏览器传递表单的时候采用的就是post请求。

>>> payload = {'key1': 'value1', 'key2': 'value2'}
>>> r = requests.post("http://httpbin.org/post", data=payload)
>>> print(r.text)
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "key1": "value1", 
    "key2": "value2"
  }, 

那么其他 HTTP 请求类型:PUT,DELETE,HEAD 以及 OPTIONS 又是如何的呢?都是一样的简单:

>>> r = requests.put('http://httpbin.org/put', data = {'key':'value'})
>>> r = requests.delete('http://httpbin.org/delete')
>>> r = requests.head('http://httpbin.org/get')
>>> r = requests.options('http://httpbin.org/get')

响应内容

二进制响应内容

>>> r.content
b'{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {\n    "key1": "value1", \n    "key2": "value2"\n  }, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encoding": "gzip, deflate", \n    "Content-Length": "23", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "httpbin.org", \n    "User-Agent": "python-requests/2.22.0"\n  }, \n  "json": null, \n  "origin": "58.246.203.154, 58.246.203.154", \n  "url": "https://httpbin.org/post"\n}\n

JSON 响应内容

>>> import requests
>>> r = requests.get('https://api.github.com/events')
>>> r.json()
[{u'repository': {u'open_issues': 0, u'url': 'https://github.com/...

加请求头

>>> url = 'https://api.github.com/some/endpoint'
>>> headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox'}
>>> r = requests.get(url, headers=headers)

Session和Cookies保留了我们的登录状态,允许我们跨网页登录时候或者下次登录时候登录信息仍然在。

Session

import requests
requests.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
r = requests.get("http://httpbin.org/cookies")
print(r.text)

结果是

{
  "cookies": {}
}

但是我们不可能每次跳转网页(比如淘宝)的时候都需要重新登录的,所以这时候就需要用到Session了。

import requests
s = requests.Session()
s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
r = s.get("http://httpbin.org/cookies")
print(r.text)

结果是

{
  "cookies": {
    "sessioncookie": "123456789"
  }
}

在使用了Session之后,跳转了另一个网页之后是存在cookies的。

Cookie

响应中包含了cookies的话

>>> url = 'http://example.com/some/cookie/setting/url'
>>> r = requests.get(url)
>>> r.cookies['example_cookie_name']
'example_cookie_value'

将cookies发送到服务器,可以使用 cookies 参数:

>>> url = 'http://httpbin.org/cookies'
>>> cookies = dict(cookies_are='working')
>>> r = requests.get(url, cookies=cookies)
>>> r.text
'{"cookies": {"cookies_are": "working"}}'

超时

设置对一个浏览器请求的时间,不是下载时间,是请求时间。

>>> requests.get('http://github.com', timeout=0.001)

错误与异常

1.遇到网络问题(如:DNS 查询失败、拒绝连接等)时,Requests 会抛出一个 ConnectionError 异常。
2.若请求超时,则抛出一个 Timeout 异常。
3.若请求超过了设定的最大重定向次数,则会抛出一个 TooManyRedirects 异常。
4.所有Requests显式抛出的异常都继承自 requests.exceptions.RequestException 。

代理

import requests
proxies = {
  "https": "http://41.118.132.69:4433"
}
r = requests.post("http://httpbin.org/post", proxies=proxies)
print r.text

你可能感兴趣的:(爬虫的Request库的安装和使用)