Requests是用python语言编写的基于urllib3采用Apache2 license开源协议的HTTP库。它比urllib更加方便。
终端下运行以下代码
pip3 install requests
实例引入:
import requests
url = "https://www.baidu.com/"
response = requests.get(url)
print(type(response))
print(response.status_code) #状态码
print(type(response.text))
print(response.text) #响应内容,此时可能会出现乱码,下文会介绍如何解决
print(response.cookies)
#cookies的内容直接成为响应的一个属性,不需要像urllib一样再声明类
各种请求方式:
import requests
requests.get("http://httpbin.org/get")
requests.post("http://httpbin.org/post")
requests.put("http://httpbin.org/put")
requests.delete("http://httpbin.org/delete")
requests.head("http://httpbin.org/get")
requests.options("http://httpbin.org/get")
import requests
response = requests.get("http://www.baidu.com")
response.encoding = response.apparent_encoding
#如果不添加上面一行代码,响应内容极有可能出现乱码
#这一行代码是让响应内容按指定方式解码,此例中也可用
#response.encoding = "utf-8" 代替
print(response.text)
传入参数的get请求:
参数的意义多种多样,比如你在百度中搜索一个词条,你可以把这个词条传进参数中进行请求
import requests
url = "http://httpbin.org/get"
data = {
"name":"cf",
"age":22
}
response = requests.get(url,params=data)
#传入的参数会列在“args”里,注意此时是把参数传给params,不是data,
#data参数在post请求时才会用到
print(response.text)
输出中可以看到以字典形式表示的参数信息:
{
"args": {
"age": "22",
"name": "cf"
},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.21.0"
},
"origin": "223.91.52.38, 223.91.52.38",
"url": "https://httpbin.org/get?name=cf&age=22"
}
写ajax请求时比较常用
import requests
url = "http://httpbin.org/get"
response = requests.get(url)
print(type(response.text)) # 类型为字符串
print(response.text)
print(type(response.json())) # 类型为字典
print(response.json())
输出结果
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.22.0"
},
"origin": "223.91.52.38, 223.91.52.38",
"url": "https://httpbin.org/get"
}
{'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.22.0'}, 'origin': '223.91.52.38, 223.91.52.38', 'url': 'https://httpbin.org/get'}
下载图片、视频时较为常用
import requests
url = "https://github.com/favicon.ico"
response = requests.get(url)
print(type(response.text)) # 类型为字符串数据
print(type(response.content)) # 类型为二进制数据
print(response.text)
print(response.content)
with open("github.png","wb") as f: #写入二进制数据时,写入参数改为“wb”
f.write(response.content)
f.close()
#请求成功后就可在文件夹中找到相应的图片文件
可以伪装成正常浏览器,而不易被识别为爬虫,不易被禁止访问。
import requests
headers = {
"user-agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
}
url = "https://www.zhihu.com/explore"
response_a = requests.get(url)
response_b = requests.get(url,headers=headers)
#此时只有正确添加了headers才会显示状态码为200正常
print(response_a.status_code)
print(response_b.status_code)
输出
400
200
可以发送一些不会体现在网址上的信息form data
import requests
data = {
"name":"germey",
"age":22
}
headers = {
"user-agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
}
url = "http://httpbin.org/post"
response = requests.post(url,data=data,headers=headers)
print(response.text)
import requests
url = "http://www.jianshu.com"
response = requests.get(url)
# 此处若不加headers状态码会显示为403,被拒绝请求
print(type(response.status_code),response.status_code)
print(type(response.headers),response.headers)
print(type(response.cookies),response.cookies)
print(type(response.url),response.url)
print(type(response.history),response.history)
状态码的判断:
import requests
#200(“ok”,”okay”,”all_ok”,etc…)
#404(“not_found”,” ”) 等等,每一种状态码都可以对应一种格式
#为requests.codes.str—str为字符串内容。
print(requests.codes.ok) #200
print(requests.codes.okay) #200
print(requests.codes.not_found) #404
更多的状态码形式可以参考Requests的官方文档
涉及到上传操作,使用post请求
import requests
url = "http://httpbin.org/post"
files = {"file": open("github.png","rb")}
response = requests.post(url,files=files)
print(response.text)
#打印后可在file字段中找到对应字节流
与urllib相比简单很多,不需要声明很多的对象
import requests
url = "https://www.baidu.com"
response = requests.get(url)
print(response.cookies)
print(response.cookies.items())
for key,value in response.cookies.items():
print(key+"="+value)
输出
]>
[('BDORZ', '27315')]
BDORZ=27315
cookies可以用来保持会话维持
import requests
requests.get("http://httpbin.org/cookies/set/number/123456789")
#设置cookies
response = requests.get("http://httpbin.org/cookies")
#获取cookies
print(response.text)
输出
{"cookies": {}} #由于设置cookies和请求cookies位于两条命令中,相互独立, 因此输出为空
成功样例:
import requests
s = requests.Session()
s.get("http://httpbin.org/cookies/set/number/123456789")
response = s.get("http://httpbin.org/cookies")
print(response.text)
输出
{
"cookies": {
"number": "123456789"
}
}
#此时由于产生了实例,为同一个会话,可以获得cookies
因此可以使用requests.Session()来模拟登陆
网站的证书也许不是官方授权的证书,因此有时会请求出错,尤其对于https,但向verify参数传递False可以避免这一情况发生,这个参数默认为True
import requests
response = requests.get("https://www.12306.cn",verify=False)
print(response.status_code)
#不过直到这篇文章完稿前,访问12306的网站都是成功的
#没有之前因为证书问题而访问错误,应该是官方把证书问题给解决了
输出
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)
200
如果想要消除掉多余的warning信息,只保留状态码(按上例)可以按下面操作
import urllib3
urllib3.disable_warnings()
import requests
response = requests.get("https://www.12306.cn",verify=False)
print(response.status_code)
#加上这样额外的两条命令可消除warning,使输出只有200
或者向get函数传递cert参数,内容是本地证书的路径字符串,这样也可达到消除warning的效果
代理设置:特殊条件,略
超时设置,可以传入参数timeout timeout=numbers
import requests
url = "https://www.baidu.com/"
response = requests.get(url,timeout=1)
print(response.status_code)
认证设置:如果网址需要认证才可以访问,可以传入auth, auth=(account,password)
import requests
url = "需要认证的网址"
response = requests.get(url,auth=(account,password))
print(response.status_code)
异常处理:与urllib大同小异,先捕获子类异常,再捕获父类异常
刚开学事情比较多,会更新慢一点,最近在学习自己写聚类。。