python网络爬虫笔记-requests

Requests库基本使用

Requests是用python语言编写的基于urllib3采用Apache2 license开源协议的HTTP库。它比urllib更加方便。

文章目录

  • Requests库基本使用
    • 安装
    • 基本的GET请求
    • 解析json
    • 获取二进制数据
    • 添加headers
    • 基于POST请求
    • Response属性
    • 文件上传
    • 获取cookie
    • 会话维持
    • 证书验证

安装

终端下运行以下代码

pip3 install requests

实例引入:

import requests
url = "https://www.baidu.com/"
response = requests.get(url)
print(type(response))
print(response.status_code) #状态码   
print(type(response.text))
print(response.text)  #响应内容,此时可能会出现乱码,下文会介绍如何解决
print(response.cookies) 
#cookies的内容直接成为响应的一个属性,不需要像urllib一样再声明类

各种请求方式:

import requests
requests.get("http://httpbin.org/get")
requests.post("http://httpbin.org/post")
requests.put("http://httpbin.org/put")
requests.delete("http://httpbin.org/delete")
requests.head("http://httpbin.org/get")
requests.options("http://httpbin.org/get")

基本的GET请求

import requests
response = requests.get("http://www.baidu.com")
response.encoding = response.apparent_encoding
#如果不添加上面一行代码,响应内容极有可能出现乱码
#这一行代码是让响应内容按指定方式解码,此例中也可用
#response.encoding = "utf-8"       代替
print(response.text)

传入参数的get请求:

参数的意义多种多样,比如你在百度中搜索一个词条,你可以把这个词条传进参数中进行请求

import requests
url = "http://httpbin.org/get"
data = {
    "name":"cf",
    "age":22
}
response = requests.get(url,params=data)  
#传入的参数会列在“args”里,注意此时是把参数传给params,不是data,
#data参数在post请求时才会用到
print(response.text)

输出中可以看到以字典形式表示的参数信息:

{
  "args": {
    "age": "22", 
    "name": "cf"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.21.0"
  }, 
  "origin": "223.91.52.38, 223.91.52.38", 
  "url": "https://httpbin.org/get?name=cf&age=22"
}

解析json

写ajax请求时比较常用

import requests
url = "http://httpbin.org/get"
response = requests.get(url)
print(type(response.text)) # 类型为字符串
print(response.text)
print(type(response.json())) # 类型为字典
print(response.json())

输出结果


{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.22.0"
  }, 
  "origin": "223.91.52.38, 223.91.52.38", 
  "url": "https://httpbin.org/get"
}


{'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.22.0'}, 'origin': '223.91.52.38, 223.91.52.38', 'url': 'https://httpbin.org/get'}

获取二进制数据

下载图片、视频时较为常用

import requests
url = "https://github.com/favicon.ico"
response = requests.get(url)
print(type(response.text)) # 类型为字符串数据
print(type(response.content)) # 类型为二进制数据
print(response.text)
print(response.content)
with open("github.png","wb") as f:   #写入二进制数据时,写入参数改为“wb”
    f.write(response.content)
    f.close()
#请求成功后就可在文件夹中找到相应的图片文件

添加headers

可以伪装成正常浏览器,而不易被识别为爬虫,不易被禁止访问。

import requests
headers = {
"user-agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
}
url = "https://www.zhihu.com/explore"
response_a = requests.get(url)
response_b = requests.get(url,headers=headers)
#此时只有正确添加了headers才会显示状态码为200正常 
print(response_a.status_code)
print(response_b.status_code)

输出

400
200

基于POST请求

可以发送一些不会体现在网址上的信息form data

import requests
data = {
    "name":"germey",
    "age":22
}
headers = {
"user-agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
}
url = "http://httpbin.org/post"
response = requests.post(url,data=data,headers=headers)
print(response.text)

Response属性

import requests
url = "http://www.jianshu.com"
response = requests.get(url)   
# 此处若不加headers状态码会显示为403,被拒绝请求
print(type(response.status_code),response.status_code)
print(type(response.headers),response.headers)
print(type(response.cookies),response.cookies)
print(type(response.url),response.url)
print(type(response.history),response.history)

状态码的判断:

import requests
#200(“ok”,”okay”,”all_ok”,etc…)
#404(“not_found”,” ”)  等等,每一种状态码都可以对应一种格式
#为requests.codes.str—str为字符串内容。
print(requests.codes.ok)  #200
print(requests.codes.okay)  #200
print(requests.codes.not_found)  #404

更多的状态码形式可以参考Requests的官方文档

文件上传

涉及到上传操作,使用post请求

import requests
url = "http://httpbin.org/post"
files = {"file": open("github.png","rb")}
response = requests.post(url,files=files)
print(response.text) 
#打印后可在file字段中找到对应字节流

获取cookie

与urllib相比简单很多,不需要声明很多的对象

import requests
url = "https://www.baidu.com"
response = requests.get(url)
print(response.cookies)
print(response.cookies.items())
for key,value in response.cookies.items():
    print(key+"="+value)

输出

]>
[('BDORZ', '27315')]
BDORZ=27315

会话维持

cookies可以用来保持会话维持

import requests
requests.get("http://httpbin.org/cookies/set/number/123456789") 
#设置cookies
response = requests.get("http://httpbin.org/cookies")
#获取cookies
print(response.text)

输出

{"cookies": {}}  #由于设置cookies和请求cookies位于两条命令中,相互独立, 因此输出为空

成功样例:

import requests
s = requests.Session()
s.get("http://httpbin.org/cookies/set/number/123456789")
response = s.get("http://httpbin.org/cookies")
print(response.text)

输出

{
  "cookies": {
    "number": "123456789"
  }
}
#此时由于产生了实例,为同一个会话,可以获得cookies

因此可以使用requests.Session()来模拟登陆

证书验证

网站的证书也许不是官方授权的证书,因此有时会请求出错,尤其对于https,但向verify参数传递False可以避免这一情况发生,这个参数默认为True

import requests
response = requests.get("https://www.12306.cn",verify=False)
print(response.status_code)
#不过直到这篇文章完稿前,访问12306的网站都是成功的
#没有之前因为证书问题而访问错误,应该是官方把证书问题给解决了

输出

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
200

如果想要消除掉多余的warning信息,只保留状态码(按上例)可以按下面操作

import urllib3
urllib3.disable_warnings()  
import requests
response = requests.get("https://www.12306.cn",verify=False)
print(response.status_code)
#加上这样额外的两条命令可消除warning,使输出只有200

或者向get函数传递cert参数,内容是本地证书的路径字符串,这样也可达到消除warning的效果

代理设置:特殊条件,略

超时设置,可以传入参数timeout timeout=numbers

import requests
url = "https://www.baidu.com/"
response = requests.get(url,timeout=1)
print(response.status_code)

认证设置:如果网址需要认证才可以访问,可以传入auth, auth=(account,password)

import requests
url = "需要认证的网址"
response = requests.get(url,auth=(account,password))
print(response.status_code)

异常处理:与urllib大同小异,先捕获子类异常,再捕获父类异常

刚开学事情比较多,会更新慢一点,最近在学习自己写聚类。。

你可能感兴趣的:(网络爬虫笔记,网络爬虫,互联网,requests,学习笔记,数据分析)