requests
requests:自动爬取HTML的页面自动网络请求提交
网络怕从排除标准
ROBOTS.TXT
Bbeautiful soup
解析html页面
正则表达式库
获取需要的页面数据
网络爬虫框架
Scrapy*
python 开发工具选择-IDE选择
常用的python IDE工具
文本工具IDE
IDLE **学习建议,python自带
Notepad++
Sublime Text **学习建议,编程体验比较好;
vim &emcs
Atom
Komodo Edit
集成重聚IDE
Pycharm **学习建议
Wing ,公司维护,工具收费,适合多人共同开发;
Pydev&eclipde
visual studio
anaconda&spyder **学习建议
canopy
安装 requests
cmd 运行
pip install requests
测试requests库抓取百度主页
Python 3.5.4 (v3.5.4:3f56838, Aug 8 2017, 02:17:05) [MSC v.1900 64 bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> import requests
>>> r = requests.get("http://www.baidu.com")
>>> r.status_code
200
>>> r.encoding = 'utf-8'
>>> r.test
Traceback (most recent call last):
File "", line 1, in
r.test
AttributeError: 'Response' object has no attribute 'test'
>>> r.text
'\r\n 百度一下,你就知道 \r\n'
>>>
>>> type(r)
>>> r.headers
{'Server': 'bfe/1.0.8.18', 'Content-Type': 'text/html', 'Pragma': 'no-cache', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Date': 'Sun, 08 Mar 2020 05:17:08 GMT', 'Transfer-Encoding': 'chunked', 'Last-Modified': 'Mon, 23 Jan 2017 13:28:24 GMT', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform'}
>>> r.apparent_encoding
'utf-8'
>>> r.url
'http://www.baidu.com/'
>>> r.content
b'\r\n \xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80\xe4\xb8\x8b\xef\xbc\x8c\xe4\xbd\xa0\xe5\xb0\xb1\xe7\x9f\xa5\xe9\x81\x93 \xe5\x85\xb3\xe4\xba\x8e\xe7\x99\xbe\xe5\xba\xa6 About Baidu
©2017 Baidu \xe4\xbd\xbf\xe7\x94\xa8\xe7\x99\xbe\xe5\xba\xa6\xe5\x89\x8d\xe5\xbf\x85\xe8\xaf\xbb \xe6\x84\x8f\xe8\xa7\x81\xe5\x8f\x8d\xe9\xa6\x88 \xe4\xba\xacICP\xe8\xaf\x81030173\xe5\x8f\xb7
\r\n'
requests库的主要方法
requests.request() 构造一个额请求,支撑一下各种方法的基础方法
requests.get() 获取网页信息,对应http的get
说明
r = requests.get(url)
r = requests.get(url,)
requests.head() 如上对应hettp的head
requests.post() 如上对应http的post
requests.put() ...
requests.patch() ....
requests.delete() ...
requests.get()
requests.get() 获取网页信息,对应http的get
说明
常用: r = requests.get(url)
详细 r = requests.get(url,)
requests.get 包含。requests包含response(爬从爬取的返回内容),get()对应request
response对象属相
r.status_code http返回的状态码
r.text 响应内容的字符串形式,就是url对应的页面内容
r.encoding 从heep头部信息猜测的响应内容编码方式
r.apparent_encoding 从内容中分析出响应内容的编码方式(备选编码方式)
r.content 响应内容的二进制形式
requests库的异常
requests.ConnectionError 网络连接错误异常,如:dns查询失败,拒绝连接等
requests.HTTPError http错误异常
requests.URLEequired URL缺失异常
requests.TooManyRedirects 超过最大重定向次数,产生重定向异常
requests.ConnectTimeout 连接远程服务器超时异常
requests.Timeout 请求url超时,产生超时异常
r.raise_for_status()方法
如果不是200,产生异常requests.HTTPError
爬取网页的通用代码框架
import requests
def getHTMLText(url):
try:
r = requests.get(url, timeout=30)
r.raise_for_status() #如果状态不是200,引发HTTPError异常
r.encoding = r.apparent_encoding
reture r.text
except:
reture "产生异常"
if __name__ == "__main__":
url = "http://www.baidu.com"
ptint(getHTMLText(url))
关键字提交
>>> import requests
>>> kv = {'wd':'python'}
>>> r = requests.get("http://www.baidu.com/s", params = kv)
>>> r.status.code
Traceback (most recent call last):
File "", line 1, in
r.status.code
AttributeError: 'Response' object has no attribute 'status'
>>> r.status_code
200
>>> r.requsts.url
Traceback (most recent call last):
File "", line 1, in
r.requsts.url
AttributeError: 'Response' object has no attribute 'requsts'
>>> r.request.url
'https://wappass.baidu.com/static/captcha/tuxing.html?&ak=c27bbc89afca0463650ac9bde68ebe06&backurl=https%3A%2F%2Fwww.baidu.com%2Fs%3Fwd%3Dpython&logid=9598849199803382055&signature=797c47c36db95cc3c41178f8a93aca0d×tamp=1583814411'
>>> len(r.text)
1519