目录
什么是Urllib
urlopen
以GET形式发送请求,获取响应体的内容
以POST方式发送请求
判断错误类型是否为超时
响应(response)
响应类型
状态码,响应头
Request(传递Headers)
结果与例一一致
方法一
方法二
HANDLER
Cookie
获取Cookie
保存Cookie
读取Cookie
异常处理
URL解析
urlparse
urlunparse
urlencode
最基本的请求库
Python内置的HTTP请求库
库 |
名称 |
urllib.request |
请求模块 |
urllib.error |
异常处理模块 |
urllib.parse |
url解析模块 |
urllib.robotparser |
robots.txt解析模块 |
urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None)
url:地址
data:POST请求参数
timeout:超时时间
例子一
import urllib.request
response = urllib.request.urlopen('http://www.python.org')
print(response.read().decode('utf-8'))#read读取
以GET形式发送请求,获取响应体的内容
例子二
import urllib.parse
import urllib.request
data=bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8')#把字典传到POST请求中
response=urllib.request.urlopen('http://httpbin.org/post',data=data)
print(response.read())
以POST方式发送请求
例子三
import socket
import urllib.error
import urllib.request
try:
response=urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)#timeout:设置超时判断时间
except urllib.error.URLError as e:#捕获异常
if isinstance(e.reason,socket.timeout):#判断错误类型是否为超时
print('TIME OUT')
判断错误类型是否为超时
import urllib.request
responce = urllib.request.urlopen('https://www.python.org')
print(type(responce))
运行结果:
类型为“http.client.HTTPResponse”
HTTPResponse包含状态码和响应头
状态码
import urllib.request
responce = urllib.request.urlopen('https://www.python.org')
print(responce.status)
print(responce.getheaders())#响应头,类型为list
print(responce.getheader('Server'))
运行结果:
200
[('Server', 'nginx'), ('Content-Type', 'text/html; charset=utf-8'), ('X-Frame-Options', 'SAMEORIGIN'), ('x-xss-protection', '1; mode=block'), ('X-Clacks-Overhead', 'GNU Terry Pratchett'), ('Via', '1.1 varnish'), ('Content-Length', '48990'), ('Accept-Ranges', 'bytes'), ('Date', 'Fri, 18 Jan 2019 02:47:58 GMT'), ('Via', '1.1 varnish'), ('Age', '2096'), ('Connection', 'close'), ('X-Served-By', 'cache-iad2126-IAD, cache-lax8625-LAX'), ('X-Cache', 'HIT, HIT'), ('X-Cache-Hits', '37, 179'), ('X-Timer', 'S1547779678.133551,VS0,VE0'), ('Vary', 'Cookie'), ('Strict-Transport-Security', 'max-age=63072000; includeSubDomains')]
nginx
例子四
import urllib.request
request = urllib.request.Request('https://python.org')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))
结果与例一一致
request可更加方便地指定请求方式,往headers中添加参数和加额外的数据
from urllib import request,parse
url = 'http://httpbin.org/post'
headers = {
'User-Agent':'Mozilia/4.0(compatible;MSIE 5.5;Windows NT)',
'Host':'httpbin.org'
}
dict = {
'name':'Germey'
}
data = bytes(parse.urlencode(dict),encoding='utf8')
req = request.Request(url=url,data=data,headers=headers,method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))
运行结果:
{
"args": {},
"data": "",
"files": {},
"form": {
"name": "Germey"
},
"headers": {
"Accept-Encoding": "identity",
"Connection": "close",
"Content-Length": "11",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "httpbin.org",
"User-Agent": "Mozilia/4.0(compatible;MSIE 5.5;Windows NT)"
},
"json": null,
"origin": "183.200.46.48",
"url": "http://httpbin.org/post"
}
作用:传递Headers
优点:逻辑结构清晰
Form Data中的数据需编码
from urllib import request,parse
url = 'http://httpbin.org/post'
dict = {
'name':'Germey'
}
data = bytes(parse.urlencode(dict),encoding='utf8')
req = request.Request(url=url,data=data,method='POST')
req.add_header('Ueer-Agent','Mozilia/4.0(compatible;MSIE 5.5;Windows NT)')
response = request.urlopen(req)
print(response.read().decode('utf-8'))
运行结果与方法一相同
应用场景:有多个键值对时用for循环添加
作用:切换IP代理,防止被封
例子
import urllib.request
proxy_hander = urllib.request.ProxyHandler({
'http':'http://127.0.0.2',
'https':'https://127.0.0.2'
})
opener = urllib.request.build_opener(proxy_hander)
response = opener.open('http://baidu.com')
print(response.read())
在客户端保存,用于记录用户身份的文本文件
在爬虫中是维持登录状态的一个机制
import http.cookiejar,urllib.request
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
print(item.name+'='+item.value)
运行结果:
BAIDUID=79AFFAB6229EB9D1A62E3B973D8BB693:FG=1
BIDUPSID=79AFFAB6229EB9D1A62E3B973D8BB693
H_PS_PSSID=1444_21103_28328_28131_26350_28267_27244
PSTM=1547805103
delPer=0
BDSVRTM=0
BD_HOME=0
将cookie保存在当前文件夹
格式一
import http.cookiejar,urllib.request
filename = "cookie.txt"
cookie = http.cookiejar.MozillaCookieJar(filename)#火狐浏览器
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True,ignore_expires=True)
格式二
import http.cookiejar,urllib.request
filename = "cookie.txt"
cookie = http.cookiejar.LWPCookieJar(filename)#LWP格式
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True,ignore_expires=True)
用什么格式存,就用什么格式读
import http.cookiejar,urllib.request
cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookie.txt',ignore_discard = True,ignore_expires = True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))
保证程序正常运行
from urllib import request,error
try:
response = request.urlopen('https://blog.csdn.net/qiao39gs/1234')
except error.URLError as e:
print(e.reason)
运行结果:
Not Found
urlib.parse.urlparse(urlstring,scheme='',allow_fragments=True)
作用:URL拆分
from urllib.parse import urlparse
result = urlparse('https://www.baidu.com/index.html;user?id=5#comment')
print(type(result),result)
运行结果:
ParseResult(scheme='https', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')
scheme:协议类型
netloc:域名
path:路径
params:参数
在域名中没有协议类型时,可添加参数scheme=''指定协议类型,域名中有协议类型时则无用
allow_fragments:False时向前拼接fragment
from urllib.parse import urlparse
result = urlparse('https://www.baidu.com/index.html#comment',allow_fragments=False)
print(result)
运行结果:
ParseResult(scheme='https', netloc='www.baidu.com', path='/index.html#comment', params='', query='', fragment='')
作用:拼接URL
from urllib.parse import urlunparse
data = ['http','ww.baidu.com','index.html','user','a=5','comment']
print(urlunparse(data))
运行结果:
http://ww.baidu.com/index.html;user?a=5#comment
作用:将字典对象转换成GET请求参数
from urllib.parse import urlencode
params = {
'name':'germey',
'age':22
}
base_url = 'http://www.baidu.com?'
url = base_url + urlencode(params)
print(url)
运行结果:
http://www.baidu.com?name=germey&age=22