第二章基本库的使用之urllib

urllib是一个python的内置HTTP请求库，利用它可以实现HTTP请求的发送，只需要指定请求的URL、请求头、请求体等信息。urllib可以把服务器返回的响应转换成Python对象，通过对象我们可以方便的获取响应的相关信息，比如响应状态码、响应头、响应体等。

urllib的四大模块

request：最基本的HTTP请求模块，可以模拟请求的发送。只需要给库方法传入URL及额外的参数。
error：异常处理模块
parse：一个工具模块。提供许多URL的处理方法，例如拆分，解析，合并等。
robotparser：用来识别网站的robot.txt文件

详解urlopen

import urllib.request
url = 'http://www.python.org'
# 调用urllib.request库的urlopen()方法得到的响应response是一个HTTPResponse类型的对象
response = urllib.request.urlopen(url)
# 返回响应状态
response.status
# 返回响应头
response.getheaders()
# 获取响应头中的Server的值，结果是Nginx，意思是服务器是Nginx搭建的。
response.getheader('Server')
print(response.read().decode('utf-8'))

data参数

urllib.parse.urlencode()：可以将字典参数转化成字符串

import urllib.parse
import urllib.request

url = 'https://www.httpbin.org/post'
# 使用bytes()将参数转化成字节流编码格式，即bytes类型
# urllib.parse.urlencode()方法可以将字典参数转化成字符串，第二个参数指定编码格式
# 这是一个post请求，传递的参数里面出现form字段，表明是模拟表单提交
data = bytes(urllib.parse.urlencode({'name':'germey'}), encoding='utf-8')
response = urllib.request.urlopen(url=url, data=data)
print(response.read().decode('utf-8'))

timeout参数：用于设置超时时间，单位为秒。如果请求超出设置的时间还没有的到响应就会抛出异常。如果不指定这个参数就会使用全局默认时间。

try:
    response = urllib.request.urlopen('https://www.httpbin.org/get', timeout=0.1)
except urllib.error.URLError as e:
# 判断捕获的异常类型如果是socket.timeout就执行
    if isinstance(e.reason, socket.timeout):
        print('Time out')

context()：参数为ssl.SSLContext类型，用来指定SSL的设置
cafile() 和 capath() ：分别用来指定CA证书和路径

Request类：创建一个Request对象放进请求中，更加丰富和灵活配置参数

from urllib import request, parse
url = 'https://www.python.org'
headers = {
    'User-Agent': 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) ',
    'Host': 'www.httpbin.org'
}
dict = {'name': 'germey'}
data = bytes(parse.urlencode(dict), encoding='utf-8')
request = request.Request(url=url, data=data, headers=headers, mothod='POST')
response = request.urlopen(request)
print(response.read().decode('utf-8'))

Handler类
urllib.request模块里的BaseHandler类是其他所有Handler类的父类，子类有：

HTTPDefaultErrorHandler：用于处理HTTP响应错误，所有错误都会抛出HTTPError类型的异常。
HTTPRedirectHandler：用于处理重定向
HTTPCookieProcessor：用于处理Cookie
ProxyHandler：用于设置代理，代理默认为空
HTTPPasswordMgr：用于管理密码，它维护着用户名密码的对照表
HTTPBasicAuthHandler：用于管理认证，如果一个链接在打开时需要认证，可以用这个类解决认证问题。
(重要)OpenerDirector：Opener类可以提供open方法。可以利用Handler类来构建Opener类
验证网页的请求

from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener
from urllib.error import URLError

username = 'admin'
password = 'admin'
url = 'https://ssr3.scrape.center/'

# 创建一个HTTPPasswordMgrWithDefaultRealm对象p
p = HTTPPasswordMgrWithDefaultRealm()
# 利用add_password()方法添加账号密码
p.add_password(None, url, username, password)
# 实例化一个HTTPBasicAuthHandler对象auth_handler
auth_handler = HTTPBasicAuthHandler(p)
# 将刚建立的Handler类auth_handler作为参数传入build_opener方法构建一个Opener
opener = build_opener(auth_handler)
# 最后利用opener的open方法打开链接，完成验证
try:
    result = opener.open(url)
    # 获取验证成功后的网页源码
    html = result.read().decode('utf-8')
    print(html)
except URLError as e:
    print(e.reason)

使用HTTPBasicAuthHandler实例化对象，并把一个添加账号密码的HTTPPasswordMgrWithDefaultRealm对象作为参数传入，用这个实例化的对象构建一个opener，并调用open方法拿到网页源码。

代理

from urllib.error import URLError
from urllib.request import ProxyHandler, build_opener

proxy_handler = ProxyHandler({
    'http': 'http://127.0.0.1:8080',
    'https': 'https://127.0.0.1:8080'
})
opener = build_opener(proxy_handler)
try:
    response = opener.open('https://www.baidu.com')
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)

使用ProxyHandler，参数是字典，键是协议名称，值是代理链接。然后利用这个Handler和build_opener方法构建一个Opener，之后发送请求即可。

Cookie

# 获取Cooike
import http.cookiejar, urllib.request

url = 'http://www.baidu.com'
# 声明一个CookieJar对象
cookie = http.cookiejar.CookieJar()
# 利用HTTPCookieProcessor构建一个handler
handler = urllib.request.HTTPCookieProcessor(cookie)
# 通过build_opener()方法构建一个opener
opener = urllib.request.build_opener(handler)
# 执行open方法获取响应
response = opener.open(url)
for item in cookie:
    print(item.name + '=' + item.value)

# 将获取网站的Cookie通过文件的方式保存下来
filename = 'cookie.txt'
# 使用MozillaCookieJar生成cookie，并把文件名作为参数传进去
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open(url)
# ignore_discard：即使cookie将被丢弃也保存下来 ignore_expires：即使cookie过期也保存下来
cookie.save(ignore_discard=True, ignore_expires=True)

以上为获取Cookie并保存为Mozilla型浏览器的Cookie格式
还可以用LWPCookieJar保存Cookie为LWP格式

# 读取Cookie内容（LWP格式）
import http.cookiejar, urllib.request

url = 'http://www.baidu.com'
cookie = http.cookiejar.LWPCookieJar()
# 调用cookie的load()方法，获取Cookie内容
cookie.load('cookie.txt', ignore_expires=True, ignore_discard=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open(url)
print(response.read().decode('utf-8'))

异常处理：

URLError：来自urllib.error，继承自OSError类，是error异常模块的基类，由request模块产生的异常都可以通过捕获这个类处理。
它有一个reason属性，返回错误的原因。
HTTPError：URLError的子类，专门用来处理HTTP请求错误。

HTTPError的3个属性：
1 . code：返回一个HTTP状态码
2 . reason：返回错误的原因
3 . headers：返回请求头

# 处理异常
import urllib.error, requests

try:
    response = urllib.request.urlopen(url)
except urllib.error.HTTPError as e:
    print(e.code, e.reason, e.headers, sep='\n')
except urllib.error.URLError as e2:
    if isinstance(e.reason, socket.timeout):
        print("Time out~")
else:
    print("Successful!")

parse模块：支持大部分协议的URL

urlparse：实现URL的识别和分段（解析URL）

from urllib.parse import urlparse

url = 'https://www.baidu.com/index.html;user?id=5#comment'
# 直接使用urlparse()解析URL
result = urlparse(url)
# API用法
result = urlparse(url=url, scheme='http', allow_fragments=False)
print(type(result))
print(result)
'''
打印：

ParseResult(scheme='https', netloc='www.baidu.com',
    path='/index.html', params='user', query='id=5', fragment='comment')
'''

返回的result类型是ParseResult类，它是一个元组，即result.scheme等价于result[0]，result.netloc等价于result[1]

标准链接格式：
scheme://netloc/path;params?query#fragment
scheme：协议名称，如https
netloc：域名，如www.baidu.com
path：路径，如index.html
params：参数
query：查询条件
fragment：锚点定位页面内部下拉位置

urlunparse：构造URL，接收可迭代对象作为参数，长度为6

# 构造URL
from urllib.parse import urlunparse

data = ['https', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']
print(urlunparse(data))

返回结果：
https://www.baidu.com/index.html;user?id=5#comment

urlsplit：和urlparse类似，区别在于params参数合并到path中，合并成长度为5的返回值
urlunsplit：和urlunparse类似，区别在于params参数合并到path中，合并成需要5个参数的构建URL函数
urljoin：有两个参数，一个是基础URL，一个是新URL，首先回去新URL中看是否缺失scheme、netloc、path这三部分，如果缺失了就去基础URL中找，找到就补充到新URL上然后生成一个URL。

urlencode：可以将字典序列化为get请求的参数

# urlencode
from urllib.parse import urlencode

params = {
    'name': 'germey',
    'age': 25
}
base_url = 'https://www.baidu.com?'
url = base_url +urlencode(params)
print(url)

打印：
https://www.baidu.com?name=germey&age=25

parse_qs：和urlencode相反，可以反序列化，将一串get请求参数转回字典
parse_qsl：用于将参数转化为由元组组成的列表
quote：可以将内容转换成URL编码格式，常用语URL带有中文参数时会使用它先将中文内容转换成URL编码格式
unquote：可以进行URL解码，常用于将一个URL编码的参数转回中文

第二章 基本库的使用之urllib

urllib的四大模块

你可能感兴趣的:(第二章 基本库的使用之urllib)

第二章基本库的使用之urllib

你可能感兴趣的:(第二章基本库的使用之urllib)