基本库的使用——urllib.urlopen

urllib库是python内置的HTTP请求库，包含一下4个模块：
request：是最基本的HTTP请求模块，可以用来模拟发送请求。

error：异常处理模块，如果出现请求错误，可以捕获异常，然后进行重试或其他操作以保证程序不会意外终止。

parse:一个工具模块，提供了许多URL处理方法，比如拆分、解析、合并等

robotparser：主要是用来识别网站的robots.txt文件，然后判断哪些网站可以爬，哪些网站不可以爬，一般用的比较少。

1.URLopen（）

urllib.request模块提供了最基本的构造HTTP请求的方法，利用它可以模拟浏览器的一个请求发起内容。
以python官网为例：

#urlopen的使用
import urllib.request
response = urllib.request.urlopen('https://www.python.org')
print(response.read().decode('utf-8'))

输出结果如下：

运行结果.png

使用type（）方法输出响应类型：

#利用type（）方法输出响应的类型
import urllib.request
response = urllib.request.urlopen('https://www.python.org')
print(type(response))

输出结果为：
可以发现它是一个HTTPResponse类型的对象，主要包含read()、readinto()、getheader(name)、getheaders()、fileno()等方法，以及msg、version、status、reason、debuglevel、closed等属性。
例如：调用read()方法可以得到返回网页的内容，调用status方法可以得到返回结果的状态码:

#调用方法和属性
import urllib.request
response = urllib.request.urlopen('https://www.python.org')
print(response.status)
print(response.getheaders())
print(response.getheader('Server'))

运行结果为：

图片.png

urlopen函数的API：

urllib.request.urlopen(url,data = None,[timeout,]*,cafile = None,capath = None,cadefault = False,context = None)

data参数
data参数是可选的，如果使用，需要用bytes方法将参数转化为字节流编码，另外使用data后，请求方式就变成了POST方式

import urllib.parse
import urllib.request
data = bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8')
response = urllib.request.urlopen('http://httpbin.org/post',data = data)
print(response.read())

传递了一个参数word，值是hello。它需要被转码成bytes（字节流）类型。使用了bytes（）方法，该方法第一个参数需要str类型，需要用urllib.parse模块里的urlencode（）方法来将参数字典转为字符串，第二个参数指定编码格式。运行结果如下：
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "word": "hello"\n }, \n "headers": {\n "Accept-Encoding": "identity", \n "Content-Length": "10", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "httpbin.org", \n "User-Agent": "Python-urllib/3.8", \n "X-Amzn-Trace-Id": "Root=1-60f81561-6376ab2566bb9e4525f0204e"\n }, \n "json": null, \n "origin": "111.164.173.185", \n "url": "http://httpbin.org/post"\n}\n'
timeout参数
timeout参数用于设置超时时间，单位为秒，意思是如果请求超出了设置的这个时间，还没有得到响应，就抛出异常。

#设置超时时间，0.05s过后没有响应，于是抛出URLError错误
import urllib.request
response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.05)
print(response.read())

运行结果：URLError:
因此可以设置超时时间，控制一个页面若长时间无响应，就跳过抓取，使用try except语句实现。

import socket
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('http://httpbin.org/get',timeout = 0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason,socket.timeout):
        print('TIME OUT')

输出结果为TIME OUT。
其他参数：
context参数必须是ssl.SSLContext类型，用来指定SSL设置。
cafile和capth这两个参数分别指定CA证书和它的路径

基本库的使用——urllib.urlopen

1.URLopen（）

你可能感兴趣的:(基本库的使用——urllib.urlopen)