Urllib库的使用

官方文档地址：https://docs.python.org/3/library/urllib.html

Urllib库的使用

Urllib是python内置的HTTP请求库
包括以下模块

urllib.request 请求模块
urllib.error 异常处理模块
urllib.parse url解析模块
urllib.robotparser robots.txt解析模块

urlopen

关于urllib.request.urlopen参数的介绍：
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

url参数的使用

先写一个简单的例子，以百度为例，抓取百度首页。

import urllib.request

response=urllib.request.urlopen('https://www.baidu.com')
print(response.read().decode('utf-8'))

运行后得到：

urlopen一般常用的有三个参数，它的参数如下：
urllib.requeset.urlopen(url,data,timeout)
response.read()可以获取到网页的内容，如果没有read()，将返回如下内容:

>>> import urllib.request

>>> response=urllib.request.urlopen('https://www.baidu.com')
>>> print(response)

data参数的使用

上述的例子是通过请求百度的get请求获得百度，下面使用urllib的post请求
这里通过http://httpbin.org/get网站演示（该网站可以作为练习使用urllib的一个站点使用，可以模拟各种请求操作）。

import urllib.request
import urllib.parse 

data=bytes(urllib.parse.urlencode({'world':'hello'}),encoding='utf8')
response=urllib.request.urlopen('http://httpbin.org/post',data=data)
print(response.read())

运行结果如下：

b'{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {\n    "world": "hello"\n  }, \n  "headers": {\n    "Accept-Encoding": "identity", \n    "Content-Length": "11", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "httpbin.org", \n    "User-Agent": "Python-urllib/3.7"\n  }, \n  "json": null, \n  "origin": "218.2.216.7, 218.2.216.7", \n  "url": "https://httpbin.org/post"\n}\n'

这里就用到urllib.parse，通过bytes(urllib.parse.urlencode())可以将post数据进行转换放到urllib.request.urlopen的data参数中。这样就完成了一次post请求。
所以如果我们添加data参数的时候就是以post请求方式请求，如果没有data参数就是get请求方式。

字符串在Python内部的表示是unicode编码，因此，在做编码转换时，通常需要以unicode作为中间编码，即先将其他编码的字符串解码（decode）成unicode，再从unicode编码（encode）成另一种编码。
decode的作用是将其他编码的字符串转换成unicode编码。
encode的作用是将unicode编码转换成其他编码的字符串。
因此，转码的时候一定要先搞明白，字符串str是什么编码，然后decode成unicode，然后再encode成其他编码。

timeout参数的使用

在某些网络情况不好或者服务器端异常的情况会出现请求慢的情况，或者请求异常，所以这个时候我们需要给请求设置一个超时时间，而不是让程序一直在等待结果。例子如下：

import urllib.request
response=urllib.request.urlopen('https://www.baidu.com',timeout=1)
print(response.read())

运行结果如下：

b'\r\n\r\n\t\r\n\r\n\r\n\t\r\n\r\n'

运行之后我们看到可以正常的返回结果，接着我们将timeout时间设置为0.1，运行程序会提示如下错误：

该异常属于 urllib. error 模块，错误原因是超时。因此，可以通过设置这个超时时间来控制一个网页，如果长时间未响应，就跳过它的抓取。这可以利用 try except 语句来实现，相关代码如下：

import urllib.request
import urllib.error
import socket

try:
  response=urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
except  urllib.error.URLError as e:
  if isinstance(e.reason,socket.timeout):
    print('TIME OUT')

运行结果如下：

>>> import urllib.request
>>> import urllib.error
>>> import socket
>>>
>>> try:
...   response=urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
... except  urllib.error.URLError as e:
...   if isinstance(e.reason,socket.timeout):
...     print('TIME OUT')
...
TIME OUT

响应 (包含响应类型、状态码、响应头)

我们以百度为例：

import urllib.request

response=urllib.request.urlopen('https://www.baidu.com')
print(type(response))
print(response.status)
print(response.getheaders())
print(response.getheader("server"))

运行结果如下：

>>> import urllib.request
>>>
>>> response=urllib.request.urlopen('https://www.baidu.com')
>>> print(type(response))

>>> print(response.status)
200
>>> print(response.getheaders())
[('Accept-Ranges', 'bytes'), ('Cache-Control', 'no-cache'), ('Content-Length', '227'), ('Content-Type', 'text/html'), ('Date', 'Mon, 18 Mar 2019 07:17:22 GMT'), ('Etag', '"5c7cdb1f-e3"'), ('Last-Modified', 'Mon, 04 Mar 2019 08:00:31 GMT'), ('P3p', 'CP=" OTI DSP COR IVA OUR IND COM "'), ('Pragma', 'no-cache'), ('Server', 'BWS/1.1'), ('Set-Cookie', 'BD_NOT_HTTPS=1; path=/; Max-Age=300'), ('Set-Cookie', 'BIDUPSID=D75C49BB12A16CA06062A0EC354A25F1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'PSTM=1552893442; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Strict-Transport-Security', 'max-age=0'), ('X-Ua-Compatible', 'IE=Edge,chrome=1'), ('Connection', 'close')]
>>> print(response.getheader("server"))
BWS/1.1

我们可以通过response.status、response.getheaders()，response.getheader("server")，获取状态码以及头部信息，response.read()获得的是响应体的内容，当然上述的urlopen只能用于一些简单的请求，因为它无法添加一些header信息，如果后面写爬虫我们可以知道，很多情况下我们是需要添加头部信息去访问目标站的，这个时候就用到了urllib.request。

Request

我们知道利用 urlopen（）方法可以实现最基本请求的发起，但这几个简单的参数并不足以构建一个完整的请求。

设置Headers

有很多网站为了防止程序爬虫爬网站造成网站瘫痪，会需要携带一些headers头部信息才能访问，最常见的有user-agent参数，下面我们写一个简单的例子。

import urllib.request

request=urllib.request.Request('https://www.baidu.com')
response=urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

输出结果如下，显然爬取成功。

Urllib库的使用