一.下载网页
1.版本1.0:
from urllib.request import urlopen
def download(url):
html=urlopen(url).read()
return html
2.不简洁,不直观所以有了升级
版本1.1:
def download(url):
print('Downloading:',url)
return urlopen(url).read()
3.当获取网页时有错误时,防止崩溃
版本2.0:
def download(url):
print('Downloading:',url)
try:
html=urlopen(url).read()
except Exception as e:
html=None
print('Download error:',e.reason)
return html
4.一般有两种错误码404或者5(2是正常),其中有时下载会出现5**,表示服务器异常,这个时候希望重新连接。(404表示请求网页不存在,一般再访问也没结果)
版本2.1(实现重新连接):
def download(url,num_retry=2):
print('Downloading:',url)
try:
html=urlopen(url).read()
except Exception as e:
html=None
print('Download error:',e.reason)
if num_retry>0:
if hasattr(e,'code') and 500<=e.code<600:
return download(url,num_retry=num_retry-1)
return html
5.下载过程中的用户代理问题,在请求中加,修改请求用Request函数.
版本3.0(最终版本):
from urllib.request import *
def download(url,User_agent='wswp',num_retry=2):
print('Downloading:',url)
headers={'User-agent':User_agent}
request=Request(url,headers=headers)
try:
html=urlopen(request).read()
except Exception as e:
html=None
print('Download error:',e.reason)
if num_retry>0:
if hasattr(e,'code') and 500<=e.code<600:
return download(url,num_retry=num_retry-1)
return html
6.引入urllib.error模块进行分析:
版本4.0:
from urllib.request import *
from urllib.error import URLError
def download(url,User_agent='wswp',num_retry=2):
print('Downloading:',url)
headers={'User-agent':User_agent}
request=Request(url,headers=headers)
try:
html=urlopen(request).read()
except URLError as e:#引入URLError进行分析
html=None
print('Download error:',e.reason)
if num_retry>0:
if hasattr(e,'code') and 500<=e.code<600:
return download(url,num_retry=num_retry-1)
return html
7.下载中的代理问题。
版本5.0:
from urllib.request import *
from urllib.parse import *
from urllib.error import URLError
def download(url,User_agent='wswp',proxy=None,num_retry=2):
print('Downloading:',url)
headers={'User-agent':User_agent}
request=Request(url,headers=headers)
#加入代理服务器的处理,就不用urlopen来下载网页了,而是用自己构建的opener来打开
opener=build_opener()
#若设置了代理,执行下面操作加入代理到opener中
if proxy:
proxy_params={urlparse(url).scheme:proxy}
opener.add_handler(ProxyHandler(proxy_params))#在自己构建的浏览器中加入了代理服务器
#当没有设置代理时,下面的打开方式和urlopen是一样的
try:
html=opener.open(request).read()
#urlopen和opene.open(request)都是返回的
对象
时一个类文件。有read方法,和code方法(链接状态码)
except URLError as e:#引入URLError进行分析
html=None
print('Download error:',e.reason)
if num_retry>0:
if hasattr(e,'code') and 500<=e.code<600:
return download(url,num_retry=num_retry-1)
return html