爬虫学习笔记（三）——下载一个网站download函数

一.下载网页
1.版本1.0：

from urllib.request import urlopen
def download(url):                          
    html=urlopen(url).read()
    return html

2.不简洁，不直观所以有了升级
版本1.1：

def download(url):
    print('Downloading:',url)
    return urlopen(url).read()

3.当获取网页时有错误时，防止崩溃
版本2.0：

def download(url):
    print('Downloading:',url)
    try:
        html=urlopen(url).read()
    except Exception as e:
        html=None
        print('Download error:',e.reason)
    return html

4.一般有两种错误码404或者5（2是正常），其中有时下载会出现5**，表示服务器异常，这个时候希望重新连接。（404表示请求网页不存在，一般再访问也没结果）
版本2.1（实现重新连接）：

def download(url,num_retry=2):
    print('Downloading:',url)
    try:
        html=urlopen(url).read()
    except Exception as e:
        html=None
        print('Download error:',e.reason)
        if num_retry>0:
            if hasattr(e,'code') and 500<=e.code<600:
                return download(url,num_retry=num_retry-1)
    return html

5.下载过程中的用户代理问题，在请求中加，修改请求用Request函数.
版本3.0（最终版本）：

from urllib.request import *
def download(url,User_agent='wswp',num_retry=2):
    print('Downloading:',url)
    headers={'User-agent':User_agent}
    request=Request(url,headers=headers)
    try:
        html=urlopen(request).read()
    except Exception as e:
        html=None
        print('Download error:',e.reason)
        if num_retry>0:
            if hasattr(e,'code') and 500<=e.code<600:
                return download(url,num_retry=num_retry-1)
    return html

6.引入urllib.error模块进行分析：
版本4.0：

from urllib.request import *
from urllib.error import URLError

def download(url,User_agent='wswp',num_retry=2):
    print('Downloading:',url)
    headers={'User-agent':User_agent}
    request=Request(url,headers=headers)
    try:
        html=urlopen(request).read()
    except URLError as e:#引入URLError进行分析
        html=None
        print('Download error:',e.reason)
        if num_retry>0:
            if hasattr(e,'code') and 500<=e.code<600:
                return download(url,num_retry=num_retry-1)
    return html

7.下载中的代理问题。
版本5.0：

from urllib.request import *
from urllib.parse import *
from urllib.error import URLError

def download(url,User_agent='wswp',proxy=None，num_retry=2):
    print('Downloading:',url)
    headers={'User-agent':User_agent}
    request=Request(url,headers=headers)
    #加入代理服务器的处理，就不用urlopen来下载网页了，而是用自己构建的opener来打开

    opener=build_opener()
    #若设置了代理，执行下面操作加入代理到opener中
    if proxy:
        proxy_params={urlparse(url).scheme:proxy}
        opener.add_handler(ProxyHandler(proxy_params))#在自己构建的浏览器中加入了代理服务器
    #当没有设置代理时，下面的打开方式和urlopen是一样的
    try:
        html=opener.open(request).read()
        
        #urlopen和opene.open(request)都是返回的
        对象
        时一个类文件。有read方法，和code方法（链接状态码）
        
        
    except URLError as e:#引入URLError进行分析
        html=None
        print('Download error:',e.reason)
        if num_retry>0:
            if hasattr(e,'code') and 500<=e.code<600:
                return download(url,num_retry=num_retry-1)
    return html

爬虫学习笔记（三）——下载一个网站download函数

你可能感兴趣的:(爬虫学习笔记（三）——下载一个网站download函数)