简单网络爬虫(封装)

一、网络爬虫过程

步骤1、分析需求

步骤2、根据需求,选取网页(指定URL地址)

步骤3、网站数据获取到本地

步骤4、定位数据

步骤5、数据存储(MySQL,Redis)

二、代码实现

步骤1. 传入url
步骤2. user_agent
步骤3. headers
步骤4.定义Request
步骤5.urlopen
步骤6. 返回byte数组

1、导入包 

#导包
from urllib import request, parse
from urllib.error import HTTPError, URLError

 3、定义请求方式函数

#定义get请求函数
def get(url,headers=None):
    return urlrequests(url,headers=headers)

#定义post请求函数
def post(url,form,headers=None):
    return urlrequests(url,form,headers=headers)

 2、对爬虫进行简单封装


#爬虫封装函数
def urlrequests(url,form=None,headers=None):
    #模拟浏览器
    user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'

    if headers == None:
        headers = {'User-Agent':user_agent}

    html_bytes = b''
    try:
        if form:
            #POST请求方式
            #(1):转换成str格式
            form_str = parse.urlencode(form)
            #(2):转换成bytes类型
            form_bytes = form_str.encode('utf-8')
            #去网站访问数据
            req = request.Request(url,data=form_bytes,headers=headers)
            #指定写入文件
            with open('fanyi.html','wb') as f:
                f.write(html_bytes)
        else:
            #GET请求方式
            req = request.Request(url,headers=headers)
        response = request.urlopen(req)
        html_bytes = response.read()
    except HTTPError as e :
        print(e)
    except URLError as e :
        print(e)

    return html_bytes


if __name__ == '__main__':
    url = 'http://fanyi.baidu.com/sug/'
    form = {'kw':'汽车'}
    html_bytes = post(url,form)
    print(html_bytes)

4、输出结果

b'{"errno":0,"data":[{"k":"\\u6c7d\\u8f66","v":"[q\\u00ec ch\\u0113] car; automobile; auto; motor vehicle; aut"},{"k":"\\u6c7d\\u8f66\\u7ad9","v":"bus station;"},{"k":"\\u6c7d\\u8f66\\u5c3e\\u6c14","v":"\\u540d automobile exhaust; vehicle exhaust;"},{"k":"\\u6c7d\\u8f66\\u4eba","v":"\\u540d Autobots;"},{"k":"\\u6c7d\\u8f66\\u914d\\u4ef6","v":"auto parts;"}]}'

 

你可能感兴趣的:(简单网络爬虫(封装))