一、网络爬虫过程
步骤1、分析需求
步骤2、根据需求,选取网页(指定URL地址)
步骤3、网站数据获取到本地
步骤4、定位数据
步骤5、数据存储(MySQL,Redis)
二、代码实现
步骤1. 传入url
步骤2. user_agent
步骤3. headers
步骤4.定义Request
步骤5.urlopen
步骤6. 返回byte数组
1、导入包
#导包
from urllib import request, parse
from urllib.error import HTTPError, URLError
3、定义请求方式函数
#定义get请求函数
def get(url,headers=None):
return urlrequests(url,headers=headers)
#定义post请求函数
def post(url,form,headers=None):
return urlrequests(url,form,headers=headers)
2、对爬虫进行简单封装
#爬虫封装函数
def urlrequests(url,form=None,headers=None):
#模拟浏览器
user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
if headers == None:
headers = {'User-Agent':user_agent}
html_bytes = b''
try:
if form:
#POST请求方式
#(1):转换成str格式
form_str = parse.urlencode(form)
#(2):转换成bytes类型
form_bytes = form_str.encode('utf-8')
#去网站访问数据
req = request.Request(url,data=form_bytes,headers=headers)
#指定写入文件
with open('fanyi.html','wb') as f:
f.write(html_bytes)
else:
#GET请求方式
req = request.Request(url,headers=headers)
response = request.urlopen(req)
html_bytes = response.read()
except HTTPError as e :
print(e)
except URLError as e :
print(e)
return html_bytes
if __name__ == '__main__':
url = 'http://fanyi.baidu.com/sug/'
form = {'kw':'汽车'}
html_bytes = post(url,form)
print(html_bytes)
4、输出结果
b'{"errno":0,"data":[{"k":"\\u6c7d\\u8f66","v":"[q\\u00ec ch\\u0113] car; automobile; auto; motor vehicle; aut"},{"k":"\\u6c7d\\u8f66\\u7ad9","v":"bus station;"},{"k":"\\u6c7d\\u8f66\\u5c3e\\u6c14","v":"\\u540d automobile exhaust; vehicle exhaust;"},{"k":"\\u6c7d\\u8f66\\u4eba","v":"\\u540d Autobots;"},{"k":"\\u6c7d\\u8f66\\u914d\\u4ef6","v":"auto parts;"}]}'