抓取静态网页是最基础的爬虫技术,其实也就是获取网页的源代码,原理就是通过模仿用户通过浏览器访问网页的过程,想web服务器发出请求、服务器接收到请求并做出响应,最后返回源代码的过程。
这其中主要用到了Requests库,以百度为例:接收百度服务器返回的响应信息
import requests
url = 'https://www.baidu.com/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0'
}
response = requests.get(url,headers=headers)
response.encoding = 'utf-8'
print(response.status_code)
这其中,url也就是网址,headers是请求头,请求头是防止某些服务器防止网络爬虫恶意抓取网页信息的防爬虫措施;response是根据url构造请求,发送GET请求,接收服务器返回的响应信息。
试一下爬百度的logo
import requests
baidu_logo_url = 'https://www.baidu.com/img/PCtm_d9c8750bed0b3c7d089fa7d55720d6cf.png'
response = requests.get(baidu_logo_url)
with open ('baidu_logo.png','wb') as file:
file.write(response.content)#获取图片的二进制数据
爬csdn的logo
import requests
csdn_logo_url = 'https://img-home.csdnimg.cn/images/20201124032511.png'
response = requests.get(csdn_logo_url)
with open ('csdn_logo.png','wb') as file:
file.write(response.content)
尝试爬取网页
import requests
def load_page(url):
'''
根据url发送请求,获取服务器返回的响应
:param url: 待抓取的url
:return:
'''
headers = {"user-Agent":"Mozilla/5.0(compatible;MSIE 9.0;"
"Windows NT 6.1;Trident / 5.0;"}
#发送GET请求,接收服务器返回的响应
response = requests.get(url,headers=headers)
return response.text
def save_file(html,filename):
'''
将抓取的网页数据写入本地文件
:param html: 服务器返回的相应内容
:param filename:
:return:
'''
print('正在保存',filename)
with open(filename,'wb')as file:
file.write(html)
def scrape_forum(begin_page,end_page):
'''
控制网络爬虫抓取网页数据的流程
:param begin_page: 起始页码
:param end_page: 结束页码
:return:
'''
for page in range(begin_page,end_page+1):
#组合所有页面的完整URL
url = f'https://ssr1.scrape.center/page/{page}'
file_name = "第"+str(page)+"页.html"
html = load_page(url)
save_file(html,file_name)
scrape_forum(1,5)
E:\ProgramFiles\anaconda\python.exe E:/python学习/爬虫学习/2022.9.22.py
Traceback (most recent call last):
File "E:\python学习\爬虫学习\2022.9.22.py", line 82, in
scrape_forum(1,5)
File "E:\python学习\爬虫学习\2022.9.22.py", line 80, in scrape_forum
save_file(html,file_name)
File "E:\python学习\爬虫学习\2022.9.22.py", line 65, in save_file
file.write(html)
TypeError: a bytes-like object is required, not 'str'
正在保存 第1页.html进程已结束,退出代码1
出现报错,为什么报错了呢,据我观察,我尝试一次爬5页的内容,就是因为爬取5页的内容,导致其中某些语法出现错误,然后我在改变一下思路,先爬取1页的试一下:
import requests
def load_page(url):
'''
根据url发送请求,获取服务器返回的响应
:param url: 待抓取的url
:return:
'''
headers = {"user-Agent":"Mozilla/5.0(compatible;MSIE 9.0;"
"Windows NT 6.1;Trident / 5.0;"}
#发送GET请求,接收服务器返回的响应
print('正在发送GET请求')
response = requests.get(url,headers=headers)
print('GET请求发送成功')
return response.text
def save_file(html,filename):
print('开始保存文件')
'''
将抓取的网页数据写入本地文件
:param html: 服务器返回的相应内容
:param filename:
:return:
'''
print('正在保存',filename)
with open(filename+'.html','w',encoding='utf-8')as file:
file.write(html)
print(filename,'保存成功!')
def forum(url):
'''
控制网络爬虫抓取网页数据的流程
:param begin_page: 起始页码
:param end_page: 结束页码
:return:
'''
file_name = ''
for i in range(8,len(url)):
if url[i]=='/':
break
file_name += url[i]
html = load_page(url)
save_file(html,file_name)
url = input('输入网址')
forum(url)
这次是非常顺利的爬取出来html文件了。
数据获取方面,我主要是使用Xpath模块,通过chrome浏览器并安装好xpath插件,能够获取所需要的文件的地址,然后就可以从网页中获取到所需要的数据了。
学习Xpath我主要使用csdn学习,当然还有一本书
例:获取Scrape一页电影的名字
# 导入urllib与lxml库
import urllib.request
from lxml import etree
# 爬取网页的网址
url = 'https://ssr1.scrape.center/'
# 请求网页获取网页源码
response=urllib.request.urlopen(url=url)
html=response.read().decode("utf-8")
# 将文本转化为HTML元素树
parse = etree.HTML(html)
# 写入xpath路径
for i in range(1,11):
item_list = parse.xpath('//*[@id="index"]/div/div/div['+str(i)+']/div/div/div[2]/a/h2/text()')
print(item_list)
获取该网站所有页面的电影的名字并保存在excel中
import urllib.request
from lxml import etree
import csv
# 创建文件对象
f = open('csv_file.csv', 'w', encoding='utf-8')
# 构建csv写入对象
csv_write = csv.writer(f)
for j in range(1,11):
# 爬取网页的网址
url = 'https://ssr1.scrape.center/page/'+str(j)
# 请求网页获取网页源码
response=urllib.request.urlopen(url=url)
html=response.read().decode("utf-8")
# 将文本转化为HTML元素树
parse = etree.HTML(html)
# 写入xpath路径
for i in range(1,11):
item_list = parse.xpath('//*[@id="index"]/div/div/div['+str(i)+']/div/div/div[2]/a/h2/text()')
csv_write.writerow(item_list)
print(item_list)
然后我们再加工一下,爬取电影名字,发行日期,片长,国家等数据并整理到表格中
import urllib.request
from lxml import etree
import csv
f = open('csv_file.csv', 'w', encoding='utf-8')
csv_write = csv.writer(f)
item_list = []
csv_write.writerow(['电影名字','评分','产地','片长','上映日期'])
for j in range(1,11):
url = 'https://ssr1.scrape.center/page/'+str(j)
response=urllib.request.urlopen(url=url)
html=response.read().decode("utf-8")
parse = etree.HTML(html,parser = etree.HTMLParser(encoding='utf-8'))
for i in range(1,11):
movie_name = parse.xpath('//*[@id="index"]/div/div/div['+str(i)+']/div/div/div[2]/a/h2/text()')
fraction = parse.xpath('//*[@id="index"]/div[1]/div[1]/div['+str(i)+']/div/div/div[3]/p[1]/text()')
country = parse.xpath('//*[@id="index"]/div[1]/div[1]/div['+str(i)+']/div/div/div[2]/div[2]/span[1]/text()')
minute = parse.xpath('//*[@id="index"]/div[1]/div[1]/div['+str(i)+']/div/div/div[2]/div[2]/span[3]/text()')
time = parse.xpath('//*[@id="index"]/div[1]/div[1]/div['+str(i)+']/div/div/div[2]/div[3]/span/text()')
fra = list(fraction)
item_list.append(list(movie_name)[0][:])
item_list.append(fra[0][17:])
item_list.append(list(country)[0][:])
item_list.append(list(minute)[0][:])
a = []
if list(time)==a:
pass
else:
item_list.append(list(time)[0][:])
csv_write.writerow(item_list)
print(item_list)
item_list.clear()