在目标网页直接ctrl
+u
查看网页源代码(或者F12
审查),豆瓣的网页源代码就出现了(非常友好):
<html lang="zh-CN" class="ua-windows ua-webkit">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="renderer" content="webkit">
<meta name="referrer" content="always">
<meta name="google-site-verification" content="ok0wCgT20tBBgo9_zat2iAcimtN4Ftf5ccsh092Xeyw" />
<title>
豆瓣电影 Top 250
title>
<meta name="baidu-site-verification" content="cZdR4xxR7RxmM4zE" />
<meta http-equiv="Pragma" content="no-cache">
......
然后我们翻到330行左右的位置,有如下代码:
<ol class="grid_view">
<li>
<div class="item">
<div class="pic">
<em class="">1em>
<a href="https://movie.douban.com/subject/1292052/">
<img width="100" alt="肖申克的救赎" src="https://img2.doubanio.com/view/photo/s_ratio_poster/public/p480747492.webp" class="">
a>
div>
<div class="info">
<div class="hd">
<a href="https://movie.douban.com/subject/1292052/" class="">
<span class="title">肖申克的救赎span>
<span class="title"> / The Shawshank Redemptionspan>
<span class="other"> / 月黑高飞(港) / 刺激1995(台)span>
a>
<span class="playable">[可播放]span>
div>
<div class="bd">
<p class="">
导演: 弗兰克·德拉邦特 Frank Darabont 主演: 蒂姆·罗宾斯 Tim Robbins /...<br>
1994 / 美国 / 犯罪 剧情
p>
<div class="star">
<span class="rating5-t">span>
<span class="rating_num" property="v:average">9.7span>
<span property="v:best" content="10.0">span>
<span>2304569人评价span>
div>
<p class="quote">
<span class="inq">希望让人自由。span>
p>
<p>
<span class="gact">
<a href="https://movie.douban.com/wish/224683240/update?add=1292052" target="_blank" class="j a_collect_btn" name="sbtn-1292052-wish" rel="nofollow">想看a>
span>
<span class="gact">
<a href="https://movie.douban.com/collection/224683240/update?add=1292052" target="_blank" class="j a_collect_btn" name="sbtn-1292052-collection" rel="nofollow">看过a>
span>
p>
div>
div>
div>
li>
<li>
<div class="item">
<div class="pic">
<em class="">2em>
<a href="https://movie.douban.com/subject/1291546/">
<img width="100" alt="霸王别姬" src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2561716440.webp" class="">
a>
div>
<div class="info">
<div class="hd">
<a href="https://movie.douban.com/subject/1291546/" class="">
<span class="title">霸王别姬span>
<span class="other"> / 再见,我的妾 / Farewell My Concubinespan>
a>
<span class="playable">[可播放]span>
div>
<div class="bd">
<p class="">
导演: 陈凯歌 Kaige Chen 主演: 张国荣 Leslie Cheung / 张丰毅 Fengyi Zha...<br>
1993 / 中国大陆 中国香港 / 剧情 爱情 同性
p>
<div class="star">
<span class="rating5-t">span>
<span class="rating_num" property="v:average">9.6span>
<span property="v:best" content="10.0">span>
<span>1709666人评价span>
div>
<p class="quote">
<span class="inq">风华绝代。span>
p>
从这些代码中我们可以发现我们需要爬取的内容都在里面了。
现在我们来分析每一部电影的HTML
语言,即从item项看:
<div class="item">
<div class="pic">
<em class="">1em>
<a href="https://movie.douban.com/subject/1292052/">
<img width="100" alt="肖申克的救赎" src="https://img2.doubanio.com/view/photo/s_ratio_poster/public/p480747492.webp" class="">
a>
div>
<div class="info">
<div class="hd">
<a href="https://movie.douban.com/subject/1292052/" class="">
<span class="title">肖申克的救赎span>
<span class="title"> / The Shawshank Redemptionspan>
<span class="other"> / 月黑高飞(港) / 刺激1995(台)span>
a>
<span class="playable">[可播放]span>
div>
<div class="bd">
<p class="">
导演: 弗兰克·德拉邦特 Frank Darabont 主演: 蒂姆·罗宾斯 Tim Robbins /...<br>
1994 / 美国 / 犯罪 剧情
p>
<div class="star">
<span class="rating5-t">span>
<span class="rating_num" property="v:average">9.7span>
<span property="v:best" content="10.0">span>
<span>2304569人评价span>
div>
<p class="quote">
<span class="inq">希望让人自由。span>
p>
<p>
<span class="gact">
<a href="https://movie.douban.com/wish/224683240/update?add=1292052" target="_blank" class="j a_collect_btn" name="sbtn-1292052-wish" rel="nofollow">想看a>
span>
<span class="gact">
<a href="https://movie.douban.com/collection/224683240/update?add=1292052" target="_blank" class="j a_collect_btn" name="sbtn-1292052-collection" rel="nofollow">看过a>
span>
p>
div>
div>
div>
li>
从《肖申克的救赎》这部电影对应的HTML
代码我们可以发现,我们需要提取的内容为:
<a href="https://movie.douban.com/subject/1292052/" class="">
<span class="title">肖申克的救赎span>
<span class="title"> / The Shawshank Redemptionspan>
<span class="other"> / 月黑高飞(港) / 刺激1995(台)span>
<span class="rating_num" property="v:average">9.7span>
<span class="inq">希望让人自由。span>
分析到这我们就可以确定我们的代码实现思路了。
1、确定网页的url
,即:
start_url = 'https://movie.douban.com/top250?start={:d}&filter='
size = 10
for i in range(size):
url = start_url.format(i * 25) # url便是每一页对应的网页链接
2、获取到对应的网页,即通过requests.get()
方法,即
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36'}
html = requests.get(url, headers=headers)
爬取网页的时候可能会出现错误,我们要确保传递给下一个过程的数据没有错误,因此:
if html.status_code == 200:
# 继续下一步操作
pass
else:
print("error!!!")
3、通过get()
方法获取网页代码有时我们无法直接处理,因此我们需要用的beautifulSoup
来解析网页(虽然这一次根本不需要)
soup = BeautifulSoup(html.text, 'html.parser')
4、经过前面的步骤我们就可以开始利用正则表达式来进行数据清洗了,因为之前我们已经确定了需求,所以现在我们按照之前的需求来一次完成我们的目标
4.1 匹配url
res = r'^[\[a-z<="\s]*href="(.*)">$'
if re.match(res, test):
url = re.match(res, test).group(1)
else:
url = 'None'
4.2 匹配名称
res = r'^[]*>(.*) $'
if re.match(res, test):
movie_name = re.match(res, test).group(1)
else:
movie_name = 'None'
4.3 匹配评分
res = r'^[\[=a-z"<>\s:_]*(.*)]$'
if re.match(res, test):
rating = re.match(res, test).group(1)
else:
rating = 'None'
4.4 匹配推荐语
res = r'^[\[=a-z"<>\s]*(.*)]$']
if re.match(res, test):
inq = re.match(res, test).group(1)
else:
inq = 'None'
5、经过前面的步骤,我们就可以在从程序输出得到爬取的,但是为了方便阅读和保存,我们需要把爬取结果保存下来,为了方便,直接保存为CSV格式的文件
with open('res.csv', 'w', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['电影名', '评分', '推荐语', '链接'])
for i in res:
writer.writerow(i)
在准备阶段我们已经确定了整个爬取流程,接下来就作为一个无情的代码机器就行了
1、编写getHtmlDiv(url)
函数,用于爬取网页并解析得到div
标签中的结果
def getHtmlDiv(url):
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36'}
html = requests.get(url, headers=headers)
if html.status_code == 200:
soup = BeautifulSoup(html.text, 'html.parser')
return soup.find_all(name='div', class_='info')
else:
print(html.status_code)
2、编写writeToCSV(res, filename)
函数,将爬取结果保存
def writeToCSV(res, filename):
with open(filename, 'w', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['电影名', '评分', '推荐语', '链接'])
for i in res:
writer.writerow(i)
3、编写getRes(ans, url)
函数,从div
标签中的到我们需要的数据
def getRes(ans, url):
div = getHtmlDiv(url)
for i in range(len(div)):
s_url = str(div[i].find_all(name='a')).split('\n')[0]
res_url = r'^[\[a-z<="\s]*href="(.*)">$'
if re.match(res_url, s_url):
movie_url = re.match(res_url, s_url).group(1)
else:
movie_url = "None"
s_title_span = div[i].find_all(name='span', class_='title') + div[i].find_all(name='span',class_='other')
res_title = r'^[]*>(.*) $'
movie_name = ''
for j in range(len(s_title_span)):
if re.match(res_title, str(s_title_span[j])):
m = re.match(res_title, str(s_title_span[j])).group(1)
m = ''.join(m.split())
else:
m = "None"
movie_name += m
s_rating = str(div[i].find_all(name='span', class_='rating_num'))
res_rating = r'^[\[=a-z"<>\s:_]*(.*)]$'
if re.match(res_rating, s_rating):
movie_rating = re.match(res_rating, s_rating).group(1)
else:
movie_rating = "None"
s_inq = str(div[i].find_all(name='span', class_='inq'))
res_inq = r'^[\[=a-z"<>\s]*(.*)]$'
if re.match(res_inq, s_inq):
movie_inq = re.match(res_inq, s_inq).group(1)
else:
movie_inq = "None"
item = (movie_name, movie_rating, movie_inq, movie_url)
ans.append(item)
return ans
4、编写main()
函数
def main():
start_url = "https://movie.douban.com/top250?start={:d}&filter="
size = 10
res = []
for i in range(size):
try:
url = start_url.format(i * 25)
res = getRes(res, url)
except:
print("main() error! i = %d" % i)
continue
writeToCSV(res, 'test.csv')
5、源码
# -*- coding: utf-8 -*-
# author:Egoist
import requests
from bs4 import BeautifulSoup
import xlwings as xw
from tqdm import tqdm
import re
import csv
def getHtmlDiv(url):
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36'}
# 添加headers 防止被反爬虫程序检测
html = requests.get(url, headers=headers) # 爬取网页
if html.status_code == 200: # 判断状态码:200则成功;403的话有可能是被Ban了,过段时间就好了
soup = BeautifulSoup(html.text, 'html.parser') # 煲汤,解析网页(在这个项目中虽然有点多余)
return soup.find_all(name='div', class_='info') # 获取div标签并返回
else:
print(html.status_code) # 输出状态码
def writeToCSV(res, filename):
with open(filename, 'w', encoding='utf-8') as f: # with open打开文件方便点
writer = csv.writer(f) # csv写入器
writer.writerow(['电影名', '评分', '推荐语', '链接']) # 写入标题
for i in res: # 依次写入数据
writer.writerow(i)
def getRes(ans, url):
div = getHtmlDiv(url) # 得到div标签的内容
for i in tqdm(range(len(div))): # 遍历每部电影
s_url = str(div[i].find_all(name='a')).split('\n')[0] # 获取包含url的标签
res_url = r'^[\[a-z<="\s]*href="(.*)">$' # 解析url的正则表达式
if re.match(res_url, s_url): # 判空
movie_url = re.match(res_url, s_url).group(1)
else:
movie_url = "None"
s_title_span = div[i].find_all(name='span', class_='title') + div[i].find_all(name='span',class_='other')
res_title = r'^[]*>(.*) $'
movie_name = ''
for j in range(len(s_title_span)):
if re.match(res_title, str(s_title_span[j])):
m = re.match(res_title, str(s_title_span[j])).group(1)
m = ''.join(m.split()) # 去掉\xa0字符
else:
m = "None"
movie_name += m
s_rating = str(div[i].find_all(name='span', class_='rating_num'))
res_rating = r'^[\[=a-z"<>\s:_]*(.*)]$'
if re.match(res_rating, s_rating):
movie_rating = re.match(res_rating, s_rating).group(1)
else:
movie_rating = "None"
s_inq = str(div[i].find_all(name='span', class_='inq'))
res_inq = r'^[\[=a-z"<>\s]*(.*)]$'
if re.match(res_inq, s_inq):
movie_inq = re.match(res_inq, s_inq).group(1)
else:
movie_inq = "None"
item = (movie_name, movie_rating, movie_inq, movie_url) # 将结果保存为tuple
ans.append(item) # 将该电影的信息加入结果list
return ans # 返回结果
def main():
start_url = "https://movie.douban.com/top250?start={:d}&filter=" # 导航url
size = 10 # 页数
res = [] # 保存数据
for i in range(size):
try:
url = start_url.format(i * 25) # 每一页的url
res = getRes(res, url) # 存储数据
except:
print("main() error! i = %d" % i)
continue
for i in res:
print(i)
writeToCSV(res, 'test.csv') # 写入结果
if __name__ == '__main__':
main() # 主函数
在代码所在的文件夹中,我们就可以看到生成了一个.csv
文件,在该文件中就保存了我们的爬取结果。