《长津湖》是国庆期间出的一部战争片,相信很多人都去看了。当然它对观众不太友好,主要因为它有三个小时的超长时间,以及真正步入长津湖战役的内容太少。
这次我用爬虫从豆瓣上获取了评价数据:
数据源:豆瓣
数据抓取:requests
数据清洗:lxml(Xpath)
数据可视化:matplotlib
代码如下:
#!/usr/bin/python3
import os
import sys
import subprocess as s
try:
import requests
except:
s.run("python -m pip install requests")
try:
from lxml import etree
except:
s.run("python -m pip install lxml")
try:
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.font_manager import *
except:
s.run("python -m pip install matplotlib")
header = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.64',
'Referer':'https://movie.douban.com/',
'Upgrade-Insecure-Requests':'1',
}
pl = requests.get("https://movie.douban.com/subject/25845392/?from=showing",
# params = {'from','showing'},
headers = header)
print("HTTP Status Code:",pl.status_code)
print("Trying to write file test.html")
with open("test.html",'w',encoding='utf-8') as f:
f.write(pl.text)
del f
with open("test.html","r",encoding='utf-8') as f:
html = f.read()
html = etree.HTML(html)
i5 = html.xpath('//*[@id="interest_sectl"]/div[1]/div[3]/div[1]/span[2]/text()')[0].replace("%",'')
i4 = html.xpath('//*[@id="interest_sectl"]/div[1]/div[3]/div[2]/span[2]/text()')[0].replace("%",'')
i3 = html.xpath('//*[@id="interest_sectl"]/div[1]/div[3]/div[3]/span[2]/text()')[0].replace("%",'')
i2 = html.xpath('//*[@id="interest_sectl"]/div[1]/div[3]/div[4]/span[2]/text()')[0].replace("%",'')
i1 = html.xpath('//*[@id="interest_sectl"]/div[1]/div[3]/div[5]/span[2]/text()')[0].replace("%",'')
#print("Type:",type(li))
#print(five[0],four[0],three[0],two[0],one[0])
matplotlib.use('qt4agg')
font = FontProperties(fname='C:\\Windows\\Fonts\\msyh.ttc') #微软雅黑字体,如果没有可以替换
plt.figure(figsize=(20,8), dpi=100)
#plt.rcParams['font.family']=['msyh']
pi = plt.pie([i5,i4,i3,i2,i1],
labels=["很好","不错","一般","较差","很差"],
autopct="%1.2f%%",
colors=['b','r','g','y','c'],
#fontproperties=font,
)
for f in pi[1]:
f.set_fontproperties(font)
plt.legend(prop=font)
plt.title("《长津湖》 电影评价",fontproperties=font)
plt.axis("equal")
plt.show()
运行时终端会输出如下:
HTTP Status Code: 200
Trying to write file test.html
D:\Python38\lib\importlib\__init__.py:127: MatplotlibDeprecationWarning:
The matplotlib.backends.backend_qt4agg backend was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
return _bootstrap._gcd_import(name[level:], package, level)
从下图可以看出,很好(5星)和不错(3星)基本上占了三分之二,说明有相当一部分人觉得这部片还是比较好看的。
(PS:【1】请自行在目录中建立test.html文件
【2】如果在运行是跳出很多Warning,并且出现乱码情况,说明是字体除了问题。这个在别 的博客上有解决方法,这里就不多赘述
)
最后,用我在观看是感受最深的一句话来结束本文:
如果我们不打仗,那么我们的下一代就要打仗。
——《长津湖》