前言
B站作为一个弹幕视频网站,有着所谓的弹幕文化,那么接下来我们看看,一个视频中出现最多的弹幕是什么?
PS:想看教学视频的,地址在这里哟
https://www.bilibili.com/video/BV1ci4y1g7eE/
知识点:
- 爬虫基本流程
- 正则
- requests
- jieba
- csv
- wordcloud
开发环境:
- Python 3.6
- Pycharm
爬取目标
https://www.bilibili.com/video/BV1Mk4y1z7ey/?spm_id_from=333.788.videocard.6
代码
1.导入工具
from bs4 import BeautifulSoup import requests import re import csv
2.导入词云制作库wordcloud和中文分词库jieba
import jieba import wordcloud
3.导入imageio库中的imread函数,并用这个函数读取本地图片,作为词云形状图片
import imageio mk = imageio.imread(r"拳头.png") headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36", "cookie": "_uuid=E151BB47-D49E-6692-19A1-2514519BCB6896608infoc; buvid3=6B813D30-FC11-4BAC-B077-A65D7118C308138392infoc; sid=bmsyfv4j; DedeUserID=24695355; DedeUserID__ckMd5=1027309007358c2f; SESSDATA=2ae6a4d8%2C1612179663%2Cb09ca*81; bili_jct=77765442d3466b18abf9c6c02098f2f1; CURRENT_FNVAL=16; rpdid=|(umuum~lllm0J'ulmYYJl|Rm; bp_video_offset_24695355=420789477979262399; LIVE_BUVID=AUTO9115968115538586; PVID=1; bsource=search_baidu; bfe_id=6f285c892d9d3c1f8f020adad8bed553" } # response = requests.get("https://api.bilibili.com/x/v1/dm/list.so?oid=186803402", headers=headers) response = requests.get("https://api.bilibili.com/x/v2/dm/history?type=1&oid=186803402&date=2020-08-08", headers=headers) # print(response.text) html_doc = response.content.decode('utf-8') print(html_doc) # soup = BeautifulSoup(html_doc,'lxml') format = re.compile("(.*?) ") DanMu = format.findall(html_doc) for i in DanMu: with open('C:/Users/Mark/Desktop/b站弹幕.csv', "a", newline='', encoding='utf-8-sig') as csvfile: writer = csv.writer(csvfile) danmu = [] danmu.append(i) writer.writerow(danmu)
4.构建并配置词云对象w,注意要加stopwords集合参数,将不想展示在词云中的词放在stopwords集合里,这里去掉“曹操”和“孔明”两个词
w = wordcloud.WordCloud(width=1000, height=700, background_color='white', font_path='msyh.ttc', mask=mk, scale=15, stopwords={' '}, contour_width=5, contour_color='red')
5.对来自外部文件的文本进行中文分词,得到string
f = open('C:/Users/Mark/Desktop/b站弹幕.csv', encoding='utf-8') txt = f.read() txtlist = jieba.lcut(txt) string = " ".join(txtlist)
6.将string变量传入w的generate()方法,给词云输入文字
w.generate(string)
7.将词云图片导出到当前文件夹
w.to_file('C:/Users/Mark/Desktop/output2-threekingdoms.png')
最后运行代码
PS:如有需要Python学习资料的小伙伴可以加下方的群去找免费管理员领取
可以免费领取源码、项目实战视频、PDF文件等