完成目标:
对B站视频⚡萨 日 朗!!!⚡【作者:雨夜繁星y】的弹幕进行获取,并制作词云。
华强受邀演唱萨日朗,每日一遍,日常生异形.
原曲:火红的萨日朗——要不要买菜(DJ8先生)
原唱:乌兰托娅
编辑器:pycharm
用到的库:re、requests、lxml、numpy、wordcloud、jieba、PIL
本次爬虫会对许多的url发起请求,因此将这些请求封装成一个函数,传入不同的url即可获得相应。
def get_req(url):
headers = {
"cookie": "_uuid=E457FD66-C6D9-AFAE-2906-548486B5774D60420infoc; buvid3=7888DF13-B86F-4CCA-8FF4-996665D18409167615infoc; sid=50tmbmmi; buvid_fp=7888DF13-B86F-4CCA-8FF4-996665D18409167615infoc; DedeUserID=56857312; DedeUserID__ckMd5=e9a29ff5b1982e8a; SESSDATA=8fb5d4d9%2C1642227376%2Ca099b*71; bili_jct=9ecbda0d46d72de6b7a8aea56480e592; CURRENT_FNVAL=80; blackside_state=1; rpdid=|(u))lR||k~m0J'uYkYY~lRlR; fingerprint3=f1e5e3ab8f8ca90169f8e4f23aaeb03a; fingerprint=c5280c11d9a04e348b55706583502ba3; fingerprint_s=834914d136d665aedab209f94ad4dc76; buvid_fp_plain=7888DF13-B86F-4CCA-8FF4-996665D18409167615infoc; bp_video_offset_56857312=560295015097913861; PVID=3; innersign=0",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36"
}
req = requests.get(url=url, headers=headers)
return req
每条弹幕文件的url请求中都含有参数cid,因此必须先获取到cid
def get_cid(bv): # bv号为全局变量
bv = bv.strip("BV") # 去掉bv字母,保留纯数字
url = f'https://api.bilibili.com/x/player/pagelist?bvid={
bv}&jsonp=jsonp'
page_text = get_req(url=url).json()
cid = page_text['data'][0]['cid']
return cid
每条弹幕文件如下,实在看不懂是什么,上网上找了一通,最后都是用正则解决的,有其他方法的大佬可以私聊我
def get_dm(cid):
url = f'https://api.bilibili.com/x/v2/dm/web/seg.so?type=1&oid={
cid}&pid=461847198&segment_index=1'
page_text = get_req(url=url).text
pat = re.compile("[\u4E00-\u9FFF]+") # 使用正则表达式匹配到弹幕文字
dama_list = pat.findall(page_text)
return dama_list
将弹幕一行一行的保存到txt文件,以为视频bv号参数
def save_dm(dama_list, bv):
fp = open(f"{
bv}.txt", "w", encoding="utf-8")
for i in dama_list:
fp.write(i + "\n")
fp.close()
词云蒙版为华强头像背景,并保存本地
def get_word_cloud(BV):
fp = open(f'{
BV}.txt', "r", encoding="utf-8")
dm = fp.read()
dm_cut = jieba.lcut(dm)
dm_cut = " ".join(dm_cut)
mask = np.array(Image.open("hq.png"))
wcd = wordcloud.WordCloud(
font_path="simkai.ttf"
, colormap="brg"
, mask=mask
, width=800
, height=400
, max_words=200
, background_color="white"
, scale=16
).generate(dm_cut)
wcd.to_file(f'{
BV}.jpg')
全局变量为bv号
if __name__ == '__main__':
bv = 'BV15L411p7M8'
cid = get_cid(bv)
dama_list = get_dm(cid)
save_dm(dama_list, bv)
get_word_cloud(bv)
听得很上头,逐渐忘记原曲~~~