最近在b站追一部名为"守护解放西"的纪录片,主要是记录以长沙坡子街派出所为核心的核心商圈城市警察的日常工作,因为这部纪录片蛮火的,然后那个弹幕也是挺多的,恰好最近自己在学爬虫,想着能不能把b站这部好看又有价值的纪录片弹幕爬取下来!
这次爬取的核心过程步骤大致可以列为:
本次爬取所需用到的爬虫模块主要为: r e q u e s t s requests requests模块
记住,先要加载那个弹幕列表的信息,选择日期,如下图所示!否则等下找api的时候接口有可能找不到
好!首先让我们一起来找api接口,首先按下F12或者右键检查元素,然后点击到NetWork,进入到如下所示的界面;
https://api.bilibili.com/x/v2/dm/history?type=1&oid=260418892&date=2021-01-04
复制到浏览器网址里面,发现真的就是我们想要获取到的弹幕信息接口!如下图所示:
以上操作,让我们找到了弹幕信息所在网址的api接口,然后我通过分析发现,这个弹幕信息都保存在一个标签
import requests
from bs4 import BeautifulSoup
# ua伪装
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 Edg/87.0.664.66',
"cookie": "_uuid=A366B5AD-0770-4D1E-F71B-2587760CAC6094820infoc; buvid3=8F1DB121-7BFB-4923-B5FA-9306898396A3143073infoc; sid=joozhe7k; DedeUserID=475936847; DedeUserID__ckMd5=ad02dfc55e996305; SESSDATA=988c9033%2C1613607061%2C29c31*81; bili_jct=974f12a39465683da26ee0da6ac4f5e1; rpdid=|(YuJ~|kJkk0J'ulm)|ll|)l; blackside_state=1; CURRENT_FNVAL=80; LIVE_BUVID=AUTO8615998250958107; fingerprint3=4517ff2ee6999d14f1b6c58b6b8256c3; fingerprint=00c1dd6c5cd06dc20c37736594a5e450; buivd_fp=8F1DB121-7BFB-4923-B5FA-9306898396A3143073infoc; buvid_fp_plain=8F1DB121-7BFB-4923-B5FA-9306898396A3143073infoc; fingerprint_s=7f4554ba1eba2e3390474eb2c577c79d; CURRENT_QUALITY=0; PVID=1; bsource=search_sougo; bfe_id=fdfaf33a01b88dd4692ca80f00c2de7f"
}
# api 接口
url = 'https://api.bilibili.com/x/v2/dm/history?type=1&oid=260418892&date=2021-01-04'
# 发起请求
response = requests.get(url=url, headers=headers)
# 编码
response.encoding = response.apparent_encoding
# 获取文本信息
content = response.text
# 打印文本信息
# print(content)
# 熬汤,bs4解析的常用说法
soup = BeautifulSoup(content, 'lxml')
# 找到所有的d标签
d_list = soup.find_all('d')
# 打印p标签的列表
# print(d_list)
# 弹幕列表承接信息
dm_list = []
# 变量每个d标签,获取d标签内容
for d in d_list:
# 将每一条弹幕信息保存到dm_list列表中
dm_list.append(d.string)
# 然后将信息进行持久化保存
with open('./解放西弹幕.txt', 'w', encoding='utf-8') as f:
for dm in dm_list:
f.write(dm)
f.write('\n')
import requests
from bs4 import BeautifulSoup
import pandas
def get_info(date):
# ua伪装
# cookie要填入你自己本人登录b站的cookie,找到后复制粘贴上来就行
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 Edg/87.0.664.66',
"cookie": "_uuid=A366B5AD-0770-4D1E-F71B-2587760CAC6094820infoc; buvid3=8F1DB121-7BFB-4923-B5FA-9306898396A3143073infoc; sid=joozhe7k; DedeUserID=475936847; DedeUserID__ckMd5=ad02dfc55e996305; SESSDATA=988c9033%2C1613607061%2C29c31*81; bili_jct=974f12a39465683da26ee0da6ac4f5e1; rpdid=|(YuJ~|kJkk0J'ulm)|ll|)l; blackside_state=1; CURRENT_FNVAL=80; LIVE_BUVID=AUTO8615998250958107; fingerprint3=4517ff2ee6999d14f1b6c58b6b8256c3; fingerprint=00c1dd6c5cd06dc20c37736594a5e450; buivd_fp=8F1DB121-7BFB-4923-B5FA-9306898396A3143073infoc; buvid_fp_plain=8F1DB121-7BFB-4923-B5FA-9306898396A3143073infoc; fingerprint_s=7f4554ba1eba2e3390474eb2c577c79d; CURRENT_QUALITY=0; PVID=1; bsource=search_sougo; bfe_id=fdfaf33a01b88dd4692ca80f00c2de7f"
}
# api 接口
url = 'https://api.bilibili.com/x/v2/dm/history?type=1&oid=260418892&date={}'.format(date)
print(url)
# 发起请求
response = requests.get(url=url, headers=headers)
# 编码
response.encoding = response.apparent_encoding
# 获取文本信息
content = response.text
# 打印文本信息
# print(content)
# 熬汤,bs4解析的常用说法
soup = BeautifulSoup(content, 'lxml')
# 找到所有的d标签
d_list = soup.find_all('d')
# 打印p标签的列表
# print(d_list)
# 弹幕列表承接信息
dm_list = []
# 变量每个d标签,获取d标签内容
for d in d_list:
# 将每一条弹幕信息保存到dm_list列表中
dm_list.append(d.string)
# 然后将信息进行持久化保存
with open('./解放西弹幕_all.txt', 'a', encoding='utf-8') as f:
for dm in dm_list:
f.write(dm)
f.write('\n')
if __name__ == '__main__':
date_start = input('输入你想要的开始的时间,格式为: 2021-01-04: ')
date_end = input('输入你想要结束的时间, 格式为: 2021-01-07: ')
# 调一下如期格式如: 2021-01-04
date_list = pandas.date_range(start=date_start, end=date_end).strftime("%Y-%m-%d")
for date in date_list:
get_info(date)
'''
结果如下:
输入你想要的开始的时间,格式为: 2021-01-04: 2021-01-04
输入你想要结束的时间, 格式为: 2021-01-07: 2021-01-08
https://api.bilibili.com/x/v2/dm/history?type=1&oid=260418892&date=2021-01-04
https://api.bilibili.com/x/v2/dm/history?type=1&oid=260418892&date=2021-01-05
https://api.bilibili.com/x/v2/dm/history?type=1&oid=260418892&date=2021-01-06
https://api.bilibili.com/x/v2/dm/history?type=1&oid=260418892&date=2021-01-07
https://api.bilibili.com/x/v2/dm/history?type=1&oid=260418892&date=2021-01-08
Process finished with exit code 0
'''
以上则是对"守护解放西"这部纪录片弹幕爬取的全部流程,如果喜欢的话,不妨动动小手,点个赞在走呗!
在这个星球上,你很重要,请珍惜你的珍贵! ~~~夜斗小神社