B站番剧的爬取和普通视频有所不同,下面是我爬取刺客伍六七的方法
像这种视频类的url不会再页面源代码里,但是我们可以看看视频的名字能不能找到。
我们可以看到,在页面源代码中,我们可以找到视频的名字。然后,我就用xpath的方法将这个名字给提取了出来。
def get_name(url):
resp = requests.get(url)
html = etree.HTML(resp.text)
name = html.xpath('/html/body/div[2]/div[1]/div[5]/h1/text()')
resp.close()
return name
当然,大家也可以尝试其他方法,比如正则。
并且,它的这个响应也是这种呈现乱码的状态,我们基本可以判断,这个可能就是我们想要的包。于是我就爬了这几个包,发现,这些包的内容都是一个完整的视频,但是30032和30280这两种包,一个是纯视频,没有声音,一个是纯声音,没有视频(mp3)。所以爬下来的东西还需要后期的合成。
我们虽然找到了视频的url,但是这个url是我们在抓包工具里人工找的,但是我们在爬取视频的时候想要让程序自己获取url并且下载视频。所以我继续寻找。后来找到了这个。
这个包里有mp4,还有高清等字样,说明很有可能就是储存视频url的包,再看它的响应
可以看到区其中有很多之前查找到的类似的url,也能找到之前找到的一模一样的url,所以可以断定,这个就是我们要找的包。这里面m4s之前的数字可能代表的就是不同清晰度的视频或者音频。
因为这个不是html所以我就用正则将我要的url给提取了出来。
def get_mp34(video_url,avid,bvid,cid,url_i):
header = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36 Edg/97.0.1072.55',
'referer': 'https://www.bilibili.com/bangumi/play/ep374151?from=search&seid=6944554235287286512&spm_id_from=333.337.0.0',
"cookie": "_uuid=9D4E10FEF-4CE4-3FD7-CE107-4D1F438110A5229945infoc; buvid3=F94E7751-DA48-4F97-89C9-5A4EE9CF1AA4148817infoc; sid=6yym998l; fingerprint=a191a4892153b18f1c33048baff8c69d; buvid_fp=F94E7751-DA48-4F97-89C9-5A4EE9CF1AA4148817infoc; buvid_fp_plain=568F8F93-8E03-7122-BAF7-DCC824ECB0B572866infoc; DedeUserID=2036429689; DedeUserID__ckMd5=f4d43c2865058f15; SESSDATA=f2f4e5d6%2C1653270738%2C3f3ec*b1; bili_jct=d2a07423b8512e9456886a800a45912d; video_page_version=v_old_home; blackside_state=1; rpdid=|(J~l|mlYku|0J'uYJ~~u)~)J; PVID=2; bp_video_offset_2036429689=605053522121801357; CURRENT_BLACKGAP=1; CURRENT_QUALITY=0; i-wanna-go-back=-1; b_ut=5; innersign=0; b_lsid=3BE3410A5_17E327FFFAC; bsource=search_360; CURRENT_FNVAL=80"
}
data = {
'avid': avid,
'bvid': bvid,
'cid': cid,
'qn': 0,
'fnver': 0,
'fnval': 2000,
'fourk': 1,
'ep_id': url_i,
'session': '872c81527c28e404d27d7487ef36e110'
}
resp = requests.get(video_url, headers=header, data=data)
list = resp.text
obj1 = re.compile(r'"codecs":"avc1.64001F","base_url":"(?P.*?)",' , re.S)
obj2 = re.compile(r'"codecs":"mp4a.40.2","base_url":"http(?P.*?)30280(?P.*?)80000000' )
baseurl1 = obj1.finditer(list)
baseurl2 = obj2.finditer(list)
for it in baseurl1:
mp4 = it.group('mp4')
for it in baseurl2:
key = it.group('key')
name = it.group('mp3')
mp3 = 'http'+key+'30280'+name+'80000000'
resp.close()
return mp3,mp4
因为这里面有很多个url但是我只要了一对匹配的,所以用正则匹配的时候有点麻烦,但是最终还是提取出来了。
经过上面的操作,已经可以下载视频了,但是还是没有达到我们想要的效果,我们还是需要在抓包工具中查找包,才能拿到视频。但是我们发现带有视频url的包有除了几个关键参数,其他的都一样,所以我们接下来要彻底丢掉人工找包,就需要这几个参数。
我发现在这个带有视频url的包的url中不一样的只有avid,bvid,cid,ep_id, session 不一样,但是 session 每一次都不一样(应该是某种加密方式或是别的什么的,反正我不是很懂,但又不影响),所以我们只需要找到这几个参数就好。而且可以发现,
ep_id我们可以轻松找到,而且,在刺客伍六七中这个参数是递增的,所以我们可以通过这个来得到其他的参数
后来我找到了这个。
嗯。。。其实在这个里面找视频名字也是可以的(后来才发现),而且这里面有所有这一季的(包括后面的)所有我们要的参数,我只拿了这一集的。
def get_key(referer, url):
header = {
'referer': referer,
"cookie": "_uuid=9D4E10FEF-4CE4-3FD7-CE107-4D1F438110A5229945infoc; buvid3=F94E7751-DA48-4F97-89C9-5A4EE9CF1AA4148817infoc; buvid_fp=F94E7751-DA48-4F97-89C9-5A4EE9CF1AA4148817infoc; video_page_version=v_old_home; blackside_state=1; rpdid=|(J~l|mlYku|0J'uYJ~~u)~)J; CURRENT_BLACKGAP=1; CURRENT_QUALITY=0; CURRENT_FNVAL=80; innersign=0; b_lsid=1FCBBAF8_17E38C696A6; bsource=search_360; PVID=1; buvid_fp_plain=F94E7751-DA48-4F97-89C9-5A4EE9CF1AA4148817infoc; SESSDATA=68181722%2C1657187087%2C2b3ad%2A11; bili_jct=0e3de1c40e32e00990ca7fb365a3350e; DedeUserID=2036429689; DedeUserID__ckMd5=f4d43c2865058f15; sid=ccvhr5mv; i-wanna-go-back=-1; b_ut=5; fingerprint3=a284a6f2bcfcdf907cfb890ae01472e1; fingerprint=bb4d7ec794b31153738097e6b58d32d4; fingerprint_s=03d24d5903476f4db97405b6daaf39c9; bp_video_offset_2036429689=612909923547040740; bp_t_offset_2036429689=612909923547040740",
"origin": "https://www.bilibili.com",
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36 Edg/97.0.1072.55'
}
resp = requests.get(url, headers=header)
list = resp.text
obj = re.compile(
r'"id":1,"name":"中国大陆"}],"bkg_cover":"","cover":"http://i0.hdslb.com/bfs/bangumi/image/d179871a635e9c908a53be6580714ce3c6fee5ba.jpg","episodes":.*?"aid":(?P.*?),"badge":"","badge_info":{"bg_color":"#FB7299","bg_color_night":"#BB5B76","text":""},"badge_type":0,"bvid":"(?P.*?)","cid":(?P.*?),"cover":"' , re.S)
result = obj.finditer(list)
for it in result:
avid = it.group("avid")
bvid = it.group("bvid")
cid = it.group("cid")
resp.close()
return avid,bvid,cid
这样我们就达到我们的想法了。
因为我才开始学这个东西,还不会用线程池,所以就用最简单的方法下载了。嗯嗯。。。关于mp3,mp4的合成,大家有没有什么好的方法?
def get_name(url):
resp = requests.get(url)
html = etree.HTML(resp.text)
name = html.xpath('/html/body/div[2]/div[1]/div[5]/h1/text()')
resp.close()
return name
def get_mp34(video_url,avid,bvid,cid,url_i):
header = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36 Edg/97.0.1072.55',
'referer': 'https://www.bilibili.com/bangumi/play/ep374151?from=search&seid=6944554235287286512&spm_id_from=333.337.0.0',
"cookie": "_uuid=9D4E10FEF-4CE4-3FD7-CE107-4D1F438110A5229945infoc; buvid3=F94E7751-DA48-4F97-89C9-5A4EE9CF1AA4148817infoc; sid=6yym998l; fingerprint=a191a4892153b18f1c33048baff8c69d; buvid_fp=F94E7751-DA48-4F97-89C9-5A4EE9CF1AA4148817infoc; buvid_fp_plain=568F8F93-8E03-7122-BAF7-DCC824ECB0B572866infoc; DedeUserID=2036429689; DedeUserID__ckMd5=f4d43c2865058f15; SESSDATA=f2f4e5d6%2C1653270738%2C3f3ec*b1; bili_jct=d2a07423b8512e9456886a800a45912d; video_page_version=v_old_home; blackside_state=1; rpdid=|(J~l|mlYku|0J'uYJ~~u)~)J; PVID=2; bp_video_offset_2036429689=605053522121801357; CURRENT_BLACKGAP=1; CURRENT_QUALITY=0; i-wanna-go-back=-1; b_ut=5; innersign=0; b_lsid=3BE3410A5_17E327FFFAC; bsource=search_360; CURRENT_FNVAL=80"
}
data = {
'avid': avid,
'bvid': bvid,
'cid': cid,
'qn': 0,
'fnver': 0,
'fnval': 2000,
'fourk': 1,
'ep_id': url_i,
'session': '872c81527c28e404d27d7487ef36e110'
}
resp = requests.get(video_url, headers=header, data=data)
list = resp.text
obj1 = re.compile(r'"codecs":"avc1.64001F","base_url":"(?P.*?)",' , re.S)
obj2 = re.compile(r'"codecs":"mp4a.40.2","base_url":"http(?P.*?)30280(?P.*?)80000000' )
baseurl1 = obj1.finditer(list)
baseurl2 = obj2.finditer(list)
for it in baseurl1:
mp4 = it.group('mp4')
for it in baseurl2:
key = it.group('key')
name = it.group('mp3')
mp3 = 'http'+key+'30280'+name+'80000000'
resp.close()
return mp3,mp4
def download(mp3_url,mp4_url,url,name):
header = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36 Edg/96.0.1054.62",
"referer": url
}
resp_mp4 = requests.get(mp4_url,headers = header)
resp_mp3 = requests.get(mp3_url,headers = header)
mpdata_mp4 = resp_mp4.content
mpdata_mp3 = resp_mp3.content
with open(f"{name}.mp4","wb") as f:
f.write(mpdata_mp4)
print(f"{name}.mp4,over")
with open(f"{name}.mp3","wb") as f:
f.write(mpdata_mp3)
print(f"{name}.mp3,over")
resp_mp4.close()
resp_mp3.close()
def get_key(referer, url):
header = {
'referer': referer,
"cookie": "_uuid=9D4E10FEF-4CE4-3FD7-CE107-4D1F438110A5229945infoc; buvid3=F94E7751-DA48-4F97-89C9-5A4EE9CF1AA4148817infoc; buvid_fp=F94E7751-DA48-4F97-89C9-5A4EE9CF1AA4148817infoc; video_page_version=v_old_home; blackside_state=1; rpdid=|(J~l|mlYku|0J'uYJ~~u)~)J; CURRENT_BLACKGAP=1; CURRENT_QUALITY=0; CURRENT_FNVAL=80; innersign=0; b_lsid=1FCBBAF8_17E38C696A6; bsource=search_360; PVID=1; buvid_fp_plain=F94E7751-DA48-4F97-89C9-5A4EE9CF1AA4148817infoc; SESSDATA=68181722%2C1657187087%2C2b3ad%2A11; bili_jct=0e3de1c40e32e00990ca7fb365a3350e; DedeUserID=2036429689; DedeUserID__ckMd5=f4d43c2865058f15; sid=ccvhr5mv; i-wanna-go-back=-1; b_ut=5; fingerprint3=a284a6f2bcfcdf907cfb890ae01472e1; fingerprint=bb4d7ec794b31153738097e6b58d32d4; fingerprint_s=03d24d5903476f4db97405b6daaf39c9; bp_video_offset_2036429689=612909923547040740; bp_t_offset_2036429689=612909923547040740",
"origin": "https://www.bilibili.com",
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36 Edg/97.0.1072.55'
}
resp = requests.get(url, headers=header)
list = resp.text
obj = re.compile(
r'"id":1,"name":"中国大陆"}],"bkg_cover":"","cover":"http://i0.hdslb.com/bfs/bangumi/image/d179871a635e9c908a53be6580714ce3c6fee5ba.jpg","episodes":.*?"aid":(?P.*?),"badge":"","badge_info":{"bg_color":"#FB7299","bg_color_night":"#BB5B76","text":""},"badge_type":0,"bvid":"(?P.*?)","cid":(?P.*?),"cover":"' , re.S)
result = obj.finditer(list)
for it in result:
avid = it.group("avid")
bvid = it.group("bvid")
cid = it.group("cid")
resp.close()
return avid,bvid,cid
if __name__ == '__main__':
for i in range(10):
url_i = i+288244
url = f'https://www.bilibili.com/bangumi/play/ep{url_i}?from=search&seid=6944554235287286512&spm_id_from=333.337.0.0'
key_url = f'https://api.bilibili.com/pgc/view/web/season?ep_id={url_i}'
avid,bvid,cid = get_key(url,key_url)
video_url = f'https://api.bilibili.com/pgc/player/web/playurl?avid={avid}&bvid={bvid}&cid={cid}&qn=0&fnver=0&fnval=2000&fourk=1&ep_id={url_i}&session=6b5d621e05942b58ba10f42302543135'
name = get_name(url)
mp3_url,mp4_url = get_mp34(video_url,avid,bvid,cid,url_i)
download(mp3_url,mp4_url,url,name)
time.sleep(5)
print('over!!!')