1. 前言说明
2. 自动化分析
3. 爬取单个视频
4. 批量爬取视频
5. 总结
前言
每次查找资料和研究总是能够发现一些新鲜事【奈何我懂得太少】。这次因为相关的自动化爬虫的原因,继续研究了一下 B 站的 m4s 文件,果然发现了一些有用的东西。不过,这次博文主要目的是,为了记录整个资料查找和研究的过程,算是一个思考过程吧。【写自动化爬虫不太在行,也没有这个需求,而且工具好像也有一大推 emmmm…】。
额外的说明
因为之前已经写过两篇文章,一篇是 爬取B站的flv视频,另一个是 爬取 m4s 的视频。这里就有一个问题,既然 B 站从 flv 改到了 m4s,那么分界点在哪里。因为可能有人会认为 B 站传来的视频全部改为 m4s 短视频流格式,其实不然,事实上大概是以某个时间点为界【当然也可以打我的脸】,两边分隔为 flv 和 m4s 的对立派。
所以,采用二分法,目的是找到分隔点。
av: 30504429,20180827数据分析
av: 30504430,【iKON】「自存」那些有艺术感的video
时间点确定为,2018-08-28
原来,我以为这是一个分隔点,但是后续的 av30504450 又是 flv 文件,所以勉强地认为这个分隔点上下浮动有一定范围,可能是一个衔接的过程。所以,大致在这个视频以前的投稿视频,姑且可以认为是 flv 视频,之后是 m4s 视频。
下面是正文部分
依旧打开 F12,重新刷新,并找到指定数据包
既然我们的目标是实现自动化爬取 B 站的视频,那么如何获取 URL 是一个关键,而不是打开网页后抓包之后得到指定数据包后下载视频,这两者还是有区别的。
以下是此数据包的 URL 链接:
https://cn-zjwz3-dx-v-11.acgvideo.com/upgcxcode/80/78/115287880/115287880-1-30080.m4s?expires=1567849800&platform=pc&ssig=Fp1CahgHwsUBCEd45Yo5-Q&oi=2085381866&trid=4d431eaebd8e4536a416728143ba73ebu&nfc=1&nfb=maPYqpoel5MI3qOUX6YpRA==&mid=100938015
查看 m4s 文件的 URL 请求时提交的参数
从中,我们最多也只能知道 expire 代表了时间戳,platform 代表 pc 主机,mid 号应该是自己的 UID 号,但是其它的我们都不知道。
格式化 Json 数据,并进一步分析这个数据包
好像里面就有我们需要的 m4s 的链接,假如解决了 Json 数据包 URL 的获取问题,那么这个问题不就解决了么。
{
"code": 0,
"message": "0",
"ttl": 1,
"data": {
"from": "local",
"result": "suee",
"message": "",
"quality": 80,
"format": "flv",
"timelength": 229291,
"accept_format": "hdflv2,flv,flv720,flv480,flv360",
"accept_description": ["高清 1080P+", "高清 1080P", "高清 720P", "清晰 480P", "流畅 360P"],
"accept_quality": [112, 80, 64, 32, 16],
"video_codecid": 7,
"seek_param": "start",
"seek_type": "offset",
"dash": {
"duration": 230,
"minBufferTime": 1.5,
"min_buffer_time": 1.5,
"video": [{
"id": 80,
"baseUrl": "http://cn-zjwz3-dx-v-11.acgvideo.com/upgcxcode/80/78/115287880/115287880-1-30080.m4s?expires=1567850700\u0026platform=pc\u0026ssig=S3R6tNYzukQZ6D4M3w-k8g\u0026oi=2085381866\u0026trid=8e7b80f4881c4966af032947f69736ceu\u0026nfc=1\u0026nfb=maPYqpoel5MI3qOUX6YpRA==\u0026mid=100938015",
"base_url": "http://cn-zjwz3-dx-v-11.acgvideo.com/upgcxcode/80/78/115287880/115287880-1-30080.m4s?expires=1567850700\u0026platform=pc\u0026ssig=S3R6tNYzukQZ6D4M3w-k8g\u0026oi=2085381866\u0026trid=8e7b80f4881c4966af032947f69736ceu\u0026nfc=1\u0026nfb=maPYqpoel5MI3qOUX6YpRA==\u0026mid=100938015",
"backupUrl": null,
"backup_url": null,
"bandwidth": 1447135,
"mimeType": "video/mp4",
"mime_type": "video/mp4",
"codecs": "avc1.640028",
"width": 1920,
"height": 1080,
"frameRate": "16000/672",
"frame_rate": "16000/672",
"sar": "1:1",
"startWithSap": 1,
"start_with_sap": 1,
"SegmentBase": {
"Initialization": "0-975",
"indexRange": "976-1559"
},
"segment_base": {
"initialization": "0-975",
"index_range": "976-1559"
},
"codecid": 7
}, {
"id": 64,
"baseUrl": "http://cn-zjwz3-dx-v-01.acgvideo.com/upgcxcode/80/78/115287880/115287880-1-30064.m4s?expires=1567850700\u0026platform=pc\u0026ssig=DCgoPxhgaq52RqUXD07bcw\u0026oi=2085381866\u0026trid=8e7b80f4881c4966af032947f69736ceu\u0026nfc=1\u0026nfb=maPYqpoel5MI3qOUX6YpRA==\u0026mid=100938015",
"base_url": "http://cn-zjwz3-dx-v-01.acgvideo.com/upgcxcode/80/78/115287880/115287880-1-30064.m4s?expires=1567850700\u0026platform=pc\u0026ssig=DCgoPxhgaq52RqUXD07bcw\u0026oi=2085381866\u0026trid=8e7b80f4881c4966af032947f69736ceu\u0026nfc=1\u0026nfb=maPYqpoel5MI3qOUX6YpRA==\u0026mid=100938015",
"backupUrl": null,
"backup_url": null,
"bandwidth": 1009346,
"mimeType": "video/mp4",
"mime_type": "video/mp4",
"codecs": "avc1.64001F",
"width": 1280,
"height": 720,
"frameRate": "16000/672",
"frame_rate": "16000/672",
"sar": "1:1",
"startWithSap": 1,
"start_with_sap": 1,
"SegmentBase": {
"Initialization": "0-974",
"indexRange": "975-1558"
},
"segment_base": {
"initialization": "0-974",
"index_range": "975-1558"
},
"codecid": 7
}, {
"id": 32,
"baseUrl": "http://cn-zjwz3-dx-v-02.acgvideo.com/upgcxcode/80/78/115287880/115287880-1-30032.m4s?expires=1567850700\u0026platform=pc\u0026ssig=f9gUrOa8UYiUpNhRNCc89A\u0026oi=2085381866\u0026trid=8e7b80f4881c4966af032947f69736ceu\u0026nfc=1\u0026nfb=maPYqpoel5MI3qOUX6YpRA==\u0026mid=100938015",
"base_url": "http://cn-zjwz3-dx-v-02.acgvideo.com/upgcxcode/80/78/115287880/115287880-1-30032.m4s?expires=1567850700\u0026platform=pc\u0026ssig=f9gUrOa8UYiUpNhRNCc89A\u0026oi=2085381866\u0026trid=8e7b80f4881c4966af032947f69736ceu\u0026nfc=1\u0026nfb=maPYqpoel5MI3qOUX6YpRA==\u0026mid=100938015",
"backupUrl": null,
"backup_url": null,
"bandwidth": 630670,
"mimeType": "video/mp4",
"mime_type": "video/mp4",
"codecs": "avc1.64001E",
"width": 852,
"height": 480,
"frameRate": "16000/672",
"frame_rate": "16000/672",
"sar": "1:1",
"startWithSap": 1,
"start_with_sap": 1,
"SegmentBase": {
"Initialization": "0-974",
"indexRange": "975-1558"
},
"segment_base": {
"initialization": "0-974",
"index_range": "975-1558"
},
"codecid": 7
}, {
"id": 16,
"baseUrl": "http://cn-zjwz3-dx-v-11.acgvideo.com/upgcxcode/80/78/115287880/115287880-1-30015.m4s?expires=1567850700\u0026platform=pc\u0026ssig=yYHw-JZzw0EDglb2kCv3PQ\u0026oi=2085381866\u0026trid=8e7b80f4881c4966af032947f69736ceu\u0026nfc=1\u0026nfb=maPYqpoel5MI3qOUX6YpRA==\u0026mid=100938015",
"base_url": "http://cn-zjwz3-dx-v-11.acgvideo.com/upgcxcode/80/78/115287880/115287880-1-30015.m4s?expires=1567850700\u0026platform=pc\u0026ssig=yYHw-JZzw0EDglb2kCv3PQ\u0026oi=2085381866\u0026trid=8e7b80f4881c4966af032947f69736ceu\u0026nfc=1\u0026nfb=maPYqpoel5MI3qOUX6YpRA==\u0026mid=100938015",
"backupUrl": null,
"backup_url": null,
"bandwidth": 391335,
"mimeType": "video/mp4",
"mime_type": "video/mp4",
"codecs": "avc1.64001E",
"width": 640,
"height": 360,
"frameRate": "16000/672",
"frame_rate": "16000/672",
"sar": "1:1",
"startWithSap": 1,
"start_with_sap": 1,
"SegmentBase": {
"Initialization": "0-974",
"indexRange": "975-1558"
},
"segment_base": {
"initialization": "0-974",
"index_range": "975-1558"
},
"codecid": 7
}],
"audio": [{
"id": 30280,
"baseUrl": "http://cn-zjwz3-dx-v-04.acgvideo.com/upgcxcode/80/78/115287880/115287880-1-30280.m4s?expires=1567850700\u0026platform=pc\u0026ssig=Meu6Rudcq5Yelyt__nKxKQ\u0026oi=2085381866\u0026trid=8e7b80f4881c4966af032947f69736ceu\u0026nfc=1\u0026nfb=maPYqpoel5MI3qOUX6YpRA==\u0026mid=100938015",
"base_url": "http://cn-zjwz3-dx-v-04.acgvideo.com/upgcxcode/80/78/115287880/115287880-1-30280.m4s?expires=1567850700\u0026platform=pc\u0026ssig=Meu6Rudcq5Yelyt__nKxKQ\u0026oi=2085381866\u0026trid=8e7b80f4881c4966af032947f69736ceu\u0026nfc=1\u0026nfb=maPYqpoel5MI3qOUX6YpRA==\u0026mid=100938015",
"backupUrl": null,
"backup_url": null,
"bandwidth": 319254,
"mimeType": "audio/mp4",
"mime_type": "audio/mp4",
"codecs": "mp4a.40.2",
"width": 0,
"height": 0,
"frameRate": "",
"frame_rate": "",
"sar": "",
"startWithSap": 0,
"start_with_sap": 0,
"SegmentBase": {
"Initialization": "0-919",
"indexRange": "920-1503"
},
"segment_base": {
"initialization": "0-919",
"index_range": "920-1503"
},
"codecid": 0
}, {
"id": 30216,
"baseUrl": "http://cn-zjwz3-dx-v-12.acgvideo.com/upgcxcode/80/78/115287880/115287880-1-30216.m4s?expires=1567850700\u0026platform=pc\u0026ssig=F_sQRRXyXKP_E1Og8Y0Kww\u0026oi=2085381866\u0026trid=8e7b80f4881c4966af032947f69736ceu\u0026nfc=1\u0026nfb=maPYqpoel5MI3qOUX6YpRA==\u0026mid=100938015",
"base_url": "http://cn-zjwz3-dx-v-12.acgvideo.com/upgcxcode/80/78/115287880/115287880-1-30216.m4s?expires=1567850700\u0026platform=pc\u0026ssig=F_sQRRXyXKP_E1Og8Y0Kww\u0026oi=2085381866\u0026trid=8e7b80f4881c4966af032947f69736ceu\u0026nfc=1\u0026nfb=maPYqpoel5MI3qOUX6YpRA==\u0026mid=100938015",
"backupUrl": null,
"backup_url": null,
"bandwidth": 67169,
"mimeType": "audio/mp4",
"mime_type": "audio/mp4",
"codecs": "mp4a.40.2",
"width": 0,
"height": 0,
"frameRate": "",
"frame_rate": "",
"sar": "",
"startWithSap": 0,
"start_with_sap": 0,
"SegmentBase": {
"Initialization": "0-919",
"indexRange": "920-1503"
},
"segment_base": {
"initialization": "0-919",
"index_range": "920-1503"
},
"codecid": 0
}]
}
}
}
不过事情还是没有我们想象的这么简单,这个 URL 拼接还是非常困难的。
https://api.bilibili.com/x/player/playurl?avid={avid}&cid={cid}&qn={sp}&type=&otype=json
http://www.bilibili.tv/widget/getPageList?aid={av号}
】https://api.bilibili.com/x/player/playurl?avid=9561754&cid=15805564&qn=80&type=&otype=json
,请求访问:{
"code": 0,
"message": "0",
"ttl": 1,
"data": {
"from": "local",
"result": "suee",
"message": "",
"quality": 80,
"format": "flv",
"timelength": 187560,
"accept_format": "flv,flv720,flv360",
"accept_description": ["高清 1080P", "高清 720P", "流畅 360P"],
"accept_quality": [80, 64, 16],
"video_codecid": 7,
"seek_param": "start",
"seek_type": "offset",
"durl": [{
"order": 1,
"length": 187560,
"size": 21283844,
"ahead": "EhBW5QA=",
"vhead": "AWQAH//hABxnZAAfrNlAUAW7AWoEBAKAAAADAIAAMgAHjBjLAQAGaOk5yyLA/Pj4AA==",
"url": "http://upos-hz-mirrorks3u.acgvideo.com/upgcxcode/64/55/15805564/15805564-1-80.flv?e=ig8euxZM2rNcNbN3hbUVhoMgnwNBhwdEto8g5X10ugNcXBlqNxHxNEVE5XREto8KqJZHUa6m5J0SqE85tZvEuENvNo8g2ENvNo8i8o859r1qXg8xNEVE5XREto8GuFGv2U7SuxI72X6fTr859r1qXg8gNEVE5XREto8z5JZC2X2gkX5L5F1eTX1jkXlsTXHeux_f2o859IB_\u0026uipk=5\u0026nbs=1\u0026deadline=1567873729\u0026gen=playurl\u0026os=ks3u\u0026oi=2085381866\u0026trid=91e009b48e9c4a58beb341f6d02a4b59u\u0026platform=pc\u0026upsig=f6d8cc11b6b3fd33f01461b47fe0ecaf\u0026uparams=e,uipk,nbs,deadline,gen,os,oi,trid,platform\u0026mid=100938015",
"backup_url": null
}]
}
}
既然对 flv 可行,那么对于 m4s 呢?avid 和 cid 号依旧可以用上述办法取得,但是之前的其它参数可不好搞,这可真是个难题!
https://api.bilibili.com/x/player/playurl?avid=66476652&cid=115287880&qn=80&type=&otype=json
,并访问,返回结果:{
"code": 0,
"message": "0",
"ttl": 1,
"data": {
"from": "local",
"result": "suee",
"message": "",
"quality": 80,
"format": "flv",
"timelength": 229306,
"accept_format": "hdflv2,flv,flv720,flv480,flv360",
"accept_description": ["高清 1080P+", "高清 1080P", "高清 720P", "清晰 480P", "流畅 360P"],
"accept_quality": [112, 80, 64, 32, 16],
"video_codecid": 7,
"seek_param": "start",
"seek_type": "offset",
"durl": [{
"order": 1,
"length": 229306,
"size": 50769135,
"ahead": "",
"vhead": "",
"url": "http://upos-hz-mirrorks3u.acgvideo.com/upgcxcode/80/78/115287880/115287880-1-80.flv?e=ig8euxZM2rNcNbUBhbUVhoMB7WNBhwdEto8g5X10ugNcXBlqNxHxNEVE5XREto8KqJZHUa6m5J0SqE85tZvEuENvNo8g2ENvNo8i8o859r1qXg8xNEVE5XREto8GuFGv2U7SuxI72X6fTr859r1qXg8gNEVE5XREto8z5JZC2X2gkX5L5F1eTX1jkXlsTXHeux_f2o859IB_\u0026uipk=5\u0026nbs=1\u0026deadline=1567874137\u0026gen=playurl\u0026os=ks3u\u0026oi=2085381866\u0026trid=8f17d24721884478ac9ec0f330bd211cu\u0026platform=pc\u0026upsig=8972303e4863144b7bcddc295faccab2\u0026uparams=e,uipk,nbs,deadline,gen,os,oi,trid,platform\u0026mid=100938015",
"backup_url": null
}]
}
}
竟然可以这样直接返回 flv 视频文件,这真是没有想到的。既然返回了 flv 文件,那么接下来的自动化过程就变得容易了,通过这个接口的拼接访问,返回的链接,就可以轻松下载到视频。接下来问题其实从如何爬取 m4s 视频又转化到如何爬取 flv 视频【应该是这样没错】。
https://api.bilibili.com/x/player/playurl?avid=66476652&cid=115287880&qn=80&type=&otype=json
https://api.bilibili.com/x/player/playurl
avid=66476652
cid=115287880
qn=80
type=
otype=json
https://api.bilibili.com/x/player/playurl?avid=66476652&cid=115287880
https://api.bilibili.com/x/player/playurl?avid=66476652&cid=115287880&qn=16
"length":229306,"size":13297215
"length":229306,"size":38226097
https://api.bilibili.com/x/player/playurl?avid=66476652&cid=115287880&fnval=1
upos-hz-mirrorkodou.acgvideo.com/upgcxcode/80/78/115287880/115287880-1-64.flv
mirrorcosu.acgvideo.com/upgcxcode/80/78/115287880/115287880-1-6.mp4
既然都得到了视频的 URL,不再写写爬取视频,总感觉少了点什么。依旧使用强大而懒人的抓包工具 Fiddler 作为演示。不过之前是抓数据包,改它的请求头,提交并保存在本地上;这次的话爬不到了,需要全部自行构造一遍,不过也没有多大的区别。
http://upos-hz-mirrorks3u.acgvideo.com/upgcxcode/80/78/115287880/115287880-1-80.flv?e=ig8euxZM2rNcNbUBhbUVhoMB7WNBhwdEto8g5X10ugNcXBlqNxHxNEVE5XREto8KqJZHUa6m5J0SqE85tZvEuENvNo8g2ENvNo8i8o859r1qXg8xNEVE5XREto8GuFGv2U7SuxI72X6fTr859r1qXg8gNEVE5XREto8z5JZC2X2gkX5L5F1eTX1jkXlsTXHeux_f2o859IB_\u0026uipk=5\u0026nbs=1\u0026deadline=1567874137\u0026gen=playurl\u0026os=ks3u\u0026oi=2085381866\u0026trid=8f17d24721884478ac9ec0f330bd211cu\u0026platform=pc\u0026upsig=8972303e4863144b7bcddc295faccab2\u0026uparams=e,uipk,nbs,deadline,gen,os,oi,trid,platform\u0026mid=100938015
http://upos-hz-mirrorks3u.acgvideo.com/upgcxcode/80/78/115287880/115287880-1-80.flv?e=ig8euxZM2rNcNbUBhbUVhoMB7WNBhwdEto8g5X10ugNcXBlqNxHxNEVE5XREto8KqJZHUa6m5J0SqE85tZvEuENvNo8g2ENvNo8i8o859r1qXg8xNEVE5XREto8GuFGv2U7SuxI72X6fTr859r1qXg8gNEVE5XREto8z5JZC2X2gkX5L5F1eTX1jkXlsTXHeux_f2o859IB_&uipk=5&nbs=1&deadline=1567874137&gen=playurl&os=ks3u&oi=2085381866&trid=8f17d24721884478ac9ec0f330bd211cu&platform=pc&upsig=8972303e4863144b7bcddc295faccab2&uparams=e,uipk,nbs,deadline,gen,os,oi,trid,platform&mid=100938015
批量爬取视频的最好办法就是,用编程实现视频的重复爬取任务,从而解放自己的双手。在这里简单地用 Python 实现视频爬取的过程【复杂的我也不会】。
import requests
import os, sys
class BilibiliCrawler():
def __init__(self, qn=80, output=''):
# 初始化
if output:
path = os.getcwd()+'\\'
path += output
if not os.path.exists(path):
os.mkdir(path)
output = path + '\\'
self.qn = qn
self.output = output
self.cid_url = 'https://api.bilibili.com/x/player/pagelist?aid={}&jsonp=jsonp'
self.flv_url = 'https://api.bilibili.com/x/player/playurl?avid={}&cid={}&qn={}&type=&otype=json'
self.headers1 = {'host': 'api.bilibili.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0'}
self.headers2 = {'host':'',
'Origin': 'https://www.bilibili.com',
'Referer': 'https://www.bilibili.com/video/ac{}',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0'}
def getCid(self, url):
# 得到 cid
data = requests.get(url, headers=self.headers1).json()
detail = data['data'][0]
cid = detail['cid']
name = detail['part']
duration = detail['duration']
return cid, name, duration
def getFlv(self, url):
# 得到 flv
data = requests.get(url, headers=self.headers1).json()
durl = data['data']['durl'][0]
size = durl['size']
url = durl['url']
length = durl['length']
return length, size, url
def download(self, url, filename='None.flv'):
# 下载
size = 0
response = requests.get(url, headers=self.headers2, stream=True, verify=False)
chunk_size = 1024
content_size = int(response.headers['content-length'])
if response.status_code == 200:
sys.stdout.write(' [文件大小]:%0.2f MB\n' % (content_size / chunk_size / 1024))
filename = os.path.join(self.output, filename)
with open(filename, 'wb') as file:
for data in response.iter_content(chunk_size = chunk_size):
file.write(data)
size += len(data)
file.flush()
sys.stdout.write(' [下载进度]:%.2f%%' % float(size / content_size * 100) + '\r')
if size / content_size == 1:
print('\n')
else:
print('下载出错')
def start(self, av):
# 开始
cid, name ,duration = self.getCid(self.cid_url.format(av))
length, size, flv = self.getFlv(self.flv_url.format(av, cid, self.qn))
host = flv.split('/')[2]
self.headers2['host'] = host
filename = name.replace(' ', '_') + '.flv'
print("name: {0} duration:{1}s".format(filename, duration))
self.download(flv, filename)
if __name__ == '__main__':
bilibili = BilibiliCrawler(qn=80, output="download")
avlist = ['66476652', '66551946']
for i in avlist:
bilibili.start(av=i)
使用方法就是填入 av 号,控制台打开【因为添加了进度条,不过代码可能还有 bug,发现后再修吧】。
个人认为,爬取视频只要选择自己认为方便和喜欢的方式就行,现在爬取工具这么多,也不用一定要自己编程实现,当然这也是个不错的方法锻炼自己的编程手艺。
本来还想写点的,发现写的也差不多了,那就先写到这里,以后想到别的可能会继续添加在里面。
emm… 关于 80 清晰视频,还需要登录信息的,就是 Cookie 值。如果需要爬高清的,就自行添加一下头部吧。上次没有弄清楚,诶~~!
将 HTML 播放器,改为 flash 播放器,可以直接看到 flv 的链接。
Fin.