python爬虫B站番剧

python爬虫B站番剧

B站番剧的爬取和普通视频有所不同,下面是我爬取刺客伍六七的方法

一、获取视频名字

像这种视频类的url不会再页面源代码里,但是我们可以看看视频的名字能不能找到。
python爬虫B站番剧_第1张图片

我们可以看到,在页面源代码中,我们可以找到视频的名字。然后,我就用xpath的方法将这个名字给提取了出来。

def get_name(url):

    resp = requests.get(url)
    html = etree.HTML(resp.text)
    name = html.xpath('/html/body/div[2]/div[1]/div[5]/h1/text()')
    resp.close()
    return name

当然,大家也可以尝试其他方法,比如正则。

二、查找视频url

我打开了抓包工具,在里面查找可能有用的包,然后我发现了这个
python爬虫B站番剧_第2张图片

可以看到这个包是m4s(一种视频格式,常见的有mp4等)
python爬虫B站番剧_第3张图片

并且,它的这个响应也是这种呈现乱码的状态,我们基本可以判断,这个可能就是我们想要的包。于是我就爬了这几个包,发现,这些包的内容都是一个完整的视频,但是30032和30280这两种包,一个是纯视频,没有声音,一个是纯声音,没有视频(mp3)。所以爬下来的东西还需要后期的合成。

1、获取视频url

我们虽然找到了视频的url,但是这个url是我们在抓包工具里人工找的,但是我们在爬取视频的时候想要让程序自己获取url并且下载视频。所以我继续寻找。后来找到了这个。

在这里插入图片描述

python爬虫B站番剧_第4张图片

这个包里有mp4,还有高清等字样,说明很有可能就是储存视频url的包,再看它的响应

python爬虫B站番剧_第5张图片

可以看到区其中有很多之前查找到的类似的url,也能找到之前找到的一模一样的url,所以可以断定,这个就是我们要找的包。这里面m4s之前的数字可能代表的就是不同清晰度的视频或者音频。

因为这个不是html所以我就用正则将我要的url给提取了出来。

def get_mp34(video_url,avid,bvid,cid,url_i):

    header = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36 Edg/97.0.1072.55',
        'referer': 'https://www.bilibili.com/bangumi/play/ep374151?from=search&seid=6944554235287286512&spm_id_from=333.337.0.0',
        "cookie": "_uuid=9D4E10FEF-4CE4-3FD7-CE107-4D1F438110A5229945infoc; buvid3=F94E7751-DA48-4F97-89C9-5A4EE9CF1AA4148817infoc; sid=6yym998l; fingerprint=a191a4892153b18f1c33048baff8c69d; buvid_fp=F94E7751-DA48-4F97-89C9-5A4EE9CF1AA4148817infoc; buvid_fp_plain=568F8F93-8E03-7122-BAF7-DCC824ECB0B572866infoc; DedeUserID=2036429689; DedeUserID__ckMd5=f4d43c2865058f15; SESSDATA=f2f4e5d6%2C1653270738%2C3f3ec*b1; bili_jct=d2a07423b8512e9456886a800a45912d; video_page_version=v_old_home; blackside_state=1; rpdid=|(J~l|mlYku|0J'uYJ~~u)~)J; PVID=2; bp_video_offset_2036429689=605053522121801357; CURRENT_BLACKGAP=1; CURRENT_QUALITY=0; i-wanna-go-back=-1; b_ut=5; innersign=0; b_lsid=3BE3410A5_17E327FFFAC; bsource=search_360; CURRENT_FNVAL=80"
    }

    data = {
        'avid': avid,
        'bvid': bvid,
        'cid': cid,
        'qn': 0,
        'fnver': 0,
        'fnval': 2000,
        'fourk': 1,
        'ep_id': url_i,
        'session': '872c81527c28e404d27d7487ef36e110'
    }

    resp = requests.get(video_url, headers=header, data=data)

    list = resp.text

    obj1 = re.compile(r'"codecs":"avc1.64001F","base_url":"(?P.*?)",', re.S)
    obj2 = re.compile(r'"codecs":"mp4a.40.2","base_url":"http(?P.*?)30280(?P.*?)80000000')

    baseurl1 = obj1.finditer(list)
    baseurl2 = obj2.finditer(list)

    for it in baseurl1:
       mp4 = it.group('mp4')

    for it in baseurl2:
       key = it.group('key')
       name = it.group('mp3')

    mp3 = 'http'+key+'30280'+name+'80000000'

    resp.close()
    return mp3,mp4

因为这里面有很多个url但是我只要了一对匹配的,所以用正则匹配的时候有点麻烦,但是最终还是提取出来了。

2、爬取url参数

经过上面的操作,已经可以下载视频了,但是还是没有达到我们想要的效果,我们还是需要在抓包工具中查找包,才能拿到视频。但是我们发现带有视频url的包有除了几个关键参数,其他的都一样,所以我们接下来要彻底丢掉人工找包,就需要这几个参数。

我发现在这个带有视频url的包的url中不一样的只有avid,bvid,cid,ep_id, session 不一样,但是 session 每一次都不一样(应该是某种加密方式或是别的什么的,反正我不是很懂,但又不影响),所以我们只需要找到这几个参数就好。而且可以发现,

在这里插入图片描述

ep_id我们可以轻松找到,而且,在刺客伍六七中这个参数是递增的,所以我们可以通过这个来得到其他的参数

后来我找到了这个。

python爬虫B站番剧_第6张图片

嗯。。。其实在这个里面找视频名字也是可以的(后来才发现),而且这里面有所有这一季的(包括后面的)所有我们要的参数,我只拿了这一集的。

def get_key(referer, url):
    header = {
        'referer': referer,
        "cookie": "_uuid=9D4E10FEF-4CE4-3FD7-CE107-4D1F438110A5229945infoc; buvid3=F94E7751-DA48-4F97-89C9-5A4EE9CF1AA4148817infoc; buvid_fp=F94E7751-DA48-4F97-89C9-5A4EE9CF1AA4148817infoc; video_page_version=v_old_home; blackside_state=1; rpdid=|(J~l|mlYku|0J'uYJ~~u)~)J; CURRENT_BLACKGAP=1; CURRENT_QUALITY=0; CURRENT_FNVAL=80; innersign=0; b_lsid=1FCBBAF8_17E38C696A6; bsource=search_360; PVID=1; buvid_fp_plain=F94E7751-DA48-4F97-89C9-5A4EE9CF1AA4148817infoc; SESSDATA=68181722%2C1657187087%2C2b3ad%2A11; bili_jct=0e3de1c40e32e00990ca7fb365a3350e; DedeUserID=2036429689; DedeUserID__ckMd5=f4d43c2865058f15; sid=ccvhr5mv; i-wanna-go-back=-1; b_ut=5; fingerprint3=a284a6f2bcfcdf907cfb890ae01472e1; fingerprint=bb4d7ec794b31153738097e6b58d32d4; fingerprint_s=03d24d5903476f4db97405b6daaf39c9; bp_video_offset_2036429689=612909923547040740; bp_t_offset_2036429689=612909923547040740",
        "origin": "https://www.bilibili.com",
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36 Edg/97.0.1072.55'
    }

    resp = requests.get(url, headers=header)

    list = resp.text

    obj = re.compile(
        r'"id":1,"name":"中国大陆"}],"bkg_cover":"","cover":"http://i0.hdslb.com/bfs/bangumi/image/d179871a635e9c908a53be6580714ce3c6fee5ba.jpg","episodes":.*?"aid":(?P.*?),"badge":"","badge_info":{"bg_color":"#FB7299","bg_color_night":"#BB5B76","text":""},"badge_type":0,"bvid":"(?P.*?)","cid":(?P.*?),"cover":"', re.S)
    result = obj.finditer(list)

    for it in result:
        avid = it.group("avid")
        bvid = it.group("bvid")
        cid = it.group("cid")
    resp.close()
   
    return avid,bvid,cid

这样我们就达到我们的想法了。

三、下载视频

因为我才开始学这个东西,还不会用线程池,所以就用最简单的方法下载了。嗯嗯。。。关于mp3,mp4的合成,大家有没有什么好的方法?

代码

def get_name(url):

    resp = requests.get(url)
    html = etree.HTML(resp.text)
    name = html.xpath('/html/body/div[2]/div[1]/div[5]/h1/text()')
    resp.close()
    return name

def get_mp34(video_url,avid,bvid,cid,url_i):

    header = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36 Edg/97.0.1072.55',
        'referer': 'https://www.bilibili.com/bangumi/play/ep374151?from=search&seid=6944554235287286512&spm_id_from=333.337.0.0',
        "cookie": "_uuid=9D4E10FEF-4CE4-3FD7-CE107-4D1F438110A5229945infoc; buvid3=F94E7751-DA48-4F97-89C9-5A4EE9CF1AA4148817infoc; sid=6yym998l; fingerprint=a191a4892153b18f1c33048baff8c69d; buvid_fp=F94E7751-DA48-4F97-89C9-5A4EE9CF1AA4148817infoc; buvid_fp_plain=568F8F93-8E03-7122-BAF7-DCC824ECB0B572866infoc; DedeUserID=2036429689; DedeUserID__ckMd5=f4d43c2865058f15; SESSDATA=f2f4e5d6%2C1653270738%2C3f3ec*b1; bili_jct=d2a07423b8512e9456886a800a45912d; video_page_version=v_old_home; blackside_state=1; rpdid=|(J~l|mlYku|0J'uYJ~~u)~)J; PVID=2; bp_video_offset_2036429689=605053522121801357; CURRENT_BLACKGAP=1; CURRENT_QUALITY=0; i-wanna-go-back=-1; b_ut=5; innersign=0; b_lsid=3BE3410A5_17E327FFFAC; bsource=search_360; CURRENT_FNVAL=80"
    }

    data = {
        'avid': avid,
        'bvid': bvid,
        'cid': cid,
        'qn': 0,
        'fnver': 0,
        'fnval': 2000,
        'fourk': 1,
        'ep_id': url_i,
        'session': '872c81527c28e404d27d7487ef36e110'
    }

    resp = requests.get(video_url, headers=header, data=data)

    list = resp.text

    obj1 = re.compile(r'"codecs":"avc1.64001F","base_url":"(?P.*?)",', re.S)
    obj2 = re.compile(r'"codecs":"mp4a.40.2","base_url":"http(?P.*?)30280(?P.*?)80000000')

    baseurl1 = obj1.finditer(list)
    baseurl2 = obj2.finditer(list)

    for it in baseurl1:
       mp4 = it.group('mp4')

    for it in baseurl2:
       key = it.group('key')
       name = it.group('mp3')

    mp3 = 'http'+key+'30280'+name+'80000000'

    resp.close()
    return mp3,mp4
   
def download(mp3_url,mp4_url,url,name):
    header = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36 Edg/96.0.1054.62",
        "referer": url
    }

    resp_mp4 = requests.get(mp4_url,headers = header)
    resp_mp3 = requests.get(mp3_url,headers = header)

    mpdata_mp4 = resp_mp4.content
    mpdata_mp3 = resp_mp3.content

    with open(f"{name}.mp4","wb") as f:
        f.write(mpdata_mp4)
    print(f"{name}.mp4,over")

    with open(f"{name}.mp3","wb") as f:
       f.write(mpdata_mp3)
    print(f"{name}.mp3,over")
    
    resp_mp4.close()

    resp_mp3.close()


def get_key(referer, url):
    header = {
        'referer': referer,
        "cookie": "_uuid=9D4E10FEF-4CE4-3FD7-CE107-4D1F438110A5229945infoc; buvid3=F94E7751-DA48-4F97-89C9-5A4EE9CF1AA4148817infoc; buvid_fp=F94E7751-DA48-4F97-89C9-5A4EE9CF1AA4148817infoc; video_page_version=v_old_home; blackside_state=1; rpdid=|(J~l|mlYku|0J'uYJ~~u)~)J; CURRENT_BLACKGAP=1; CURRENT_QUALITY=0; CURRENT_FNVAL=80; innersign=0; b_lsid=1FCBBAF8_17E38C696A6; bsource=search_360; PVID=1; buvid_fp_plain=F94E7751-DA48-4F97-89C9-5A4EE9CF1AA4148817infoc; SESSDATA=68181722%2C1657187087%2C2b3ad%2A11; bili_jct=0e3de1c40e32e00990ca7fb365a3350e; DedeUserID=2036429689; DedeUserID__ckMd5=f4d43c2865058f15; sid=ccvhr5mv; i-wanna-go-back=-1; b_ut=5; fingerprint3=a284a6f2bcfcdf907cfb890ae01472e1; fingerprint=bb4d7ec794b31153738097e6b58d32d4; fingerprint_s=03d24d5903476f4db97405b6daaf39c9; bp_video_offset_2036429689=612909923547040740; bp_t_offset_2036429689=612909923547040740",
        "origin": "https://www.bilibili.com",
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36 Edg/97.0.1072.55'
    }

    resp = requests.get(url, headers=header)

    list = resp.text

    obj = re.compile(
        r'"id":1,"name":"中国大陆"}],"bkg_cover":"","cover":"http://i0.hdslb.com/bfs/bangumi/image/d179871a635e9c908a53be6580714ce3c6fee5ba.jpg","episodes":.*?"aid":(?P.*?),"badge":"","badge_info":{"bg_color":"#FB7299","bg_color_night":"#BB5B76","text":""},"badge_type":0,"bvid":"(?P.*?)","cid":(?P.*?),"cover":"', re.S)
    result = obj.finditer(list)

    for it in result:
        avid = it.group("avid")
        bvid = it.group("bvid")
        cid = it.group("cid")
    resp.close()
   
    return avid,bvid,cid

if __name__ == '__main__':
    for i in range(10):
        url_i = i+288244
        url = f'https://www.bilibili.com/bangumi/play/ep{url_i}?from=search&seid=6944554235287286512&spm_id_from=333.337.0.0'
        key_url = f'https://api.bilibili.com/pgc/view/web/season?ep_id={url_i}'
        avid,bvid,cid = get_key(url,key_url)
        video_url = f'https://api.bilibili.com/pgc/player/web/playurl?avid={avid}&bvid={bvid}&cid={cid}&qn=0&fnver=0&fnval=2000&fourk=1&ep_id={url_i}&session=6b5d621e05942b58ba10f42302543135'
        name = get_name(url)
        mp3_url,mp4_url = get_mp34(video_url,avid,bvid,cid,url_i)
        download(mp3_url,mp4_url,url,name)
        time.sleep(5)
    print('over!!!')

你可能感兴趣的:(Python爬虫,python,爬虫)