[爬虫]python爬取B站日榜100名up主及其视频信息

参观我们准备爬取的网页,注意:不要停留太久,内容太过丰富且有趣,等回过神来已经半天过去了

[爬虫]python爬取B站日榜100名up主及其视频信息_第1张图片
点此跳转

第一名
观察这个页面包含的信息,包括[标题][播放量][视频弹幕数量][up主姓名]…

常规操作,F12查看这些数据源码所处的位置
[爬虫]python爬取B站日榜100名up主及其视频信息_第2张图片
日榜100名的list列表
[爬虫]python爬取B站日榜100名up主及其视频信息_第3张图片
每一个item中数据所在位置

了解到结构后,就可以开始写爬虫了。首先爬虫需要的几个库,没有的话(pip install ***)

  • BeautifulSoup4(解析html页面)
  • requests(发送请求)
  • datetime(最后在文件中加入日期)
  • json(处理json文件格式数据)
  • time(每个循环后加入时间函数,减轻服务器请求压力)
  • os(文件操作)
url =('https://www.bilibili.com/ranking/all/0/0/3')
headers = {
     'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36 Edg/80.0.361.48'}
response = requests.get(url,headers = headers)
if response.status_code == 200:
  soup = BeautifulSoup(response.text,'html.parser')
  1. url为我们要获取数据的网页的网址
  2. headers为请求头,这一步的目的是模拟浏览器访问,减小被识别为爬虫的几率(具体查看方法:chrome或Edge浏览器在搜索框输入about:version→用户代理)
  3. if response.status_code == 200指当网站可以正常请求时,解析返回的网页

开始定位数据位置

for item in soup.find_all(attrs={
     "class": "rank-item"}):
   no = item.find(attrs = {
     "class":"num"})#no为在日榜的排名
   title = item.find(attrs = {
     "class":"title"})
   web = item.find(attrs = {
     "class":"detail"})
   web_detail = web.find('a')
   web_detail = web_detail['href'].replace("//space.bilibili.com/","")#up主个人的账号
   tt = title.text.replace(",","")#视频标题名
   for details in item.find_all(attrs = {
     "class":"data-box"})://class为data-box中定位
       for detail in details:
           if(detail.string == None):
                continue
           detail.string#获取视频的播放量,弹幕数,up主姓名

csv的特性是,遇到“,”会跳到下一列。所以在处理视频标题时要将tt中的“,”去掉,以免出现格式错误。

所有的数据已经定位完成,接着把数据存入csv表格中

完整代码如下

from bs4 import BeautifulSoup
import requests
import datetime

date = datetime.datetime.now().strftime('%Y-%m-%d')

with open('F:/crawler_data/Bilibili/day_rank/day_rank_'+date+'.csv', 'w', encoding='gb18030', errors='ignore') as file:
    file.write("日排名,标题,播放量,弹幕数量,UP主姓名,UP主ID号\n")

    url = ('https://www.bilibili.com/ranking?spm_id_from=333.158.b_7072696d61727950616765546162.3')
    headers = {
     'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36 Edg/80.0.361.48'}
    response = requests.get(url,headers = headers)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text,'html.parser')

    file_id_list = open("F:/crawler_data/Bilibili/rank_id_list/" + date + "日榜up主id.txt", "w")//将日榜前100位up主的id号导入txt文件中

    for item in soup.find_all(attrs={
     "class": "rank-item"}):
        no = item.find(attrs = {
     "class":"num"})#no为在日榜的排名
        title = item.find(attrs = {
     "class":"title"})
        web = item.find(attrs = {
     "class":"detail"})
        web_detail = web.find('a')
        web_detail = web_detail['href'].replace("//space.bilibili.com/","")#up主个人的账号
        file_id_list.write(web_detail + "\n")
        tt = title.text.replace(",","")#视频标题名
        file.write("{},{}".format(no.text,tt))
        for details in item.find_all(attrs = {
     "class":"data-box"}):
            for detail in details:
                if(detail.string == None):
                    continue
                file.write(",")
                file.write("{}".format(detail.string))#获取视频的播放量,弹幕数,up主姓名
        file.write(",")
        file.write("{}".format(web_detail))
        file.write("\n")
        print("————正在获取id为:"+web_detail+"的up主信息————")
    file.write(date)
    print("——done——")
    file_id_list.close()
    file.close()

爬取到的数据
[爬虫]python爬取B站日榜100名up主及其视频信息_第4张图片

分析日榜up主个人的视频数据

在爬取日榜名单时,我将日榜上up主的id都存在了一个txt文件中,示例如下
[爬虫]python爬取B站日榜100名up主及其视频信息_第5张图片
一开始,我想直接利用request请求,但发现没有返回值
error
经过查询,up主的个人页面是动态请求的,换一种方法。

F12→network→刷新→选择XHR→在一众文件中找到

[爬虫]python爬取B站日榜100名up主及其视频信息_第6张图片
https://api.bilibili.com/x/space/arc/search?mid=270308437&pn=1&ps=25&jsonp=jsonp
up主粉丝json文件https://api.bilibili.com/x/relation/stat?vmid=270308437&jsonp=jsonp

前期准备工作完成

  • import requests
  • import time
  • import datetime
  • import os
  • import json

从事先准备的txt文件中读取id

data = []#存储日榜前100位up主的id号
for line in open("F:/crawler_data/Bilibili/rank_id_list/"+date+"日榜up主id.txt","r")://从txt文件中读取
    line = line[:-1]
    data.append(line)

因为不止爬取单一up主的视频信息,所以要对URL做如下操作

for j in data:
    up_detail = 'https://api.bilibili.com/x/space/acc/info?mid=%s&jsonp=jsonp'%j
    up_fans = 'https://api.bilibili.com/x/relation/stat?vmid=%s&jsonp=jsonp'%j

请求的返回值是json格式,利用json库解析

response_fans = requests.get(up_fans,headers = headers)
response_detail = requests.get(up_detail,headers = headers)

text_fans = json.loads(response_fans.text)
text_detail = json.loads((response_detail.text))

获取up主的个人粉丝数:

res_fans = text_fans['data']
follower = str(res_fans['follower'])  # up主个人的粉丝数

获得up主视频页数:

res_page = text_page['data']['page']
page = int(res_page['count'] / 30 + 1)#获取视频的页数

获得每一个视频的标题、av号、评论数、播放量、时长:

res = text['data']['list']['vlist']
     for item in res:
         title = str(item['title'])#视频标题
         av = str(item['aid'])  # 视频av号
         comment = str(item['comment'])  # 视频评论数
         play = str(item['play'])  # 视频播放量
         video_length = str(item['length'])  # 视频时长

完整代码如下

import requests
import json
import time
import datetime
import os

date = datetime.datetime.now().strftime('%Y-%m-%d')
data = []#存储日榜前100位up主的id号
for line in open("F:/crawler_data/Bilibili/rank_id_list/"+date+"日榜up主id.txt","r"):
    line = line[:-1]
    data.append(line)
path = 'F:/crawler_data/Bilibili/up_detail/'+date+'/'
isExists=os.path.exists(path)
if not isExists:
    os.makedirs(path)
    print("目录创建成功")
for j in data:
    with open('F:/crawler_data/Bilibili/up_detail/'+date+'/'+j+'.csv', 'w', encoding='gb18030',errors='ignore') as file:
        print("——正在爬取id为"+j+"的up主视频信息——")
        file.write("视频av号,视频标题,视频评论数,视频时长,视频观看量")
        file.write("\n")
        up_detail = 'https://api.bilibili.com/x/space/acc/info?mid=%s&jsonp=jsonp'%j
        up_fans = 'https://api.bilibili.com/x/relation/stat?vmid=%s&jsonp=jsonp'%j
        headers = {
     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36 Edg/80.0.361.50'}
        response_fans = requests.get(up_fans,headers = headers)
        response_detail = requests.get(up_detail,headers = headers)

        text_fans = json.loads(response_fans.text)
        text_detail = json.loads((response_detail.text))

        res_fans = text_fans['data']
        follower = str(res_fans['follower'])  # up主个人的粉丝数

        url_page = 'https://api.bilibili.com/x/space/arc/search?mid=%s&ps=30&tid=0&pn=1&keyword=&order=pubdate&jsonp=jsonp'%j
        response_page = requests.get(url_page, headers=headers)
        text_page = json.loads(response_page.text)
        res_page = text_page['data']['page']
        page = int(res_page['count'] / 30 + 1)#获取视频的页数

        for i in range(1, page):
            url = 'https://api.bilibili.com/x/space/arc/search?mid=%s&ps=30&tid=0&pn=%s&keyword=&order=pubdate&jsonp=jsonp'%(j,i)
            response = requests.get(url, headers=headers)
            text = json.loads(response.text)
            res = text['data']['list']['vlist']
            print("------正在爬取第-----"+str(i)+"-----页-----")
            for item in res:
                title = str(item['title'])#视频标题
                av = str(item['aid'])  # 视频av号
                comment = str(item['comment'])  # 视频评论数
                play = str(item['play'])  # 视频播放量
                video_length = str(item['length'])  # 视频长度
                file.write("{},{},{},{},{}".format(av,title,comment,video_length,play))
                file.write("\n")
                print("-----正在爬取视频av号为:" + av + "的信息-----")
        print("-----完成-----")
        file.write("{}".format(follower))
        file.close()
        time.sleep(5)

爬取到的数据展示
[爬虫]python爬取B站日榜100名up主及其视频信息_第7张图片

爬虫阶段——Done

你可能感兴趣的:(大数据,python,爬虫)