爬取B站弹幕并生成词云

网上看到的爬取教程接口大都失效了,这次自己整一下,就当学习笔记了

自己在寻找弹幕的时候耗了很长时间,老想在视频上找到弹幕的加载地址……

其实弹幕就在右边


爬取B站弹幕并生成词云_第1张图片
1.png

其实好多实现还是利用原来的
代码如下:

from wordcloud import WordCloud
import matplotlib.pyplot as plt
import requests
import jieba
from pyquery import PyQuery as pq
from urllib.parse import urlencode
import datetime


def get_html(url):
    try:
        headers = {
            'Cookie': 'b LIVE_BUVID__ckMd5=7776ad817b9e0091; bp_t_offset_328350021=150073248314016020; _dfcaptcha=29276d4b1897beac8fcc8bb55f8ecdce',
            'Host': 'api.bilibili.com',
            'Origin': 'https://www.bilibili.com',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
        }
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            response.encoding = response.apparent_encoding
            return response.content
        else:
            return None
    except:
        print("Connet_Error")

def get_text(html):
    doc = pq(html)
    items = doc('i d').items()
    for item in items:
        yield item.text()

def create_date(datestart = None,dateend = None):
    # 创建日期表

    if datestart is None:
        datestart = '2018-01-01'
    if dateend is None:
        dateend = datetime.datetime.now().strftime('%Y-%m-%d')

    # 转为日期格式
    datestart=datetime.datetime.strptime(datestart,'%Y-%m-%d') #字符串格式转化为日期格式的函数
    dateend=datetime.datetime.strptime(dateend,'%Y-%m-%d')
    date_list = []
    date_list.append(datestart.strftime('%Y-%m-%d'))
    while datestart

先把弹幕内容存进 txt 文件里,之后再读取,快些?
结果如下:


爬取B站弹幕并生成词云_第2张图片
Figure_1.png

你可能感兴趣的:(爬取B站弹幕并生成词云)