网易云音乐爬虫 && 数据可视化

Introduction

看见有人写了一篇我用Python分析了42万字的歌词,为了搞清楚民谣歌手们在唱些什么,觉得挺好玩的,于是就想自己也实现一下。于是本作品就诞生了。

源代码地址:

GitHub: CloudMusic-Crawler

爬虫

爬虫部分主要是调用已有的 API。这部分的工作可以参考NetEase-MusicBox,该作品作者实现了网易云音乐的命令行版,我用了一下还不错。主要参考了该作者的api.py部分。

# -*- coding: utf-8 -*-
# @Author: GreatV
# @Date: 2017-04-16
# 参考 https://github.com/darknessomi/musicbox

import requests

headers = {
    'Accept': '*/*',
    'Accept-Encoding': 'gzip,deflate,sdch',
    'Accept-Language': 'zh-CN,zh;q=0.8',
    'Connection': 'keep-alive',
    'Content-Type': 'application/x-www-form-urlencoded',
    'Host': 'music.163.com',
    'Referer': 'http://music.163.com/search/',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'
}

cookies = {'appver': '1.5.2'}

def get_artist_music(artist_id):
    url = 'http://music.163.com/api/artist/{}'.format(artist_id)

    try:
        r = requests.get(url, headers = headers, cookies = cookies)
        return r.json()['hotSongs']
    except requests.exceptions.RequestException as e:
        print e
        return []


def get_music_lyric(song_id):
    url = 'http://music.163.com/api/song/lyric?os=pc&id={}&lv=-1&kv=-1&tv=-1'.format(song_id)

    try:
        r = requests.get(url, headers = headers, cookies = cookies)
        if 'lrc' in r.json() and 'lyric' in r.json()['lrc'] and r.json()['lrc']['lyric'] is not None:
            return r.json()['lrc']['lyric']
        else:
            return []
    except requests.exceptions.RequestException as e:
        print e
        return []

def get_music_comments(song_id, offset=0, total='fasle', limit=100):
    url = 'http://music.163.com/api/v1/resource/comments/R_SO_4_{}/'.format(song_id)
    payload = {
        'rid': 'R_SO_4_{}'.format(song_id),
        'offset': offset,
        'total': total,
        'limit': limit
    }

    try:
        r = requests.get(url, headers = headers, cookies = cookies, params = payload)
        return r.json()
    except requests.exceptions.RequestException as e:
        print e
        return []

if __name__ == '__main__':
    pass

文件处理

该部分主要的工作是将所有歌词写入一个文件,同时每个作者的所有歌词也放入一个文件,以备后面的分析只用。

本次获取的歌词大概 26000 行。

文本分析

分词用的是“结巴”中文分词。

我首先选取了一位歌手作为代表分析了一下词频,如下所示:

做了一个词云:

然后。把所有的歌词都分析了一下,得到了如下饼状图:

还做了一个词云,如下所示:

部分代码如下:

# 获取分词结果
def get_word_list(path): 
    raw_word_list = []

    with open(path, 'r') as f:
        for line in f:
            #line = line.replace(' ','')
            line = line.strip()
            seg_list = jieba.cut(line, cut_all = False) # return formation ['a', 'b', 'a']
            raw_word_list += seg_list # return formation ['xx', 'a', 'b', 'a']
        print 'Write word list done!'
    return raw_word_list

# 词频统计
def words_frequency(word_list, top): # top: 所需高频词位数
    # world_dic = Counter(word_list) # return formation Counter({'aa': 2, 'c': 5, 'd':89})
    sorted_list = sorted(Counter(word_list).iteritems(), 
                key = lambda t:t[1], reverse = True) # return formation [('xx',100),('aa',97)]
    # print sorted_list
    # for word, freq in sorted_list[:top]:
    # print word+' : '+str(freq)

    return sorted_list[:top]

# 分词过滤
def word_list_filter(raw_word_list, top = 25):
    stopwords = [u' ', u'说', u'里', u'嘞', u'做', u'噢', u'话'] # 预定义停用词
    word_list = []

    # 停用词表来自:
    # https://github.com/XuJin1992/ChineseTextClassifier/blob/master/src/main/resources/text/stop_word.txt
    with open('stop_words.txt', 'r') as f:
        for word in f:
            word = word.strip().decode('utf-8')
            stopwords.append(word)

    # 过滤停用词
    for word in raw_word_list:
        if word not in stopwords:
            word_list.append(word)

    return words_frequency(word_list, top = top)


# 可视化
# 可传入 kind 的参数
# ‘bar’ or ‘barh’ 柱状图
# ‘pie’ 饼状图
# 其它参见 http://pandas.pydata.org/pandas-docs/stable/visualization.html
def visualization(word_list, kind = 'bar'):
    # data = {'word': word_list.keys(), 'frequency': word_list.values()}
    frame = DataFrame(word_list, columns = ['word', 'frequency'])
    frame.set_index('word',inplace = True)

    if kind == 'pie':
        frame.plot(subplots = True, kind = kind, legend = False)
    else:
        frame.plot(kind = kind)
    plt.show()


# 创建词云
def generate_wordcloud(word_list, mask_name):
    text = ' '.join(word_list)
    mask = np.array(Image.open(mask_name))

    wc = WordCloud(font_path = 'SourceHanSerifCN-Regular.otf', background_color = 'white', 
        max_words = 5000, mask = mask)

    # generate word cloud
    wc.generate(text)

    # create coloring from image
    # image_colors = ImageColorGenerator(mask)

    # store to file
    # wc.to_file('haha.png')

    # show
    plt.imshow(wc, interpolation='bilinear')
    plt.axis("off")
    plt.figure()
    plt.imshow(mask, cmap=plt.cm.gray, interpolation='bilinear')
    plt.axis("off")
    plt.show()

接下来的工作

  • 情绪分析
  • 云音乐的评论很精彩,可以做一下评论,看看有什么发现
  • 饼状图太丑,想换一个

你可能感兴趣的:(网易云音乐爬虫 && 数据可视化)