看见有人写了一篇我用Python分析了42万字的歌词,为了搞清楚民谣歌手们在唱些什么,觉得挺好玩的,于是就想自己也实现一下。于是本作品就诞生了。
GitHub: CloudMusic-Crawler
爬虫部分主要是调用已有的 API。这部分的工作可以参考NetEase-MusicBox,该作品作者实现了网易云音乐的命令行版,我用了一下还不错。主要参考了该作者的api.py部分。
# -*- coding: utf-8 -*-
# @Author: GreatV
# @Date: 2017-04-16
# 参考 https://github.com/darknessomi/musicbox
import requests
headers = {
'Accept': '*/*',
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'zh-CN,zh;q=0.8',
'Connection': 'keep-alive',
'Content-Type': 'application/x-www-form-urlencoded',
'Host': 'music.163.com',
'Referer': 'http://music.163.com/search/',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'
}
cookies = {'appver': '1.5.2'}
def get_artist_music(artist_id):
url = 'http://music.163.com/api/artist/{}'.format(artist_id)
try:
r = requests.get(url, headers = headers, cookies = cookies)
return r.json()['hotSongs']
except requests.exceptions.RequestException as e:
print e
return []
def get_music_lyric(song_id):
url = 'http://music.163.com/api/song/lyric?os=pc&id={}&lv=-1&kv=-1&tv=-1'.format(song_id)
try:
r = requests.get(url, headers = headers, cookies = cookies)
if 'lrc' in r.json() and 'lyric' in r.json()['lrc'] and r.json()['lrc']['lyric'] is not None:
return r.json()['lrc']['lyric']
else:
return []
except requests.exceptions.RequestException as e:
print e
return []
def get_music_comments(song_id, offset=0, total='fasle', limit=100):
url = 'http://music.163.com/api/v1/resource/comments/R_SO_4_{}/'.format(song_id)
payload = {
'rid': 'R_SO_4_{}'.format(song_id),
'offset': offset,
'total': total,
'limit': limit
}
try:
r = requests.get(url, headers = headers, cookies = cookies, params = payload)
return r.json()
except requests.exceptions.RequestException as e:
print e
return []
if __name__ == '__main__':
pass
该部分主要的工作是将所有歌词写入一个文件,同时每个作者的所有歌词也放入一个文件,以备后面的分析只用。
本次获取的歌词大概 26000 行。
分词用的是“结巴”中文分词。
我首先选取了一位歌手作为代表分析了一下词频,如下所示:
做了一个词云:
然后。把所有的歌词都分析了一下,得到了如下饼状图:
还做了一个词云,如下所示:
部分代码如下:
# 获取分词结果
def get_word_list(path):
raw_word_list = []
with open(path, 'r') as f:
for line in f:
#line = line.replace(' ','')
line = line.strip()
seg_list = jieba.cut(line, cut_all = False) # return formation ['a', 'b', 'a']
raw_word_list += seg_list # return formation ['xx', 'a', 'b', 'a']
print 'Write word list done!'
return raw_word_list
# 词频统计
def words_frequency(word_list, top): # top: 所需高频词位数
# world_dic = Counter(word_list) # return formation Counter({'aa': 2, 'c': 5, 'd':89})
sorted_list = sorted(Counter(word_list).iteritems(),
key = lambda t:t[1], reverse = True) # return formation [('xx',100),('aa',97)]
# print sorted_list
# for word, freq in sorted_list[:top]:
# print word+' : '+str(freq)
return sorted_list[:top]
# 分词过滤
def word_list_filter(raw_word_list, top = 25):
stopwords = [u' ', u'说', u'里', u'嘞', u'做', u'噢', u'话'] # 预定义停用词
word_list = []
# 停用词表来自:
# https://github.com/XuJin1992/ChineseTextClassifier/blob/master/src/main/resources/text/stop_word.txt
with open('stop_words.txt', 'r') as f:
for word in f:
word = word.strip().decode('utf-8')
stopwords.append(word)
# 过滤停用词
for word in raw_word_list:
if word not in stopwords:
word_list.append(word)
return words_frequency(word_list, top = top)
# 可视化
# 可传入 kind 的参数
# ‘bar’ or ‘barh’ 柱状图
# ‘pie’ 饼状图
# 其它参见 http://pandas.pydata.org/pandas-docs/stable/visualization.html
def visualization(word_list, kind = 'bar'):
# data = {'word': word_list.keys(), 'frequency': word_list.values()}
frame = DataFrame(word_list, columns = ['word', 'frequency'])
frame.set_index('word',inplace = True)
if kind == 'pie':
frame.plot(subplots = True, kind = kind, legend = False)
else:
frame.plot(kind = kind)
plt.show()
# 创建词云
def generate_wordcloud(word_list, mask_name):
text = ' '.join(word_list)
mask = np.array(Image.open(mask_name))
wc = WordCloud(font_path = 'SourceHanSerifCN-Regular.otf', background_color = 'white',
max_words = 5000, mask = mask)
# generate word cloud
wc.generate(text)
# create coloring from image
# image_colors = ImageColorGenerator(mask)
# store to file
# wc.to_file('haha.png')
# show
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.figure()
plt.imshow(mask, cmap=plt.cm.gray, interpolation='bilinear')
plt.axis("off")
plt.show()