在大数据领域词云也不是啥新鲜事了,记得若干年前微博也有生成词云的功能,我的微博最大的关键字好像是“吃”来着……
本文是参考Python 爬虫实战(1):分析豆瓣中最新电影的影评改写而来,python版本为2.7,用的框架是beautiful。
详细内容可进链接学习。
import urllib
from bs4 import BeautifulSoup as bs
CommentList = [];
for a in range(11):
url = 'https://movie.douban.com/subject/11537954/comments?start={}&limit=20'.format(a*20)
resp = urllib.urlopen(url)
html_data = resp.read().decode('utf-8')
soup = bs(html_data, 'html.parser') #第二个参数是指定解析器
comment_eachpage = soup.find_all('div', class_='comment')
for item in comment_eachpage:
if item.find_all('p')[0].find('span').string is not None:
CommentList.append(item.find_all('p')[0].find('span').string)
allcomments = ''
for k in range(len(CommentList)):
allcomments = allcomments + (CommentList[k]).strip()
import re
pattern = re.compile(ur'[\u4e00-\u9fa5]+')
filterdata = re.findall(pattern, allcomments)
comments_zh = ''.join(filterdata)
import jieba #分词包
import pandas as pd
segment = jieba.lcut(comments_zh)
words_detail=pd.DataFrame({'segment':segment})
stopwords=pd.read_csv(r"D:\myprograms\douban\moviecontent\chineseStopWords.txt",index_col=False,quoting=3,sep="\t",names=['stopword'], encoding=u'gbk')
words_detail=words_detail[~words_detail.segment.isin(stopwords.stopword)]
import numpy as np
words_result=words_detail.groupby(by=['segment'])['segment'].agg({"countnum":np.size})
words_result=words_result.reset_index().sort_values(by=["countnum"],ascending=False)
import matplotlib.pyplot as plt
#%matplotlib inline
import matplotlib
matplotlib.rcParams['figure.figsize'] = (10.0, 5.0)
from wordcloud import WordCloud #词云包
wordcloud=WordCloud(font_path="simhei.ttf",background_color="white",max_font_size=80)
word_frequence = {x[0]:x[1] for x in words_result.head(1000).values}
wordcloud=wordcloud.fit_words(word_frequence)
fig = plt.gcf()
plt.imshow(wordcloud)
fig.savefig('rick&morty.png', dpi=100)
这段代码只要把豆瓣链接里的id换掉就可以生成不同影视作品的词云,
但是有个缺陷是只能爬两百多条影评,估计是受到了反爬取机制的限制。
等学会了怎么伪装浏览器再来改进一下(突然挖坑)。