Python | 爬虫抓取影评生成词云

在大数据领域词云也不是啥新鲜事了,记得若干年前微博也有生成词云的功能,我的微博最大的关键字好像是“吃”来着……

本文是参考Python 爬虫实战(1):分析豆瓣中最新电影的影评改写而来,python版本为2.7,用的框架是beautiful。
详细内容可进链接学习。

import urllib
from bs4 import BeautifulSoup as bs

CommentList = [];
for a in range(11):
    url = 'https://movie.douban.com/subject/11537954/comments?start={}&limit=20'.format(a*20)
    resp = urllib.urlopen(url) 
    html_data = resp.read().decode('utf-8') 
    soup = bs(html_data, 'html.parser') #第二个参数是指定解析器
    comment_eachpage = soup.find_all('div', class_='comment')
    for item in comment_eachpage: 
        if item.find_all('p')[0].find('span').string is not None:
            CommentList.append(item.find_all('p')[0].find('span').string)

allcomments = ''
for k in range(len(CommentList)):
    allcomments = allcomments + (CommentList[k]).strip()

import re

pattern = re.compile(ur'[\u4e00-\u9fa5]+')
filterdata = re.findall(pattern, allcomments)
comments_zh = ''.join(filterdata)            

import jieba    #分词包
import pandas as pd  

segment = jieba.lcut(comments_zh)
words_detail=pd.DataFrame({'segment':segment})

stopwords=pd.read_csv(r"D:\myprograms\douban\moviecontent\chineseStopWords.txt",index_col=False,quoting=3,sep="\t",names=['stopword'], encoding=u'gbk')
words_detail=words_detail[~words_detail.segment.isin(stopwords.stopword)] 

import numpy as np
words_result=words_detail.groupby(by=['segment'])['segment'].agg({"countnum":np.size})
words_result=words_result.reset_index().sort_values(by=["countnum"],ascending=False)

import matplotlib.pyplot as plt
#%matplotlib inline
import matplotlib
matplotlib.rcParams['figure.figsize'] = (10.0, 5.0)
from wordcloud import WordCloud #词云包

wordcloud=WordCloud(font_path="simhei.ttf",background_color="white",max_font_size=80)
word_frequence = {x[0]:x[1] for x in words_result.head(1000).values}


wordcloud=wordcloud.fit_words(word_frequence)
fig = plt.gcf()
plt.imshow(wordcloud)
fig.savefig('rick&morty.png', dpi=100)

这段代码只要把豆瓣链接里的id换掉就可以生成不同影视作品的词云,
但是有个缺陷是只能爬两百多条影评,估计是受到了反爬取机制的限制。
等学会了怎么伪装浏览器再来改进一下(突然挖坑)。

你可能感兴趣的:(python)