利用Python3做词频统计和词云图

起源:

因看到一篇满眼是字的文章,故希望能够快速的检索出关键字,所以尝试用Python3来实现。

代码

import jieba
import numpy
import codecs
import pandas
import matplotlib.pyplot as plt
from wordcloud import WordCloud

file = codecs.open(r"ljs.txt")
content = file.read()
file.close()
segment=[]
segs=jieba.cut(content)
for seg in segs:
    if len(seg) > 1 and seg != '\r\n':
        segment.append(seg)

words_df = pandas.DataFrame({'segment':segment})
stopwords = pandas.read_csv('stopword.txt',index_col=False,quoting=3,sep=',',names=['stopword'],encoding="utf-8")
words_df=words_df[~words_df.segment.isin(stopwords.stopword)]

words_stat = words_df.groupby(by=['segment'])['segment'].agg({"计数":numpy.size})
words_stat = words_stat.reset_index().sort_values(by=["计数"],ascending=False)
words_df.head()


wordcloud = WordCloud(font_path='simhei.ttf',background_color='black')
words_frequence = {x[0]:x[1] for x in words_stat.values}

#fit_word函数,接受字典类型,其他类型会报类似没有items属性的错误
wordcloud = wordcloud.fit_words(words_frequence)
plt.imshow(wordcloud)
plt.show()

输出

利用Python3做词频统计和词云图_第1张图片

其中,用到的stopword文件为停用词文件,可参考链接:
https://github.com/wendy1990/short_text_classification/blob/master/conf/stopwords.txt
本文只是基于已有文件做词频统计,后续将考虑利用爬虫技术爬取网页信息再做分析,可参考:
https://segmentfault.com/a/1190000010473819

你可能感兴趣的:(练手应用)