需要两个库一个是jieba切词库,将一段句子切词用法比较简单。就是
import jieba
print " ".join(jieba.cut('我是来自中国北京某某大学的一名硕士研究生,这是我的测试语句,下面测试北京大学生和北京大学学生。'))
词云代码。py实现。mylist里面的string可以是文章也可以是词语,如果是文章则需要用jieba分词切一下。
因为这个需求比较简单,有兴趣的可以改一下。
from os import path
import jieba
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
mylist = [u'投诉分布',u'业务处理规范',u'刷卡赠好礼',u'天天民生日',u'惠吃惠生活',u'懂你的信用卡',u'精细化经营',u'诚于民,道相生',u'差异化分析',u'预测',u'客户行为',u'精准营销',u'以客户为中心',u'客户工单分析',u'投诉分布',u'业务处理规范',u'刷卡增好礼',u'以市场为导向',u'以创新为导向']
word_list = [" ".join(jieba.cut(sentence)) for sentence in mylist]
word_list.extend(mylist)
new_text = ' '.join(word_list)
wc = WordCloud(font_path="C/windows/fonts/MSYHBD.TTC", background_color="black",max_words=2000,height=300,width=600,prefer_horizontal=0.75,min_font_size=5,max_font_size=50,margin=0)
wc.generate(new_text)
plt.imshow(wc)
plt.axis("off")
plt.show()
其中wordcloud类的参数说明如下:
Word cloud object for generating and drawing.
Parameters
----------
font_path : string 字体路径
Font path to the font that will be used (OTF or TTF).
Defaults to DroidSansMono path on a Linux machine. If you are on
another OS or don't have this font, you need to adjust this path.
width : int (default=400)宽度
Width of the canvas.
height : int (default=200)高度
Height of the canvas.
prefer_horizontal : float (default=0.90)水平的词条的百分比,
The ratio of times to try horizontal fitting as opposed to vertical.
If prefer_horizontal < 1, the algorithm will try rotating the word
if it doesn't fit. (There is currently no built-in way to get only vertical
words.)
mask : nd-array or None (default=None) 词图的形状,默认是方的。输入图以后可以变成图的形状
If not None, gives a binary mask on where to draw words. If mask is not
None, width and height will be ignored and the shape of mask will be
used instead. All white (#FF or #FFFFFF) entries will be considerd
"masked out" while other entries will be free to draw on. [This
changed in the most recent version!]
scale : float (default=1)
Scaling between computation and drawing. For large word-cloud images,
using scale instead of larger canvas size is significantly faster, but
might lead to a coarser fit for the words.
min_font_size : int (default=4)最小字号
Smallest font size to use. Will stop when there is no more room in this
size.
font_step : int (default=1)不同字体之间的差距
Step size for the font. font_step > 1 might speed up computation but
give a worse fit.
max_words : number (default=200)最大词数
The maximum number of words.
stopwords : set of strings or None限制词
The words that will be eliminated. If None, the build-in STOPWORDS
list will be used.
background_color : color value (default="black")背景颜色
Background color for the word cloud image.
max_font_size : int or None (default=None)最大字号
Maximum font size for the largest word. If None, height of the image is
used.
mode : string (default="RGB")背景模式
Transparent background will be generated when mode is "RGBA" and
background_color is None.
relative_scaling : float (default=.5)
Importance of relative word frequencies for font-size. With
relative_scaling=0, only word-ranks are considered. With
relative_scaling=1, a word that is twice as frequent will have twice
the size. If you want to consider the word frequencies and not only
their rank, relative_scaling around .5 often looks good.
.. versionchanged: 2.0
Default is now 0.5.
color_func : callable, default=None
Callable with parameters word, font_size, position, orientation,
font_path, random_state that returns a PIL color for each word.
Overwrites "colormap".
See colormap for specifying a matplotlib colormap instead.
regexp : string or None (optional)
Regular expression to split the input text into tokens in process_text.
If None is specified, ``r"\w[\w']+"`` is used.
collocations : bool, default=True
Whether to include collocations (bigrams) of two words.
.. versionadded: 2.0
colormap : string or matplotlib colormap, default="viridis"
Matplotlib colormap to randomly draw colors from for each word.
Ignored if "color_func" is specified.
.. versionadded: 2.0
normalize_plurals : bool, default=True
Whether to remove trailing 's' from words. If True and a word
appears with and without a trailing 's', the one with trailing 's'
is removed and its counts are added to the version without
trailing 's' -- unless the word ends with 'ss'.
Attributes
----------
``words_`` : dict of string to float
Word tokens with associated frequency.
.. versionchanged: 2.0
``words_`` is now a dictionary
``layout_`` : list of tuples (string, int, (int, int), int, color))
Encodes the fitted word cloud. Encodes for each word the string, font
size, position, orientation and color.
Notes
-----
Larger canvases with make the code significantly slower. If you need a
large word cloud, try a lower canvas size, and set the scale parameter.
The algorithm might give more weight to the ranking of the words
than their actual frequencies, depending on the ``max_font_size`` and the
scaling heuristic.
"""