词频统计


    词频统计就是指统计出某个文本中各个词出现的次数,这里使用python中的词典数据结构易得。我用的是matplotlib画柱状图,画出top-K个高频词。这里需要注意的是图中的中文显示问题,在使用之前,需要修改相应的设置,具体方法不妨去google一下,我就不详细介绍了。
     # -*- coding: UTF-8-*-
import string
import numpy
import pylab

def getstr(word, count):
    countstr = word + ',' + str(count)
    return countstr

def get_wordlist(infile):
    c = open(infile).readlines()
    wordlist = []
    for line in c:
        if len(line)>1:
            words = line.split(' ')
            for word in words:
                if len(word)>1:
                    wordlist.append(word)
    return wordlist
    
def get_wordcount(wordlist, outfile):
    out = open(outfile, 'w')
    wordcnt ={}
    for i in wordlist:
        if i in wordcnt:
            wordcnt[i] += 1
        else:
            wordcnt[i] = 1
    worddict = wordcnt.items()
    worddict.sort(key=lambda a: -a[1])
    for word,cnt in worddict:
        out.write(getstr(word.encode('gbk'), cnt)+'\n')
    out.close()
    return wordcnt

def barGraph(wcDict):
    wordlist=[]
    for key,val in wcDict.items():
        if val>5 and len(key)>3:
            wordlist.append((key.decode('utf-8'),val))
    wordlist.sort()
    keylist=[key for key,val in wordlist]
    vallist=[val for key,val in wordlist]
    barwidth=0.5
    xVal=numpy.arange(len(keylist))
    pylab.xticks(xVal+barwidth/2.0,keylist,rotation=45)
    pylab.bar(xVal,vallist,width=barwidth,color='y')
    pylab.title(u'微博词频分析图')
    pylab.show()
     
if __name__ == '__main__':
    myfile = 'F://NLP/iWInsightor/weibo_filter.dat'
    outfile = 'F://NLP/iWInsightor/result.dat'
    wordlist = get_wordlist(myfile)
    wordcnt = get_wordcount(wordlist,outfile)
    barGraph(wordcnt)
    
    至此,我们的工作就完成了。下面是我的微博词频的一个柱状图。这些仅是业余时间之作,尚有诸多不足之处。

你可能感兴趣的:(词频统计)