情感分析正成为研究和社交媒体分析的热点领域,尤其是在用户评论和微博上。它是文本挖掘的一种特殊情况,一般关注在识别正反观点上,虽然它常不很准确,它仍然是有用的。为简单起见(因为训练数据容易获取),我将重点放在2个可能的情感分类:积极的和消极的。
def word_feats(words): return dict([(word, True) for word in words])
这里是在电影评论语料上训练和测试朴素贝叶斯分类器的完整Python代码。
import nltk.classify.util from nltk.classify import NaiveBayesClassifier from nltk.corpus import movie_reviews def word_feats(words): return dict([(word, True) for word in words]) negids = movie_reviews.fileids('neg') posids = movie_reviews.fileids('pos') negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids] posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids] negcutoff = len(negfeats)*3/4 poscutoff = len(posfeats)*3/4 trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff] testfeats = negfeats[negcutoff:] + posfeats[poscutoff:] print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats)) classifier = NaiveBayesClassifier.train(trainfeats) print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats) classifier.show_most_informative_features()
train on 1500 instances, test on 500 instances accuracy: 0.728 Most Informative Features magnificent = True pos : neg = 15.0 : 1.0 outstanding = True pos : neg = 13.6 : 1.0 insulting = True neg : pos = 13.0 : 1.0 vulnerable = True pos : neg = 12.3 : 1.0 ludicrous = True neg : pos = 11.8 : 1.0 avoids = True pos : neg = 11.7 : 1.0 uninvolving = True neg : pos = 11.7 : 1.0 astounding = True pos : neg = 10.3 : 1.0 fascination = True pos : neg = 10.3 : 1.0 idiotic = True neg : pos = 9.8 : 1.0如你所见,10个最由信息量的特征是是,在大多数情况下,高度描述性的形容词。只有2个字,似乎有点奇怪是“弱势”和“避免”。也许这些词是表明一部好电影的重要的情节点或情节发展。无论是哪种情况,用简单的假设和非常少的代码,我们能够得到几乎73%的准确率。这有点接近人类的准确性,显然人们认同的情绪时候只有大约80%。在本系列的后续文章将介绍精度和召回指标,替代的分类,技术提高精度。
原文:http://streamhacker.com/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier/