一、TF-IDF简介
1.1 TF-IDF概念
TF-IDF(term frequency-inverse document frequency):一种用于信息检索与数据挖掘的常用加权技术。用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。
主要思想:如果某个词或短语在一篇文章中出现的频率TF高,并且在其他文章中很少出现,则认为此词或者短语具有很好的类别区分能力,适合用来分类,也就可以作为上文中所提到的关键字(以上内容来自百度百科)。
TF(term frequency):是分词出现的频率,表示该分词在文档中出现的频率,这个值越大说明这个词越重要。计算公式为:
TF=(该分词在该文档出现的次数)/(该文档分词的总数)
IDF(inverse document frequency):即逆向文件频率,在一个文档库中,一个分词出现在的文档数越少就越能和其他文档区别开来。计算公式为:
IDF=log((总文档数/出现该分词的文档数)+0.01)
TF-IDF的计算就是将上面两个值进行相乘,即:TF-IDF=TF*IDF。
1.2 TF-IDF应用
主要应用来计算两篇文章或者两短文本的相似性,首先计算出文本的分词,把所有的词合并成一个集合,计算每篇文章对于这个集合中的词的词频,生成两篇文章各自的词频向量,进而通过欧氏距离或余弦距离求出两个向量的余弦相似度,值越大就表示越相似。
1.3 TF-IDF的优缺点
TF-IDF的优缺点:TF-IDF算法非常容易理解,并且很容易实现,但是其简单结构并没有真正反映出每个单词的重要程度,根据我们的经验知道在文档的首尾词语一般都会表达出文章的主旨,另外也忽略了该词在文档中的分布情况。
二、TF-IDF实现
2.1 通过Scikit-Learn实现
# -*- coding:utf-8 -*-
from sklearn.feature_extraction.text import TfidfVectorizer
def get_tf_tfid(corpus):
countVectorizer = TfidfVectorizer(encoding='utf-8', lowercase=True, stop_words='english',
token_pattern='(?u)[A-Za-z][A-Za-z]+[A-Za-z]', ngram_range=(1, 1),
analyzer='word', max_df=0.85, min_df=1, max_features=150)
vector = countVectorizer.fit_transform(corpus).toarray()
print vector
if __name__=='__main__':
corpus = ['Where I can buy good oil for massage?',
'I advise you to use car oil any type forever.',
'I`m searching oil for body massage.',
'I love you!', ]
get_tf_tfid(corpus)
输出结果:
[[ 0. 0. 0.574 0. 0. 0.574 0. 0.453 0.366 0. 0. 0. ]
[ 0.430 0. 0. 0.430 0.430 0. 0. 0. 0.274 0. 0.430 0.430 ]
[ 0. 0.574 0. 0. 0. 0. 0. 0.453 0.366 0.574 0. 0. ]
[ 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. ]]
# -*- coding:utf-8 -*-
import math
from nltk.tokenize import WordPunctTokenizer
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
def nltkProcess(corpus):
tokens=[]
for line in corpus:
token = WordPunctTokenizer().tokenize(line)
noStopwords = [w.lower() for w in token if not w.lower() in stopwords.words('english')]
lmtzr = []
for w in noStopwords:
lmtzr.append(WordNetLemmatizer().lemmatize(w))
stem = []
for w in lmtzr:
stem.append(PorterStemmer().stem(w))
tokens.append(stem)
return tokens
def get_word(word, corpus):
sum=0
for text in corpus:
if word in text:
sum+=1
return sum
def get_tf(word,text):
return (text.count(word)+0.00)/len(text)
def get_idf(word, corpus):
return math.log(len(corpus) / (1 + get_word(word, corpus)))
def get_tfidf(corpus):
corpus=nltkProcess(corpus)
for i, text in enumerate(corpus):
print("The words of TF-IDF is:")
scores = {word: get_tf(word,text) * get_idf(word,corpus) for word in text}
sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
for word, score in sorted_words[:2]:
print("\tWord: {}, TF-IDF: {}".format(word, round(score, 3)))
if __name__=='__main__':
corpus = ['Where I can buy good oil for massage?',
'I advise you to use car oil any type forever.',
'I`m searching oil for body massage.',
'I love you!', ]
get_tfidf(corpus)
输出结果:
The words of TF-IDF is:
Word: good, TF-IDF: 0.139
Word: buy, TF-IDF: 0.139
The words of TF-IDF is:
Word: use, TF-IDF: 0.099
Word: car, TF-IDF: 0.099
The words of TF-IDF is:
Word: `, TF-IDF: 0.116
Word: search, TF-IDF: 0.116
The words of TF-IDF is:
Word: !, TF-IDF: 0.347
Word: love, TF-IDF: 0.347