python sklearn TfidfVectorizer

参考:http://python.jobbole.com/81311/

# -*- coding:utf-8 -*-

from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer
import math
import numpy as np

corpus = ['This is the first document.',
      'This is the second second document.',
      'And the third one.',
      'Is this the first document?',]
vectorizer = TfidfVectorizer(min_df=1)
vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(TfidfVectorizer().fit(corpus).vocabulary_)
print(TfidfVectorizer().fit(corpus).idf_)
print(TfidfVectorizer().fit(corpus).smooth_idf)
x = TfidfVectorizer().fit(corpus)
# print(x.transform(corpus).toarray())
# print(vectorizer.fit_transform(corpus))
print(vectorizer.fit_transform(corpus).toarray())

结果:

[u'and', u'document', u'first', u'is', u'one', u'second', u'the', u'third', u'this']
{u'and': 0, u'third': 7, u'this': 8, u'is': 3, u'one': 4, u'second': 5, u'the': 6, u'document': 1, u'first': 2}
[1.91629073 1.22314355 1.51082562 1.22314355 1.91629073 1.91629073
 1.         1.91629073 1.22314355]
True
[[0.         0.43877674 0.54197657 0.43877674 0.         0.
  0.35872874 0.         0.43877674]
 [0.         0.27230147 0.         0.27230147 0.         0.85322574
  0.22262429 0.         0.27230147]
 [0.55280532 0.         0.         0.         0.55280532 0.
  0.28847675 0.55280532 0.        ]
 [0.         0.43877674 0.54197657 0.43877674 0.         0.
  0.35872874 0.         0.43877674]]

最后计算结果和手算的会不一样。
可以看到idf的结果中the的为1,是因为在所有文档中均出现了,其它词的结果以这个为标准。
还可以发现结果中,每一行的数的平方和都为1。
并且0.43877674/0.35872874 = 1.22314355,0.85322574/0.22262429 = 1.91629073×2。

你可能感兴趣的:(python sklearn TfidfVectorizer)