tf-idf理解与使用

一、tf-idf总结

是由tf和idf两部分相乘得到

  • 1.tf该个句子里面各个单词的词频,词频越高反应的是这个句子对这个词特别看重,讲的主旨应该也是跟这个有关。
  • 2.idf统计的是 l o g 10 语 料 库 句 子 总 数 / 包 含 该 词 组 的 句 子 的 个 数 log_{10}^{语料库句子总数/包含该词组的句子的个数} log10/,反应的是这个词组重不重要,因为这个词组在所有句子都出现的话,那么肯定不重要了。

二、公式

  • t f = 一 句 话 中 单 词 出 现 的 次 数 / 单 词 总 数 tf=一句话中单词出现的次数/单词总数 tf=/
  • i d f = l o g 10 语 料 库 句 子 总 数 / 包 含 该 词 组 的 句 子 的 个 数 idf=log_{10}^{语料库句子总数/包含该词组的句子的个数} idf=log10/

三、使用

3.1直接可以得到结果

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
     'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())
print(X.shape)

结果

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]
(4, 9)

3.2 不同n-gram

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
     'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?',
]
vectorizer = TfidfVectorizer(ngram_range=(1,5))
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

结果

['and', 'and this', 'and this is', 'and this is the', 'and this is the third', 'document', 'document is', 'document is the', 'document is the second', 'document is the second document', 'first', 'first document', 'is', 'is the', 'is the first', 'is the first document', 'is the second', 'is the second document', 'is the third', 'is the third one', 'is this', 'is this the', 'is this the first', 'is this the first document', 'one', 'second', 'second document', 'the', 'the first', 'the first document', 'the second', 'the second document', 'the third', 'the third one', 'third', 'third one', 'this', 'this document', 'this document is', 'this document is the', 'this document is the second', 'this is', 'this is the', 'this is the first', 'this is the first document', 'this is the third', 'this is the third one', 'this the', 'this the first', 'this the first document']

3.3 字节为单位

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
     'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?',
]
vectorizer = TfidfVectorizer(analyzer='char_wb')
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

结果

[' ', '.', '?', 'a', 'c', 'd', 'e', 'f', 'h', 'i', 'm', 'n', 'o', 'r', 's', 't', 'u']

3.4字节为单位n-gram

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
     'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?',
]
vectorizer = TfidfVectorizer(analyzer='char_wb',ngram_range=(2,5))
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names(),X.toarray().shape)

结果

[' a', ' an', ' and', ' and ', ' d', ' do', ' doc', ' docu', ' f', ' fi', ' fir', ' firs', ' i', ' is', ' is ', ' o', ' on', ' one', ' one.', ' s', ' se', ' sec', ' seco', ' t', ' th', ' the', ' the ', ' thi', ' thir', ' this', '. ', '? ', 'an', 'and', 'and ', 'co', 'con', 'cond', 'cond ', 'cu', 'cum', 'cume', 'cumen', 'd ', 'do', 'doc', 'docu', 'docum', 'e ', 'e.', 'e. ', 'ec', 'eco', 'econ', 'econd', 'en', 'ent', 'ent ', 'ent.', 'ent. ', 'ent?', 'ent? ', 'fi', 'fir', 'firs', 'first', 'he', 'he ', 'hi', 'hir', 'hird', 'hird ', 'his', 'his ', 'ir', 'ird', 'ird ', 'irs', 'irst', 'irst ', 'is', 'is ', 'me', 'men', 'ment', 'ment ', 'ment.', 'ment?', 'nd', 'nd ', 'ne', 'ne.', 'ne. ', 'nt', 'nt ', 'nt.', 'nt. ', 'nt?', 'nt? ', 'oc', 'ocu', 'ocum', 'ocume', 'on', 'ond', 'ond ', 'one', 'one.', 'one. ', 'rd', 'rd ', 'rs', 'rst', 'rst ', 's ', 'se', 'sec', 'seco', 'secon', 'st', 'st ', 't ', 't.', 't. ', 't?', 't? ', 'th', 'the', 'the ', 'thi', 'thir', 'third', 'this', 'this ', 'um', 'ume', 'umen', 'ument'] (4, 138)

你可能感兴趣的:(python,算法学习总结,算法)