我们的目标是提取一篇文章中的关键词 or 给出关键词,在语料库中找到这组关键词最相近的文章。 两个目标要解决的问题是差不多的。今天用一种很简单却很有效的方法来解决这个问题, TF-IDF。在本文,我们选取第二种描述,即给出关键词,在语料库中找到与这组关键词最相近的文章。
TF,Term Frequency 词频,表示词语在一篇文章中出现的频数。TF值越大,表示这个词在该篇文章中出现的频数约大。但是如果仅仅根据数量来判断一个词是否为关键词,显然是不够的。例如[1],在文章中“的”,“是”这样的词往往数量很大,但却不是我们想要的关键词,这样的词称为停用词。(Stop words)。为了解决这个问题,于是引入了 IDF。
IDF Inverse Document Frequency。逆文档频率,它表示一个词的区分程度大小。 一个词的 IDF 值越大,表示这个词越重要。 本文就不列举公式了,想看公式的同学请参考引文 阮一峰老师的文章。
本文的主要目标是实现一个demo。
有了TF(数量)和IDF(权重)
我们将二者相乘,就可以比较合理的衡量一个词重要性。TF-IDF
import numpy as np
import math
file_dir = 'input/tf_idf_data.txt' # 数据在文尾给出
docid2content = {} # int - list
word2id = {} # str-int
id2word = {} # int-str
word_id = 0
with open(file_dir, 'r') as f:
doc_id = 0
for line in f.readlines():
seg = line.strip('\n').split(' ')
docid2content[doc_id] = seg
doc_id += 1
for word in seg:
# 自定义词典
if word not in word2id:
word2id[word] = word_id
id2word[word_id] = word
word_id += 1
n_doc = len(docid2content)
n_word = len(word2id)
print('Document length = %d' % n_doc)
print('Unique word number = %d' % n_word)
Document length = 148
Unique word number = 20035
# V 词典词数量, M 文档数量
# 统计词频 - Term Frequency
word_tf_VM = np.zeros(shape=[n_word, n_doc])
for doc_id in range(n_doc):
for word in docid2content[doc_id]:
word_tf_VM[word2id[word]][doc_id] += (1.0/len(docid2content[doc_id])) # 归一化
print('==========> 词频统计')
for i in range(5):
print(word_tf_VM[i])
==========> 词频统计
[ 0.01611279 0. 0. 0. 0. 0.0021978 0.
0. 0. 0. 0.0208605 0. 0. 0.
0. 0. 0. 0. 0. 0.
0.03106212 0. 0. 0. 0. 0. 0.
0. 0. 0. 0.02606882 0. 0. 0.
0. 0. 0. 0. 0. 0.
0.02504174 0. 0. 0.00064226 0.00089127 0. 0.
0. 0. 0. 0.03455497 0. 0. 0.
0. 0. 0. 0. 0. 0.00089767
0.02677888 0. 0. 0. 0. 0. 0.
0. 0. 0. 0.02833333 0. 0. 0.
0. 0. 0. 0. 0. 0.01930215
0. 0. 0. 0. 0. 0. 0.
0. 0. 0.01268657 0.00817439 0.00331126 0. 0.
0. 0.00388601 0. 0. 0. 0.02874133
0. 0. 0. 0. 0. 0. 0.
0. 0. 0.06626506 0.00102564 0. 0. 0.
0.0027248 0. 0. 0. 0. 0.05405405
0. 0. 0. 0. 0. 0. 0.
0. 0.02467232 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.02205882 0. 0.
0. 0. 0. 0. 0. 0.
0.00239808]
[ 0.00302115 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0.00090909 0. 0.00079051 0. 0. 0.
0. 0. 0.00137931 0. 0. 0.00088339
0. 0. 0. 0.00208551 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.00160256 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0.00089928 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.00087642 0. 0.
0. 0. 0.00099108 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0.00069013 0.00086806 0. 0. 0. 0. 0.
0. 0. 0. 0.00087489 0. 0.
0.00133333 0. 0. 0. 0.00077101 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. ]
[ 0.00050352 0. 0. 0. 0. 0. 0.
0. 0. 0. 0.00130378 0. 0.0010929 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0.00208551 0. 0.00106724 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0.00064226 0. 0. 0. 0. 0.
0. 0.00104712 0. 0.00160256 0. 0. 0.
0. 0.00097943 0. 0.00089767 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.00112867 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.00145349
0. 0. 0. 0. 0. 0. 0.
0. 0. 0.00269179 0. 0. 0.00100402
0. 0.00069013 0. 0. 0. 0. 0.
0. 0. 0.001001 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. ]
[ 0.00050352 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0.00097943 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. ]
[ 0.00050352 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0.00089127 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0.00208768 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. ]
# 逆文档频率 - Inverse Document frequency
word_idf_V = np.zeros(shape=[n_word])
for i in range(n_word):
word = id2word[i]
for doc_id in range(n_doc):
if word in docid2content[doc_id]:
word_idf_V[i] += 1
for i in range(n_word):
word_idf_V[i] = math.log( n_doc / (word_idf_V[i] + 1) )
print('==========> 逆文档频率预览')
for i in range(5):
print(word_idf_V[i])
==========> 逆文档频率预览
1.73911573574
2.22462355152
2.16399892971
3.8985999851
3.61091791264
# 根据 TF-IDF 值, 输入关键词,给出最相近的 top 3 篇文章标号
input_word = [2,5,10,34,100]
for word_id in input_word:
word = id2word[word_id]
tf_idf = list() # ele (doc_id, tf_idf)
for doc_id in range(n_doc):
tf_idf.append((doc_id, word_tf_VM[word_id][doc_id] * word_idf_V[word_id]))
sort_tf_idf = sorted(tf_idf, key = lambda x:x[1], reverse=True)
print(word,'==>', sort_tf_idf[0], sort_tf_idf[1], sort_tf_idf[2])
你们好 ==> (106, 0.0058250307663738872) (30, 0.0045130321787443146) (52, 0.0034679470027370175)
二周目 ==> (0, 0.062589924690941171) (20, 0.0063331879234507513) (101, 0.0056082711158862925)
微信 ==> (127, 0.0083196841455246261) (126, 0.0069068013846811339) (109, 0.006832832962890706)
弹幕 ==> (23, 0.0017124917032905211) (0, 0.0012503086454034534) (3, 0.0010561444578422166)
快乐 ==> (125, 0.009259887219141132) (121, 0.0070725122854857457) (88, 0.0037244328690671309)