TF-IDF(Term Frequency & Inverse Documentation Frequency 词频-逆文档)算法是当前非常常用的一种文本特征的提取方法,在文本信息检索,语意抽取等自然语言处理(NLP)中广泛应用。本文将简单的介绍一下基于英文文本的TF-IDF算法实现,并且利用现在比较流行的词云的方式直观的表现出一个结果。
开发环境:Python 3.6.0 NLTK 3.2(NLTK是一个在自然语言处理方面被广泛利用的Python语言类库,他提供的集成方法可以大幅提高编程效率,官网:Natural Language Toolkit**,也可以利用pip安装)
$ pip3 install nltk
import nltkimport mathimport string from nltk.corpus
import stopwordsfrom collections
import Counterfrom nltk.stem.porter import*
text1 = "Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora. Challenges in natural language processing frequently involve natural language understanding, natural language generation (frequently from formal, machine-readable logical forms), connecting language and machine perception, managing human-computer dialog systems, or some combination thereof."
text2 = "The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem.[2] However, real progress was much slower, and after the ALPAC report in 1966, which found that ten-year-long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. Little further research in machine translation was conducted until the late 1980s, when the first statistical machine translation systems were developed."
text3 = "During the 1970s, many programmers began to write conceptual ontologies, which structured real-world information into computer-understandable data. Examples are MARGIE (Schank, 1975), SAM (Cullingford, 1978), PAM (Wilensky, 1978), TaleSpin (Meehan, 1976), QUALM (Lehnert, 1977), Politics (Carbonell, 1979), and Plot Units (Lehnert 1981). During this time, many chatterbots were written including PARRY, Racter, and Jabberwacky。"
def get_tokens(text):
lower = text.lower()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
no_punctuation = lower.translate(remove_punctuation_map)
tokens = nltk.word_tokenize(no_punctuation)
return tokens
这一步我们创建了一个分词函数,将所有英语字母转化为小写方便在下一步进行分析,并且将成段落的语料转化为了一个以单词为单位的Python List对象完成分词。例如我们有这么一句话,“Nature language processing is cool !” 将会被转化成[“nature”,“language”,“prosing”,“is”,“cool”,“!”]这么一个列表
def stem_tokens(tokens, stemmer):
stemmed = [] for item in tokens:
return stemmed
这一步是进行对已经分完词进行词干抽取,在英语的语言形式中经常有不同的变形,例如apple和apples表示单复数,process和processing的分词和动名词形式,这些单词往往在语言表示的意思上有相同的含义,所以对类似进行变形过的词汇进行词干抽取,可以提取出有相同词干词义的词。至此一些常规的NLP的文本预处理工作就完成了,接下来我们来简单介绍一下TF-IDF的实现原理。TF(Term Frequency):中文意思是词频,也就是在一段文本中出现的频率较高的词,由于我们在之前的预处理中已经去掉了英文中的停词(类似与to,is,are,the这些高频出现但是却没有真正的实际意义的词汇)所以这里我们往往可以认为出现频率越高的词汇会对整个文档有较大的影响。
IDF(Inverse Document Frequency):逆文档频率,首先我们回想一下停词,它们往往会在文档中非常高频的出现但是反而不能表达出文档的真实意思。那么同样的在不是停词的另外一些单词中,有些单词往往可以更加体现出文章的真实表达的意思,就像this thing made in china,and this thing is big。中thing只是个指代它既不能告诉你它是什么具体的东西也不能告诉你它的任何具体特征,但是big和china却可以很好的描述这句话说了什么,但是things的词频要比china和big都要大,这显然是有问题的。所以为了能够解决这么一个问题,我们需要对前面的TF进行修正,于是提出了逆文档频率,它的大小和一个词的常见程度是成反比的。
def tf(word, count):
return count[word] / sum(count.values())
def n_containing(word, count_list):
return sum(1 for count in count_list if word in count)
def idf(word, count_list):
return math.log(len(count_list)) / (1 + n_containing(word, count_list))
def tfidf(word, count, count_list):
return tf(word, count) * idf(word, count_list)
def count_term(text):
tokens = get_tokens(text)
filtered = [w for w in tokens if not w in stopwords.words('english')]
stemmer = PorterStemmer()
stemmed = stem_tokens(filtered, stemmer)
count = Counter(stemmed)
return count
def main():
texts = [text1, text2, text3]
countlist = [] for text in texts:
for i, count in enumerate(countlist):
print("Top words in document {}".format(i + 1))
scores = {word: tfidf(word, count, countlist) for word in count}
sorted_words = sorted(scores.items(), key = lambda x: x[1], reverse=True)
for word, score in sorted_words[:5]:
print("\tWord: {}, TF-IDF: {}".format(word, round(score, 5)))
if __name__ == "__main__":
[nltk_data] Downloading package punkt to ...
[nltk_data] \AppData\Roaming\nltk_data...
[nltk_data] Package punkt is already up-to-date!
Top words in document 1
Word: languag, TF-IDF: 0.07121
Word: natur, TF-IDF: 0.06103
Word: comput, TF-IDF: 0.04069
Word: process, TF-IDF: 0.03052
Word: concern, TF-IDF: 0.02034
Top words in document 2
Word: translat, TF-IDF: 0.05086
Word: machin, TF-IDF: 0.02713
Word: research, TF-IDF: 0.02034
Word: sixti, TF-IDF: 0.01017
Word: littl, TF-IDF: 0.01017
Top words in document 3
Word: mani, TF-IDF: 0.02555
Word: lehnert, TF-IDF: 0.02555
Word: 1978, TF-IDF: 0.02555
Word: began, TF-IDF: 0.01277
Word: exampl, TF-IDF: 0.01277
