本文的目标是使用sklearn工具包实现自动提取文章主题。
首先补充一下tf-idf的基本知识:
TF-IDF(term frequency–inverse document frequency)是一种用于资讯检索与文本挖掘的常用加权技术。TF-IDF是一种统计方法,用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。TF-IDF加权的各种形式常被搜索引擎应用,作为文件与用户查询之间相关程度的度量或评级。除了TF-IDF以外,互联网上的搜索引擎还会使用基于连结分析的评级方法,以确定文件在搜寻结果中出现的顺序。
TF: Term Frequency, 用于衡量一个词在一个文件中的出现频率。因为每个文档的长度的差别可以很大,因而一个词在某个文档中出现的次数可能远远大于另一个文档,所以词频通常就是一个词出现的次数除以文档的总长度,相当于是做了一次归一化。
IDF: 逆向文件频率,用于衡量一个词的重要性。计算词频TF的时候,所有的词语都被当做一样重要的,但是某些词,比如”is”, “of”, “that”很可能出现很多很多次,但是可能根本并不重要,因此我们需要减轻在多个文档中都频繁出现的词的权重。
tf*idf就是TF-IDF:
下面是实现代码:
from time import time
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.datasets import fetch_20newsgroups
import numpy as np
n_features = 1000 #词汇数目
stop_words='english' #停用词
n_top_words = 10 #话题取最具代表性的前10个词
n_topics = 10 #话题数目,也就是说文章分为几类
# 从网络加载文章数据
print("读取数据 ...")
t0 = time()
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
remove=('headers', 'footers', 'quotes'))
print("完成,共耗时 %0.3f 秒." % (time() - t0))
data_samples = dataset.data[:]#data_samples是一个string list
#打印1篇样本,看看数据是什么样的
print('------------------------------------------------------------------------------')
print(dataset['data'][1].replace('\n',''))
print('------------------------------------------------------------------------------')
print("提取文本的 tf-idf 特征 ...")
# 计算文本的tf-idf作为输入NMF模型的特征(feature).
# max_df=0.95, min_df=2: 使用一些启发式规则预处理去掉一些词语,这里删除只在一个文本中出现或者在95%以上的文本中出现的词语
# max_features=n_features: 预处理后,在剩余的词语里面保留数据集中最常见的n_feature个词语
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
max_features=n_features,
stop_words=stop_words)
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(data_samples)#每一篇文章用词袋模型表示
print("完成,共耗时 %0.3f 秒." % (time() - t0))
print("第一篇文本数据集的 bag-of-word 表示:",tfidf[1])
# train the NMF model
print("学习 NMF 分解来拟合 tfidf 特征矩阵, NMF使用%d个话题(topics)..."% (n_topics))
t0 = time()
# nmf模型对文本的理解保存在nmf.components_参数矩阵中
model = NMF(n_components=n_topics, random_state=1,alpha=.1, l1_ratio=.5).fit(tfidf)
print("完成,共耗时 %0.3f 秒." % (time() - t0))
print("\n每个话题的代表词语有:")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
for topic_idx, topic in enumerate(model.components_):
print("Topic #%d:" % topic_idx)
print(", ".join([tfidf_feature_names[i]
for i in topic.argsort()[:-n_top_words - 1:-1]]))
#下面是单词在话题中的分布
id = [None]*4
id[0] = tfidf_feature_names.index('software')
id[1] = tfidf_feature_names.index('computer')
id[2] = tfidf_feature_names.index('faith')
id[3] = tfidf_feature_names.index('bible')
xs = [None]*4
for i in range(4):
xs[i] = model.components_[:,id[i]]
print("四个单词在话题空间的坐标/表示(representation)/特征(feature)\n")
print("每个单词是一个%d维的词向量:" % (n_topics))
print('software ','computer ','faith ','bible')
print(np.array(xs).T)
输出可以看到如下:
读取数据 ...
完成,共耗时 1.113 秒.
------------------------------------------------------------------------------
Yeah, do you expect people to read the FAQ, etc. and actually accept hardatheism? No, you need a little leap of faith, Jimmy. Your logic runs outof steam!Jim,Sorry I can't pity you, Jim. And I'm sorry that you have these feelings ofdenial about the faith you need to get by. Oh well, just pretend that it willall end happily ever after anyway. Maybe if you start a new newsgroup,alt.atheist.hard, you won't be bummin' so much?Bye-Bye, Big Jim. Don't forget your Flintstone's Chewables! :) --Bake Timmons, III
------------------------------------------------------------------------------
提取文本的 tf-idf 特征 ...
完成,共耗时 1.534 秒.
第一篇文本数据集的 bag-of-word 表示: (0, 359) 0.157950555534
(0, 670) 0.0931791735124
(0, 741) 0.121443360906
(0, 369) 0.173106148429
(0, 88) 0.125432039287
(0, 83) 0.162222363177
(0, 441) 0.262077091882
(0, 138) 0.200556359561
(0, 620) 0.215594107778
(0, 546) 0.122396192927
(0, 366) 0.334605423579
(0, 781) 0.16296634579
(0, 503) 0.53173441063
(0, 837) 0.299508506649
(0, 644) 0.152814863448
(0, 508) 0.0832618012544
(0, 341) 0.129313818352
(0, 584) 0.131614732473
(0, 851) 0.136967454787
(0, 625) 0.102062172628
(0, 627) 0.167821730866
(0, 103) 0.178910856535
(0, 978) 0.136899150037
(0, 162) 0.136424939982
(0, 321) 0.0850802014059
学习 NMF 分解来拟合 tfidf 特征矩阵, NMF使用10个话题(topics)...
完成,共耗时 0.759 秒.
每个话题的代表词语有:
Topic #0:
don, just, people, think, like, know, time, good, right, ve
Topic #1:
card, video, monitor, drivers, cards, bus, vga, driver, color, ram
Topic #2:
god, jesus, bible, christ, faith, believe, christian, christians, church, sin
Topic #3:
game, team, year, games, season, players, play, hockey, win, player
Topic #4:
car, new, 00, sale, 10, price, offer, condition, shipping, 20
Topic #5:
thanks, does, know, advance, mail, hi, anybody, info, looking, help
Topic #6:
windows, file, use, files, dos, window, program, using, problem, running
Topic #7:
edu, soon, cs, university, com, email, internet, article, ftp, send
Topic #8:
key, chip, encryption, clipper, keys, government, escrow, public, use, algorithm
Topic #9:
drive, scsi, drives, hard, disk, ide, controller, floppy, cd, mac
四个单词在话题空间的坐标/表示(representation)/特征(feature)
每个单词是一个10维的词向量:
software computer faith bible
[[ 0. 0.04794712 0.04666755 0.10193202 0.11112844]
[ 0.14831491 0.18456368 0. 0. 0. ]
[ 0. 0. 0.58343343 0.72806754 0. ]
[ 0. 0. 0. 0. 0. ]
[ 0.18538898 0.19922135 0. 0. 0.01818877]
[ 0.22524223 0.02979431 0. 0. 0. ]
[ 0.47153804 0.18160167 0. 0. 0. ]
[ 0. 0.07469935 0. 0. 0.03092071]
[ 0.00697661 0.12276027 0. 0. 0. ]
[ 0.17984988 0.22276098 0. 0. 0. ]]