@[vsm|向量空间模型|文本相似度]
本文源地址http://www.houzhuo.net/archives/51.html
vsm概念简单,把对文本内容的处理转化为向量空间中的
向量
计算,以空间上的相似度来直观表达语义上的相似度。
目录
文本聚类
主要依据聚类假设:同类的文档相似度较大,非同类的文档相似度较小。作为一种无监督的机器学习
方法,聚类由于不需要训练过程、以及不需要预先对文档手工标注类别,因此具有较高的灵活性和自动化处理能力,成为对文本信息进行有效组织、摘要和导航的重要手段。
所有的文本都可表现成向量的形式:
向量中的每一维都代表在文档中出现的一个独立词组或单个词,并且我们给每个词组赋予一个权值
(最简单就是词频,或者广为人知的tf_idf权重)。所以一个文档就会转换成一个n维的向量。
向量夹角公式
接下来就是利用中学所学的的公式来计算向量之间的夹角,夹角越小即代表较高的相似度。当然,我们比较之前需要将两个向量转化为同一维度(下面的代码中将加以演示)
__author__ = 'iothz'
import string
from string import *
list_of_all_file =[]
str_of_file1 = ""
str_of_file2 = ""
file1 = open('science.txt', 'r')
for line in file1.readlines():
nopunc =line.replace(",", "").replace(".", "").replace("?", "").replace("\"", "").replace("\'", "").replace(")", "").replace("(", "").replace("[", " ").replace("]", " ").replace("\n", " ")
str_of_file1 +=nopunc
list_of_all_file.append(str_of_file1)
file2 = open('science2.txt', 'r')
for line in file2.readlines():
nopunc =line.replace(",", "").replace(".", "").replace("?", "").replace("\"", "").replace("\'", "").replace(")", "").replace("(", "").replace("[", " ").replace("]", " ").replace("\n", " ")
str_of_file2 +=nopunc
list_of_all_file.append(str_of_file2)
文本预处理方法各不相同,上面代码去除两个文本的标点,并添加到一个list中方便下面处理
from collections import Counter
def build_lexicon(corpus):
lexicon = set()
for doc in corpus:
lexicon.update([word for word in doc.split()])
return lexicon
vocabulary = build_lexicon(list_of_all_file)
print 'the vector of two file is [' + ', '.join(list(vocabulary)) + ']'
the vector of two file is [and, nlp, basketball, love, her, i, baseball, you, cins]
这里引入了一个新的Python对象Counter用来在一个循环中进行计数。结果统计出每个单词出现的次数,但是我们现在还不能比较,因为他们的不在同一词汇空间中。
def freq(term, document):
return document.split().count(term)
def tf(term, document):
return freq(term, document)
doc_term_matrix = []
for doc in list_of_all_file:
#print '****'
print 'the doc is "' + doc + ' " '
tf_vector = [tf(word, doc) for word in vocabulary]
print tf_vector
tf_vector_string = ', '.join(format(freq, 'd') for freq in tf_vector)
print 'the tf vector for Document %d is [%s]' % ((list_of_all_file.index(doc)+1), tf_vector_string)
doc_term_matrix.append(tf_vector)
print 'All combined, here is out master document term matrix: '
print doc_term_matrix
the doc is "i love basketball i love you and nlp "
[1, 1, 1, 2, 0, 2, 0, 1, 0]
the tf vector for Document 1 is [1, 1, 1, 2, 0, 2, 0, 1, 0]
the doc is "i love baseball i love her and cins "
[1, 0, 0, 2, 1, 2, 1, 0, 1]
the tf vector for Document 2 is [1, 0, 0, 2, 1, 2, 1, 0, 1]
All combined, here is out master document term matrix:
[[1, 1, 1, 2, 0, 2, 0, 1, 0], [1, 0, 0, 2, 1, 2, 1, 0, 1]]
根据这段代码我们得到了相同长度的量化结果,
量化结果的长度是由语料库决定的
。有过接触机器学习经验的人都知道,为了避免个别单词在文档中过于频繁的出现,影响分析结果,我们要对每个词频向量进行比例缩放(归一化
)。
import math
import numpy as np
def normalizer(vec):
denom = np.sum([el**2 for el in vec])
return [(el / math.sqrt(denom)) for el in vec]
doc_term_matrix_normalizer = []
for vec in doc_term_matrix:
doc_term_matrix_normalizer.append(normalizer(vec))
print 'A regular old document term matrix '
print np.matrix(doc_term_matrix)
print '\nA document term matrix with row-wise norms of 1:'
print np.matrix(doc_term_matrix_normalizer)
A regular old document term matrix
[[1 1 1 2 0 2 0 1 0]
[1 0 0 2 1 2 1 0 1]]
A document term matrix with row-wise norms of 1:
[[ 0.28867513 0.28867513 0.28867513 0.57735027 0. 0.57735027
0. 0.28867513 0. ]
[ 0.28867513 0. 0. 0.57735027 0.28867513 0.57735027
0.28867513 0. 0.28867513]]
我们就这样得到了一个归一化过后的向量,并且没有丢失过多信息。但是比如在一篇文章中,“我”,“的”这类高频词汇对我们的做相似性比较似乎并没有什么作用,因为每篇文章中都会出现,反而会影响结果。所以我们将引入最通用的一种文本权值计算方法
tf-idf
def numDocsContaining(word, doclist):
docCount = 0
for doc in doclist:
if(freq(word, doc) > 0):
docCount +=1
return docCount
def idf(word, doclist):
n_samples = len(doclist)
df = numDocsContaining(word, doclist)
return np.log(n_samples / 1+df)
my_idf_vector = [idf(word, list_of_all_file) for word in vocabulary]
print 'Our vocabulary vector is [' + ', '.join(list(vocabulary)) + ']'
print 'The inverse document frequency vector is [' + ', '.join(format(freq, 'f') for freq in my_idf_vector) + ']'
Our vocabulary vector is [and, nlp, basketball, love, her, i, baseball, you, cins]
The inverse document frequency vector is
[1.098612, 1.098612, 1.098612, 1.098612, 0.693147, 1.098612, 0.693147, 1.098612, 0.693147]
tf-idf
是一种统计方法,它的主要思想是:如果某个词或短语在一篇文章中出现的频率TF高,并且在其他文章中很少出现,则认为此词或者短语具有很好的类别区分能力,适合用来分类。
而IDF逆向文件频率(Inverse Document Frequency)
是一个词语普遍重要性的度量。某一特定词语的IDF,可以由总文件数目除以包含该词语之文件的数目,再将得到的商取对数得到:
其中
文件总数
分母
。我们快得到想要的结果了。为了得到TF-IDF加权词向量,你必须做一个简单的计算:tf * idf。
如果你用一个AxB的向量乘以另一个AxB的向量,你将得到一个大小为AxA的向量,或者一个标量。我们不会那么做,因为我们想要的是一个具有相同维度(1 x词数量)的词向量,向量中的每个元素都已经被自己的idf权重加权了,所以:
def build_idf_matrix(idf_vector):
idf_mat = np.zeros((len(idf_vector),len(idf_vector)))
np.fill_diagonal(idf_mat,idf_vector)
return idf_mat
my_idf_matrix = build_idf_matrix(my_idf_vector)
print my_idf_matrix
[[ 1.09861229 0. 0. 0. 0. 0. 0.
0. 0. ]
[ 0. 1.09861229 0. 0. 0. 0. 0.
0. 0. ]
[ 0. 0. 1.09861229 0. 0. 0. 0.
0. 0. ]
[ 0. 0. 0. 1.09861229 0. 0. 0.
0. 0. ]
[ 0. 0. 0. 0. 0.69314718 0. 0.
0. 0. ]
[ 0. 0. 0. 0. 0. 1.09861229
...]]
这样我们就将IDF向量转化为BxB的矩阵了,矩阵的
对角线就是IDF向量
。这意味着我们现在可以用逆文档词频矩阵乘以每一个词频向量了。当然,我们在其中还是要做一次归一化
操作
doc_term_tfidf__matrix = []
for tf_vector in doc_term_matrix:
doc_term_tfidf__matrix.append(np.dot(tf_vector, my_idf_matrix))
doc_term_tfidf__matrix_normalizer = []
for tf_vector in doc_term_tfidf__matrix:
doc_term_tfidf__matrix_normalizer.append(normalizer(tf_vector))
print vocabulary
print np.matrix(doc_term_tfidf__matrix_normalizer)
[[ 0.28867513 0.28867513 0.28867513 0.57735027 0. 0.57735027
0. 0.28867513 0. ]
[ 0.31320094 0. 0. 0.62640189 0.19760779 0.62640189
0.19760779 0. 0.19760779]]
由此已经计算出了tf-idf权值,最后一步便是计算向量间的夹角了
x = np.array(doc_term_tfidf__matrix_normalizer[0][:])
y = np.array(doc_term_tfidf__matrix_normalizer[1][:])
Lx = np.sqrt(x.dot(x))
Ly = np.sqrt(y.dot(y))
print Lx, Ly
cos_angle = x.dot(y) / (Lx*Ly)
print 'cos_value: ', cos_angle
angle = np.arccos(cos_angle)
angle2 = angle*360/2/np.pi
print 'angle: ', angle2
similarity = (90-angle2) / 90
print 'similarity is: ' ,similarity
1.0 1.0
cos_value: 0.8137199207
angle: 35.5390135166
similarity is: 0.605122072038
得到最终结果!当然还有最简的方式,就是利用
scikit-learn
来计算,但是为了夯实基础还是要从最基本的了解起。
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(min_df=1)
tfidf_matrix = tfidf_vectorizer.fit_transform(list_of_all_file)
print tfidf_matrix.todense()