以下词汇相似度计算方法的实现是基于WordSimilarity-353进行,即根据相关方法计算得到给定词汇的相似度后,再使用斯皮尔曼等级相关判定来计算所得的词汇相似度与已人工标注好的相似度之间的相关性。
一、基于语义词典的方法
常用的语义词典是WordNet,一般直接在Python里面使用,即通过pip install nltk来安装NLTK之后,再下载nltk-data放在相应文件夹即可(官方推荐的方法下载特别慢),以下实现是根据Wu-Palmer 提出的最短路径计算两者的path_similarity(具体可以查看:http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html。),并且取其中最大的词汇相似度最为结果。
# -*- coding: utf-8 -*-
from nltk.corpus import wordnet as wn
import pandas as pd
import numpy as np
from scipy import stats
def WordSimibywup_simi():
data = pd.read_csv("combined.csv")
wordsList = np.array(data.iloc[ :, [ 0, 1 ] ])
simScore = np.array(data.iloc[ :, [ 2 ] ])
predScore = np.zeros((len(simScore), 1))
for i, (word1, word2) in enumerate(wordsList):
print("process #%d:%s and %s" % (i, word1, word2))
synsets1 = wn.synsets(word1)
synsets2 = wn.synsets(word2)
for synset1 in synsets1:
for synset2 in synsets2:
temp = synset1.wup_similarity(synset2)
if temp is not None and temp>score:
score=temp
predScore[ i, 0 ] = score
submitData = np.hstack((wordsList, simScore, predScore))
(pd.DataFrame(submitData)).to_csv("WordSimbywup_simi.csv", index=False,\
header=[ "Word1", "Word2", "OriginSimi", "PredSimi" ])
(coef1, pvalue) = stats.spearmanr(simScore, predScore)
print("WordSimibywup_simi:", 'correlation=', coef1, 'pvalue=', pvalue)
if __name__=='__main__':
WordSimibywup_simi()
# (WordSimibywup_simi: correlation= 0.339297340633 pvalue= 5.85303993897e-11)
这是Google利用Word2vec提前训练谷歌新闻得到的词向量模型,里面包含了很多很多很多的单词的词向量,网上一搜GoogleNews-vectors-negative300.bin.gz下载下来解压就好了,这里采用gensim这个工具来实现,可以通过pip install gensim来安装,代码如下:
# -*- coding:utf-8 -*-
import pandas as pd
import numpy as np
import gensim
from scipy import stats
def WordSimibyGoogleNews():
set = pd.read_csv('combined.csv')
data = np.array(set.iloc[ :, [ 0, 1 ] ])
simScore = np.array(set.iloc[ :, [ 2 ] ])
Prescore = np.zeros((len(data), 1))
model=gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
for i,(word1, word2) in enumerate(data):
print("process #%d:%s and %s" % (i, word1, word2))
Prescore[ i, 0 ] = model.similarity(word1,word2)
(coef1, pvalue) = stats.spearmanr(simScore, Prescore)
submitData = np.hstack((data, simScore, simScore))
(pd.DataFrame(submitData)).to_csv("wordsimbypath_GoogleNews.csv", index=False,
header=[ "Word1", "Word2", "OriginSimi", "PredSimi" ])
print("WordSimibyCS:", 'correlation=', coef1, 'pvalue=', pvalue)
if __name__=='__main__':
WordSimibyGoogleNews()
#(WordSimibyCS: correlation= 0.700016648627 pvalue= 2.86866666051e-53)