【Natural Language Processing】词汇相似度(Word similarity)计算

        以下词汇相似度计算方法的实现是基于WordSimilarity-353进行,即根据相关方法计算得到给定词汇的相似度后,再使用斯皮尔曼等级相关判定来计算所得的词汇相似度与已人工标注好的相似度之间的相关性。

一、基于语义词典的方法

        常用的语义词典是WordNet,一般直接在Python里面使用,即通过pip install nltk来安装NLTK之后,再下载nltk-data放在相应文件夹即可(官方推荐的方法下载特别慢),以下实现是根据Wu-Palmer 提出的最短路径计算两者的path_similarity(具体可以查看:http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html。),并且取其中最大的词汇相似度最为结果。

# -*- coding: utf-8 -*-

from nltk.corpus import wordnet as wn
import pandas as pd
import numpy as np
from scipy import stats

def WordSimibywup_simi():
    data = pd.read_csv("combined.csv")
    wordsList = np.array(data.iloc[ :, [ 0, 1 ] ])
    simScore = np.array(data.iloc[ :, [ 2 ] ])
    predScore = np.zeros((len(simScore), 1))
    for i, (word1, word2) in enumerate(wordsList):
        print("process #%d:%s and %s" % (i, word1, word2))
        synsets1 = wn.synsets(word1)
        synsets2 = wn.synsets(word2)
        for synset1 in synsets1:
            for synset2 in synsets2:
                temp = synset1.wup_similarity(synset2)
                if temp is not None and temp>score:
                    score=temp
        predScore[ i, 0 ] = score
    submitData = np.hstack((wordsList, simScore, predScore))
    (pd.DataFrame(submitData)).to_csv("WordSimbywup_simi.csv", index=False,\
                                      header=[ "Word1", "Word2", "OriginSimi", "PredSimi" ])
    (coef1, pvalue) = stats.spearmanr(simScore, predScore)
    print("WordSimibywup_simi:", 'correlation=', coef1, 'pvalue=', pvalue)

if __name__=='__main__':
    WordSimibywup_simi()
    # (WordSimibywup_simi: correlation= 0.339297340633 pvalue= 5.85303993897e-11)

二、基于GoogleNews语料计算词汇相似度

        这是Google利用Word2vec提前训练谷歌新闻得到的词向量模型,里面包含了很多很多很多的单词的词向量,网上一搜GoogleNews-vectors-negative300.bin.gz下载下来解压就好了,这里采用gensim这个工具来实现,可以通过pip install gensim来安装,代码如下:

# -*- coding:utf-8 -*-

import pandas as pd
import numpy as np
import gensim
from scipy import stats

def WordSimibyGoogleNews():
    set = pd.read_csv('combined.csv')
    data = np.array(set.iloc[ :, [ 0, 1 ] ])
    simScore = np.array(set.iloc[ :, [ 2 ] ])
    Prescore = np.zeros((len(data), 1))
    model=gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
    for i,(word1, word2) in enumerate(data):
        print("process #%d:%s and %s" % (i, word1, word2))
        Prescore[ i, 0 ] = model.similarity(word1,word2)
    (coef1, pvalue) = stats.spearmanr(simScore, Prescore)
    submitData = np.hstack((data, simScore, simScore))
    (pd.DataFrame(submitData)).to_csv("wordsimbypath_GoogleNews.csv", index=False,
                                      header=[ "Word1", "Word2", "OriginSimi", "PredSimi" ])
    print("WordSimibyCS:", 'correlation=', coef1, 'pvalue=', pvalue)

if __name__=='__main__':
    WordSimibyGoogleNews()
    #(WordSimibyCS: correlation= 0.700016648627 pvalue= 2.86866666051e-53)

        以上只是用了最简单的两种方法,除此之外还有很多种方法,比如可以通过爬取相应的词汇库,然后计算每个词汇的TF-IDF特征,然后直接比较两个目标词汇的TF-IDF特征的余弦相似度,甚至可以是基于检索相应词汇返回页面数量等方法。

你可能感兴趣的:(Natural,Language,Processing)