由于自己在csdn的上传空间不够,暂时将语料放在百度云上
链接: https://pan.baidu.com/s/1qYKRXOo 密码: 4psr
文件名是 text8
或者在gensim的word2vec模块官网进行下载。
下载word2vec源文件
http://download.csdn.net/detail/a1368783069/9585714
在终端直接 make 会报错:
error: malloc.h: No such file or directory
找到报错的文件,
将#include 注释,并添加头文件#include
如下所示:
//#include
#include
当以上错误在所有文件中都被改正后,直接 make进行编译,最后编译,并出现一句错误(忘了),不用管,打开word2vec自带的demo,例如 ./demo-analogy.sh,查看是否正常运行。
安装gnesim:
自己使用的python3.5,在终端直接 pip3 install gensim 进行安装。
pip3 install gensim
通过加载训练预料,进行初始化并训练word2vec模型
from gensim.models import word2vec
sentences = word2vec.Text8Corpus("./text8")
model = word2vec.Word2Vec(sentences,size=200)
计算相似的词,topn=1设置自取最相似词表中的第一个
model.most_similar(positive=["woman","king"],negative=["man"],topn=1)
# output : [('queen', 0.6571654081344604)]
#其意义是计算一个词d(或者词表),使得该词的向量v(d)与v(a="woman")-v(c="man")+v(b="king")最近
持久化训练后的模型到本地,加载持久化的模型创建新的模型
model.save("./text8.model")
model.save_word2vec_format("./text8.model.bin",binary=True)
model= word2vec.Word2Vec.load_word2vec_format("./text8.model.bin", binary=True)
1,类比的方法,即原版本中的 Analogy 方法。
model.most_similar(['girl','father'],['boy'],topn=3)
# output:[('mother', 0.7488583326339722), ('wife', 0.7154518365859985), ('lover', 0.6844878196716309)]
model.most_similar(['girl','father'],['boy'],topn=10)
#output:[('mother', 0.7488583326339722), ('wife', 0.7154518365859985), ('lover', 0.6844878196716309), ('daughter', 0.673076331615448), ('grandmother', 0.6717995405197144), ('grandfather', 0.6610891819000244), ('widow', 0.6601356267929077), ('aunt', 0.6599773168563843), ('brother', 0.6576226949691772), ('servant', 0.6481131315231323)]
more_examples = ["he his she", "big bigger bad","going went being"]
for example in more_examples:
a,b,x = example.split()
predicted = model.most_similar([x,b],[a])[0][0]
print(a,b,x,predicted)
#output:
#he his she her
#big bigger bad worse
#going went being was
2, 计算与当前词距离最近的词,即原版本中的distance方法
model.most_similar(["boy"], topn=3)
# or
model.most_similar("boy", topn=3)
#[('girl', 0.7218124270439148), ('kid', 0.6669290065765381), ('boss', 0.6069169044494629)]
3, 计算相似概率,即词向量的cosine值
model.similarity("boy", "girl")
#0.72181237676432175
4, word-embeding向量大小
model["boy"]
#array([ 1.06947005, 0.11825128, 0.66958714, 0.64033055, 0.15489197,
-0.57684672, -0.4583835 , 1.20578945, -0.79072052, .......
model["boy"].shape
# (200,)
发现不匹配的词
model.doesnt_match("breakfast cereal dinner lunch".split())
#output:'cereal'
具体的类语法:gensim.models.word2vec.Word2Vec
参考官网:http://radimrehurek.com/gensim/models/word2vec.html