一、原理
Glove原理部分有大神已经写好的,此处给出链接方便查看:
https://blog.csdn.net/coderTC/article/details/73864097
Cbow/Skip-Gram 是一个local context window的方法,比如使用NS来训练,缺乏了整体的词和词的关系,负样本采用sample的方式会缺失词的关系信息。
另外,直接训练Skip-Gram类型的算法,很容易使得高曝光词汇得到过多的权重
Global Vector融合了矩阵分解Latent Semantic Analysis (LSA)的全局统计信息和local context window优势。融入全局的先验统计信息,可以加快模型的训练速度,又可以控制词的相对权重。
二、Glove实践
本人采用的是Ubuntu16.04的系统,安装了python 2.7
首先打开终端,安装gensim
sudo easy_install --upgrade gensim 就好了~~~~~~~~~~~~~
该代码为c的版本,并且跑在linux下
在glove文件夹里打开终端进行编译,输入:
make
会生成一个build文件夹
然后再执行sh demo.sh就行了:
sh demo.sh
其中,可以再demo.sh里面,设置训练语料路径(默认是从网上下载一个语料,把这段删了,改成自己的语料路径就行了),还可以设置迭代次数,向量的维度等等,自己随便折腾就行了
#!usr/bin/python
# -*- coding: utf-8 -*-
import shutil
import gensim
def getFileLineNums(filename):
f = open(filename,'r')
count = 0
for line in f:
count += 1
return count
def prepend_line(infile, outfile, line):
"""
Function use to prepend lines using bash utilities in Linux.
(source: http://stackoverflow.com/a/10850588/610569)
"""
with open(infile, 'r') as old:
with open(outfile, 'w') as new:
new.write(str(line) + "\n")
shutil.copyfileobj(old, new)
def prepend_slow(infile, outfile, line):
"""
Slower way to prepend the line by re-creating the inputfile.
"""
with open(infile, 'r') as fin:
with open(outfile, 'w') as fout:
fout.write(line + "\n")
for line in fin:
fout.write(line)
def load(filename):
# Input: GloVe Model File
# More models can be downloaded from http://nlp.stanford.edu/projects/glove/
# glove_file="glove.840B.300d.txt"
glove_file = filename
dimensions = 50
num_lines = getFileLineNums(filename)
# num_lines = check_num_lines_in_glove(glove_file)
# dims = int(dimensions[:-1])
dims = 50
print num_lines
#
# # Output: Gensim Model text format.
gensim_file='glove_model.txt'
gensim_first_line = "{} {}".format(num_lines, dims)
#
# # Prepends the line.
#if platform == "linux" or platform == "linux2":
prepend_line(glove_file, gensim_file, gensim_first_line)
#else:
# prepend_slow(glove_file, gensim_file, gensim_first_line)
# Demo: Loads the newly created glove_model.txt into gensim API.
model=gensim.models.KeyedVectors.load_word2vec_format(gensim_file,binary=False) #GloVe Model
model_name = gensim_file[6:-4]
#model.save('/home/qf/GloVe-master/' + model_name)
return model
if __name__ == '__main__':
myfile='/home/qf/GloVe-master/vectors.txt'
load(myfile)
####################################
model_name='model'
model = gensim.models.KeyedVectors.load('/home/qf/GloVe-master/'+model_name)
print len(model.vocab)
word_list = [u'to',u'one']
for word in word_list:
print word,'--'
for i in model.most_similar(word, topn=10):
print i[0],i[1]
print ''
在终端下运行该程序,可以获得一个在第一行增加了两个数的新的向量txt,第一个数指明一共有多少个向量,第二个数指明每个向量有多少维,同时获得一个模型文件model
测试时就可以直接用word2vec的load函数加载了
至此完成模型的训练预测试工作!!!
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
后记:
注意:1、该程序下载的语料库是已经处理好的,如果使用自己的需要另外处理
2、在过程中遇到的一些问题:
(2)安装gensim时,使用pip install gensim报错,使用sudo easy_install --upgrade gensim就可以了
(3)python下出现SyntaxError: Non-ASCII character '\xe5' in file 的错误
解决办法:是因为编码有问题,在python脚本的开头加上
#!usr/bin/python
# -*- coding: utf-8 -*-
(4)在上述程序中最开始load模型时使用gensim.models.Word2Vec.load('/home/qf/GloVe-master/'+model_name)
报错,后改为:
gensim.models.KeyedVectors.load('/home/qf/GloVe-master/'+model_name)
原因可能时没有安装word2vec,可以在终端下安装word2vec:
pip install word2vec
即可按第一种方式调用~~~未实践验证
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
参考: https://blog.csdn.net/sscssz/article/details/53333225