前言
之前一直用word2vec,今天在用gensim加载glove时发现gensim只提供了word2vec的接口,如果我们想用gensim加载Glove词向量怎么办呢?
word2vec和Glove都可以以文本格式呈现,打开文本文件发现,两者区别在于word2vec第一行注明词向量的数量和维度。
Glove词向量格式: word1 0.134 0.254 0.354
word2 0.245 0.335 0.377
word3 0.345 0.488 0.553
word4 0.564 0.234 0.564
word2vec词向量格式: 4 3
word1 0.134 0.254 0.354
word2 0.245 0.335 0.377
word3 0.345 0.488 0.553
word4 0.564 0.234 0.564
所以,如果想用gensim加载预训练的glove词向量,只需要在glove第一行添加词向量数量和维度就可以了。
针对glove加载的问题,gensim官方也提供了转换脚本。
gensim转换脚本
转换原理非常简单,把官方代码贴过来大家一看就明白了。
主要涉及两个函数。get_glove_info统计词向量数量和维数,glove2word2vec进行转换。 def get_glove_info(glove_file_name):
"""Get number of vectors in provided `glove_file_name` and dimension of vectors. Parameters ---------- glove_file_name : str Path to file in GloVe format. Returns ------- (int, int) Number of vectors (lines) of input file and its dimension. """
with smart_open(glove_file_name) as f:
num_lines = sum(1 for _ in f)
with smart_open(glove_file_name) as f:
num_dims = len(f.readline().split()) - 1
return num_lines, num_dims
def glove2word2vec(glove_input_file, word2vec_output_file):
"""Convert `glove_input_file` in GloVe format to word2vec format and write it to `word2vec_output_file`. Parameters ---------- glove_input_file : str Path to file in GloVe format. word2vec_output_file: str Path to output file. Returns ------- (int, int) Number of vectors (lines) of input file and its dimension. """
num_lines, num_dims = get_glove_info(glove_input_file)
logger.info("converting %i vectors from %s to %s", num_lines, glove_input_file, word2vec_output_file)
with smart_open(word2vec_output_file, 'wb') as fout:
fout.write("{0} {1}\n".format(num_lines, num_dims).encode('utf-8'))
with smart_open(glove_input_file, 'rb') as fin:
for line in fin:
fout.write(line)
return num_lines, num_dims
gensim加载glove预训练词向量
有了官方提供的转换脚本,我们可以很方便的将glove转换为word2vec。 from gensim.models import KeyedVectors
glove_file = 'test_glove.txt'
tmp_file = 'test_word2vec.txt'
# call glove2word2vec script
# default way (through CLI): python -m gensim.scripts.glove2word2vec --input --output
from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec(glove_file, tmp_file)
model = KeyedVectors.load_word2vec_format(tmp_file)
其他问题
顺带提一下gensim如何保存和加载word2vec词向量。
保存 model.save('/tmp/mymodel.model')
model.wv.save_word2vec_format('/tmp/mymodel.txt',binary = False)
model.wv.save_word2vec_format('/tmp/mymodel.bin.gz',binary = True)
第一种方法保存的文件不能利用文本编辑器查看但是保存了训练的全部信息,可以在读取后追加训练
后两种方法保存为word2vec文本格式但是保存时丢失了词汇树等部分信息,不能追加训练
加载 # 模型用第一种方式保存,可以直接加载并追加训练
model = gensim.models.Word2Vec.load('/tmp/mymodel.model')
model.train(more_sentences)
# 模型用后两种方式保存
model = gensim.models.KeyedVectors.load_word2vec_format('/tmp/mymodel.txt',binary = False)
model = gensim.models.KeyedVectors.load_word2vec_format('/tmp/mymodel.bin',binary = True)
参考链接