记录使用gensim进行词向量增量训练(遇到的几个问题)

一般很少用到gensim来训练词向量,但是网上关于词向量增量训练几乎都是用gensim解决的,所以记录下使用gensim进行词向量增量训练及一个问题(笔记性质,记录给自己看。。。但如果能帮到和我一样的小白那就最好不过了)
1、代码文件
论坛上可以搜到,我用的是以下这位po主的,他写的已经很清晰了:
https://blog.csdn.net/qq_43404784/article/details/83794296

2、关于gensim中word2vec的数据格式与参数
一开始不是很清楚,以为gensim和word2vec一样,训练数据的格式是分好词的文本文件一行一句话,但这样并不能输出预期的词向量。
后来查阅了官网https://radimrehurek.com/gensim/models/word2vec.html 发现训练数据需要用列表。。另外,关于Word2vec的一些训练参数,官网也有说明。
Parameters:
sentences (iterable of iterables, optional)– The sentences iterable can be simply a list of lists of tokens, but for larger corpora, consider an iterable that streams the sentences directly from disk/network. See BrownCorpus, Text8Corpus or LineSentence in word2vec module for such examples. See also the tutorial on data streaming in Python. If you don’t supply sentences, the model is left uninitialized – use if you plan to initialize it in some other way.
corpus_file (str, optional) – Path to a corpus file in LineSentence format. You may use this argument instead of sentences to get performance boost. Only one of sentences or corpus_file arguments need to be passed (or none of them, in that case, the model is left uninitialized).
size (int, optional) – Dimensionality of the word vectors.
window (int, optional) – Maximum distance between the current and predicted word within a sentence.
min_count (int, optional) – Ignores all words with total frequency lower than this.
workers (int, optional) – Use these many worker threads to train the model (=faster training with multicore machines).
sg ({0, 1}, optional) – Training algorithm: 1 for skip-gram; otherwise CBOW.
hs ({0, 1}, optional) – If 1, hierarchical softmax will be used for model training. If 0, and negative is non-zero, negative sampling will be used.
negative (int, optional) – If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.
……等等,不再贴了……

3、加载已训练好的word2vec格式词向量和输出word2vec格式词向量
一开始自作聪明的地想:既然预训练的时候要输出一次模型,那可不可以把已经训练好的word2vec文本加载进来作为模型呢?
能不能是一回事,本菜鸟还是不顾三七二十一先加载词向量再说。
model = gensim.models.KeyedVectors.load_word2vec_format(‘xx.bin/xx.txt’,binary=True/False)
这里可能会出现编码问题
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xba in position xx: invalid start byte
要确保加载进来的文本是utf8编码。。。
好不容易找了utf8编码的文件加载进来。。
odel.build_vocab(text, update=True) # 更新词汇表
model.train(text, total_examples=model.corpus_count, epochs=model.iter)
在这一步报错,查了一下官网发现已训练好的Word2vec格式词向量并不能代替gensim训练的模型。。。(我不知道自己在干什么???)
最后,如果要把预训练好的词向量输出为word2vec格式的文本,可以加上以下代码:
model.wv.save_word2vec_format(‘xx.txt’,binary = False)

最后也不知道这么折腾出来的词向量又能给自己的工作带来多少利好ORZ。。人生如此艰辛。。。

你可能感兴趣的:(gensim)