使用Genism进行词向量训练:教程版
1.起源
将文本换成向量后可以方便的使用SVM\logistics\deep learnig 等方法完成文本分类、标签、情感分析等实际任务.如何获得有效的词向量成为重要基础性工作.本博展示使用gensim包训练词向量的相关基础知识.
训练方法参考论文:
训练方法主要有CBOW与Skip-gram两种.
Deep learningvia word2vec’s “skip-gram and CBOW models”, using either hierarchical softmaxor negative sampling [1] [2].
[1] |
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013. |
[2] |
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013. |
CBOW Mehod
Skip-gram Method
2.完成训练的重要类、方法及参数含义
(1)model主类
sg defines thetraining algorithm. By default (sg=0), CBOW is used. Otherwise (sg=1), skip-gram isemployed.
size is thedimensionalityof the feature vectors.
window is the maximum distance between the current andpredicted word within a sentence.
alpha is the initiallearning rate (will linearly drop to zero as trainingprogresses).
min_count =低于该词频的词会被drop. ignore all words with total frequency lowerthan this.
workers = use thismanyworker threads to train the model (=faster training with multicoremachines).
hs = if 1,hierarchical softmax will be used for model training. If set to0 (default), and negative is non-zero,negative sampling will be used.
negative =if > 0, negative sampling will be used, the int for negativespecifies how many “noise words” should be drawn (usually between 5-20).Default is 5. If set to 0, no negative samping is used.
cbow_mean = if 0, use the sum of the context wordvectors. If 1 (default), use the mean. Only applies when cbow is used.
iter = number of iterations (epochs) over the corpus.
示例:
model = Word2Vec(sentences,size=100, window=5, min_count=5, workers=4)
(2)模型构建后(类实例化),再调用类的train方法之前一定要构建词树.
build_vocab(sentences, keep_raw_vocab=False, trim_rule=None)
Build vocabulary from a sequence of sentences (can be aonce-only generator stream). Each sentence must be a list of unicode strings.
词树就是基于词在语料中出现的词频构建的哈夫曼树,这样那些经常出现的词汇会在训练时更快被检索到,节省训练的搜索时间.
(3)
class gensim.models.word2vec.LineSentence
(source, max_sentence_length=10000, limit=None)
一行一句话,每个单词以空格分隔,所以需要@XXX换成@”XX中”字符去除
(4)模型训练好后,可以load/save model
model.save(fname)
model = Word2Vec.load(fname) # you can continue training with the loaded model!
或者保存输出词向量
save_word2vec_format(fname, fvocab=None, binary=False)
Store theinput-hidden weight matrix in the same format used by theoriginal C word2vec-tool, for compatibility.
save_word2vec_format(fname, fvocab=None, binary=False)
(5)模型结果使用
model.most_similar(positive=['woman', 'king'], negative=['man'])
[('queen', 0.50882536), ...]
model.doesnt_match("breakfast cereal dinner lunch".split())
'cereal'#返回这个list中与其他词最大搭的词
model.similarity('woman', 'man')
0.73723527
model['computer'] # raw numpy vector of a word
array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32)
trained_model.similarity('woman', 'man')
0.73723527#两单词cosine相似度
trained_model.n_similarity(['restaurant', 'japanese'], ['japanese', 'restaurant'])#两对词向量的cos相似度
1.0000000000000004
(6)其他相关知识
modelparameters are stored as matrices (NumPy arrays)
#vocabulary times 词汇计数
#size (size parameter)of floats 维度
内存估计:
主要内存占用:
100,000 uniquewords、200维
模型参数占用内存:100,000*200*4(float占4bytes)*3=~229MB
再加一部分内存:几MB
存储:词汇哈夫曼树(节省搜索空间)
加速:
model
=Word2Vec(sentences,workers=4)
The workersparameter has only effect if you have Cython installed
参考链接:
1.Making sense of word2vec
http://rare-technologies.com/making-sense-of-word2vec/
2.Genism word2vec
http://rutumulkar.com/blog/2015/word2vec/
http://rare-technologies.com/word2vec-tutorial/
3.http://www.52nlp.cn/中英文维基百科语料上的word2vec实验
http://radimrehurek.com/gensim/models/word2vec.html