使用Genism进行词向量训练:教程版

使用Genism进行词向量训练:教程版

使用Genism进行词向量训练:教程版_第1张图片

1.起源

将文本换成向量后可以方便的使用SVM\logistics\deep learnig 等方法完成文本分类、标签、情感分析等实际任务.如何获得有效的词向量成为重要基础性工作.本博展示使用gensim包训练词向量的相关基础知识.

训练方法参考论文:

训练方法主要有CBOW与Skip-gram两种.

 

Deep learningvia word2vec’s “skip-gram and CBOW models”, using either hierarchical softmaxor negative sampling [1] [2].

[1]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.

 

[2]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.

CBOW Mehod

 

使用Genism进行词向量训练:教程版_第2张图片

 

Skip-gram Method

使用Genism进行词向量训练:教程版_第3张图片

 

 

2.完成训练的重要类、方法及参数含义

(1)model主类

 

sg defines thetraining algorithm. By default (sg=0), CBOW is used. Otherwise (sg=1), skip-gram isemployed.

size is thedimensionalityof the feature vectors.

window is the maximum distance between the current andpredicted word within a sentence.

alpha is the initiallearning rate (will linearly drop to zero as trainingprogresses).

min_count =低于该词频的词会被drop. ignore all words with total frequency lowerthan this.

workers = use thismanyworker threads to train the model (=faster training with multicoremachines).

hs = if 1,hierarchical softmax will be used for model training. If set to0 (default), and negative is non-zero,negative sampling will be used.

negative =if > 0, negative sampling will be used, the int for negativespecifies how many “noise words” should be drawn (usually between 5-20).Default is 5. If set to 0, no negative samping is used.

cbow_mean = if 0, use the sum of the context wordvectors. If 1 (default), use the mean. Only applies when cbow is used.

iter = number of iterations (epochs) over the corpus.

示例:

 

model = Word2Vec(sentences,size=100, window=5, min_count=5, workers=4)

 

(2)模型构建后(类实例化),再调用类的train方法之前一定要构建词树.

 

build_vocab(sentenceskeep_raw_vocab=Falsetrim_rule=None)

Build vocabulary from a sequence of sentences (can be aonce-only generator stream). Each sentence must be a list of unicode strings.

词树就是基于词在语料中出现的词频构建的哈夫曼树,这样那些经常出现的词汇会在训练时更快被检索到,节省训练的搜索时间.

(3)

class gensim.models.word2vec.LineSentence(sourcemax_sentence_length=10000limit=None)

一行一句话,每个单词以空格分隔,所以需要@XXX换成@”XX字符去除

(4)模型训练好后,可以load/save model

 

model.save(fname)
model = Word2Vec.load(fname)  # you can continue training with the loaded model!
 

或者保存输出词向量

 

 

save_word2vec_format(fnamefvocab=Nonebinary=False)

Store theinput-hidden weight matrix in the same format used by theoriginal C word2vec-tool, for compatibility.

save_word2vec_format(fname, fvocab=None, binary=False)

 

 


(5)模型结果使用

 

 model.most_similar(positive=['woman', 'king'], negative=['man'])
[('queen', 0.50882536), ...]

 model.doesnt_match("breakfast cereal dinner lunch".split())
'cereal'#返回这个list中与其他词最大搭的词

 

 model.similarity('woman', 'man')
0.73723527

 model['computer']  # raw numpy vector of a word
array([-0.00449447, -0.00310097,  0.02421786, ...], dtype=float32)

trained_model.similarity('woman', 'man')
0.73723527#两单词cosine相似度

trained_model.n_similarity(['restaurant', 'japanese'], ['japanese', 'restaurant'])#两对词向量的cos相似度
1.0000000000000004


(6)其他相关知识

modelparameters are stored as matrices (NumPy arrays)

#vocabulary times   词汇计数

#size (size parameter)of floats 维度

内存估计:

主要内存占用:

100,000 uniquewords200

模型参数占用内存:100,000*200*4(float4bytes)*3=~229MB

再加一部分内存:几MB

存储:词汇哈夫曼树(节省搜索空间)

加速:

model=Word2Vec(sentences,workers=4)

The workersparameter has only effect if you have Cython installed

 

 

 

 

 

 

参考链接:

1.Making sense of word2vec

http://rare-technologies.com/making-sense-of-word2vec/

2.Genism word2vec

http://rutumulkar.com/blog/2015/word2vec/

http://rare-technologies.com/word2vec-tutorial/

 3.http://www.52nlp.cn/中英文维基百科语料上的word2vec实验

http://radimrehurek.com/gensim/models/word2vec.html

 

 

你可能感兴趣的:(Deep,Learning,in,NLP)