在处理NLP任务时,首先要解决的就是词(或字)在计算机中的表示问题。优秀的词(或字)表示要求能准确的表达出semantic(语义)
和syntactic(语法)
的特征。
目前常用的词嵌入(word embedding)训练方法有两种:
本文旨在介绍如何使用 word2vec
和 glove
算法训练自己的词向量;
Gensim很好地实现了word2vec
算法,可以使用gensim.models.word2vec
模块训练自己的词向量(word embedding)。
python
环境,可参考链接;gensim
:pip install gensim
;官方API。
1. LineSentence:该类是可以遍历包含sentence的文件(一个sentence一行,其中sentence是经过预处理,且用空格进行分词);
class gensim.models.word2vec.LineSentence(object):
def __init__(self, source, max_sentence_length=MAX_WORDS_IN_BATCH, limit=None):
重要参数:
source
: 包含句子的文件;max_sentence_length
:句子的最大长度,默认是:10000
;2. Word2Vec:该类实现word2vec
算法,可以训练词向量;
class gensim.models.word2vec.Word2Vec(BaseWordEmbeddingsModel):
def __init__(self, sentences=None, corpus_file=None, size=100, alpha=0.025,
window=5, min_count=5, max_vocab_size=None, sample=1e-3, seed=1,
workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5,
ns_exponent=0.75, cbow_mean=1, hashfxn=hash, iter=5, null_word=0,
trim_rule=None, sorted_vocab=1, batch_words=MAX_WORDS_IN_BATCH,
compute_loss=False, callbacks=(),max_final_vocab=None
):
重要参数:
sentences
:可迭代的sentence
,可以是LineSentence
对象;size
:词向量维度,默认值是:100
;window
:在一个句子中,当前词(current word)和预测词(predicted word)的距离,默认值是:5
;min_count
:小于该数值频率的词不计入词典,默认值是:5
;workers
:线程数量,默认值是:3
;def train_word_vec():
hanzi_sentences = LineSentence(file_util.get_project_path() + './data/hanzi.txt')
hanzi_model = Word2Vec(hanzi_sentences, size=hidden_dims, window=5, min_count=3,
workers=multiprocessing.cpu_count())
hanzi_model.save(file_util.get_project_path() + './data/word2vec_models/hanzi_embedding_{}.model'.format(hidden_dims))
hanzi_model.wv.save_word2vec_format(file_util.get_project_path() + './data/word2vec_models/hanzi_embedding_{}.txt'.format(hidden_dims))
其中:
hanzi_model.save
:保存模型,该模型可继续进行训练;hanzi_model.wv.save_word2vec_format
:保存训练好的词向量;GloVe是斯坦福大学Jeffrey、Richard等提供的一种词向量表示算法,GloVe的全称是Global Vectors for Word Representation,是一个基于全局词频统计(count-based & overall staticstics)的词表征(word representation)算法。该算法综合了global matrix factorization(全局矩阵分解)
和 local context window(局部上下文窗口)
两种方法的优点。
目前斯坦福官方提供的glove
工具只支持在linux
系统下运行,网址;
GloVe
源码:git clone http://github.com/stanfordnlp/glove ./
;make
,注意:在该过程中会出现一些警告warning
,可以忽略;编译完成之后会在当前目录生成build
文件夹;Glove
训练词向量的脚本是demo.sh
脚本,所有的参数都是在该脚本里进行配置。现对脚本做如下注释:
#!/bin/bash
set -e
# Makes programs, downloads sample data, trains a GloVe model, and then evaluates it.
# One optional argument can specify the language used for eval script: matlab, octave or [default] python
make
# 下载语料库,自己训练时可以注释掉
if [ ! -e text8 ]; then
if hash wget 2>/dev/null; then
wget http://mattmahoney.net/dc/text8.zip
else
curl -O http://mattmahoney.net/dc/text8.zip
fi
unzip text8.zip
rm text8.zip
fi
# 语料库文件
CORPUS=text8
# 训练好的词典文件
VOCAB_FILE=vocab.txt
# 共现矩阵二进制文件
COOCCURRENCE_FILE=cooccurrence.bin
COOCCURRENCE_SHUF_FILE=cooccurrence.shuf.bin
BUILDDIR=build
# 词向量文件名 vectors.txt
SAVE_FILE=vectors
VERBOSE=2
MEMORY=4.0
# 词频小于该值,不计入词典
VOCAB_MIN_COUNT=5
# 词向量维度
VECTOR_SIZE=50
MAX_ITER=15
# 训练窗口大小
WINDOW_SIZE=15
BINARY=2
NUM_THREADS=8
X_MAX=10
echo
echo "$ $BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE"
$BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE
echo "$ $BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE"
$BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE
echo "$ $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE"
$BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE
echo "$ $BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE"
$BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE
if [ "$CORPUS" = 'text8' ]; then
if [ "$1" = 'matlab' ]; then
matlab -nodisplay -nodesktop -nojvm -nosplash < ./eval/matlab/read_and_evaluate.m 1>&2
elif [ "$1" = 'octave' ]; then
octave < ./eval/octave/read_and_evaluate_octave.m 1>&2
else
echo "$ python eval/python/evaluate.py"
python eval/python/evaluate.py
fi
fi
配置完成后运行该脚本即可:./demo.sh
。