gensim中word2vec python源码理解(二)Skip-gram模型训练

gensim中word2vec python源码理解(一)初始化构建单词表
gensim中word2vec python源码理解(二)Skip-gram模型训练

拖了太久没更Part2.,再捡起来发现gensim3.8.0里面的word2vec有2018年的更新,更新文档中提到:

gensim包中集成许多训练词向量的方法,不仅仅是word2vec
例如:~gensim.models.doc2vec.Doc2Vec~gensim.models.fasttext.FastText and
wrappers for :class:~gensim.models.wrappers.VarEmbed and :class:~gensim.models.wrappers.WordRank

本文是在上一篇《使用Hierarchical Softmax方法构建单词表》的基础上,继续记录对word2vec源码的理解过程。

if sentences is not None:
	if isinstance(sentences, GeneratorType):
    	raise TypeError("You can't pass a generator as the sentences argument. Try an iterator.")
        # 1. 构建单词表
        self.build_vocab(sentences, trim_rule=trim_rule)
        # 2. 进行训练
        self.train(
            sentences, total_examples=self.corpus_count, epochs=self.iter,
            start_alpha=self.alpha, end_alpha=self.min_alpha
        )

实际上gensim包在调用的时候使用的是C版本的skip-gram和COBW训练模型,但是这里主要看一下python代码中skip-gram训练的实现方法。

def train_batch_sg(model, sentences, alpha, work=None, compute_loss=False):
    """
    Update skip-gram model by training on a sequence of sentences. 
    通过句子序列更新skip-gram模型
     Each sentence is a list of string tokens, which are looked up in the model's
    vocab dictionary. Called internally from `Word2Vec.train()`.
    每个句子都是一个字符串标记列表,在模型的vocab词典中查找。
    在Word2Vec.train()中调用
    This is the non-optimized, Python version. If you have cython installed, gensim
    will use the optimized version from word2vec_inner instead.
     """
    result = 0
    for sentence in sentences: # 单独处理一个batch句子序列的每一个句子
    	# 取出句子中出现的单词的下标
        word_vocabs = [model.wv.vocab[w] for w in sentence if w in model.wv.vocab and
                       model.wv.vocab[w].sample_int > model.random.rand() * 2**32]  #将该句中的单词信息取出保存
        for pos, word in enumerate(word_vocabs): # 遍历所有的单词
            reduced_window = model.random.randint(model.window)  # `b` in the original word2vec code
            # now go over all words from the (reduced) window, predicting each one in turn
            start = max(0, pos - model.window + reduced_window)
            for pos2, word2 in enumerate(word_vocabs[start:(pos + model.window + 1 - reduced_window)], start):#遍历窗口内的所有单词
                # don't train on the `word` itself 排除目标单词,计算每一个单词与目标词的得分
                if pos2 != pos:
                    train_sg_pair(
                        model, model.wv.index2word[word.index], word2.index, alpha, compute_loss=compute_loss
                    )#作用:对一个上下文单词进行计算,更新目标词向量和二叉树内节点的向量
        result += len(word_vocabs)  # 记录处理的单词总数并返回
    return result

计算每个句子得分的函数

   def score_sentence_sg(model, sentence, work=None):
        """
        Obtain likelihood score for a single sentence in a fitted skip-gram representaion.   获得单个句子的似然函数得分

        The sentence is a list of Vocab objects (or None, when the corresponding
        word is not in the vocabulary). Called internally from `Word2Vec.score()`.

        This is the non-optimized, Python version. If you have cython installed, gensim
        will use the optimized version from word2vec_inner instead.

        """
        log_prob_sentence = 0.0 # 初始化对数似然得分为0
        if model.negative:
            raise RuntimeError("scoring is only available for HS=True")  # 只在分层softmax条件下使用

        word_vocabs = [model.wv.vocab[w] for w in sentence if w in model.wv.vocab] # 在句子中且在单词表中的单词w保存在word_vocabs list中
        for pos, word in enumerate(word_vocabs): #遍历句子中的每个单词
            if word is None:
                continue  # OOV word in the input sentence => skip

            # now go over all words from the window, predicting each one in turn 一次预测窗口内的每个单词
            start = max(0, pos - model.window) # 找到窗口的起始位置,当前位置-窗口大小,为负数的话则取0
            for pos2, word2 in enumerate(word_vocabs[start: pos + model.window + 1], start): # 取窗口内的单词计算得分并求和
                # don't train on OOV words and on the `word` itself 不能计算当前词的分数
                if word2 is not None and pos2 != pos: # 将每个词的得分累加
                    log_prob_sentence += score_sg_pair(model, word, word2)

        return log_prob_sentence # 返回当前句子的似然函数分数值

你可能感兴趣的:(word2vec)