Tensorflow实例:实现Word2Vec(Skip-Gram模式)

Word2Vec,即“词向量”或“词嵌入”。Word2Vec是一个可以将语言中字词转为向量形式表达(Vector Representations)的模型。
One-Hot Encoder,一个词对应一个向量(向量中只有一个值为1,其余为0),通常需要将一篇文章中每一个词都转成一个向量,而整篇文章则变为一个稀疏矩阵。使用One-Hot Encoder存在的问题:

  1. 我们对特征的编码往往是随机的,没有提供任何关联信息,没有考虑到字词间可能存在的关系。
  2. 将字词存储为稀疏向量的话,我们通常需要更多的数据来训练,因为稀疏数据训练的效率比较低,计算也非常麻烦。

使用向量表达(Vector Representations)则可以有效地解决这个问题。向量空间模型(Vector Space Models)可以将字词转为连续值(相对于One-Hot编码的离散值)的向量表达,并且其中意思相近的词将被映射到向量空间中相近的位置。

Word2Vec即是一种计算非常高效的,可以从原始语料中学习字词空间向量的预测模型。它主要分为CBOW(Continuous Bag of Words)和Skip-Gram两种模式,其中CBOW是从原始语句(比如:中国的首都是_)推测目标字词(比如:北京);而Skip-Gram则正好相反,它是从目标字词推测出原始语句。其中,CBOW对小型数据比较合适,而Skip-Gram在大型语料中表现得更好。

在本节中,我们将主要使用Skip-Gram模式的Word2Vec,先来看一下它训练样本的构造,以“the quick brown fox jumped over the lazy dog”这句话为例。我们要构造一个语境与目标词汇的映射关系,其中语境包括一个单词左边和右边的词汇,假设我们的的滑窗尺寸为1,可以制造的映射关系包括[the, brown]->quick、[quick, for]->brown、[brown, jumped] ->fox等。因为Skip-Gram模型是从目标词汇预测语境,所以训练样本不再是[the, brown]->quick,而是quick->the和quick->brown。我们的数据集就变为了(quick, the)、(quick, brown)、(brown, quick)、(brown, fox)等。我们训练时,希望模型能从目标词汇quick预测出语境the,同时也需要制造随机的词汇作为负样本(噪声),我们希望预测的概率分布在正样本the上尽可能大,而在随机产生的负样本上尽可能小。
这里的做法就是通过优化算法比如SGD来更新模型中Word Embedding的参数,让概率分布的损失函数(NCE Loss)尽可能小。这样每个单词的Embedded Vector就会随着训练过程不断调整,直到处于一个最合适语料的空间位置。这样我们的损失函数最小,最符合语料,同时预测出正确单词的概率也最高。
下面开始用Tensorflow实现Word2Vec的训练。

import collections
import math
import os
import random
import zipfile
import numpy as np
import urllib
import tensorflow as tf

# 我们先定义下载文本数据的函数
url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes):
    if not os.path.exists(filename):
        filename, _ = urllib.request.urlretrieve(url + filename, filename)
    statinfo = os.stat(filename)
    if statinfo.st_size == expected_bytes:
        print('Found and verified', filename)
    else:
        print(statinfo.st_size)
        raise Exception('Failed to verify' + filename + '. Can you get to it with a browser?')
    return filename

filename = maybe_download('text8.zip', 31344016)

# 接下来解压下载的压缩文件,并使用tf.compat.as_str将数据转成单词的列表。通过程序输出,可以
# 知道数据最后被转为了一个包含17005207个单词的列表
def read_data(filename):
    with zipfile.ZipFile(filename) as f:
        data = tf.compat.as_str(f.read(f.namelist()[0])).split()
    return data

words = read_data(filename)
print('Data size', len(words))

# 接下来创建vocabulary词汇表,我们使用collections.Counter统计单词列表中单词的频数,然后使用
# most_common方法取top 50000频数的单词作为vocabulary。再创建一个dict,将top 50000词汇的
# vocabulary放入dictionary中,以便快速查询。top 50000词汇之外的单词,我们认定其为Unkown(未知),
# 将其编号为0,并统计这类词汇的数量。下面遍历单词列表,对其中每一个单词,先判断是否在dictionary
# 中,如果是则转为其编码,如果不是则转为编号0(Unkown)。最后返回转换后的编码(data)、每个单词
# 的频数统计(count)、词汇表(dictionary)及其反转的形式(reverse_dictionary)
vocabulary_size = 50000

def build_dataset(words):
    count = [['UNK', -1]]
    count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
    dictionary = dict()
    for word, _ in count:
        dictionary[word] = len(dictionary)
    data = list()
    unk_count = 0
    for word in words:
        if word in dictionary:
            index = dictionary[word]
        else:
            index = 0
            unk_count += 1
        data.append(index)
    count[0][1] = unk_count
    reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    return  data, count, dictionary, reverse_dictionary

data, count, dictionary, reverse_dictionary = build_dataset(words)

# 然后我们删除原始单词列表,可以节省内存。再打印vocabulary中最高频出现的词汇及其数量
del words
print('Most common words (+UNK)', count[:5])
print('Sample data', data[:10], [reverse_dictionary[i] for i in data[:10]])

# 下面生成Word2Vec的训练样本。我们根据前面提到的Skip-Gram模式(从目标单词反推语境),将
# 原始数据“the quick brown fox jumped over the lazy dog”转为(quick, the)、(quick, brown)
# 、(brown, fox)等样本。我们定义函数generate_batch来生成训练用的batch数据;skip_window指单词
# 最远可以联系的距离,设为1代表只能跟紧邻的两个单词生成样本;num_skips为对每个单词生成多少个样本,
# 它不能大于skip_window值的两倍,并且batch_size必须是它的整数倍
data_index = 0

def generate_batch(batch_size, num_skips, skip_window):
    global data_index
    assert batch_size % num_skips == 0
    assert num_skips <= 2 * skip_window
    batch = np.ndarray(shape=(batch_size), dtype=np.int32)
    labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
    span = 2 * skip_window + 1
    buffer = collections.deque(maxlen=span)

    # 接下来从序号data_index开始,把span个单词顺序读入buffer作为初始值
    for _ in range(span):
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
    for i in range(batch_size // num_skips):
        target = skip_window
        targets_to_avoid = [ skip_window ]
        for j in range(num_skips):
            while target in targets_to_avoid:
                target = random.randint(0, span - 1)
            targets_to_avoid.append(target)
            batch[i * num_skips + j] = buffer[skip_window]
            labels[i * num_skips + j, 0] = buffer[target]
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
    return batch, labels

batch_size = 128
embedding_size = 128
skip_window = 1
num_skips = 2

valid_size = 16
valid_window = 100
valid_examples = np.random.choice(valid_window, valid_size, replace=False)
num_sampled = 64

# 下面就开始定义Skip-Gram Word2Vec模型的网络结构。
graph = tf.Graph()
with graph.as_default():
    train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
    train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
    valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

    embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
    embed = tf.nn.embedding_lookup(embeddings, train_inputs)

    nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size],
                                                  stddev=1.0 / math.sqrt(embedding_size)))
    nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

    loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weights,
                                         biases=nce_biases,
                                         labels=train_labels,
                                         inputs=embed,
                                         num_sampled=num_sampled,
                                         num_classes=vocabulary_size))
    optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)

    norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
    normalized_embeddings = embeddings / norm
    valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_examples)
    similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)

    init = tf.global_variables_initializer()

num_steps = 100001
with tf.Session(graph=graph) as sess:
    init.run()
    print("Initialized")

    average_loss = 0
    for step in range(num_steps):
        batch_inputs, batch_labels = generate_batch(batch_size, num_steps, skip_window)
        feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}

        _, loss_val = sess.run([optimizer, loss], feed_dict=feed_dict)
        average_loss += loss_val

        if step % 2000 == 0:
            if step > 0:
                average_loss /= 2000
            print("Average loss at step ", step, ":", average_loss)
            average_loss = 0

        if step % 10000 == 0:
            sim = similarity.eval()
            for i in range(valid_size):
                valid_word = reverse_dictionary[valid_examples[i]]
                top_k = 8
                nearest = (-sim[i, :]).argsort()[1:top_k+1]
                log_str = "Nearest to %s:" % valid_word
                for k in range(top_k):
                    close_word = reverse_dictionary[nearest[k]]
                    log_str = "%s %s" % (log_str, close_word)
                print(log_str)
    final_embeddings = normalized_embeddings.eval()

你可能感兴趣的:(tensorflow)