Word2Vec,即“词向量”或“词嵌入”。Word2Vec是一个可以将语言中字词转为向量形式表达(Vector Representations)的模型。
One-Hot Encoder,一个词对应一个向量(向量中只有一个值为1,其余为0),通常需要将一篇文章中每一个词都转成一个向量,而整篇文章则变为一个稀疏矩阵。使用One-Hot Encoder存在的问题:
使用向量表达(Vector Representations)则可以有效地解决这个问题。向量空间模型(Vector Space Models)可以将字词转为连续值(相对于One-Hot编码的离散值)的向量表达,并且其中意思相近的词将被映射到向量空间中相近的位置。
Word2Vec即是一种计算非常高效的,可以从原始语料中学习字词空间向量的预测模型。它主要分为CBOW(Continuous Bag of Words)和Skip-Gram两种模式,其中CBOW是从原始语句(比如:中国的首都是_)推测目标字词(比如:北京);而Skip-Gram则正好相反,它是从目标字词推测出原始语句。其中,CBOW对小型数据比较合适,而Skip-Gram在大型语料中表现得更好。
在本节中,我们将主要使用Skip-Gram模式的Word2Vec,先来看一下它训练样本的构造,以“the quick brown fox jumped over the lazy dog”这句话为例。我们要构造一个语境与目标词汇的映射关系,其中语境包括一个单词左边和右边的词汇,假设我们的的滑窗尺寸为1,可以制造的映射关系包括[the, brown]->quick、[quick, for]->brown、[brown, jumped] ->fox等。因为Skip-Gram模型是从目标词汇预测语境,所以训练样本不再是[the, brown]->quick,而是quick->the和quick->brown。我们的数据集就变为了(quick, the)、(quick, brown)、(brown, quick)、(brown, fox)等。我们训练时,希望模型能从目标词汇quick预测出语境the,同时也需要制造随机的词汇作为负样本(噪声),我们希望预测的概率分布在正样本the上尽可能大,而在随机产生的负样本上尽可能小。
这里的做法就是通过优化算法比如SGD来更新模型中Word Embedding的参数,让概率分布的损失函数(NCE Loss)尽可能小。这样每个单词的Embedded Vector就会随着训练过程不断调整,直到处于一个最合适语料的空间位置。这样我们的损失函数最小,最符合语料,同时预测出正确单词的概率也最高。
下面开始用Tensorflow实现Word2Vec的训练。
import collections
import math
import os
import random
import zipfile
import numpy as np
import urllib
import tensorflow as tf
# 我们先定义下载文本数据的函数
url = 'http://mattmahoney.net/dc/'
def maybe_download(filename, expected_bytes):
if not os.path.exists(filename):
filename, _ = urllib.request.urlretrieve(url + filename, filename)
statinfo = os.stat(filename)
if statinfo.st_size == expected_bytes:
print('Found and verified', filename)
else:
print(statinfo.st_size)
raise Exception('Failed to verify' + filename + '. Can you get to it with a browser?')
return filename
filename = maybe_download('text8.zip', 31344016)
# 接下来解压下载的压缩文件,并使用tf.compat.as_str将数据转成单词的列表。通过程序输出,可以
# 知道数据最后被转为了一个包含17005207个单词的列表
def read_data(filename):
with zipfile.ZipFile(filename) as f:
data = tf.compat.as_str(f.read(f.namelist()[0])).split()
return data
words = read_data(filename)
print('Data size', len(words))
# 接下来创建vocabulary词汇表,我们使用collections.Counter统计单词列表中单词的频数,然后使用
# most_common方法取top 50000频数的单词作为vocabulary。再创建一个dict,将top 50000词汇的
# vocabulary放入dictionary中,以便快速查询。top 50000词汇之外的单词,我们认定其为Unkown(未知),
# 将其编号为0,并统计这类词汇的数量。下面遍历单词列表,对其中每一个单词,先判断是否在dictionary
# 中,如果是则转为其编码,如果不是则转为编号0(Unkown)。最后返回转换后的编码(data)、每个单词
# 的频数统计(count)、词汇表(dictionary)及其反转的形式(reverse_dictionary)
vocabulary_size = 50000
def build_dataset(words):
count = [['UNK', -1]]
count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
dictionary = dict()
for word, _ in count:
dictionary[word] = len(dictionary)
data = list()
unk_count = 0
for word in words:
if word in dictionary:
index = dictionary[word]
else:
index = 0
unk_count += 1
data.append(index)
count[0][1] = unk_count
reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
return data, count, dictionary, reverse_dictionary
data, count, dictionary, reverse_dictionary = build_dataset(words)
# 然后我们删除原始单词列表,可以节省内存。再打印vocabulary中最高频出现的词汇及其数量
del words
print('Most common words (+UNK)', count[:5])
print('Sample data', data[:10], [reverse_dictionary[i] for i in data[:10]])
# 下面生成Word2Vec的训练样本。我们根据前面提到的Skip-Gram模式(从目标单词反推语境),将
# 原始数据“the quick brown fox jumped over the lazy dog”转为(quick, the)、(quick, brown)
# 、(brown, fox)等样本。我们定义函数generate_batch来生成训练用的batch数据;skip_window指单词
# 最远可以联系的距离,设为1代表只能跟紧邻的两个单词生成样本;num_skips为对每个单词生成多少个样本,
# 它不能大于skip_window值的两倍,并且batch_size必须是它的整数倍
data_index = 0
def generate_batch(batch_size, num_skips, skip_window):
global data_index
assert batch_size % num_skips == 0
assert num_skips <= 2 * skip_window
batch = np.ndarray(shape=(batch_size), dtype=np.int32)
labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
span = 2 * skip_window + 1
buffer = collections.deque(maxlen=span)
# 接下来从序号data_index开始,把span个单词顺序读入buffer作为初始值
for _ in range(span):
buffer.append(data[data_index])
data_index = (data_index + 1) % len(data)
for i in range(batch_size // num_skips):
target = skip_window
targets_to_avoid = [ skip_window ]
for j in range(num_skips):
while target in targets_to_avoid:
target = random.randint(0, span - 1)
targets_to_avoid.append(target)
batch[i * num_skips + j] = buffer[skip_window]
labels[i * num_skips + j, 0] = buffer[target]
buffer.append(data[data_index])
data_index = (data_index + 1) % len(data)
return batch, labels
batch_size = 128
embedding_size = 128
skip_window = 1
num_skips = 2
valid_size = 16
valid_window = 100
valid_examples = np.random.choice(valid_window, valid_size, replace=False)
num_sampled = 64
# 下面就开始定义Skip-Gram Word2Vec模型的网络结构。
graph = tf.Graph()
with graph.as_default():
train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, train_inputs)
nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size],
stddev=1.0 / math.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))
loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weights,
biases=nce_biases,
labels=train_labels,
inputs=embed,
num_sampled=num_sampled,
num_classes=vocabulary_size))
optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_examples)
similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)
init = tf.global_variables_initializer()
num_steps = 100001
with tf.Session(graph=graph) as sess:
init.run()
print("Initialized")
average_loss = 0
for step in range(num_steps):
batch_inputs, batch_labels = generate_batch(batch_size, num_steps, skip_window)
feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}
_, loss_val = sess.run([optimizer, loss], feed_dict=feed_dict)
average_loss += loss_val
if step % 2000 == 0:
if step > 0:
average_loss /= 2000
print("Average loss at step ", step, ":", average_loss)
average_loss = 0
if step % 10000 == 0:
sim = similarity.eval()
for i in range(valid_size):
valid_word = reverse_dictionary[valid_examples[i]]
top_k = 8
nearest = (-sim[i, :]).argsort()[1:top_k+1]
log_str = "Nearest to %s:" % valid_word
for k in range(top_k):
close_word = reverse_dictionary[nearest[k]]
log_str = "%s %s" % (log_str, close_word)
print(log_str)
final_embeddings = normalized_embeddings.eval()