One hot representation用来表示词向量非常简单,但是却有很多问题。最大的问题是我们的词汇表一般都非常大,比如达到百万级别,这样每个词都用百万维的向量来表示简直是内存的灾难。这样的向量其实除了一个位置是1,其余的位置全部都是0,表达的效率不高,能不能把词向量的维度变小呢?
Dristributed representation可以解决One hot representation的问题,它的思路是通过训练,将每个词都映射到一个较短的词向量上来。所有的这些词向量就构成了向量空间,进而可以用普通的统计学的方法来研究词与词之间的关系。这个较短的词向量维度是多大呢?这个一般需要我们在训练时自己来指定。
本博文就是使用TensorFlow的embedding_lookup模块对Word2Vec训练保存与简单使用的探究。
在此基础之上,我们就可以使用自己训练的Word2Vec进行RNN处理应用。
此实战要用到的数据集为text8.zip
tf.nn.embedding_lookup介绍
tf.nn.embedding_lookup(params,ids, partition_strategy=’mod’, name=None, validate_indices=True,max_norm=None)
根据ids中的id,寻找params中的对应元素,可以理解为索引,所以ids中元素值不能超出params的第一维的维数值。
比如,ids=[1,3,5],则找出params中下标为1,3,5的向量组成一个矩阵返回。
embedding_lookup不是简单的查表,id对应的向量是可以训练的,训练参数个数应该是 category num*embedding size,也就是说lookup是一种全连接层。
参数说明:
params: 表示完整的embedding张量,或者除了第一维度之外具有相同形状的P个张量的列表,表示经分割的嵌入张量。
ids: 一个类型为int32或int64的Tensor,包含要在params中查找的id
代码部分:
# encode : utf - 8
# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import collections
import pickle
import math
import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"
import random
import zipfile
import numpy as np
import urllib
import tensorflow as tf
# Step 1: Download the data.
url = 'http://mattmahoney.net/dc/'
def maybe_download(filename, expected_bytes):
"""Download a file if not present, and make sure it's the right size."""
if not os.path.exists(filename):
filename, _ = urllib.request.urlretrieve(url + filename, filename)
statinfo = os.stat(filename)
if statinfo.st_size == expected_bytes:
print('Found and verified', filename)
else:
print(statinfo.st_size)
raise Exception(
'Failed to verify ' + filename + '. Can you get to it with a browser?')
return filename
filename = maybe_download('text8.zip', 31344016)
# Read the data into a list of strings.
def read_data(filename):
"""Extract the first file enclosed in a zip file as a list of words"""
with zipfile.ZipFile(filename) as f:
data = tf.compat.as_str(f.read(f.namelist()[0])).split()
return data
words = read_data(filename)
print('Data size', len(words))
# Step 2: Build the dictionary and replace rare words with UNK token.
vocabulary_size = 50000
def build_dataset(words):
count = [['UNK', -1]]
count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
dictionary = dict()
for word, _ in count:
dictionary[word] = len(dictionary)
data = list()
unk_count = 0
for word in words:
if word in dictionary:
index = dictionary[word]
else:
index = 0 # dictionary['UNK']
unk_count += 1
data.append(index)
count[0][1] = unk_count
reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
return data, count, dictionary, reverse_dictionary
data, count, dictionary, reverse_dictionary = build_dataset(words)
del words # Hint to reduce memory.
print('Most common words (+UNK)', count[:5])
print('Sample data', data[:10], [reverse_dictionary[i] for i in data[:10]])
data_index = 0
# Step 3: Function to generate a training batch for the skip-gram model.
def generate_batch(batch_size, num_skips, skip_window):
global data_index
assert batch_size % num_skips == 0
assert num_skips <= 2 * skip_window
batch = np.ndarray(shape=(batch_size), dtype=np.int32)
labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
span = 2 * skip_window + 1 # [ skip_window target skip_window ]
buffer = collections.deque(maxlen=span)
for _ in range(span):
buffer.append(data[data_index])
data_index = (data_index + 1) % len(data)
for i in range(batch_size // num_skips):
target = skip_window # target label at the center of the buffer
targets_to_avoid = [ skip_window ]
for j in range(num_skips):
while target in targets_to_avoid:
target = random.randint(0, span - 1)
targets_to_avoid.append(target)
batch[i * num_skips + j] = buffer[skip_window]
labels[i * num_skips + j, 0] = buffer[target]
buffer.append(data[data_index])
data_index = (data_index + 1) % len(data)
return batch, labels
batch, labels = generate_batch(batch_size=8, num_skips=2, skip_window=1)
for i in range(8):
print(batch[i], reverse_dictionary[batch[i]],
'->', labels[i, 0], reverse_dictionary[labels[i, 0]])
# Step 4: Build and train a skip-gram model.
batch_size = 128
embedding_size = 128 # Dimension of the embedding vector.
skip_window = 1 # How many words to consider left and right.
num_skips = 2 # How many times to reuse an input to generate a label.
# We pick a random validation set to sample nearest neighbors. Here we limit the
# validation samples to the words that have a low numeric ID, which by
# construction are also the most frequent.
valid_size = 16 # Random set of words to evaluate similarity on.
valid_window = 100 # Only pick dev samples in the head of the distribution.
valid_examples = np.random.choice(valid_window, valid_size, replace=False)
num_sampled = 64 # Number of negative examples to sample.
graph = tf.Graph()
with graph.as_default():
# Input data.
train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
# Ops and variables pinned to the CPU because of missing GPU implementation
with tf.device('/cpu:0'):
# Look up embeddings for inputs.
embeddings = tf.Variable(
tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, train_inputs)
# Construct the variables for the NCE loss
nce_weights = tf.Variable(
tf.truncated_normal([vocabulary_size, embedding_size],
stddev=1.0 / math.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))
# Compute the average NCE loss for the batch.
# tf.nce_loss automatically draws a new sample of the negative labels each
# time we evaluate the loss.
loss = tf.reduce_mean(
tf.nn.nce_loss(weights=nce_weights,
biases=nce_biases,
labels=train_labels,
inputs=embed,
num_sampled=num_sampled,
num_classes=vocabulary_size))
# Construct the SGD optimizer using a learning rate of 1.0.
optimizer = tf.train.GradientDescentOptimizer(1.2).minimize(loss)
# Compute the cosine similarity between minibatch examples and all embeddings.
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(
normalized_embeddings, valid_dataset)
similarity = tf.matmul(
valid_embeddings, normalized_embeddings, transpose_b=True)
# Add variable initializer.
init = tf.global_variables_initializer()
# Step 5: Begin training.
num_steps = 100001
with tf.Session(graph=graph) as session:
# We must initialize all variables before we use them.
init.run()
saver = tf.train.Saver()
print("Initialized")
average_loss = 0
for step in range(num_steps):
batch_inputs, batch_labels = generate_batch(
batch_size, num_skips, skip_window)
feed_dict = {train_inputs : batch_inputs, train_labels : batch_labels}
# We perform one update step by evaluating the optimizer op (including it
# in the list of returned values for session.run()
_, loss_val = session.run([optimizer, loss], feed_dict=feed_dict)
average_loss += loss_val
if step % 1000 == 0:
if step > 0:
average_loss /= 1000
# The average loss is an estimate of the loss over the last 2000 batches.
print("Average loss at step ", step, ": ", average_loss)
average_loss = 0
# Note that this is expensive (~20% slowdown if computed every 500 steps)
if step % 10000 == 0:
sim = similarity.eval()
for i in range(valid_size):
valid_word = reverse_dictionary[valid_examples[i]]
top_k = 8 # number of nearest neighbors
nearest = (-sim[i, :]).argsort()[1:top_k+1]
log_str = "Nearest to %s:" % valid_word
for k in range(top_k):
close_word = reverse_dictionary[nearest[k]]
log_str = "%s %s," % (log_str, close_word)
print(log_str)
final_embeddings = normalized_embeddings.eval()
saver_path = saver.save(session, './2RNN/3_1Word2Vec/MyModel')
print("saver path: ",saver_path)
with open('./2RNN/3_1Word2Vec/tf_128_2.pkl', 'wb') as fw:
pickle.dump({'embeddings': final_embeddings, 'word2id': dictionary, 'id2word': reverse_dictionary}, fw, protocol=4)
# Step 6: Visualize the embeddings.
def plot_with_labels(low_dim_embs, labels, filename='tsne.png'):
assert low_dim_embs.shape[0] >= len(labels), "More labels than embeddings"
plt.figure(figsize=(18, 18)) #in inches
for i, label in enumerate(labels):
x, y = low_dim_embs[i,:]
plt.scatter(x, y)
plt.annotate(label,
xy=(x, y),
xytext=(5, 2),
textcoords='offset points',
ha='right',
va='bottom')
plt.savefig(filename)
#%%
try:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
plot_only = 200
low_dim_embs = tsne.fit_transform(final_embeddings[:plot_only,:])
labels = [reverse_dictionary[i] for i in range(plot_only)]
plot_with_labels(low_dim_embs, labels)
except ImportError:
print("Please install sklearn, matplotlib, and scipy to visualize embeddings.")
运行结果:
Average loss at step 1000 : 149.03840727233887
Average loss at step 2000 : 86.77497396659851
Average loss at step 3000 : 61.10482195854187
...
Average loss at step 97000 : 4.575266252994537
Average loss at step 98000 : 4.605689331054688
Average loss at step 99000 : 4.6487927632331845
Average loss at step 100000 : 4.653323454380035
Nearest to or: and, agouti, microcebus, ssbn, dasyprocta, than, clodius, mucus,
Nearest to as: agouti, when, microcebus, ssbn, bpp, amalthea, roshan, michelob,
Nearest to i: we, ii, you, t, subcode, they, tabula, g,
Nearest to they: there, he, we, you, it, these, not, who,
Nearest to and: or, but, microcebus, agouti, mucus, dasyprocta, while, michelob,
Nearest to zero: eight, five, seven, four, six, nine, dasyprocta, michelob,
Nearest to states: nations, bandanese, kingdom, absalom, dasyprocta, aediles, applescript, kv,
Nearest to have: had, has, are, were, be, klister, having, agouti,
Nearest to five: four, six, seven, eight, three, two, zero, nine,
Nearest to used: known, agouti, microcebus, iit, abitibi, spoken, dasyprocta, upanija,
Nearest to an: wernicke, riley, binds, oddly, tunings, rearranged, tamarin, apparition,
Nearest to between: with, within, into, from, in,through, jarman, saracens,
Nearest to time: reginae, year, callithrix, iit, albury, upanija, brahma, microcebus,
Nearest to it: he, she, this, there, they, which,amalthea, microcebus,
Nearest to from: into, through, during, in, within, between, dominican, with,
Nearest to six: four, seven, five, eight, nine, three, two, agouti,
saver path: ./2RNN/3_1Word2Vec/MyModel
结果分析:
训练10万步后,loss由149减少到4.6,每个数据都找到了一个较为适合的语料空间位置。
例如:Nearest to five: four, six, seven, eight, three, two, zero, nine。
区分出了,数字词汇都在靠近的位置。
在上个部分我们训练的过程中,我们也把训练的结果保存到了tf_128_2.pkl文件中,我们这部分要做的就是把保存的数据给取出来。
代码部分
# -*- coding: utf-8 -*-
import tensorflow as tf
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import pickle
with open('./2RNN/3_1Word2Vec/tf_128_2.pkl', 'rb') as fr:
data = pickle.load(fr)
final_embeddings = data['embeddings']
word2id = data['word2id']
id2word = data['id2word']
print("word2id:",type(word2id),len(word2id))
print("word2id one:",list(word2id.items())[0])
print("id2word:",type(id2word),len(id2word))
print("id2word one:",list(id2word.items())[0])
print("final_embeddings:",type(final_embeddings),final_embeddings.shape)
print("final_embeddings one:",final_embeddings[0])
运行结果
word2id: <class 'dict'> 50000
word2id one: ('UNK', 0)
id2word: <class 'dict'> 50000
id2word one: (0, 'UNK')
final_embeddings: <class 'numpy.ndarray'> (50000,128)
final_embeddings one:
[ 0.07824267 0.02380653 -0.04904078 -0.15769418 -0.03343008 -0.00123829
-0.00840652 0.11035322 0.05255153 -0.01701773 -0.03454393 0.07412812
0.12529139 0.08700892 0.13564599 0.06016889 -0.02242458 0.01967838
-0.08621006 0.19164786 0.05878171 0.150539930.15180601 0.11737475
0.02684335 -0.02697461 0.02076019 -0.074430790.0905515 -0.00580214
-0.10034874 0.10663538 0.10468851 -0.0018832 -0.03854908 -0.04377652
-0.07925367 -0.01276041 0.06139784 -0.04612593 -0.0026719 -0.14129621
0.03356975 -0.08864117 0.03864674 0.06496057 -0.03393148 -0.18256697
0.1531667 0.01806654 -0.25479555 -0.0102073 -0.01091281 -0.13244723
0.03231056 -0.04288295 0.00475867 -0.063878960.16555941 -0.1105833
0.16233324 -0.01569812 -0.03743415 0.118394350.14104177 -0.06637108
-0.02597998 -0.05089493 0.05379589 0.02132376 -0.0230114 0.16737887
-0.07722343 0.06376561 -0.06996173 0.07367135 -0.04434428 -0.05931331
0.13638481 -0.12992401 0.05051441 0.100753180.1285995 0.03757066
-0.15496145 0.02049168 -0.02400574 0.04723364 -0.05883536 0.20387387
-0.01346673 0.09482987 0.02737017 0.079759790.02752302 0.1652701
-0.06379505 -0.01461394 -0.01188034 0.118714 -0.0942675 0.08787307
-0.06561033 0.04986798 0.18926224 0.111620020.01565995 0.09576936
-0.02896462 0.03163688 0.08406845 0.07642328 -0.04427774 -0.03355639
-0.07277506 -0.20906252 -0.00820385 -0.006069670.02557734 0.03273683
0.04223491 0.04725773 -0.011081 -0.02940390.04183002 -0.00577809
0.13359077 -0.02493091]
结果分析
word2id与id2word:都是拥有50000元素的字典变量,看用于id与word的相互转换。
final_embeddings是一个二维数据拥有50000条数据,每个数据为128的向量,就类似于Mnist手写数据集里的784个像素点,这就是词向量的实质。后面我们就可以用我们自己训练的词向量来做语义分析的处理了。