用tensorflow实现word2vec(skip-gram+NEC模型)

前言:

本文的代码主要参考github上的一篇开源的代码“Basic word2vec example”,但是几乎只提取了其中网络搭建的必要部分,并且为了方便自己作为初学者的理解进行了一些语言上简化(并没有简化模型),同时加上了一些自己的批注。
主要目的是学习熟悉tensorflow的使用,同时加深对word2vec的理解,因此在此进行记录。

正文:

第一步:下载及读取数据

读取的数据最终以一个一个单词的形式储存在vacabulary里。注意,这里的单词一定要按照原文的语序排好,而不能是乱序的(从word2vec的原理上来讲必须这样),举个例子vac=[I, like, eating, ChongQing, food, …]这样。

def maybe_download(filename, expected_bytes):
  """Download a file if not present, and make sure it's the right size."""
  local_filename = os.path.join(gettempdir(), filename)
  if not os.path.exists(local_filename):
    local_filename, _ = urllib.request.urlretrieve(url + filename,
                                                   local_filename)
  statinfo = os.stat(local_filename)
  if statinfo.st_size == expected_bytes:
    print('Found and verified', filename)
  else:
    print(statinfo.st_size)
    raise Exception('Failed to verify ' + local_filename +
                    '. Can you get to it with a browser?')
  return local_filename

filename = maybe_download('text8.zip', 31344016)

# Read the data into a list of strings.
def read_data(filename):
  """Extract the first file enclosed in a zip file as a list of words."""
  with zipfile.ZipFile(filename) as f:
    data = tf.compat.as_str(f.read(f.namelist()[0])).split()
  return data

vocabulary = read_data(filename)
vocabulary_size = len(vocabulary)
print('Data size', vocabulary_size)

第二步:建立词典

将vacabulary里面的单词词频进行统计,然后根据词频大小从小到大对单词进行编号。特别的,对未达到阈值的低频词全部编号为 0.

def build_dataset(words, n_words):
  """Process raw inputs into a dataset."""
  count = [['UNK', -1]]
  count.extend(collections.Counter(words).most_common(n_words - 1))
  dictionary = dict()
  for word, _ in count:
    dictionary[word] = len(dictionary)
  data = list()
  unk_count = 0
  for word in words:
    index = dictionary.get(word, 0)
    if index == 0:  # dictionary['UNK']
      unk_count += 1
    data.append(index)
  count[0][1] = unk_count
  reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
  return data, count, dictionary, reversed_dictionary

# Filling 4 global variables:
# data - list of codes (integers from 0 to vocabulary_size-1).
#   This is the original text but words are replaced by their codes
# count - map of words(strings) to count of occurrences
# dictionary - map of words(strings) to their codes(integers)
# reverse_dictionary - maps codes(integers) to words(strings)
data, count, dictionary, reverse_dictionary = build_dataset(
    vocabulary, vocabulary_size)
del vocabulary  # Hint to reduce memory.
print('Most common words (+UNK)', count[:5])
print('Sample data', data[:10], [reverse_dictionary[i] for i in data[:10]])

第三步:建立生成训练集的函数

根据skip-gram的思想,我们最终输入的训练集的每一对应该是形如(x=center_word, y=context_word)的形式,举个例子,假设我们设置参数skip_nums=2(代表每个窗口中center_word要进行几次预测),skip_window=1(窗口的大小),那么我们得到的作为输入的训练集应该是形如(like, I)(like, eating) (eating, like) (eating, ChongQing)…

data_index = 0
def generate_batch(batch_size, num_skips, skip_window):
  global data_index
  assert batch_size % num_skips == 0
  assert num_skips <= 2 * skip_window
  batch = np.ndarray(shape=(batch_size), dtype=np.int32)
  labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
  span = 2 * skip_window + 1  # [ skip_window target skip_window ]
  buffer = collections.deque(maxlen=span)  # pylint: disable=redefined-builtin
  if data_index + span > len(data):
    data_index = 0
  buffer.extend(data[data_index:data_index + span])
  data_index += span
  for i in range(batch_size // num_skips):
    context_words = [w for w in range(span) if w != skip_window]
    words_to_use = random.sample(context_words, num_skips)
    for j, context_word in enumerate(words_to_use):
      batch[i * num_skips + j] = buffer[skip_window]
      labels[i * num_skips + j, 0] = buffer[context_word]
    if data_index == len(data):
      buffer.extend(data[0:span])
      data_index = span
    else:
      buffer.append(data[data_index])
      data_index += 1
  # Backtrack a little bit to avoid skipping words in the end of a batch
  data_index = (data_index + len(data) - span) % len(data)
  return batch, labels

第四步:建立模型并进行训练

这一步就是搭建模型网络并进行训练了,其中需要注意的是tf.nn.nec_loss,这个函数直接帮我们完成了负采样及相应的计算,我们只需要把训练语料,待训练参数放进去就可以了。其余部分几乎和word2vec中的skip-gram思想完全一致,对算法原理还不太清楚地同学们可以参考之前fasttext一文中的参考文献:Word2vec中的数学原理。

batch_size = 128
embedding_size = 50  # Dimension of the embedding vector.
skip_window = 2  # How many words to consider left and right.
num_skips = 4  # How many times to reuse an input to generate a label.
num_sampled = 5  # Number of negative examples to sample.
graph = tf.Graph()

with graph.as_default():
    # Input data
    with tf.name_scope('inputs'):
        train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
        train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])

    # Look up embeddings for inputs.
    with tf.name_scope('embeddings'):
        embeddings = tf.Variable(
            tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
        embed = tf.nn.embedding_lookup(embeddings, train_inputs)

    # Construct the variables for the NCE loss
    with tf.name_scope('weights'):
        nce_weights = tf.Variable(
            tf.truncated_normal(
                [vocabulary_size, embedding_size],
                stddev=1.0 / math.sqrt(embedding_size)))
    with tf.name_scope('biases'):
        nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

  # Compute the average NCE loss for the batch.
  # tf.nce_loss automatically draws a new sample of the negative labels each
  # time we evaluate the loss.
    with tf.name_scope('loss'):
        loss = tf.reduce_mean(
            tf.nn.nce_loss(
                weights=nce_weights,
                biases=nce_biases,
                labels=train_labels,
                inputs=embed,
                num_sampled=num_sampled,
                num_classes=vocabulary_size))

    with tf.name_scope('optimizer'):
        optimizer = tf.train.AdamOptimizer().minimize(loss)

  # Compute the cosine similarity between minibatch examples and all embeddings.

  # Add variable initializer.
    init = tf.global_variables_initializer()

num_steps = 100000

with tf.Session(graph=graph) as session:
    # We must initialize all variables before we use them.
    init.run()
    print('Initialized')

    average_loss = 0
    for step in xrange(num_steps):
        batch_inputs, batch_labels = generate_batch(batch_size, num_skips,
                                                    skip_window)

        feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}
        session.run(optimizer, feed_dict=feed_dict)
        if step % 10000 == 0:
            print(session.run(loss,  feed_dict=feed_dict))

小结:
以上就是通过tensorflow实现word2vec训练的一个简单的流程,不得不说tensorflow功能非常强大,能想到的基本都给你准备好了。

你可能感兴趣的:(NLP)