TensorFlow与NLP(词向量:skip-gram)

开篇

前面已经讲了两种表示文本特征的向量化方法了,到这里也可以进入我们的词向量了,词向量是近几年来NLP领域最重要的研究成果之一,我们现在再看一些基本的NLP任务也基本上再也离不开词向量的身影,今天我们就用代码的层面来看看它到底是什么?ps:拖延症晚期,跳票严重。今天无论如何都要把词向量这篇博客补上。

word2vec

前面我们也讲到了两种向量化的方式,他们有个缺点就是太长了,都是以词典的大小来表示自己的,这就限制了我们词典的长度以避免相应的维度问题。比如说one-hot编码,如果词典长达10万,那么我们的词向量也就10万啦。这也太稀疏啦,而且每个词的表示是完全不一样的,不好计算相似度,这就很糟糕了。那我们现在使用的词向量是怎么得来呢。现在的词向量来源于上下文的语义,算是语言模型的副产品。这边我还是放上宏毅大佬的几张ppt以便大家的理解。
TensorFlow与NLP(词向量:skip-gram)_第1张图片

首先还是要讲一下神经网络语言模型,当然我们后面肯定会有语言模型的详细代码博客要写,什么是语言模型呢,说白了就是判断一句话是否符合逻辑,换句话说就是,现实生活中出现这句话的概率是多大,如果概率大,那么就判定这句话是符合逻辑了,是人话。神经网络语言模型训练好了,就是为了生成人话的,注意上面的图,我们输入那两个人的名字,期待他们输出的就是宣誓就职。如果是其他的话,那么就是有误差的,就要调整我们的参数。这就是典型的一个预测模型,而这个预测模型就是我们的语言模型的实现,当我们使用lstm这类循环神经网络的时候,我们就可以源源不断的生成我们想要的句子啦,只要给它一个初始值。Ok,到这里,你可能回想这和我们的词向量有什么关系,这当然有关系啦,我们的词向量根据上下文的词来生成,就是这种典型的语言模型,只不过采用的不是典型的n-gram,因为它不需要考虑文字的语序,上下文信息显得更加有效。

TensorFlow与NLP(词向量:skip-gram)_第2张图片

注意词向量的模型式三层的,只有一个隐藏层,而它的词向量也主要来源于它的隐藏层的参数矩阵,知乎上有一篇很好的回答,相信能够给大家很好的启发词向量。因为不是纯理论博客,有些东西点到为止,但是欢迎大家提问。如有错误,也希望大神能够指出,帮我改正。词向量算是入门nlp的坑啦,讲的好的博客比较少。希望我贴出的这篇回答能给大家带来启发。

代码

这边先讲数据预处理的代码,主要是数据的获取和词典的生成以及训练数据的生成,博主代码底子薄弱,所有这些预处理的代码都调过。还是要简单的描述一下我们代码的主要任务,词向量的代码不是你给个语料,它就给你生成词向量的代码,如果你是这样认为的,那么希望你能够好好阅读一下词向量的理论知识。我们整体任务还是预测任务,像今天我们要说的skip-gram就是入上图,给出一个中心词,预测它的上下文。词向量是语言模型的副产品,它是来源于中间隐藏层的权值矩阵的。希望我说的还算清楚,下面开始我们的预处理代码,为什么要讲预处理呢。因为我的底子差,希望能够积累这些琐碎的处理经验。

首先是数据的获取:

def load_movie_data():
    save_folder_name = 'temp'
    pos_file = os.path.join(save_folder_name, 'rt-polarity.pos')
    neg_file = os.path.join(save_folder_name, 'rt-polarity.neg')

    # Check if files are already downloaded
    if os.path.exists(save_folder_name):
        pos_data = []
        with open(pos_file, 'r') as temp_pos_file:
            for row in temp_pos_file:
                pos_data.append(row)
        neg_data = []
        with open(neg_file, 'r') as temp_neg_file:
            for row in temp_neg_file:
                neg_data.append(row)
    else: # If not downloaded, download and save
        movie_data_url = '+++'
        stream_data = urllib.request.urlopen(movie_data_url)
        tmp = io.BytesIO()
        while True:
            s = stream_data.read(16384)
            if not s:  
                break
            tmp.write(s)
            stream_data.close()
            tmp.seek(0)

        tar_file = tarfile.open(fileobj=tmp, mode="r:gz")
        pos = tar_file.extractfile('rt-polaritydata/rt-polarity.pos')
        neg = tar_file.extractfile('rt-polaritydata/rt-polarity.neg')
        # Save pos/neg reviews
        pos_data = []
        for line in pos:
            pos_data.append(line.decode('ISO-8859-1').encode('ascii',errors='ignore').decode())
        neg_data = []
        for line in neg:
            neg_data.append(line.decode('ISO-8859-1').encode('ascii',errors='ignore').decode())
        tar_file.close()
        # Write to file
        if not os.path.exists(save_folder_name):
            os.makedirs(save_folder_name)
        # Save files
        with open(pos_file, "w") as pos_file_handler:
            pos_file_handler.write(''.join(pos_data))
        with open(neg_file, "w") as neg_file_handler:
            neg_file_handler.write(''.join(neg_data))
    texts = pos_data + neg_data
    target = [1]*len(pos_data) + [0]*len(neg_data)
    return(texts, target)

这边数据是情感分析的数据集,数据的下载使用代码去下载实现,但是不太好用,老是下载不下来,大家可以直接点击它给的链接,浏览器会直接下载到本地。主要下载下来的数据打开的时候可能会出现编码的问题,这时候大家可以在open函数里面加一个参数,encoding=’latin-1’,具体为什么要这么做,请大家参考我的博客python编码问题。其实这边的target用处不是很大,因为我们不做分类任务嘛,只要有文字内容就可以了。

在后续的完整代码里面,我会放出不需要下载的那种代码,大家要实现记得去下载一下数据集。后面的cnn文本分类我们还是会用到这个数据集的。下面继续我们的代码,正常的预处理代码,我们只保留单词数三个以上的句子,那种三个一下的我们就不要啦

# Normalize text
def normalize_text(texts, stops):
    # Lower case
    texts = [x.lower() for x in texts]

    # Remove punctuation
    texts = [''.join(c for c in x if c not in string.punctuation) for x in texts]

    # Remove numbers
    texts = [''.join(c for c in x if c not in '0123456789') for x in texts]

    # Remove stopwords
    texts = [' '.join([word for word in x.split() if word not in (stops)]) for x in texts]

    # Trim extra whitespace
    texts = [' '.join(x.split()) for x in texts]

    return(texts)

texts = normalize_text(texts, stops)

# Texts must contain at least 3 words
target = [target[ix] for ix, x in enumerate(texts) if len(x.split()) > 2]
texts = [x for x in texts if len(x.split()) > 2]

这是比较常规的处理啦,大家如果有兴趣可以进行更加精细的预处理,这边我也给出我的一些建议,具体见我的博客。一切的预处理都是为了后面模型更加精确的结果。

下面是构建词典的代码

# Build dictionary of words
def build_dictionary(sentences, vocabulary_size):
    # Turn sentences (list of strings) into lists of words
    split_sentences = [s.split() for s in sentences]
    words = [x for sublist in split_sentences for x in sublist]

    # Initialize list of [word, word_count] for each word, starting with unknown
    count = [['RARE', -1]]

    # Now add most frequent words, limited to the N-most frequent (N=vocabulary size)
    count.extend(collections.Counter(words).most_common(vocabulary_size-1))

    # Now create the dictionary
    word_dict = {}
    # For each word, that we want in the dictionary, add it, then make it
    # the value of the prior dictionary length
    for word, word_count in count:
        word_dict[word] = len(word_dict)
        #注意这边是按照词频高低依次建立索引的

    return(word_dict)

这边就是使用python的collections库去统计了一下词频,按照词频的大小依次建立相应的索引存进了字典形成我们要使用的词典,词典的单词数是有限制的,我们只能尽可能地保留词频高的单词,这样难免会出现oov问题。

下面我们需要把文本转换成数字

# Turn text data into lists of integers from dictionary
def text_to_numbers(sentences, word_dict):
    # Initialize the returned data
    data = []
    for sentence in sentences:
        sentence_data = []
        # For each word, either use selected index or rare word index
        for word in sentence:
            if word in word_dict:
                word_ix = word_dict[word]
            else:
                word_ix = 0
            sentence_data.append(word_ix)
        data.append(sentence_data)
    return(data)

这里没什么特别的,双层循环,把每个单词变成了它在字典里面的索引。

下面是batch数据的生成,具体的解释我都在代码中写好了注释

# Generate data randomly (N words behind, target, N words ahead)
def generate_batch_data(sentences, batch_size, window_size, method='skip_gram'):
    # Fill up data batch
    batch_data = []
    label_data = []
    while len(batch_data) < batch_size:
        # select random sentence to start
        rand_sentence = np.random.choice(sentences)
        # Generate consecutive windows to look at
        #这边就是生成我们的窗口数据,注意这边前几个窗口是不满五个单词的
        window_sequences = [rand_sentence[max((ix-window_size),0):(ix+window_size+1)] for ix, x in enumerate(rand_sentence)]
        # Denote which element of each window is the center word of interest
        #生成中心词的索引
        label_indices = [ix if ixelse window_size for ix,x in enumerate(window_sequences)]
        #这边给出上面两行代码生成的部分实例, 相信大家能够理解
        #[[0, 0, 2451], [0, 0, 2451, 0], [0, 0, 2451, 0, 1304], [0, 2451, 0, 1304, 2713]]
        #[0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]


        # Pull out center word of interest for each window and create a tuple for each window
        if method=='skip_gram':
            batch_and_labels = [(x[y], x[:y] + x[(y+1):]) for x,y in zip(window_sequences, label_indices)]
            #batch_and_labels其实是这样的[(0, [0, 2451]), (0, [0, 2451, 0]), (2451, [0, 0, 0, 1304])]
            # Make it in to a big list of tuples (target word, surrounding word)
            tuple_data = [(x, y_) for x,y in batch_and_labels for y_ in y]
            #这里其实就是把中心词和它周围的词一对一的匹配起来了
            #[(0, 0), (0, 2451), (0, 0), (0, 2451), (0, 0), (2451, 0), (2451, 0), (2451, 0)]
        elif method=='cbow':
            batch_and_labels = [(x[:y] + x[(y+1):], x[y]) for x,y in zip(window_sequences, label_indices)]
            # Make it in to a big list of tuples (target word, surrounding word)
            tuple_data = [(x_, y) for x,y in batch_and_labels for x_ in x]
        else:
            raise ValueError('Method {} not implemented yet.'.format(method))

        # extract batch and labels
        batch, labels = [list(x) for x in zip(*tuple_data)]
        batch_data.extend(batch[:batch_size])
        label_data.extend(labels[:batch_size])
    # Trim batch and label at the end
    batch_data = batch_data[:batch_size]
    label_data = label_data[:batch_size]

    # Convert to numpy array
    batch_data = np.array(batch_data)
    label_data = np.transpose(np.array([label_data]))

    return(batch_data, label_data)

我们的skip-gram模型,是用一个中心词去预测它周围的单词,但是经过这么一段数据预处理,大家是不是发现我们生成的数据和标签其实是一一对应的,一个中心词对应一个周围单词,而不是一个中心词对应它周围的多个单词。其实个人认为是为了方便训练,而且词向量的训练是核心任务,表面上我们是一个预测模型,我们其实不太care这个模型要达到多大的精度,希望的只是我们的嵌入矩阵能够很好的代表我们的词向量。

word2vec是有两种训练模型的,它们之间的区别,这边也稍微地提一下,当然就从基本概念上来讲的话,其实没有多大的意思,CBOW是采用周围的词来预测中心词的,Skip-Gram模型采取CBOW的逆过程的动机在于:CBOW算法对于很多分布式信息进行了平滑处理(例如将一整段上下文信息视为一个单一观察量)。很多情况下,对于小型的数据集,这一处理是有帮助的。相形之下,Skip-Gram模型将每个“上下文-目标词汇”的组合视为一个新观察量,这种做法在大型数据集中会更为有效。

讲到这里的话,就要开始触及我们的核心概念了,关于词向量的训练,我们词向量的基本任务还是预测任务,很容易想到的损失函数就是神经网络语言模型里面的损失函数softmax,单词每计算一个样本数据,我们就需要计算一次词典里面全部词是要预测的那个词的概率,这是很耗费资源的,又和词典数搭边,当然我们的算法工程师也不是等闲之辈,就提出来分级softmax和负采样的计算方法,什么是分层softmax呢,这边我套用吴恩达描述,
意思就是说不是一下子就确定到底是属于 10,000 (词典数)类中的哪一类。想象如果你有一个分类器,它告诉你目标词是在词汇表的前 5000 个中还是在词汇表的后 5000 个词中,假如这个二分类器告诉你这个词在前 5000 个词中,然后第二个分类器会告诉你这个词在词汇表的前 2500 个词中,或者在词汇表的第二组 2500 个词中,诸如此类,直到最终你找到一个词准确所在的分类器(上图编号 3 所示),那么就是这棵树的一个叶子节点。像这样有一个树形的分类器,意味着树上内部的每一个节点都可以是一个二分类器,比如逻辑回归分类器,所以你不需要再为单次分类,对词汇表中所有的 10,000 个词求和了。实际上用这样的分类树,计算成本与词汇表大小的对数成正比,而不是词汇表大小的线性函数,这个就叫做分级 softmax 分类器。
TensorFlow与NLP(词向量:skip-gram)_第3张图片

实际上,分级的softmax分类器会被构造成常用词在顶部,然而不常用的词像durian会在树的更深处(上图编号2所示的分类树),因为你想更常见的词会更频繁,所以你可能只需要少量检索就可以获得常用单词像the和of。然而你更少见到的词比如durian就更合适在树的较深处,因为你一般不需要到那样的深处,所以有不同的经验法则可以帮助构造分类树形成分级softmax分类器。

负采样的话计算就更加简洁一些了,它是为一个标准样本生成k个负样本,k值一般都是很小的,2-5之间,然后就如下图,我们的损失函数其实就是个二分类的函数,将softmax转换成逻辑回归,我们的计算量就大大的减少了。TensorFlow里面也是采用的相似的方法。
TensorFlow与NLP(词向量:skip-gram)_第4张图片

这边有TensorFlow官方的词向量说明,有什么不理解的大家可以参考一下,它讲的还是非常清楚的,有一定的词向量理论基础的同学都可以去看一看。

下面进入我们的核心代码,模型定义和训练。

# Define Embeddings:
embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))

# NCE loss parameters
nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size],
                                               stddev=1.0 / np.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

# Create data/target placeholders
x_inputs = tf.placeholder(tf.int32, shape=[batch_size])
y_target = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

# Lookup the word embedding:
embed = tf.nn.embedding_lookup(embeddings, x_inputs)

# Get loss from prediction
loss = tf.reduce_mean(tf.nn.nce_loss(nce_weights, nce_biases, embed, y_target,
                                     num_sampled, vocabulary_size))

# Create optimizer
optimizer = tf.train.GradientDescentOptimizer(learning_rate=1.0).minimize(loss)

完整代码贴上

import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import random
import os
import string
import requests
import collections
import io
import tarfile
import urllib.request
from nltk.corpus import stopwords
from tensorflow.python.framework import ops
ops.reset_default_graph()

os.chdir(os.path.dirname(os.path.realpath(__file__)))

# Start a graph session
sess = tf.Session()

# Declare model parameters
batch_size = 50
embedding_size = 200
vocabulary_size = 10000
generations = 50000
print_loss_every = 500

num_sampled = int(batch_size/2)    # Number of negative examples to sample.
window_size = 2       # How many words to consider left and right.

# Declare stop words
stops = stopwords.words('english')

# We pick five test words. We are expecting synonyms to appear
print_valid_every = 2000
valid_words = ['cliche', 'love', 'hate', 'silly', 'sad']
# Later we will have to transform these into indices

# Load the movie review data
# Check if data was downloaded, otherwise download it and save for future use
def load_movie_data():
    save_folder_name = 'temp'
    pos_file = os.path.join(save_folder_name, 'rt-polarity.pos')
    neg_file = os.path.join(save_folder_name, 'rt-polarity.neg')

    # Check if files are already downloaded
    if os.path.exists(save_folder_name):
        pos_data = []
        with open(pos_file, 'r', encoding='latin-1') as temp_pos_file:
            for row in temp_pos_file:
                pos_data.append(row)
        neg_data = []
        with open(neg_file, 'r', encoding='latin-1') as temp_neg_file:
            for row in temp_neg_file:
                neg_data.append(row)
    else: # If not downloaded, download and save
        movie_data_url = 'http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz'
        stream_data = urllib.request.urlopen(movie_data_url)
        tmp = io.BytesIO()
        while True:
            s = stream_data.read(16384)
            if not s:  
                break
            tmp.write(s)
            stream_data.close()
            tmp.seek(0)

        tar_file = tarfile.open(fileobj=tmp, mode="r:gz")
        pos = tar_file.extractfile('rt-polaritydata/rt-polarity.pos')
        neg = tar_file.extractfile('rt-polaritydata/rt-polarity.neg')
        # Save pos/neg reviews
        pos_data = []
        for line in pos:
            pos_data.append(line.decode('ISO-8859-1').encode('ascii',errors='ignore').decode())
        neg_data = []
        for line in neg:
            neg_data.append(line.decode('ISO-8859-1').encode('ascii',errors='ignore').decode())
        tar_file.close()
        # Write to file
        if not os.path.exists(save_folder_name):
            os.makedirs(save_folder_name)
        # Save files
        with open(pos_file, "w") as pos_file_handler:
            pos_file_handler.write(''.join(pos_data))
        with open(neg_file, "w") as neg_file_handler:
            neg_file_handler.write(''.join(neg_data))
    texts = pos_data + neg_data
    target = [1]*len(pos_data) + [0]*len(neg_data)
    return(texts, target)

texts, target = load_movie_data()

# Normalize text
def normalize_text(texts, stops):
    # Lower case
    texts = [x.lower() for x in texts]

    # Remove punctuation
    texts = [''.join(c for c in x if c not in string.punctuation) for x in texts]

    # Remove numbers
    texts = [''.join(c for c in x if c not in '0123456789') for x in texts]

    # Remove stopwords
    texts = [' '.join([word for word in x.split() if word not in (stops)]) for x in texts]

    # Trim extra whitespace
    texts = [' '.join(x.split()) for x in texts]

    return(texts)

texts = normalize_text(texts, stops)

# Texts must contain at least 3 words
target = [target[ix] for ix, x in enumerate(texts) if len(x.split()) > 2]
texts = [x for x in texts if len(x.split()) > 2]

# Build dictionary of words
#这边构建字典是使用的collections
def build_dictionary(sentences, vocabulary_size):
    # Turn sentences (list of strings) into lists of words
    split_sentences = [s.split() for s in sentences]
    words = [x for sublist in split_sentences for x in sublist]

    # Initialize list of [word, word_count] for each word, starting with unknown
    count = [['RARE', -1]]

    # Now add most frequent words, limited to the N-most frequent (N=vocabulary size)
    count.extend(collections.Counter(words).most_common(vocabulary_size-1))

    # Now create the dictionary
    word_dict = {}
    # For each word, that we want in the dictionary, add it, then make it
    # the value of the prior dictionary length
    #注意这边词典的编号,也就是单词的value值不是它出现了多少次,而是词频排序
    for word, word_count in count:
        word_dict[word] = len(word_dict)

    return(word_dict)


# Turn text data into lists of integers from dictionary
#每一句的每一个单词都使用在字典里面的索引去表示
def text_to_numbers(sentences, word_dict):
    # Initialize the returned data
    data = []
    for sentence in sentences:
        sentence_data = []
        # For each word, either use selected index or rare word index
        for word in sentence:
            if word in word_dict:
                word_ix = word_dict[word]
            else:
                word_ix = 0
            sentence_data.append(word_ix)
        data.append(sentence_data)
    return(data)

# Build our data set and dictionaries
word_dictionary = build_dictionary(texts, vocabulary_size)
word_dictionary_rev = dict(zip(word_dictionary.values(), word_dictionary.keys()))
text_data = text_to_numbers(texts, word_dictionary)

# Get validation word keys
valid_examples = [word_dictionary[x] for x in valid_words]

# Generate data randomly (N words behind, target, N words ahead)
def generate_batch_data(sentences, batch_size, window_size, method='skip_gram'):
    # Fill up data batch
    batch_data = []
    label_data = []
    while len(batch_data) < batch_size:
        # select random sentence to start
        rand_sentence = np.random.choice(sentences)
        # Generate consecutive windows to look at
        window_sequences = [rand_sentence[max((ix-window_size),0):(ix+window_size+1)] for ix, x in enumerate(rand_sentence)]
        # Denote which element of each window is the center word of interest
        label_indices = [ix if ixelse window_size for ix,x in enumerate(window_sequences)]

        # Pull out center word of interest for each window and create a tuple for each window
        if method=='skip_gram':
            batch_and_labels = [(x[y], x[:y] + x[(y+1):]) for x,y in zip(window_sequences, label_indices)]
            # Make it in to a big list of tuples (target word, surrounding word)
            tuple_data = [(x, y_) for x,y in batch_and_labels for y_ in y]
        elif method=='cbow':
            batch_and_labels = [(x[:y] + x[(y+1):], x[y]) for x,y in zip(window_sequences, label_indices)]
            # Make it in to a big list of tuples (target word, surrounding word)
            tuple_data = [(x_, y) for x,y in batch_and_labels for x_ in x]
        else:
            raise ValueError('Method {} not implemented yet.'.format(method))

        # extract batch and labels
        batch, labels = [list(x) for x in zip(*tuple_data)]
        batch_data.extend(batch[:batch_size])
        label_data.extend(labels[:batch_size])
    # Trim batch and label at the end
    batch_data = batch_data[:batch_size]
    label_data = label_data[:batch_size]

    # Convert to numpy array
    batch_data = np.array(batch_data)
    label_data = np.transpose(np.array([label_data]))

    return(batch_data, label_data)


# Define Embeddings:
embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))

# NCE loss parameters
nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size],
                                               stddev=1.0 / np.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

# Create data/target placeholders
x_inputs = tf.placeholder(tf.int32, shape=[batch_size])
y_target = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

# Lookup the word embedding:
embed = tf.nn.embedding_lookup(embeddings, x_inputs)

# Get loss from prediction
loss = tf.reduce_mean(tf.nn.nce_loss(nce_weights, nce_biases, embed, y_target,
                                     num_sampled, vocabulary_size))

# Create optimizer
optimizer = tf.train.GradientDescentOptimizer(learning_rate=1.0).minimize(loss)

# Cosine similarity between words
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)

#Add variable initializer.
init = tf.global_variables_initializer()
sess.run(init)

# Run the skip gram model.
loss_vec = []
loss_x_vec = []
for i in range(generations):
    batch_inputs, batch_labels = generate_batch_data(text_data, batch_size, window_size)
    feed_dict = {x_inputs : batch_inputs, y_target : batch_labels}

    # Run the train step
    sess.run(optimizer, feed_dict=feed_dict)

    # Return the loss
    if (i+1) % print_loss_every == 0:
        loss_val = sess.run(loss, feed_dict=feed_dict)
        loss_vec.append(loss_val)
        loss_x_vec.append(i+1)
        print("Loss at step {} : {}".format(i+1, loss_val))

    # Validation: Print some random words and top 5 related words
    if (i+1) % print_valid_every == 0:
        sim = sess.run(similarity, feed_dict=feed_dict)
        for j in range(len(valid_words)):
            valid_word = word_dictionary_rev[valid_examples[j]]
            top_k = 5 # number of nearest neighbors
            nearest = (-sim[j, :]).argsort()[1:top_k+1]
            log_str = "Nearest to {}:".format(valid_word)
            for k in range(top_k):
                close_word = word_dictionary_rev[nearest[k]]
                log_str = "%s %s," % (log_str, close_word)
            print(log_str)

你可能感兴趣的:(深度学习,NLP,Python,文本处理)