TensorFlow学习笔记(4)——TensorFlow实现GloVe

本文的代码来自于《TensorFlow自然语言处理》(Natural Language Processing with TensorFlow),作者是Thushan Ganegedara。

对了宝贝儿们,卑微小李的公众号【野指针小李】已开通,期待与你一起探讨学术哟~摸摸大!

目录

  • 0 前言
  • 1 数据集下载
  • 2 读取数据集
  • 3 创建词典
  • 4 生成GloVe的batch数据
  • 5 生成共现概率矩阵
  • 6 GloVe算法
    • 6.1 定义超参数
    • 6.2 定义输入与输出
    • 6.3 定义模型参数以及其他变量
    • 6.4 定义模型计算
    • 6.5 相似度计算
    • 6.6 定义模型参数优化器
    • 6.7 运行GloVe模型
  • 参考

0 前言

本文的代码来自于《TensorFlow自然语言处理》(Natural Language Processing with TensorFlow),作者是Thushan Ganegedara。在作者代码的基础上,我添加了部分自己的注释(作者的注释是英文,我的注释是用的中文)。代码已上传至github,这里是链接。

如果有任何错误或者没有讲解清楚的部分,请评论在下方,看到后我会更改。

关于GloVe的原理,如果有疑问的同学,可以参考我之前的文章:GloVe原理与公式讲解。

TensorFlow版本是1.8.0。

1 数据集下载

url = 'http://www.evanjones.ca/software/'

def maybe_download(filename, expected_bytes):
    """Download a file if not present, and make sure it's the right size."""
    if not os.path.exists(filename):
        filename, _ = urlretrieve(url + filename, filename)
    statinfo = os.stat(filename)
    if statinfo.st_size == expected_bytes:
        print('Found and verified %s' % filename)
    else:
        print(statinfo.st_size)
        raise Exception(
            'Failed to verify ' + filename + '. Can you get to it with a browser?')
    return filename

filename = maybe_download('wikipedia2text-extracted.txt.bz2', 18377035)

不愿意采用这种方式下载的同学,也可以直接访问链接 http://www.evanjones.ca/software/wikipedia2text-extracted.txt.bz2 进行下载

2 读取数据集

该步骤主要包含:将数据读取出来成为string,将数据转换为小写,对数据进行分词操作。每次读取1M数据。

def read_data(filename):
    """
    Extract the first file enclosed in a zip file as a list of words
    and pre-processes it using the nltk python library
    """

    with bz2.BZ2File(filename) as f:

        data = []
        file_size = os.stat(filename).st_size
        chunk_size = 1024 * 1024 # reading 1 MB at a time as the dataset is moderately large
        print('Reading data...')
        for i in range(ceil(file_size//chunk_size)+1):
            bytes_to_read = min(chunk_size,file_size-(i*chunk_size))
            file_string = f.read(bytes_to_read).decode('utf-8')
            file_string = file_string.lower()  # 将数据转换为小写
            # tokenizes a string to words residing in a list
            file_string = nltk.word_tokenize(file_string)  # 分词
            data.extend(file_string)
    return data

words = read_data(filename)
print('Data size %d' % len(words))
token_count = len(words)

print('Example words (start): ',words[:10])
print('Example words (end): ',words[-10:])

输出结果:

Reading data...
Data size 3361192
Example words (start):  ['propaganda', 'is', 'a', 'concerted', 'set', 'of', 'messages', 'aimed', 'at', 'influencing']
Example words (end):  ['favorable', 'long-term', 'outcomes', 'for', 'around', 'half', 'of', 'those', 'diagnosed', 'with']

3 创建词典

根据以下的规则进行词典的创建. 为了方便理解以下的元素,采用 "I like to go to school"作为例子.

  • dictionary: 词语与ID之间的映射关系 (e.g. {‘I’: 0, ‘like’: 1, ‘to’: 2, ‘go’: 3, ‘school’: 4})
  • reverse_dictionary: ID与词语之间的映射关系 (e.g. {0: ‘I’, 1: ‘like’, 2: ‘to’, 3: ‘go’, 4: ‘school’})
  • count: 列表,列表中每个元素是个元组,每个元组中的元素为单词以及频率 (word, frequency) (e.g. [(‘I’, 1), (‘like’, 1), (‘to’, 2), (‘go’, 1), (‘school’, 1)])
  • data : 文本中的词语,这些词语以ID来代替 (e.g. [0, 1, 2, 3, 2, 4])

标记 UNK 来表示稀有词语。

词典中只统计50000个常见词。

# we restrict our vocabulary size to 50000
vocabulary_size = 50000 

def build_dataset(words):
    count = [['UNK', -1]]
    # Gets only the vocabulary_size most common words as the vocabulary
    # All the other words will be replaced with UNK token
    count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
    dictionary = dict()

    # Create an ID for each word by giving the current length of the dictionary
    # And adding that item to the dictionary
    for word, _ in count:
        dictionary[word] = len(dictionary)
    
    data = list()
    unk_count = 0
    # Traverse through all the text we have and produce a list
    # where each element corresponds to the ID of the word found at that index
    for word in words:
        # If word is in the dictionary use the word ID,
        # else use the ID of the special token "UNK"
        if word in dictionary:
            index = dictionary[word]
        else:
            index = 0  # dictionary['UNK']
            unk_count = unk_count + 1
        data.append(index)
    
    # update the count variable with the number of UNK occurences
    count[0][1] = unk_count
  
    reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys())) 
    # Make sure the dictionary is of size of the vocabulary
    assert len(dictionary) == vocabulary_size
    
    return data, count, dictionary, reverse_dictionary

data, count, dictionary, reverse_dictionary = build_dataset(words)
print('Most common words (+UNK)', count[:5])
print('Sample data', data[:10])
del words  # Hint to reduce memory.

因为作者是同一个人,又是同一份代码,之前Word2Vec中我也写了这块代码的运行逻辑,链接为:《TensorFlow学习笔记(3)——TensorFlow实现Word2Vec》,第4部分。如果对这个代码有疑惑的可以跳转链接过去看。

输出结果为:

Most common words (+UNK) [['UNK', 68751], ('the', 226893), (',', 184013), ('.', 120919), ('of', 116323)]
Sample data [1721, 9, 8, 16479, 223, 4, 5168, 4459, 26, 11597]

4 生成GloVe的batch数据

batch是中心词;labels是中心词上下文窗口中的词语。对于中心词的上下文,每次读取2 * window_size + 1个词语,称之为span。每个span中,中心词为1,上下文大小为2 * window_size。该函数以这种方式继续,直到创建batch_size数据点。每次到达单词序列的末尾时,我们都会从头开始。

batch: 1 × 8 1 \times 8 1×8的向量; labels: 8 × 1 8 \times 1 8×1的向量; weights: 1 × 8 1 \times 8 1×8的向量,词语 i i i与词语 j j j共现的次数, 1 d \frac{1}{d} d1,其中 d d d为两个词之间的距离。

data_index = 0

def generate_batch(batch_size, window_size):
    # data_index is updated by 1 everytime we read a data point
    global data_index 
    
    # two numpy arras to hold target words (batch)
    # and context words (labels)
    batch = np.ndarray(shape=(batch_size), dtype=np.int32)
    labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
    weights = np.ndarray(shape=(batch_size), dtype=np.float32)

    # span defines the total window size, where
    # data we consider at an instance looks as follows. 
    # [ skip_window target skip_window ]
    span = 2 * window_size + 1 
    
    # The buffer holds the data contained within the span
    buffer = collections.deque(maxlen=span)
  
    # Fill the buffer and update the data_index
    for _ in range(span):
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
  
    # This is the number of context words we sample for a single target word
    num_samples = 2*window_size 

    # We break the batch reading into two for loops
    # The inner for loop fills in the batch and labels with 
    # num_samples data points using data contained withing the span
    # The outper for loop repeat this for batch_size//num_samples times
    # to produce a full batch
    for i in range(batch_size // num_samples):
        k=0
        # avoid the target word itself as a prediction
        # fill in batch and label numpy arrays
        for j in list(range(window_size))+list(range(window_size+1,2*window_size+1)):
            batch[i * num_samples + k] = buffer[window_size]
            labels[i * num_samples + k, 0] = buffer[j]
            # 因为 j 是跳过了 window_size 的,所以 j - window_size 不会为0
            weights[i * num_samples + k] = abs(1.0/(j - window_size))
            k += 1 
    
        # Everytime we read num_samples data points,
        # we have created the maximum number of datapoints possible
        # withing a single span, so we need to move the span by 1
        # to create a fresh new span
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
    return batch, labels, weights

print('data:', [reverse_dictionary[di] for di in data[:9]])

for window_size in [2, 4]:
    data_index = 0
    batch, labels, weights = generate_batch(batch_size=8, window_size=window_size)
    print('\nwith window_size = %d:' %window_size)
    print('    batch:', [reverse_dictionary[bi] for bi in batch])
    print('    labels:', [reverse_dictionary[li] for li in labels.reshape(8)])
    print('    weights:', [w for w in weights])

在这里weights就体现出了论文中所写的:

In all cases we use a decreasing weighting function, so that word pairs that are d d d words apart contribute 1 / d 1/d 1/d to the total count.

输出数据:

data: ['propaganda', 'is', 'a', 'concerted', 'set', 'of', 'messages', 'aimed', 'at']

with window_size = 2:
    batch: ['a', 'a', 'a', 'a', 'concerted', 'concerted', 'concerted', 'concerted']
    labels: ['propaganda', 'is', 'concerted', 'set', 'is', 'a', 'set', 'of']
    weights: [0.5, 1.0, 1.0, 0.5, 0.5, 1.0, 1.0, 0.5]

with window_size = 4:
    batch: ['set', 'set', 'set', 'set', 'set', 'set', 'set', 'set']
    labels: ['propaganda', 'is', 'a', 'concerted', 'of', 'messages', 'aimed', 'at']
    weights: [0.25, 0.33333334, 0.5, 1.0, 1.0, 0.5, 0.33333334, 0.25]

这个输出数据,我们以window_size = 4来看,在这一个窗口中中心词为set,其左侧上下文有['propaganda', 'is', 'a', 'concerted'],右侧上下文有['of', 'messages', 'aimed', 'at']batchlabels中的数据是一一对应的(比如labels[0]batch[0]的上下文)。而以propaganda为例,距离set为4( 4 − 0 = 4 4-0=4 40=4),所以weights[0]=1/4=0.25window_size=2同理。

5 生成共现概率矩阵

# We are creating the co-occurance matrix as a compressed sparse colum matrix from scipy. 
cooc_data_index = 0
dataset_size = len(data) # We iterate through the full text
skip_window = 4 # How many words to consider left and right.

# The sparse matrix that stores the word co-occurences
cooc_mat = lil_matrix((vocabulary_size, vocabulary_size), dtype=np.float32)

print(cooc_mat.shape)
def generate_cooc(batch_size, skip_window):
    '''
    Generate co-occurence matrix by processing batches of data
    '''
    data_index = 0
    print('Running %d iterations to compute the co-occurance matrix'%(dataset_size//batch_size))
    for i in range(dataset_size//batch_size):
        # Printing progress
        if i>0 and i%100000==0:
            print('\tFinished %d iterations'%i)
            
        # Generating a single batch of data
        batch, labels, weights = generate_batch(batch_size, skip_window)
        labels = labels.reshape(-1)
        
        # Incrementing the sparse matrix entries accordingly
        # inp: 中心词 i 的 id
        # lbl: 上下文词语 j 的 id
        # w: i 与 j 共现的频率
        for inp,lbl,w in zip(batch,labels,weights):            
            cooc_mat[inp,lbl] += (1.0*w)

# Generate the matrix
generate_cooc(8,skip_window)    

# Just printing some parts of co-occurance matrix
print('Sample chunks of co-occurance matrix')


# Basically calculates the highest cooccurance of several chosen word
for i in range(10):
    idx_target = i
    
    # get the ith row of the sparse matrix and make it dense
    ith_row = cooc_mat.getrow(idx_target)     
    ith_row_dense = ith_row.toarray('C').reshape(-1)  # 获得频率,如果ith_row没有这个元素,那么就是0     
    
    # select target words only with a reasonable words around it.
    # 获得一个 X_i 在 10 - 50000 之间的单词
    while np.sum(ith_row_dense)<10 or np.sum(ith_row_dense)>50000:
        # Choose a random word
        idx_target = np.random.randint(0,vocabulary_size)
        
        # get the ith row of the sparse matrix and make it dense
        ith_row = cooc_mat.getrow(idx_target) 
        ith_row_dense = ith_row.toarray('C').reshape(-1)    
        
    print('\nTarget Word: "%s"'%reverse_dictionary[idx_target])
        
    # sort_indices 按照从小到大排序 ith_row_dense (词频从小到大排序), 结果为索引
    sort_indices = np.argsort(ith_row_dense).reshape(-1) # indices with highest count of ith_row_dense
    # 按照从大到小排序 ith_row_dense (词频从大到小排序), 结果为索引
    sort_indices = np.flip(sort_indices,axis=0) # reverse the array (to get max values to the start)

    # printing several context words to make sure cooc_mat is correct
    print('Context word:',end='')
    for j in range(10):        
        idx_context = sort_indices[j]       
        print('"%s"(id:%d,count:%.2f), '%(reverse_dictionary[idx_context],idx_context,ith_row_dense[idx_context]),end='')
    print()

这里作者是采用了scipy.sparse中的lil_matrix,因为原论文中提到,这个共现概率矩阵是个稀疏矩阵,所以采用lil_matrix可以节省存储空间。lil_matrix(arg1, shape=None, dtype=None, copy=False), 基于行连接存储的稀疏矩阵。lil_matrix使用两个列表保存非零元素。data保存每行中的非零元素,rows保存非零元素所在的列。这种格式也很适合逐个添加元素,并且能快速获取行相关的数据[4]。简单来说就是lil_matrix只存储非零元素的行列以及元素,其余位置全为0。

输出的部分结果为:

(50000, 50000)
Running 420149 iterations to compute the co-occurance matrix
	Finished 100000 iterations
	Finished 200000 iterations
	Finished 300000 iterations
	Finished 400000 iterations
Sample chunks of co-occurance matrix

...

Target Word: "to"
Context word:"the"(id:1,count:2481.16), ","(id:2,count:989.33), "."(id:3,count:689.00), "a"(id:8,count:579.83), "and"(id:5,count:573.08), "be"(id:30,count:553.83), "of"(id:4,count:470.50), "UNK"(id:0,count:470.00), "in"(id:6,count:412.25), "is"(id:9,count:283.42), 

这里的代码逻辑很简单,就是每一次抓8个数据计算(batch_size),那么一共需要抓取420149次。每一次抓取,都可以得到一个中心词及其8个上下文(window_size=4),以及在这个窗口中中心词与上下文共现的频率,接着更新共现概率矩阵。

6 GloVe算法

6.1 定义超参数

batch_size: 每个 batch 中的样本数;embedding_size: 嵌入层向量的大小;window_size: 上下文窗口大小;valid_examples: 随机选择的验证集样本(随机选择后就是常量了);epsilon: 防止 l o g {\rm log} log 发散。

batch_size = 128 # Data points in a single batch
embedding_size = 128 # Dimension of the embedding vector.
window_size = 4 # How many words to consider left and right.

# We pick a random validation set to sample nearest neighbors
valid_size = 16 # Random set of words to evaluate similarity on.
# We sample valid datapoints randomly from a large window without always being deterministic
valid_window = 50

# When selecting valid examples, we select some of the most frequent words as well as
# some moderately rare words as well
valid_examples = np.array(random.sample(range(valid_window), valid_size))
valid_examples = np.append(valid_examples,random.sample(range(1000, 1000+valid_window), valid_size),axis=0)

num_sampled = 32 # Number of negative examples to sample.

epsilon = 1 # used for the stability of log in the loss function

6.2 定义输入与输出

为每一个batch_size的内容创建训练集中输入与输出的placeholders,并且为验证集创建一个常数的tensor。

tf.reset_default_graph()

# Training input data (target word IDs).
train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
# Training input label data (context word IDs)
train_labels = tf.placeholder(tf.int32, shape=[batch_size])
# Validation input data, we don't need a placeholder
# as we have already defined the IDs of the words selected
# as validation data
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

这里的valid_dataset就对应了6.1中的valid_examples。而这里的train_datasettrain_labels是用于每个batch中查询词向量用的,详细见6.4。

6.3 定义模型参数以及其他变量

in_embeddings: W W W, 50000 × 128 50000 \times 128 50000×128; in_bias_embeddings: b b b, 50000 × 1 50000 \times 1 50000×1; out_embeddings: W ~ \tilde{W} W~, 50000 × 128 50000 \times 128 50000×128; out_bias_embeddings: b ~ \tilde{b} b~, 50000 50000 50000

词向量初始化都是 [ − 1 , 1 ] [-1, 1] [1,1]的均匀分布,偏置初始化都是 [ 0 , 0.01 ] [0, 0.01] [0,0.01]的均匀分布

# Variables.
in_embeddings = tf.Variable(
    tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0),name='embeddings')
in_bias_embeddings = tf.Variable(
    tf.random_uniform([vocabulary_size],0.0,0.01,dtype=tf.float32),name='embeddings_bias')

out_embeddings = tf.Variable(
    tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0),name='embeddings')
out_bias_embeddings = tf.Variable(
    tf.random_uniform([vocabulary_size],0.0,0.01,dtype=tf.float32),name='embeddings_bias')

这里定义了词向量 W W W W ~ \tilde{W} W~,还有损失函数中的偏置项 b b b b ~ \tilde{b} b~

6.4 定义模型计算

定义了4个查找方法:embed_in, embed_out, embed_bias_in, embed_bias_out

weights_x: 1 × 8 1 \times 8 1×8, 权重函数 f ( X i j ) f(X_{ij}) f(Xij)

x_ij: 1 × 8 1 \times 8 1×8, 词语 i i i j j j 的共现频率, X i j X_{ij} Xij

损失函数: J = ∑ i , j = 1 V f ( X i j ) ( w i T w ~ j + b i + b ~ j − l o g ( 1 + X i j ) ) 2 J=\sum_{i, j=1}^V f(X_{ij}) (w_i^T \tilde{w}_j + b_i + \tilde{b}_j - {\rm log}(1 + X_{ij}))^2 J=i,j=1Vf(Xij)(wiTw~j+bi+b~jlog(1+Xij))2

# Look up embeddings for inputs and outputs
# Have two seperate embedding vector spaces for inputs and outputs
embed_in = tf.nn.embedding_lookup(in_embeddings, train_dataset)
embed_out = tf.nn.embedding_lookup(out_embeddings, train_labels)
embed_bias_in = tf.nn.embedding_lookup(in_bias_embeddings,train_dataset)
embed_bias_out = tf.nn.embedding_lookup(out_bias_embeddings,train_labels)

# weights used in the cost function
weights_x = tf.placeholder(tf.float32,shape=[batch_size],name='weights_x') 
# Cooccurence value for that position
x_ij = tf.placeholder(tf.float32,shape=[batch_size],name='x_ij')

# Compute the loss defined in the paper. Note that 
# I'm not following the exact equation given (which is computing a pair of words at a time)
# I'm calculating the loss for a batch at one time, but the calculations are identical.
# I also made an assumption about the bias, that it is a smaller type of embedding
loss = tf.reduce_mean(
    weights_x * (tf.reduce_sum(embed_in*embed_out,axis=1) + embed_bias_in + embed_bias_out - tf.log(epsilon+x_ij))**2)

这里就是用每个batch中的train_datasettrain_labels来查询词向量以及偏置向量,将查询到的内容放入到损失函数中进行计算。由于原论文中提到 l o g ( 0 ) {\rm log}(0) log(0)是发散的,所以采用 l o g ( 1 + X i j ) {\rm log}(1 + X_{ij}) log(1+Xij)解决这个问题。

6.5 相似度计算

这一部分主要是采用余弦相似度计算词语的相似度,详细的内容在6.7中。

# Compute the similarity between minibatch examples and all embeddings.
# We use the cosine distance:
embeddings = (in_embeddings + out_embeddings)/2.0  # X = U + V
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))  # 矩阵中每行元素的模
normalized_embeddings = embeddings / norm  # L2正则化
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)  # 提取验证集中的数据
similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))  # 余弦相似度

这里计算余弦相似度主要是采用了L2正则化,使得 ∣ A ⃗ ∣ × ∣ B ⃗ ∣ = 1 |\vec{A}|\times|\vec{B}|=1 A ×B =1,从而得到两个词语的余弦相似度。

6.6 定义模型参数优化器

在这里采用了Adagrad优化器。

# Optimizer.
optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)

6.7 运行GloVe模型

训练数据,训练num_steps次。并且在每次迭代中,在一个固定的验证集中评估算法,并且打印出距离给定词语最近的词语。

从结果来看,随着训练的进行,最接近验证集中词语的词语是一直在发生改变的。

num_steps = 100001
glove_loss = []

average_loss = 0
with tf.Session(config=tf.ConfigProto(allow_soft_placement=True)) as session:
    
    tf.global_variables_initializer().run()
    print('Initialized')
    
    for step in range(num_steps):
        
        # generate a single batch (data,labels,co-occurance weights)
        batch_data, batch_labels, batch_weights = generate_batch(
            batch_size, skip_window) 
        
        # 因为已经计算出来了共现矩阵,所以这里不需要 batch_weights
        # Computing the weights required by the loss function
        batch_weights = [] # weighting used in the loss function
        batch_xij = [] # weighted frequency of finding i near j
        
        # Compute the weights for each datapoint in the batch
        for inp,lbl in zip(batch_data,batch_labels.reshape(-1)):  
            # 100: x_max, 0.75: 3/4, point_weight: f(X_ij), batch_xij: 词语 i 与 j 的频率
            point_weight = (cooc_mat[inp,lbl]/100.0)**0.75 if cooc_mat[inp,lbl]<100.0 else 1.0 
            batch_weights.append(point_weight)
            batch_xij.append(cooc_mat[inp,lbl])
        batch_weights = np.clip(batch_weights,-100,1)
        batch_xij = np.asarray(batch_xij)
        
        # Populate the feed_dict and run the optimizer (minimize loss)
        # and compute the loss. Specifically we provide
        # train_dataset/train_labels: training inputs and training labels
        # weights_x: measures the importance of a data point with respect to how much those two words co-occur
        # x_ij: co-occurence matrix value for the row and column denoted by the words in a datapoint
        feed_dict = {train_dataset : batch_data.reshape(-1), train_labels : batch_labels.reshape(-1),
                     weights_x:batch_weights,x_ij:batch_xij}
        _, l = session.run([optimizer, loss], feed_dict=feed_dict)
        
        # Update the average loss variable
        average_loss += l
        if step % 2000 == 0:
            if step > 0:
                average_loss = average_loss / 2000
            # The average loss is an estimate of the loss over the last 2000 batches.
            print('Average loss at step %d: %f' % (step, average_loss))
            glove_loss.append(average_loss)
            average_loss = 0
        
        # Here we compute the top_k closest words for a given validation word
        # in terms of the cosine distance
        # We do this for all the words in the validation set
        # Note: This is an expensive step
        if step % 10000 == 0:
            sim = similarity.eval()
            for i in range(valid_size):
                valid_word = reverse_dictionary[valid_examples[i]]
                top_k = 8 # number of nearest neighbors
                nearest = (-sim[i, :]).argsort()[1:top_k+1]
                log = 'Nearest to %s:' % valid_word
                for k in range(top_k):
                    close_word = reverse_dictionary[nearest[k]]
                    log = '%s %s,' % (log, close_word)
                print(log)
            
    final_embeddings = normalized_embeddings.eval()

部分输出结果(这里选用第0次训练,即初始情况,以及第100000次训练的结果):

Average loss at step 0: 8.672687
Nearest to ,: pitcher, discharges, pigs, tolerant, fuzzy, medium-, on-campus, eduskunta,
Nearest to this: mediastinal, destined, implementing, honolulu, non-mormon, juniors, tycho, powered,
Nearest to most: translating, absolute, 111, bechet, adam, aleksey, penetrators, rake,
Nearest to but: motown, ridged, beginnings, shareholder, resurfacing, english, intelligence, o'dea,
Nearest to is: higher-quality, kitchener, kelley, confronted, m15, stanislaus, depictions, buf,
Nearest to ): encyclopedic, commute, symbiotic, forecasts, 1993., 243-year, cenwealh, inclosure,
Nearest to not: toulon, discount, dunblane, vividly, recorded, olive, afrikaansche, german-speaking,
Nearest to with: tofu, expansive, penned, grids, 102, drought, merced, cunningham,
Nearest to ;: all-electric, internationally-recognised, czars, 1216, kana, immaculate, innings, wnba,
Nearest to a: non-residents, presumption, cephas, tau, stepfather, beside, aorist, vom,
Nearest to for: bitterroots, sx-64, weekday, edificio, sousley, self-proclaimed, whoever, liquid,
Nearest to have: dissenting, barret, psilocybin, massamba-débat, kopfstein, 5.5, fillmore, innovator,
Nearest to was: ., is, most, wheelchair, 1575, warm-blooded, dynamically, 1913.,
Nearest to 's: eoka, melancholia, downs, gallipoli, reichswehr, easter, chest, construed,
Nearest to were: 1138, djuna, 3, beni, high-grade, slander, agency, séamus,
Nearest to be: knelt, horrors, assistant, hospitalised, 1802, fierce, cinemas, magnified,
...
Average loss at step 100000: 0.019544
Nearest to ,: ., the, in, a, of, and, ,, is,
Nearest to this: ), (, ``, UNK, or, ., in, ,,
Nearest to most: ., the, of, ,, and, for, a, to,
Nearest to but: ), UNK, '', or, and, ,, in, .,
Nearest to is: 's, the, of, at, world, ., in, on,
Nearest to ): were, in, ., and, ,, the, by, is,
Nearest to not: (, ``, UNK, ), '', of, 's, the,
Nearest to with: been, had, to, has, be, that, a, may,
Nearest to ;: a, such, an, ,, for, and, with, is,
Nearest to a: the, was, ., in, and, ,, to, of,
Nearest to for: are, by, and, ,, in, to, the, was,
Nearest to have: is, was, that, also, this, not, has, a,
Nearest to was: ., of, in, and, ,, 's, for, to,
Nearest to 's: it, is, has, there, this, are, was, not,
Nearest to were: a, as, is, with, and, ,, to, for,
Nearest to be: was, it, when, had, that, his, in, ,,

根据结果,我们发现,随着训练的进行,最接近于中心词的词语是在发生改变的,且越来越相似(比如初始化中最接近于be的词语是莫名其妙的,而100000次后有了was)。

而整体的代码逻辑上来讲:

  1. 每一轮迭代生成一个中心词batch_data及其窗口中的上下文batch_labels
  2. 迭代这组词,得到权重函数 f ( X i j ) = ( X i j x m a x ) 0.75 f(X_{ij})=(\frac{X_{ij}}{x_{\rm max}})^{0.75} f(Xij)=(xmaxXij)0.75,并且提取出共现频率 X i j X_{ij} Xij
  3. 通过np.clip()限制权重函数的最大值不超过1
  4. 将数据放入到6.4定义的损失函数loss中以及6.6定义的优化器optimizer中训练。

参考

[1] Thushan Ganegedara. Natural Language Processing with TensorFlow (TensorFlow自然语言处理)[M]. 北京: 机械工业出版社, 2019: 88-90.
[2] Jeffrey Pennington, Richard Socher, Christopher D. Manning. Glove: Global Vectors for Word Representation[C]// Conference on Empirical Methods in Natural Language Processing. 2014.
[3] AI研习社-译站. 【官方】【中英】CS224n 斯坦福深度自然语言处理课 @雷锋字幕组[EB/OL]. (2019-01-22)[2021-07-06]. https://www.bilibili.com/video/BV1pt411h7aT?p=3
[4] -柚子皮-. SciPy教程 - 稀疏矩阵库scipy.sparse[EB/OL]. (2014-12-06)[2021-07-08]. https://blog.csdn.net/pipisorry/article/details/41762945
[5] TaoTao Yu. embedding_lookup的学习笔记[EB/OL]. (2019-08-04)[2021-07-08]. https://blog.csdn.net/hit0803107/article/details/98377030

你可能感兴趣的:(TensorFlow,python,nlp,TensorFlow,GloVe,NLP)