08-词嵌入(Word embeddings)

文章目录

      • 1. One-hot编码
      • 2. 用独一无二的数字编码每个单词
      • 3. 词嵌入
      • 4. 代码示例
        • 4.1 数据准备
        • 4.2 使用Embedding层
        • 4.3 文本预处理
        • 4.4 创建一个分类模型
        • 4.5 编译/训练模型

1. One-hot编码

这一部分,之前已在 01-Embedding层是什么?怎么理解?简单的评论情感分类实验 这篇文章中详细解释过。
08-词嵌入(Word embeddings)_第1张图片

2. 用独一无二的数字编码每个单词

我们可能尝试的第二种方法是使用独一无二的数字对每个单词进行编码。继续上面的例子,您可以将1赋给“cat”,2赋给“mat”,以此类推。然后,你可以将“猫坐在垫子上”这句话编码为密集向量,如[5,1,4,3,5,2]。 这种方法是有效的。现在有了一个稠密的向量(所有元素都是满的),而不是一个稀疏的向量。

然而,这种方法也有两个缺点:

  1. 整数编码是任意的(它不捕获单词之间的任何关系)。
  2. 对于模型来说,解释整数编码是一项挑战。例如,线性分类器学习每个特征的单个权重。因为任何两个单词的相似度与其编码的相似度之间没有关系,所以这种特征-权重组合是没有意义的。

3. 词嵌入

单词嵌入为我们提供了一种高效、密集的表示方式,其中相似的单词具有相似的编码。 重要的是,您不必手工指定这种编码。一个嵌入是一个密集的浮点值向量(向量的长度是你指定的参数)。不用手动指定嵌入的值,它们是可训练的参数(模型在训练过程中学习到的权值,与模型学习密集层的权值的方式相同)。通常可以看到8维(用于小型数据集)的单词嵌入,在处理大型数据集时可以达到1024维。更高维度的嵌入可以捕捉单词之间的细粒度关系,但需要更多的数据来学习。

08-词嵌入(Word embeddings)_第2张图片
上面是一个单词嵌入图。每个单词都表示为一个4维的浮点值向量。另一种理解嵌入的方法是“查找表”。在掌握了这些权重之后,您可以通过查找表中它对应的密集向量来对每个单词进行编码。

4. 代码示例

4.1 数据准备

导入所需要的包:

import io
import os
import re  # 规则表达式(Regular Expression, RE),在Python中使用正则表达式,需要导入re模块。
import shutil
import string
import tensorflow as tf

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.layers import TextVectorization

我们将在本教程中使用大型电影评论数据集。在此数据集上训练情感分类器模型,并在此过程中从头开始学习嵌入。使用 Keras 文件实用程序下载数据集并查看目录。

url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset = tf.keras.utils.get_file("aclImdb_v1.tar.gz", url,
                                  untar=True, cache_dir='.',
                                  cache_subdir='')

dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')  # /public/home/zhaiyuxin/mycode/classnote/tensorflow_zyx/aclImdb
print(os.listdir(dataset_dir))  # ['imdbEr.txt', 'train', 'imdb.vocab', 'test', 'README']

08-词嵌入(Word embeddings)_第3张图片
其格式如下所示:
08-词嵌入(Word embeddings)_第4张图片
看一下 train/ 目录。它有 posneg 文件夹,电影评论分别标记为正面和负面。我们将使用来自 posneg 文件夹的评论来训练二元分类模式。

train_dir = os.path.join(dataset_dir, 'train')
print(os.listdir(train_dir))  # ['urls_neg.txt', 'pos', 'labeledBow.feat', 'urls_pos.txt', 'unsup', 'neg', 'unsupBow.feat', 'urls_unsup.txt']

08-词嵌入(Word embeddings)_第5张图片
train 目录还有其他文件夹,在创建训练数据集之前应将其删除。

remove_dir = os.path.join(train_dir, 'unsup')  # 删除该目录中多余的unsup文件夹
shutil.rmtree(remove_dir)  # shutil.rmtree() 递归删除一个目录以及目录内的所有内容
print(os.listdir(train_dir))  # ['urls_neg.txt', 'pos', 'labeledBow.feat', 'urls_pos.txt', 'neg', 'unsupBow.feat', 'urls_unsup.txt']

我们可以看到 unsup 文件夹成功被删除:
08-词嵌入(Word embeddings)_第6张图片
接下来,使用 tf.keras.utils.text_dataset_from_directory 创建一个 tf.data.Dataset
使用 train 目录创建训练和验证数据集,都拆分 20% 用于验证。

batch_size = 1024
seed = 123
# subset只有在有validation_split下才使用,这里表示0.8用于training,0.2用于validation
train_ds = tf.keras.utils.text_dataset_from_directory('aclImdb/train', batch_size=batch_size, validation_split=0.2, subset='training', seed=seed)
val_ds = tf.keras.utils.text_dataset_from_directory('aclImdb/train', batch_size=batch_size, validation_split=0.2, subset='validation', seed=seed)

运行到这里控制台会显示:

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.

我们可以查看训练数据集中的几个电影评论及其标签(1:正面,0:负面)。

for text_batch, label_batch in train_ds.take(1):  # 注意体会这里两个for的含义
    for i in range(10):
        print(label_batch[i].numpy(), text_batch[i].numpy())

08-词嵌入(Word embeddings)_第7张图片

配置数据集以提高性能:
这是加载数据时应该使用的两种重要方法,以确保I/O不会阻塞。

AUTOTUNE = tf.data.AUTOTUNE
train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
  1. .cache()将数据从磁盘加载后保存在内存中。这将确保数据集在训练模型时不会成为瓶颈。如果您的数据集太大,无法放入内存中,您还可以使用此方法创建一个性能磁盘缓存,这比读取许多小文件更有效。
  2. .prefetch()在训练时重叠数据预处理和模型执行。

4.2 使用Embedding层

Keras使文字嵌入变得很容易使用。嵌入层可以理解为一个从整数索引(代表特定的单词)映射到密集向量(它们的嵌入)的查找表。 嵌入的维度(或宽度)是一个参数,你可以用它来实验,看看什么样的适合你的问题,就像你可以用Dense层中的神经元数量来实验一样。

# Embed a 1,000 word vocabulary into 5 dimensions.
# 1000表示单词表个数,5表示向量维度。
embedding_layer = tf.keras.layers.Embedding(1000, 5)

当你创建一个嵌入层时,嵌入的权重是随机初始化的(像其他层一样)。在训练过程中,它们通过反向传播逐渐调整。 经过训练后,学习到的单词嵌入将大致编码单词之间的相似性(因为它们是根据您的模型所解决的的特定问题学习得到的)。

如果你向嵌入层传递一个整数,结果会用嵌入表中的向量替换每个整数:

result = embedding_layer(tf.constant([1, 0, 7]))
print(result.numpy())

我们可以看到,经过embedding层的处理,每个整数都可以用嵌入表中的向量唯一地表示:
在这里插入图片描述
对于文本或序列问题,嵌入层取一个二维的整数张量,形状为(samples, sequence_length),其中每个条目是一个整数序列。它可以嵌入可变长度的序列。 你可以在嵌入层中输入形状为(32,10)(32个长度为10的batch)或(64,15)(64个长度为15的batch)的batch。

返回的张量比输入的张量多一个轴,嵌入向量会沿着新的最后一个轴对齐。例如,给它传递一个(2,3)输入batch,输出为(2,3,N)。N为tf.keras.layers.Embeddingoutput_dim的值,在该例中,output_dim=5

result = embedding_layer(tf.constant([[0, 1, 2], [3, 4, 5]]))
print(result.numpy())
print(result.shape)  # (2, 3, 5)

08-词嵌入(Word embeddings)_第8张图片
当给定一个batch序列作为输入时,嵌入层返回一个形状为(samples, sequence_length, embedding_dimension)的3D浮点型张量。要将这个可变长度的序列转换为固定的表示形式,有各种各样的标准方法。在将其传递给Dense层之前,您可以使用RNN、Attention或池化层。 在本教程中我们使用池化,因为它是最简单的。文本分类与RNN是很好的下一步学习方向。

4.3 文本预处理

接下来,定义情感分类模型所需的数据集预处理步骤。用所需的参数初始化一个文本矢量化层(TextVectorization layer)来对电影评论进行矢量化。
这里用到了python中的re模块,具体使用可参考:正则表达式-re。

首先,我们要对训练集中的文本做标准化处理。

# Create a custom standardization function to strip HTML break tags '
'.
def custom_standardization(input_data): lowercase = tf.strings.lower(input_data) stripped_html = tf.strings.regex_replace(lowercase, '
'
, ' ') # 将lowercase中所有
用空格替代
return tf.strings.regex_replace(stripped_html, '[%s]' % re.escape(string.punctuation), '') # 将stripped_html中所有标点符号用空格替代

我们检验一下该标准化函数是否起作用:

for text_batch, label_batch in train_ds.take(1):
    for i in range(5):
        print(label_batch[i].numpy(), custom_standardization(text_batch[i]).numpy())

可以看出其运行结果中已无
和标点符号。

0 b'oh my god please for the love of all that is holy do not watch this movie it it 82 minutes of my life i will never get back sure i could have stopped watching half way through but i thought it might get better it didnt anyone who actually enjoyed this movie is one seriously sick and twisted individual no wonder us australiansnew zealanders have a terrible reputation when it comes to making movies everything about this movie is horrible from the acting to the editing i dont even normally write reviews on here but in this case ill make an exception i only wish someone had of warned me before i hired this catastrophe'
1 b'this movie is soooo funny the acting is wonderful the ramones are sexy the jokes are subtle and the plot is just what every high schooler dreams of doing to hisher school i absolutely loved the soundtrack as well as the carefully placed cynicism if you like monty python you will love this film this movie is a tad bit greaseesk without all the annoying songs the songs that are sung are likable you might even find yourself singing these songs once the movie is through this musical ranks number two in musicals to me second next to the blues brothers but please do not think of it as a musical per say seeing as how the songs are so likable it is hard to tell a carefully choreographed scene is taking place i think of this movie as more of a comedy with undertones of romance you will be reminded of what it was like to be a rebellious teenager needless to say you will be reminiscing of your old high school days after seeing this film highly recommended for both the family since it is a very youthful but also for adults since there are many jokes that are funnier with age and experience'
0 b'alex d linz replaces macaulay culkin as the central figure in the third movie in the home alone empire four industrial spies acquire a missile guidance system computer chip and smuggle it through an airport inside a remote controlled toy car because of baggage confusion grouchy mrs hess marian seldes gets the car she gives it to her neighbor alex linz just before the spies turn up the spies rent a house in order to burglarize each house in the neighborhood until they locate the car home alone with the chicken pox alex calls 911 each time he spots a theft in progress but the spies always manage to elude the police while alex is accused of making prank calls the spies finally turn their attentions toward alex unaware that he has rigged devices to cleverly boobytrap his entire house home alone 3 wasnt horrible but probably shouldnt have been made you cant just replace macauley culkin joe pesci or daniel stern home alone 3 had some funny parts but i dont like when characters are changed in a movie series view at own risk'
0 b'theres a good movie lurking here but this isnt it the basic idea is good to explore the moral issues that would face a group of young survivors of the apocalypse but the logic is so muddled that its impossible to get involved  for example our four heroes are understandably paranoid about catching the mysterious airborne contagion thats wiped out virtually all of mankind yet they wear surgical masks some times not others some times theyre fanatical about wiping down with bleach any area touched by an infected person other times they seem completely unconcerned  worse after apparently surviving some weeks or months in this new killorbekilled world these people constantly behave like total newbs they dont bother accumulating proper equipment or food theyre forever running out of fuel in the middle of nowhere they dont take elementary precautions when meeting strangers and after wading through the rotting corpses of the entire human race theyre as squeamish as sheltered debutantes you have to constantly wonder how they could have survived this long and even if they did why anyone would want to make a movie about them  so when these dweebs stop to agonize over the moral dimensions of their actions its impossible to take their soulsearching seriously their actions would first have to make some kind of minimal sense  on top of all this we must contend with the dubious acting abilities of chris pine his portrayal of an arrogant young james t kirk might have seemed shrewd when viewed in isolation but in carriers he plays on exactly that same note arrogant and boneheaded its impossible not to suspect that this constitutes his entire dramatic range  on the positive side the film looks excellent its got an oversharp saturated look that really suits the southwestern us locale but that cant save the truly feeble writing nor the paperthin and annoying characters even if youre a fan of the endoftheworld genre you should save yourself the agony of watching carriers'
0 b'i saw this movie at an actual movie theater probably the 200 one with my cousin and uncle we were around 11 and 12 i guess and really into scary movies i remember being so excited to see it because my cool uncle let us pick the movie and we probably never got to do that again and sooo disappointed afterwards just boring and not scary the only redeeming thing i can remember was corky pigeon from silver spoons and that wasnt all that great just someone i recognized ive seen bad movies before and this one has always stuck out in my mind as the worst this was from what i can recall one of the most boring nonscary waste of our collective 6 and a waste of film i have read some of the reviews that say it is worth a watch and i say too each his own but i wouldnt even bother not even so bad its good'

在这里我们使用tf.keras.layers.TextVectorization将字符串标准化、拆分并映射到整数。tf.keras.layers.TextVectorization一个简单的示例。

# Vocabulary size and number of words in a sequence.
vocab_size = 10000
sequence_length = 100

# 设置 maximum_sequence 长度,因为所有样本的长度不同。
vectorize_layer = TextVectorization(
    standardize=custom_standardization,  # 上述定义的标准化方法
    max_tokens=vocab_size,  # 词汇表的最大词汇量
    output_mode='int',  # 输出整数索引
    output_sequence_length=sequence_length)  # 仅在 int 模式下有效

text_ds = train_ds.map(lambda x, y: x)  # 制作一个纯文本数据集(无标签)
vectorize_layer.adapt(text_ds)  # 调用 adapt 来构建词汇表(将text_ds加入到TextVectorization层)

我们可以通过vectorize_layer.get_vocabulary()来查看词汇表:
08-词嵌入(Word embeddings)_第9张图片
查看经过TextVectorization层处理后的数据样式:

for text_batch, label_batch in train_ds.take(1):
    for i in range(2):  # 只查看2个
        print(vectorize_layer(text_batch[i]))

08-词嵌入(Word embeddings)_第10张图片

4.4 创建一个分类模型

使用Keras Sequential API定义情感分类模型。在这种情况下,它是一个 “连续的词袋” 风格的模型。

  • 文本矢量化层将字符串转换为词汇表索引。 我们已经将vectorize_layer初始化为TextVectorization层,并通过调用来构建它的词汇表。现在vectorize_layer可以用作端到端分类模型的第一层,将转换后的字符串输入到嵌入层。
  • 嵌入层 获取整数编码的词汇表,并查找每个单词索引的嵌入向量。这些向量在模型训练时学习并向输出数组添加了一个维度。得到的维度为:(batch, sequence, embedding).
  • GlobalAveragePooling1D层 通过平均序列维度为每个示例返回一个固定长度的输出向量。这允许模型以最简单的方式处理可变长度的输入。
  • 固定长度的输出矢量通过一个全连接(密集)层,该层有16个隐藏单元。
  • 最后一层与单个输出节点紧密连接。
embedding_dim = 16

model = Sequential([
    vectorize_layer,
    Embedding(vocab_size, embedding_dim, name="embedding"),
    GlobalAveragePooling1D(),
    Dense(16, activation='relu'),
    Dense(1)
])

4.5 编译/训练模型

我们将使用TensorBoard可视化指标:loss和准确accuracy。在这里,我们创建一个tf.keras.callbacks.TensorBoard

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")

model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=15,
    callbacks=[tensorboard_callback])

使用这种方法,模型验证集的准确率达到了 79% 左右(请注意,由于训练准确度更高,模型过度拟合)。

08-词嵌入(Word embeddings)_第11张图片

我们用model.summary()查看模型的结构:
08-词嵌入(Word embeddings)_第12张图片

TensorBoard 中可视化模型指标。参考:TensorBoard使用教程。
08-词嵌入(Word embeddings)_第13张图片

你可能感兴趣的:(Coding,Note,tensorflow,python,深度学习)