Keras-文本序列_文本向量化(标记做 one-hot 编码)
参考:
https://blog.csdn.net/qq_30614345/article/details/98714874
6.1.1 单词和字符的 one-hot 编码
代码清单 6-1 单词级的 one-hot 编码(简单示例)
代码清单 6-2 字符级的 one-hot 编码(简单示例)
代码清单 6-3 用 Keras 实现单词级的 one-hot 编码
代码清单 6-4 使用散列技巧的单词级的 one-hot 编码(简单示例)
print(token_index)
print(results)
# 单词和字符的 one-hot 编码
# one-hot 编码是将标记转换为向量的最常用、最基本的方法。在第 3 章的 IMDB 和路透社两
# 个例子中,你已经用过这种方法(都是处理单词)。它将每个单词与一个唯一的整数索引相关联,
# 然后将这个整数索引 i 转换为长度为 N 的二进制向量(N 是词表大小),这个向量只有第 i 个元
# 素是 1,其余元素都为 0。
# 当然,也可以进行字符级的 one-hot 编码。为了让你完全理解什么是 one-hot 编码以及如何
# 实现 one-hot 编码,代码清单 6-1 和代码清单 6-2 给出了两个简单示例,一个是单词级的 one-hot
# 编码,另一个是字符级的 one-hot 编码。
# 代码清单 6-1 单词级的 one-hot 编码(简单示例)
import numpy as np
# This is our initial data; one entry per "sample"
# (in this toy example, a "sample" is just a sentence, but
# it could be an entire document).
# 初始数据:每个样本是列表的一个元素(本例中的样本是一个句子,但也可以是一整篇文档)
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
# First, build an index of all tokens in the data.
token_index = {}
for sample in samples:
# We simply tokenize the samples via the `split` method.
# in real life, we would also strip punctuation and special characters
# from the samples.
for word in sample.split():
if word not in token_index:
# Assign a unique index to each unique word
token_index[word] = len(token_index) + 1
# Note that we don't attribute index 0 to anything.
# Next, we vectorize our samples.
# We will only consider the first `max_length` words in each sample.
max_length = 10
# This is where we store our results:
results = np.zeros((len(samples), max_length, max(token_index.values()) + 1))
for i, sample in enumerate(samples):
for j, word in list(enumerate(sample.split()))[:max_length]:
index = token_index.get(word)
results[i, j, index] = 1.
print(token_index)
print(results)
{'The': 1, 'cat': 2, 'sat': 3, 'on': 4, 'the': 5, 'mat.': 6, 'dog': 7, 'ate': 8, 'my': 9, 'homework.': 10}
[[[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
[[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]]
Character level one-hot encoding (toy example)
# 代码清单 6-2 字符级的 one-hot 编码(简单示例)
import string
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
# 所有可打印的 ASCII 字符
characters = string.printable # All printable ASCII characters.
token_index = dict(zip(characters, range(1, len(characters) + 1)))
max_length = 50
results = np.zeros((len(samples), max_length, max(token_index.values()) + 1))
for i, sample in enumerate(samples):
for j, character in enumerate(sample[:max_length]):
index = token_index.get(character)
results[i, j, index] = 1.
print(token_index)
print(results)
{'0': 1, '1': 2, '2': 3, '3': 4, '4': 5, '5': 6, '6': 7, '7': 8, '8': 9, '9': 10, 'a': 11, 'b': 12, 'c': 13, 'd': 14, 'e': 15, 'f': 16, 'g': 17, 'h': 18, 'i': 19, 'j': 20, 'k': 21, 'l': 22, 'm': 23, 'n': 24, 'o': 25, 'p': 26, 'q': 27, 'r': 28, 's': 29, 't': 30, 'u': 31, 'v': 32, 'w': 33, 'x': 34, 'y': 35, 'z': 36, 'A': 37, 'B': 38, 'C': 39, 'D': 40, 'E': 41, 'F': 42, 'G': 43, 'H': 44, 'I': 45, 'J': 46, 'K': 47, 'L': 48, 'M': 49, 'N': 50, 'O': 51, 'P': 52, 'Q': 53, 'R': 54, 'S': 55, 'T': 56, 'U': 57, 'V': 58, 'W': 59, 'X': 60, 'Y': 61, 'Z': 62, '!': 63, '"': 64, '#': 65, '$': 66, '%': 67, '&': 68, "'": 69, '(': 70, ')': 71, '*': 72, '+': 73, ',': 74, '-': 75, '.': 76, '/': 77, ':': 78, ';': 79, '<': 80, '=': 81, '>': 82, '?': 83, '@': 84, '[': 85, '\\': 86, ']': 87, '^': 88, '_': 89, '`': 90, '{': 91, '|': 92, '}': 93, '~': 94, ' ': 95, '\t': 96, '\n': 97, '\r': 98, '\x0b': 99, '\x0c': 100}
[[[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]]
Note that Keras has built-in utilities for doing one-hot encoding text at the word level or character level, starting from raw text data. This is what you should actually be using, as it will take care of a number of important features, such as stripping special characters from strings, or only taking into the top N most common words in your dataset (a common restriction to avoid dealing with very large input vector spaces).
Using Keras for word-level one-hot encoding:
# Keras 的内置函数可以对原始文本数据进行单词级或字符级的 one-hot 编码。你应该
# 使用这些函数,因为它们实现了许多重要的特性,比如从字符串中去除特殊字符、只考虑数据
# 集中前 N 个最常见的单词(这是一种常用的限制,以避免处理非常大的输入向量空间)。
# 代码清单 6-3 用 Keras 实现单词级的 one-hot 编码
from keras.preprocessing.text import Tokenizer
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
# We create a tokenizer, configured to only take
# into account the top-1000 most common words
# 创建一个分词器(tokenizer),设置为只考虑前 1000 个最常见的单词
tokenizer = Tokenizer(num_words=1000)
# This builds the word index 构建单词索引
tokenizer.fit_on_texts(samples)
# This turns strings into lists of integer indices.
# 将字符串转换为整数索引组成的列表
sequences = tokenizer.texts_to_sequences(samples)
# You could also directly get the one-hot binary representations.
# Note that other vectorization modes than one-hot encoding are supported!
# 也可以直接得到 one-hot 二进制表示。
# 这个分词器也支持除 one-hot 编码外的其他向量化模式
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')
# This is how you can recover the word index that was computed
# 找回单词索引
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
# one-hot 编码的一种变体是所谓的 one-hot 散列技巧(one-hot hashing trick),如果词表中唯
# 一标记的数量太大而无法直接处理,就可以使用这种技巧。这种方法没有为每个单词显式分配
# 一个索引并将这些索引保存在一个字典中,而是将单词散列编码为固定长度的向量,通常用一
# 个非常简单的散列函数来实现。这种方法的主要优点在于,它避免了维护一个显式的单词索引,
# 从而节省内存并允许数据的在线编码(在读取完所有数据之前,你就可以立刻生成标记向量)。
# 这种方法有一个缺点,就是可能会出现散列冲突(hash collision),即两个不同的单词可能具有
# 相同的散列值,随后任何机器学习模型观察这些散列值,都无法区分它们所对应的单词。如果
# 散列空间的维度远大于需要散列的唯一标记的个数,散列冲突的可能性会减小。
# 代码清单 6-4 使用散列技巧的单词级的 one-hot 编码(简单示例)
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
# We will store our words as vectors of size 1000.
# Note that if you have close to 1000 words (or more)
# you will start seeing many hash collisions, which
# will decrease the accuracy of this encoding method.
# 将单词保存为长度为 1000 的向量。如果单词数量接近 1000 个(或更多),那么会遇到很多散列冲突,这会降低这种编码方法的准确性
dimensionality = 1000
max_length = 10
results = np.zeros((len(samples), max_length, dimensionality))
for i, sample in enumerate(samples):
for j, word in list(enumerate(sample.split()))[:max_length]:
# Hash the word into a "random" integer index
# that is between 0 and 1000
# 将单词散列为 0~1000 范围内的一个随机整数索引
index = abs(hash(word)) % dimensionality
results[i, j, index] = 1.