16_NLP stateful CharRNN_window_Tokenizer_stationary_colab_ResetState_character word level_regex_IMDb: https://blog.csdn.net/Linli522362242/article/details/115388298
16_2NLP RNN_colab tensorboard_os.curdir_Pretrained Embed_TrainingSampler_En–Decoder_Beam search_mask: https://blog.csdn.net/Linli522362242/article/details/115518150
16_3_NLP RNNs Encoder Decoder Multi-Head Attention_complexity_max path length_sequential operations_colorbar:https://blog.csdn.net/Linli522362242/article/details/115689038
Exercises
In general, if you translate a sentence one word at a time, the result will be terrible. For example, the French sentence “Je vous en prie” means “You are welcome,” but if you translate it one word at a time, you get “I you in pray.” Huh? It is much better to read the whole sentence first and then translate it. A plain sequence-tosequence RNN would start translating a sentence immediately after reading the first word, while an Encoder–Decoder RNN will first read the whole sentence and then translate it. That said, one could imagine a plain sequence-to-sequence RNN that would output silence whenever it is unsure about what to say next (just like human translators do when they must translate a live broadcast).
Variable-length input sequences can be handled by padding the shorter sequences so that all sequences in a batch have the same length, and using masking to ensure the RNN ignores the padding token. For better performance, you may also want to create batches containing sequences of similar sizes. Ragged tensors can hold sequences of variable lengths, and tf.keras will likely support them eventually, which will greatly simplify handling variable-length input sequences (at the time of this writing, it is not the case yet).
Regarding variable-length output sequences,
Beam search is a technique used to improve the performance of a trained Encoder–Decoder model, for example in a neural machine translation system. The algorithm
https://blog.csdn.net/Linli522362242/article/details/115689038
An attention mechanism is a technique initially used in Encoder–Decoder models to give the decoder more direct access to the input sequence, allowing it to
deal with longer input sequences. At each decoder time step, the current decoder’s state and the full output of the encoder are processed by an alignment model that outputs an alignment score(e.g.) for each input time step(t). This score indicates which part of the input is most relevant to the current decoder time step.https://blog.csdn.net/Linli522362242/article/details/115689038
The weighted sum of the encoder output (weighted by their alignment score or ) is then fed to the decoder, which produces the next decoder state and the output for this time step.
The most important layer in the Transformer architecture is the Multi-Head Attention layer (the original Transformer architecture contains18 of them,
including 6 Masked Multi-Head Attention layers).
It is at the core of language models such as BERT and GPT-2. Its purpose is to allow the model to identify which words are most aligned with each other, and then improve each word’s representation using these contextual clues.
Sampled softmax is used when training a classification model when there are many classes (e.g., thousands)(When the output vocabulary is large (which is the case here), outputting a probability for each and every possible word would be terribly slow. If the target vocabulary contains, say, 50,000 French words, then the decoder would output 50,000-dimensional vectors, and then computing the softmax function over such a large vector would be very computationally intensive. To avoid this,https://blog.csdn.net/Linli522362242/article/details/115518150). It computes an approximation of the crossentropy loss based on the logit predicted by the model for the correct word, and the predicted logits for a sample of incorrect words. This speeds up training considerably compared to computing the softmax over all logits and then estimating the cross-entropy loss. After training, the model can be used normally, using the regular softmax function to compute all the class probabilities based on all the logits.
In TensorFlow you can use the tf.nn.sampled_softmax_loss() function for this during training and use the normal softmax function at inference time (sampled softmax cannot be used at inference time because it requires knowing the target).
First we need to build a function that generates strings based on a Reber grammar. The grammar will be represented as a list of possible transitions for each state. A transition specifies the string to output (or a grammar to generate it) and the next state.
We start at B, and move from one node(state_i) to the next, adding the symbols we pass to our string as we go. When we reach the final E, we stop. If there are two paths we can take, e.g. after T we can go to either S or X, we randomly choose one (with equal probability).
default_reber_grammar = [
[("B", 1)], #(state 0) output=B ==>(state 1)
[("T", 2), ("P", 3)],#(state 1) output=T ==>(state 2) OR =P ==>(state 3)
[("S", 2), ("X", 4)],#(state 2) output=S ==>(state 2) OR =X ==>(state 4)
[("T", 3), ("V", 5)],# and so on...
[("X", 3), ("S", 6)],
[("P", 4), ("V", 6)],
[("E", None)]] #(state 6) output=E ==>(terminal state)
In this manner we can generate an infinite number of strings which belong to the rather peculiar Reber language. Verify for yourself that the strings on the left below are possible Reber strings, while those on the right are not.
What are the set of symbols which can "legally" follow a T?. An S can follow a T, but only if the immediately preceding symbol was a B. A V or a T can follow a T, but only if the symbol immediately preceding it was either a T or a P(e.g. PTT, PTV, PTTV, PTTT, ...). In order to know what symbol sequences are legal, therefore, any system which recognizes reber strings must have some form of memory, which can use not only the current input, but also fairly recent history in making a decision.
Let's generate a few strings based on the default Reber grammar:
np.random.seed(42)
def generate_string( grammar ):
state = 0 # state: node
output = []
while state is not None: # or the number of choice output after state
index = np.random.randint( len(grammar[state]) )
production, state = grammar[state][index]
# if isinstance(production, list):# for embedded_reber_grammar
# production = generate_string( grammar=production)
output.append(production)
return "".join(output)
for _ in range(25):
print(generate_string(default_reber_grammar), end=" ")
While the Reber grammar represents a simple finite state grammar and has served as a benchmark for equally simple learning systems (it can be learned by an Elman network), a more demanding task is to learn the Embedded Reber Grammar, shown below:
Using this grammar two types of strings are generated: one kind which is made using the top path through the graph: BT
embedded_reber_grammar = [
[("B", 1)],
[("T", 2), ("P", 3)],
[(default_reber_grammar, 4)],
[(default_reber_grammar, 5)],
[("T", 6)],
[("P", 6)],
[("E", None)]]
import numpy as np
def generate_string( grammar ):
state = 0 # state: node
output = []
while state is not None: # or the number of choice output after state
index = np.random.randint( len(grammar[state]) )
production, state = grammar[state][index]
if isinstance(production, list):# for embedded_reber_grammar
production = generate_string( grammar=production)
output.append(production)
return "".join(output)
Now let's generate a few strings based on the embedded Reber grammar:
np.random.seed(42)
for _ in range(25):
print(generate_string(embedded_reber_grammar), end=" ")
Okay,
now we need a function to generate strings that do not respect the grammar. We could generate a random string, but the task would be a bit too easy, so instead we will generate a string that respects the grammar, and we will corrupt it by changing just one character:
POSSIBLE_CHARS = "BEPSTVX"
def generate_corrupted_string( grammar, chars=POSSIBLE_CHARS ):
good_string = generate_string( grammar )
index = np.random.randint( len(good_string) )
good_char = good_string[index]
# changing just one character
bad_char = np.random.choice( sorted(set(chars)-set(good_char)) )
return good_string[:index] + bad_char + good_string[index+1:]
# Let's look at a few corrupted strings:
np.random.seed(42)
for _ in range(25):
print( generate_corrupted_string(embedded_reber_grammar),
end=" " )
We cannot feed strings directly to an RNN, so we need to encode them somehow.
def string_to_ids(s, chars=POSSIBLE_CHARS):
return [chars.index(c) for c in s]
string_to_ids("BTTTXXVVETE")
We can now generate the dataset, with 50% good strings, and 50% bad strings:
###########################
# x (2D ragged): 2 x (num_rows)
# y (scalar)
# result (2D ragged): 2 x (num_rows)
x = tf.ragged.constant([[1, 2], [3]]) # 2xr1
y = 3
print(x.ragged_rank) #==>1
print(x + y)
# 不规则张量的不规则秩(ragged_rank)是底层 values
张量的分区次数(即 RaggedTensor
对象的嵌套深度)
# x 底层 values
张量的分区次数 是 ragged_rank=1
# x (2d ragged): 3 x (num_rows)
# y (2d tensor): 3 x 1
# Result (2d ragged): 3 x (num_rows)
x = tf.ragged.constant([
[10, 87, 12],
[19, 53],
[12, 32]
])# 3x r1
y = [[1000], [2000], [3000]]
print(x.ragged_rank) # ==> 1
print(x + y)
==>因为y是3x1按 3x r1进行广播
# x 底层 values
张量的分区次数 是 ragged_rank=1
# trailing dimensions do not match.##########
# x (2d ragged): 3 x (r1)
# y (2d tensor): 3 x 4 # trailing dimensions do not match.##########
x = tf.ragged.constant([[1, 2], [3, 4, 5, 6], [7]]) # 2,4,1
y = tf.constant([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]]) # 4,4,4
try:
x + y
except tf.errors.InvalidArgumentError as exception:
print(exception)
Expected 'tf.Tensor(False, shape=(), dtype=bool)' to be true. Summarized data: b'Unable to broadcast: dimension size mismatch in dimension'
1
b'lengths='
4
b'dim_size='
2, 4, 1
# 不规则张量的不规则秩(ragged_rank)是底层 values
张量的分区次数(即 RaggedTensor
对象的嵌套深度)
# x 底层 values
张量的分区次数 是 ragged_rank=1
# ragged dimensions do not match.##########
# x (2d ragged): 3 x (r1)
# y (2d ragged): 3 x (r2) # ragged dimensions do not match.##########
x = tf.ragged.constant([[1, 2, 3], [4], [5, 6]]) # 3, 1, 2
y = tf.ragged.constant([[10, 20], [30, 40], [50]]) # 2, 2, 1
try:
x + y
except tf.errors.InvalidArgumentError as exception:
print(exception)
Expected 'tf.Tensor(False, shape=(), dtype=bool)' to be true. Summarized data: b'Unable to broadcast: dimension size mismatch in dimension'
1
b'lengths='
2, 2, 1
b'dim_size='
3, 1, 2
# x,y 底层 values
张量的分区次数 是 ragged_rank=1
# x (3d ragged): 3 x (r1) x 2
# y (3d ragged): 3 x (r1) x 3 # trailing dimensions do not match ##########
x = tf.ragged.constant([[[1, 2], [3, 4], [5, 6]],
[[7, 8], [9, 10]]]) # 2, 2, 2, 2, 2
y = tf.ragged.constant([[[1, 2, 0], [3, 4, 0], [5, 6, 0]],
[[7, 8, 0], [9, 10, 0]]]) # 3, 3, 3, 3, 3
print(x.ragged_rank) #==> 2
print(y.ragged_rank) #==> 2
try:
x + y
except tf.errors.InvalidArgumentError as exception:
print(exception)
# x,y 底层 values
张量的分区次数 是 ragged_rank=2?
# x,y 底层 values
张量(pylist : A nested list
, tuple
or np.ndarray
. 他们的底层nested element is a scalar value)的分区次数 是 ragged_rank=2?
rt = tf.RaggedTensor.from_row_splits(
values=tf.RaggedTensor.from_row_splits(
#index= 0 1 2 3 4 5 6 7 8 9 10
values=[1,2,3,4,5,6,7,8,9,10], #index:0 1 2 3 4 5
row_splits=[0, 2, 4, 6, 8, 10]), # ==> [0,2), [2,4), [4,6), [6,8), [8,10)
row_splits=[0, 3, 5])
print(rt)
print("Shape: {}".format(rt.shape))
print("Number of partitioned dimensions: {}".format(rt.ragged_rank))
# y是1x1 cell可以按 2 x (r1) x 2 进行广播进行广播, 并且 x指定 底层 values
张量的分区次数 是 ragged_rank=1
# x (3d ragged): 2 x (r1) x 2
# y (2d ragged): 1 x 1
# Result (3d ragged): 2 x (r1) x 2
x = tf.ragged.constant([ [[1, 2], [3, 4], [5, 6]],
[[7, 8]]
],ragged_rank=1)
y = tf.constant([[10]])
print( (x+y).ragged_rank )# ==>1
print(x + y)
为什么x可以指定 底层 values
张量((pylist is a
nested list
, tuple
or np.ndarray
))的分区次数 是 ragged_rank=1? since x 底层 values的
nested element是
同等长度=2的
list(
均匀内层维度)
rt = tf.RaggedTensor.from_row_splits(
#index= 0 1 2 3 4
values=[ [1, 2], [3, 4], [5, 6], [7, 8] ],
row_splits=[0, 3, 4]) # ==> [0,3), [3,4)
print(rt)
print("Shape: {}".format(rt.shape))
print("Number of partitioned dimensions: {}".format(rt.ragged_rank))
# x (3d ragged): 2 x (r1) x (r2) x 1
# y (1d tensor): 3
# Result (3d ragged): 2 x (r1) x (r2) x 3
x = tf.ragged.constant([
[
[[1], [2]],
[],
[[3]],
[[4]],
],
[
[[5], [6]],
[[7]]
]
],ragged_rank=2)
y = tf.constant([10, 20, 30])
# X broadcasting ==>2 x (r1) x (r2) x 3
# x = tf.ragged.constant([
# [
# [[1,1,1], [2,2,2]],
# [],
# [[3,3,3]],
# [[4,4,4]],
# ],
# [
# [[5,5,5], [6,6,6]],
# [[7,7,7]]
# ]
# ],ragged_rank=2)
print(x + y)
不规则张量使用 RaggedTensor
类进行编码。在内部,每个 RaggedTensor
包含:
values
张量,它将可变长度行连接成扁平列表。# [3, 1, 4, 1, 5, 9, 2]row_partition
,它指示如何将这些扁平值分成各行。# [0, 4, 4, 6, 7]rt = tf.RaggedTensor.from_row_splits(
values=[3, 1, 4, 1, 5, 9, 2],
row_splits=[0, 4, 4, 6, 7])
print(rt)
可以使用四种不同的编码存储 row_partition
:
row_splits
是一个整型向量,用于指定行之间的拆分点: [ start_index, end_index )。value_rowids
是一个整型向量,用于指定每个值的行索引。row_lengths
是一个整型向量,用于指定每一行的长度。uniform_row_length
是一个整型标量,用于指定所有行的单个长度。nrows
还可以包含在 row_partition
编码中,以考虑具有 value_rowids
的空尾随行或具有 uniform_row_length
的空行。选择为行分区使用哪种编码由不规则张量在内部进行管理,以提高某些环境下的效率。尤其是,不同行分区方案的某些优点和缺点是:
高效索引:row_splits
编码可以实现不规则张量的恒定时间索引和切片。
高效连接:row_lengths
编码在连接不规则张量时更有效,因为当两个张量连接在一起时,行长度不会改变。
较小的编码大小:value_rowids
编码在存储有大量空行的不规则张量时更有效,因为张量的大小只取决于值的总数。另一方面,row_splits
和 row_lengths
编码在存储具有较长行的不规则张量时更有效,因为它们每行只需要一个标量值。
兼容性:value_rowids
方案与 tf.segment_sum
等运算使用的分段格式相匹配。row_limits
方案与 tf.sequence_mask
等运算使用的格式相匹配。
均匀维度:如下文所述,uniform_row_length
编码用于对具有均匀维度的不规则张量进行编码。
具有多个不规则维度的不规则张量通过为 values
张量使用嵌套 RaggedTensor
进行编码。每个嵌套 RaggedTensor
都会增加一个不规则维度。
rt = tf.RaggedTensor.from_row_splits(
values=tf.RaggedTensor.from_row_splits(
#index:0 1 2 3 4 5 6 7 8 9 10
values=[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],#idx 0 1 2 3 4 5
row_splits=[0, 3, 3, 5, 9, 10]), #==>[0,3), [3,3), [3,5), [5,9), [9,10)
row_splits=[0, 1, 1, 5]) #==>[0,1),[1,1),[1, 5)
print(rt)
print("Shape: {}".format(rt.shape))
print("Number of partitioned dimensions: {}".format(rt.ragged_rank))
工厂函数 tf.RaggedTensor.from_nested_row_splits
可用于通过提供一个 row_splits
张量列表,直接构造具有多个不规则维度的 RaggedTensor:
rt = tf.RaggedTensor.from_nested_row_splits(
flat_values=[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
nested_row_splits=([0, 1, 1, 5], # second operation
[0, 3, 3, 5, 9, 10] # first operation
)
)
print(rt)
print("Number of partitioned dimensions: {}".format(rt.ragged_rank))
不规则张量的不规则秩是底层 values
张量的分区次数(即 RaggedTensor
对象的嵌套深度)。最内层的 values
张量称为其 flat_values。在以下示例中,conversations
具有 ragged_rank=3,其 flat_values
为具有 24 个字符串的一维 Tensor
:
# shape = [batch, (paragraph), (sentence), (word)]
conversations = tf.ragged.constant(
#b p s word_list
[ [ [ ["I", "like", "ragged", "tensors."]
],
[ ["Oh", "yeah?"],
["What", "can", "you", "use", "them", "for?"]
],
[ ["Processing", "variable", "length", "data!"]
]
],
[ [ ["I", "like", "cheese."],
["Do", "you?"]
],
[ ["Yes."],
["I", "do."]
]
]
])
print( conversations.shape )
print("Number of partitioned dimensions: {}".format(conversations.ragged_rank) )
print("flat_values: ", conversations.flat_values)
具有均匀内层维度的不规则张量通过为 flat_values(即最内层 values
)使用多维 tf.Tensor
进行编码。
x底层 values是
同等长度=2的
list(
具有均匀内层维度),
x
is
pylist:
A nested list
, tuple
or np.ndarray
. 底层 values can be a
list
, tuple
or np.ndarray
rt = tf.RaggedTensor.from_row_splits(
#index 0 1 2 3 4 5 6
values=[[1, 3], [0, 0], [1, 3], [5, 3], [3, 3], [1, 2]],
row_splits=[0, 3, 4, 6])
print(rt)
print("Shape: {}".format(rt.shape))
print("Number of partitioned dimensions: {}".format(rt.ragged_rank))
print("Flat values shape: {}".format(rt.flat_values.shape))
print("Flat values:\n{}".format(rt.flat_values))
具有均匀非内层维度的不规则张量通过使用 uniform_row_length
对行分区进行编码。
rt = tf.RaggedTensor.from_uniform_row_length(
values=tf.RaggedTensor.from_row_splits(
#index 0 1 2 3 4 5 6 7 8 9 10
values=[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],#idx 0 2 4
row_splits=[0, 3, 5, 9, 10]), # ==>[0,3), [3,5), [5,9),[9,10)
uniform_row_length=2)
print(rt)
print("Shape: {}".format(rt.shape))
print("Number of partitioned dimensions: {}".format(rt.ragged_rank))
inner_shape: A tuple of integers specifying the shape for individual inner values(最内层的 values或者底层values
张量称为其 flat_values) in the returned RaggedTensor
. Defaults to ()
if ragged_rank
is not specified. If ragged_rank
is specified, then a default is chosen based on the contents of pylist
###########################https://www.tensorflow.org/guide/ragged_tensor?hl=zh-cn
We can now generate the dataset, with 50% good strings, and 50% bad strings:
def generate_dataset(size):
good_strings = [ string_to_ids(
generate_string(embedded_reber_grammar) #['string', 'string',...]
) for _ in range(size//2) ] # [ [id,id,...], [id,,...], ... ]
bad_strings = [ string_to_ids(
generate_corrupted_string(embedded_reber_grammar)
) for _ in range(size-size//2) ]
# all_strings is 2D
all_strings = good_strings + bad_strings # '+': append operation
# ragged_rank must be 1 since the innermost of all_strings is not uniform
X = tf.ragged.constant( all_strings, ragged_rank=1 )
y = np.array([ [1.] for _ in range( len(good_strings) )] +
[ [0.] for _ in range( len(bad_strings) ) ]
)
return X,y
np.random.seed(42)
X_train, y_train = generate_dataset( 10000 )
X_valid, y_valid = generate_dataset( 2000 )
X_train[:5], y_train[:5], y_train[-5:]
Let's take a look at the first training sequence:
X_train[0]
Perfect! We are ready to create the RNN to identify good strings. We build a simple sequence binary classifier:
####################################https://www.tensorflow.org/api_docs/python/tf/keras/layers/InputLayer
tf.keras.layers.InputLayer
ragged : Boolean, whether the placeholder created is meant to be ragged. In this case, values of 'None' in the 'shape' argument represent ragged dimensions. For more information about RaggedTensors, see this guide. Default to False.
####################################
from tensorflow import keras
# POSSIBLE_CHARS = "BEPSTVX"
np.random.seed(42)
tf.random.set_seed(42)
embedding_size = 5
# Ragged tensors can hold sequences of variable lengths
model = keras.models.Sequential([ # None: timesteps or input length
keras.layers.InputLayer( input_shape=[None], dtype=tf.int32, ragged=True ),
keras.layers.Embedding( input_dim=len(POSSIBLE_CHARS), # without oov
output_dim=embedding_size ),
keras.layers.GRU(30),
keras.layers.Dense(1, activation="sigmoid")
])
optimizer = keras.optimizers.SGD( lr=0.02, momentum=0.95, nesterov=True )
model.compile( loss="binary_crossentropy", optimizer=optimizer, metrics=["accuracy"] )
history = model.fit( X_train, y_train, epochs=20,
validation_data=(X_valid, y_valid) )
... ...
Now let's test our RNN on two tricky strings: the first one is bad while the second one is good. They only differ by the second to last character. If the RNN gets this right, it shows that it managed to notice the pattern that the second letter should always be equal to the second to last letter倒数第二个字母. That requires a fairly long short-term memory (which is the reason why we used a GRU cell).
#
test_strings = ["BPBTSSSSSSSXXTTVPXVPXTTTTTVVETE",
"BPBTSSSSSSSXXTTVPXVPXTTTTTVVEPE"]
X_test = tf.ragged.constant([ string_to_ids(s) for s in test_strings ],
ragged_rank=1 )
y_proba = model.predict( X_test )
print()
print("Estimated probability that teste are Reber strings:")
for index, string in enumerate( test_strings ):
print( "{}: {:.2f}%".format(string, 100*y_proba[index][0]) )
It worked fine. The RNN found the correct answers with very high confidence.
Let's start by creating the dataset. We will use random days between 1000-01-01 and 9999-12-31:
##########################
date.toordinal()
Return the proleptic Gregorian ordinal of the date, where January 1 of year 1 has ordinal 1.
For any date object d, date.fromordinal(d.toordinal()) == d.
from datetime import date
dt=date.fromordinal( 1 )
print(dt)
print( dt.strftime( "%d, %Y" ) )
print( dt.isoformat() )
##########################
from datetime import date
# cannot use strftime()'s %B format since it depends on the locale
MONTHS = ["January", "February", "March", "April", "May", "June",
"July", "August", "September", "October", "November", "December"]
def random_dates( n_dates ):
min_date = date(1000,1, 1).toordinal()
max_date = date(9999, 12, 31).toordinal()
ordinals = np.random.randint( max_date-min_date, size=n_dates ) + min_date
dates = [ date.fromordinal( ordinal) for ordinal in ordinals ]
x = [ MONTHS[dt.month-1] + " " + dt.strftime( "%d, %Y" ) for dt in dates ]
y = [ dt.isoformat() for dt in dates ]
return x, y
Here are a few random dates, displayed in both the input format and the target format:
np.random.seed(42)
n_dates = 3
x_example, y_example = random_dates( n_dates )
print( "{:25s}{:25s}".format("Input", "Target") )
print( "-"*50 )
for idx in range(n_dates):
print( "{:25s}{:25s}".format(x_example[idx], y_example[idx]) )
Let's get the list of all possible characters in the inputs:
INPUT_CHARS = "".join( sorted( set( "".join(MONTHS)
+ "0123456789, " )
) )
INPUT_CHARS
And here's the list of possible characters in the outputs:
OUTPUT_CHARS = "0123456789-"
Let's write a function to convert a string to a list of character IDs, as we did in the previous exercise:
def date_str_to_ids( date_str, chars=INPUT_CHARS ):
return [ chars.index(c) for c in date_str ]
date_str_to_ids(x_example[0], INPUT_CHARS)
date_str_to_ids( y_example[0], OUTPUT_CHARS )
def prepare_date_strs( date_strs, chars=INPUT_CHARS ): #ragg #veriable length
X_ids = [ date_str_to_ids(dt, chars) for dt in date_strs ]# [[nested_list],[nested_list]...]
X = tf.ragged.constant( X_ids, ragged_rank=1 )
return (X+1).to_tensor() # +1 for id start from 1
def create_dataset( n_dates ):
x,y = random_dates(n_dates)
return prepare_date_strs(x, INPUT_CHARS),\
prepare_date_strs(y, OUTPUT_CHARS)
np.random.seed(42)
X_train, Y_train = create_dataset( 10000 )
X_valid, Y_valid = create_dataset( 2000 )
X_test, Y_test = create_dataset( 2000 )
Y_train[0]
<=(X+1).to_tensor() # +1 for id start from 1<=
Let's first try the simplest possible model: we feed in the input sequence, which first goes through the encoder (an embedding layer followed by a single LSTM layer), which outputs a vector, then it goes through a decoder (a single LSTM layer, followed by a dense output layer), which outputs a sequence of vectors, each representing the estimated probabilities for all possible output character.
Since the decoder expects a sequence as input, we repeat the vector (which is output by the decoder) as many times as the longest possible output sequence.
embedding_size = 32
max_output_length = Y_train.shape[1] # 10==len( "7075-09-20" )
np.random.seed(42)
tf.random.set_seed(42)
# INPUT_CHARS = ' ,0123456789ADFJMNOSabceghilmnoprstuvy'
encoder = keras.models.Sequential([
keras.layers.Embedding( input_dim=len(INPUT_CHARS)+1, #+1 since (X+1).to_tensor() # +1 for id start from 1
output_dim=embedding_size,
input_shape=[None] ),
keras.layers.LSTM(128) # return_sequences=False,
])# ==>(batch_size,128 ) just the last time step
# OUTPUT_CHARS = '0123456789-'
decoder = keras.models.Sequential([
keras.layers.LSTM( 128, return_sequences=True ),
keras.layers.Dense( len(OUTPUT_CHARS)+1, #+1 since (X+1).to_tensor() # +1 for id start from 1
activation="softmax" )
])
model = keras.models.Sequential([
encoder, #==>(batch_size,128 )
keras.layers.RepeatVector( max_output_length ),#==>(batch_size,max_output_length,128 )
decoder#==>(batch_size, max_output_length )
])
# https://blog.csdn.net/Linli522362242/article/details/113311720
# adam==>Nadam
optimizer = keras.optimizers.Nadam() # and the length of date string is veriable==>ragged tensor
model.compile( loss="sparse_categorical_crossentropy", # since OUTPUT_CHARS = "0123456789-"
optimizer=optimizer, metrics=["accuracy"] )
history = model.fit( X_train, Y_train, epochs=20,
validation_data=(X_valid, Y_valid) )
... ...
Looks great, we reach 100% validation accuracy!
model.summary()
Let's use the model to make some predictions. We will need to be able to convert a sequence of character IDs to a readable string:
# def prepare_date_strs( date_strs, chars=INPUT_CHARS ): #ragg #veriable length
# X_ids = [ date_str_to_ids(dt, chars) for dt in date_strs ]# [[nested_list],[nested_list]...]
# X = tf.ragged.constant( X_ids, ragged_rank=1 )
# return (X+1).to_tensor() # +1 for id start from 1
# def create_dataset( n_dates ):
# x,y = random_dates(n_dates)
# return prepare_date_strs(x, INPUT_CHARS),\
# prepare_date_strs(y, OUTPUT_CHARS)
#OUTPUT_CHARS = "0123456789-"
def ids_to_date_strs( ids, chars=OUTPUT_CHARS ):
# " " since (X+1).to_tensor() # +1 for id start from 1
return [ "".join([ (" "+chars)[index] for index in sequence ])
for sequence in ids ]
X_new = prepare_date_strs([ "September 17, 2009", "July 14, 1789" ])
# OR ids = model.predict_classes(X_new) 12<==since len("1789-07-14")
# model.predict(X_new).shape : (2, 10, 12) <==10<==since len("0123456789-")
ids = np.argmax( model.predict(X_new), axis=-1 )
for date_str in ids_to_date_strs(ids):
print(date_str)
Perfect! :)
However, since the model was only trained on input strings of length 18 (which is the length of the longest date, e.g. len("September 17, 2009")==18), it does not perform well if we try to use it to make predictions on shorter sequences:
X_new = prepare_date_strs([ "May 02, 2020", "July 14, 1789" ])
#ids = model.predict_classes(X_new)
ids = np.argmax(model.predict(X_new), axis=-1)
for date_str in ids_to_date_strs(ids):
print(date_str)
wrong! We need to ensure that we always pass sequences of the same length(same time steps) as during training, using padding if necessary. Let's write a little helper function for that:
X_train[19]
#####################################
https://www.tensorflow.org/api_docs/python/tf/pad
tf.pad(
tensor, paddings, mode='CONSTANT', constant_values=0, name=None
)
mode : One of "CONSTANT"(填充0), "REFLECT", or "SYMMETRIC" (case-insensitive)
This operation pads a tensor
according to the paddings
you specify. paddings
is an integer tensor with shape [n, 2]
, where n is the rank of tensor(Note: The rank of a tensor is not the same as the rank of a matrix. The rank of a tensor is
the number of indices required to uniquely select each element of the tensor
. Rank is also known as "order", "degree", or "ndims.")
.
# TF v2 style
def compute_z(a,b,c):
r1 = tf.subtract(a,b)
r2 = tf.multiply(2,r1)
z = tf.add(r2,c)
return z
tf.print('Scalar Inputs:', compute_z(1,2,3))
tf.print('Rank 1 Inputs:', compute_z([1], [2], [3]))
tf.print('Rank 2 Inputs:', compute_z([[1]], [[2]], [[3]]))
input
, paddings[D, 0]
indicates how many values to add before the contents of tensor
in that dimension, and paddings[D, 1]
indicates how many values to add after the contents of tensor
in that dimension.import tensorflow as tf
t=[[2,3,4],
[5,6,7]
]
print(tf.pad(t,[[1,2], # [fill padding before d=0 OR axis=0 content, fill padding after d=0 OR axis=0 content]
[3,4] # [fill padding before d=1 OR axis=1 content, fill padding after d=1 OR axis=1 content]
],
"CONSTANT")
)
tf.Tensor(
[[0 0 0 0 0 0 0 0 0 0]
[0 0 0 2 3 4 0 0 0 0]
[0 0 0 5 6 7 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0]], shape=(5, 10), dtype=int32) # "CONSTANT"(填充0)
注:[1,2]是在pad里是第一个,代表第一维即矩阵的行,左边的1代表上方放1行0,右边的2代表下方放2行0
同理,[3,4]顺序是第二个,代表对列操作,左边的3代表在左边放3列0,右边4代表在右边放4列0
mode
is "REFLECT" then both paddings[D, 0]
and paddings[D, 1]
must be no greater than tensor.dim_size(D) - 1
. import tensorflow as tf
t=[[2,3,4],
[5,6,7]
]
print(tf.pad(t,[[1,2],
[3,4]
],
"REFLECT")
)
t.shape==2x2
Reason:
if D=0 ( OR axis=0 OR along to row), then both paddings[D, 0]
and paddings[D, 1]
must be no greater than tensor.dim_size(D)-1=2-1=1
import tensorflow as tf
t=[[2,3,4],
[5,6,7]
]
print(tf.pad(t,[[1,1], # d=0 OR axis=0, only 1 since t is 2x3, 2-1=1
[1,1] # d=1 OR axis=1,at most 2 since t is 2x3, 3-1=2
],
"REFLECT")
)
REFLECT
以234为轴向上翻 +以567为轴向下翻 +以竖5252为轴向左翻 +以竖7474为轴向右翻
import tensorflow as tf
t=[[2,3,4],
[5,6,7]
]
print(tf.pad(t,[[1,1],# d=0 OR axis=0
[1,2] # d=1 OR axis=1, at most 2 since t is 2x3, 3-1=2
],
"REFLECT")
)
mode
is "SYMMETRIC" then both paddings[D, 0]
and paddings[D, 1]
must be no greater than tensor.dim_size(D)
. import tensorflow as tf
t=[[2,3,4],
[5,6,7]
]
print(tf.pad(t,[[1,1],
[1,2]
],
"SYMMETRIC")
)
3D
import tensorflow as tf
t = tf.Variable(tf.ones([2, 3, 4]), name="tsr")
pad = np.array([[0, 0], # not fill padding on axis=0
[1, 2], # fill padding on axis=1
[3, 4]])# fill padding on axis=2
print(t)
print(tf.pad(t,pad,
"CONSTANT")
)
import tensorflow as tf
t = tf.Variable(tf.ones([2, 3, 3]), name="tsr")
pad = np.array([[1, 1], # fill padding on axis=0
[0, 0], # not fill padding on axis=1
[0, 0]])# not fill padding on axis=2
# print(t)
print(tf.pad(t,pad,
"CONSTANT")
)
#####################################
X_train[19]
# since we use X = tf.ragged.constant( X_ids, ragged_rank=1 ) # 内部非均匀
max_input_length = X_train.shape[1] # 18
def prepare_date_strs_padded( date_strs ):
X = prepare_date_strs( date_strs )
if X.shape[1]
Cool! Granted, there are certainly much easier ways to write a date conversion tool (e.g., using regular expressions or even basic string manipulation), but you have to admit that using neural networks is way cooler. ;-)
However, real-life sequence-to-sequence problems will usually be harder, so for the sake of completeness, let's build a more powerful model.
Instead of feeding the decoder a simple repetition of the encoder's output vector, we can feed it the target sequence, shifted by one time step to the right. This way, at each time step the decoder will know what the previous target character was. This should help is tackle more complex sequence-to-sequence problems.
Since the first output character of each target sequence has no previous character, we will need a new token to represent the start-of-sequence (sos).
######################
sos_id = len(OUTPUT_CHARS) + 1
def shifted_output_sequences(Y):
sos_tokens = tf.fill( dims=(len(Y),1),
value=sos_id )
return tf.concat([ sos_tokens, Y[:,:-1] ],
axis=1 )
######################
During inference, we won't know the target, so what will we feed the decoder? We can just predict one character at a time, starting with an sos token, then feeding the decoder all the characters that were predicted so far.
#####################
since we use X = tf.ragged.constant( X_ids, ragged_rank=1 ) # 内部非均匀
max_input_length = X_train.shape[1] # 18
max_output_length = Y_train.shape[1]
def prepare_date_strs_padded( date_strs ):
X = prepare_date_strs( date_strs )
if X.shape[1]
[ 0, max_input_length-X.shape[1] ]
])
return X
sos_id = len(OUTPUT_CHARS)+1
def predict_date_strs( date_strs ):
X = prepare_date_strs_padded( date_strs )
Y_pred = tf.fill( dims=(len(X), 1), value=sos_id )
for index in range( max_output_length ):
pad_size = max_output_length - Y_pred.shape[1]
X_decoder = tf.pad(Y_pred, [[0,0], # not fill batch_size dimension
[0,pad_size] # fill sequence/timestep dimension for conver variable length sequence to fixed length(==max_output_length)
])
Y_probas_next = model.predict([X,X_decoder])[:, index:index+1]
Y_pred_next = tf.argmax( Y_probas_next, axis=-1, output_type=tf.int32 )
Y_pred = tf.concat([Y_pred, Y_pred_next], axis=1)
return ids_to_date_strs(Y_pred[:,1:])
#####################
But if the decoder's LSTM expects to get the previous target as input at each step, how shall we pass it it the vector output by the encoder? Well, one option is to ignore the output vector, and instead use the encoder's LSTM state as the initial state of the decoder's LSTM (which requires that encoder's LSTM must have the same number of units as the decoder's LSTM).
Now let's create the decoder's inputs (for training, validation and testing). The sos token will be represented using the last possible output character's ID + 1.
# def prepare_date_strs( date_strs, chars=INPUT_CHARS ): #ragg #veriable length
# X_ids = [ date_str_to_ids(dt, chars) for dt in date_strs ]# [[nested_list],[nested_list]...]
# X = tf.ragged.constant( X_ids, ragged_rank=1 )
# return (X+1).to_tensor() # +1 for id start from 1
# def create_dataset( n_dates ):
# x,y = random_dates(n_dates)
# return prepare_date_strs(x, INPUT_CHARS),\
# prepare_date_strs(y, OUTPUT_CHARS)
# np.random.seed(42)
# X_train, Y_train = create_dataset( 10000 )
# X_valid, Y_valid = create_dataset( 2000 )
# X_test, Y_test = create_dataset( 2000 )
sos_id = len(OUTPUT_CHARS) + 1
def shifted_output_sequences(Y):
sos_tokens = tf.fill( dims=(len(Y),1),
value=sos_id )
return tf.concat([ sos_tokens, Y[:,:-1] ],
axis=1 )
X_train_decoder = shifted_output_sequences(Y_train)
X_valid_decoder = shifted_output_sequences(Y_valid)
X_test_decoder = shifted_output_sequences(Y_test)
Y_train
==>shifted by one time step to the right
Let's take a look at the decoder's training inputs:
X_train_decoder
Now let's build the model. It's not a simple sequential model anymore, so let's use the functional API:
encoder_embedding_size = 32
decoder_embedding_size = 32
lstm_units = 128
np.random.seed(42)
tf.random.set_seed(42)
# INPUT_CHARS=' ,0123456789ADFJMNOSabceghilmnoprstuvy'
# keras.layers.Embedding( input_dim=len(INPUT_CHARS)+1, #+1 since (X+1).to_tensor() # +1 for id start from 1
# output_dim=embedding_size,
# input_shape=[None] ),
encoder_input = keras.layers.Input( shape=[None], dtype=tf.int32 )
encoder_embedding = keras.layers.Embedding( input_dim=len(INPUT_CHARS)+1,#+1 since (X+1).to_tensor() # +1 for id start from 1
output_dim=encoder_embedding_size
)(encoder_input)
_, encoder_state_h, encoder_state_c = keras.layers.LSTM(
lstm_units, return_state=True, # return_sequences=False,
)(encoder_embedding)
encoder_state = [encoder_state_h, encoder_state_c]
# OUTPUT_CHARS = "0123456789-"
decoder_input = keras.layers.Input( shape=[None], dtype=tf.int32 )
decoder_embedding = keras.layers.Embedding( input_dim=len(OUTPUT_CHARS)+2,#+1 since +1 in (X+1).to_tensor() for id start from 1 and +1 again for sos
output_dim=decoder_embedding_size
)(decoder_input)
decoder_lstm_output = keras.layers.LSTM( lstm_units, return_sequences=True )(
decoder_embedding, initial_state=encoder_state )
decoder_output = keras.layers.Dense( len(OUTPUT_CHARS)+1, #+1 since (X+1).to_tensor() # +1 for id start from 1 and we don't need to +1 again for predicting 'sos' with 0 probability
activation="softmax" )(
decoder_lstm_output )
model = keras.models.Model( inputs=[encoder_input, decoder_input],
outputs=[decoder_output] )
# https://blog.csdn.net/Linli522362242/article/details/113311720
# adam==>Nadam
optimizer = keras.optimizers.Nadam()# and the length of date string is veriable==>ragged tensor
model.compile( loss="sparse_categorical_crossentropy", # since OUTPUT_CHARS = "0123456789-"
optimizer=optimizer,
metrics=["accuracy"] )
history = model.fit( [X_train, X_train_decoder], Y_train,
epochs=10,
validation_data=([X_valid, X_valid_decoder], Y_valid)
)
This model also reaches 100% validation accuracy, but it does so even faster.
Let's once again use the model to make some predictions. This time we need to predict characters one by one.
Figure 16-6. Neural machine translation using an Encoder–Decoder network with an attention model : https://blog.csdn.net/Linli522362242/article/details/115689038
# since we use X = tf.ragged.constant( X_ids, ragged_rank=1 ) # 内部非均匀
# max_input_length = X_train.shape[1] # 18
# max_output_length = Y_train.shape[1]
# def prepare_date_strs( date_strs, chars=INPUT_CHARS ): #ragg #veriable length
# X_ids = [ date_str_to_ids(dt, chars) for dt in date_strs ]# [[nested_list],[nested_list]...]
# X = tf.ragged.constant( X_ids, ragged_rank=1 )
# return (X+1).to_tensor() # +1 for id start from 1
# def prepare_date_strs_padded( date_strs ):
# X = prepare_date_strs( date_strs )
# if X.shape[1]
Works fine!
https://blog.csdn.net/Linli522362242/article/details/115518150
https://blog.csdn.net/Linli522362242/article/details/115518150
The sequences in this problem are pretty short, but if we wanted to tackle longer sequences, we would probably have to use attention mechanisms. While it's possible to code our own implementation, it's simpler and more efficient to use TF-Addons's implementation instead. Let's do that now, this time using Keras' subclassing API.
Warning: due to a TensorFlow bug (see this issue https://github.com/tensorflow/addons/issues/1153 for details), the get_initial_state()
method fails in eager mode, so for now we have to use the subclassing API, as Keras automatically calls tf.function()
on the call()
method (so it runs in graph mode).
In this implementation, we've reverted back to using the TrainingSampler
, for simplicity (but you can easily tweak it to use a ScheduledEmbeddingTrainingSampler
instead). We also use a GreedyEmbeddingSampler
during inference, so this class is pretty easy to use:
Figure 16-3. A simple machine translation model(just sending the encoder’s final hidden state to the decoder)
Figure 16-6. Neural machine translation using an Encoder–Decoder network with an attention model
is the alignment score vector. The decoder decides which part of the source sentence it needs to pay attention to, instead of having encoder encode all the information of the source sentence into a fixed-length vector.
is the attention weight vector (=the number of time steps in encoder output) at the decoder time step. We apply a softmax activation function to the alignment scores to obtain the attention weights.
is the attention Context Vector(=1x the number units in LSTMcell/GRUcell) at the the decoder time step.
The scaled dot-product attention of queries , keys , and values is
Note: Decoder receive (as input) : previous hidden state, all time steps of Encoder outputs, and previous time step Decoder output(inference time) OR previous time step target
Decoder LSTM/GRU
previous hidden state( OR encoder_final_state = [encoder_state_h, encoder_state_c] ), concatenation(+) between previous time step Decoder output(inference time) OR previous time step target(train time,after embedding lookup) + the scaled dot-product attention ( between previous hidden states and all time steps of Encoder outputs)
Tensor
(s) containing the alignments emitted at the previous time step for each attention mechanism.TensorArray
(s) containing alignment matrices from all time steps for each attention mechanism. Call stack()
on each to convert to a Tensor
.AttentionWrapperState中还有一个clone方法,在我们的模型图中也有调用的地方:其实就是对我们初始化的AttentionWrapperState对象,将cell_state的属性值替换为从encoder输出的state(经过average pooling)。https://zhuanlan.zhihu.com/p/52608602
max_output_length = Y_train.shape[1]
class DateTranslation( keras.models.Model ):
def __init__( self, units=128, encoder_embedding_size=32,
decoder_embedding_size=32, **kwargs ):
super().__init__(**kwargs)
################################# encoder
self.encoder_embedding = keras.layers.Embedding(
input_dim=len(INPUT_CHARS)+1, #+1 since (X+1).to_tensor() #+1 for id start from 1
output_dim=encoder_embedding_size
)
self.encoder = keras.layers.LSTM( units,
return_sequences=True,#will use an attention model(OR Alignment model)
return_state=True # Whether to return the last state in addition to the output. Default: False
)
################################# decoder
self.decoder_embedding = keras.layers.Embedding(
input_dim=len(OUTPUT_CHARS)+2,# +1 for id start from 1 and +1 again for 'SOS'
output_dim=decoder_embedding_size
)
# https://blog.csdn.net/Linli522362242/article/details/115689038
# https://github.com/tensorflow/addons/blob/v0.12.0/tensorflow_addons/seq2seq/attention_wrapper.py#L490-L609
# def _calculate_attention(self, query, state):
# score = _luong_score(query, self.keys, self.scale_weight)
# # def _luong_score(query, keys, scale):
# # Reshape from [batch_size, depth] to [batch_size, 1, depth]
# # for matmul.
# query = tf.expand_dims(query, 1)
# score = tf.matmul(query, keys, transpose_b=True) # simply compute the dot product
# score = tf.squeeze(score, [1])#remove the dimension(=1) index at 1
# if scale is not None:
# score = scale * score
# alignments = self.probability_fn(score, state)#probability_fn: str = "softmax",
# next_state = alignments
# return alignments, next_state
# the attention mechanism is to measure the similarity between
# one of the encoder’s outputs and the decoder’s previous hidden state
# the encoder outputs (are both the keys and values)
# the decoder hidden state at the decoder time step t-1 is the query
# receive: the encoder outputs concatenated with the decoder’s previous hidden state
self.attention = tfa.seq2seq.LuongAttention(units) # (multiplicative)
# why uses keras.layers.LSTMCell? During inference, we use one step output as next step input
# keras.layers.LSTMCell processes one step within the whole time sequence input
decoder_inner_cell = keras.layers.LSTMCell(units)# one time step or one word
self.decoder_cell = tfa.seq2seq.AttentionWrapper(
cell=decoder_inner_cell,
attention_mechanism=self.attention
)# default output_attention: bool = True, the output at each time step is the attention value
#+1 since (X+1).to_tensor() # +1 for id start from 1 and we don't need to +1 again for predicting 'sos' with 0 probability
output_layer = keras.layers.Dense( len(OUTPUT_CHARS)+1)
# https://www.tensorflow.org/addons/api_docs/python/tfa/seq2seq/TrainingSampler
# A training sampler that simply reads its inputs.
# its role is to tell the decoder at each step what it should pretend the
# previous output was.
# During inference, this should be the embedding of the token that was actually output
# During training, it should be the embedding of the previous target token
# time_major : Python bool. Whether the tensors in inputs are time major.
# If False (default), they are assumed to be batch major.
# sampler = tfa.seq2seq.sampler.TrainingSampler()
# In tfa.seq2seq.BasicDecoder
# The tfa.seq2seq.Sampler instance passed as argument is responsible to
# sample from the output distribution and
# produce the input for the next decoding step.
# https://www.tensorflow.org/addons/api_docs/python/tfa/seq2seq/BasicDecoder
self.decoder = tfa.seq2seq.BasicDecoder(
cell = self.decoder_cell, # contains LSTMCell and attention
sampler = tfa.seq2seq.sampler.TrainingSampler(),
output_layer = output_layer
)
# During inference, why GreedyEmbeddingSampler?
# we've run the model(its input contains all previous outputs) once for each
# new character if we use TrainingSampler
# But,
# at each time step, the GreedyEmbeddingSampler will compute the argmax of
# the decoder's outputs, and run the resulting token IDs through the
# decoder's embedding layer. Then it will feed the resulting embeddings to
# the decoder's LSTM cell at the next time step. This way, we only need to
# run the decoder once to get the full prediction.
self.inference_decoder = tfa.seq2seq.BasicDecoder(
cell = self.decoder_cell,# contains LSTMCell and attention
sampler = tfa.seq2seq.sampler.GreedyEmbeddingSampler(
embedding_fn = self.decoder_embedding
),
output_layer = output_layer,
maximum_iterations=max_output_length #prevent infinite loop
)
def call(self, inputs, training=None): #None: num_time_steps
# encoder_inputs = keras.layers.Input( shape=[None], dtype=np.int32 )
# decoder_inputs = keras.layers.Input( shape=[None], dtype=np.int32 )
encoder_input, decoder_input = inputs
################################# encoder
encoder_embeddings = self.encoder_embedding(encoder_input)
encoder_outputs, encoder_state_h, encoder_state_c = self.encoder(
encoder_embeddings,
training=training # **kwargs
)
encoder_state = [encoder_state_h, encoder_state_c] # the last state
# https://www.tensorflow.org/addons/api_docs/python/tfa/seq2seq/LuongAttention#setup_memory
self.attention(encoder_outputs, # inputs shape: (None_batch_size, 18 time steps, 128 neurons
setup_memory=True)
################################# decoder
decoder_embeddings = self.decoder_embedding( decoder_input )#shape: time steps x decoder_embedding_size
# Luong attention(multiplicative) requires both vectors must have the same dimensionality
# at each time step,
# previous time step state(encoder_final_state) x target(OR time steps OR sequences)
decoder_initial_state = self.decoder_cell.get_initial_state(
decoder_embeddings # original state fields' shape
)# generate zero filled state for cell==> shape: decoder_embedding_size
# https://www.tensorflow.org/addons/api_docs/python/tfa/seq2seq/AttentionWrapperState
# clone
# The new state fields' shape must match original state fields' shape.
# This will be validated, and original fields' shape will be "propagated" to new fields.
# print("decoder_initial_state",decoder_initial_state) ###########
decoder_initial_state = decoder_initial_state.clone(
# cell_state :
# The state of the wrapped RNN cell at the previous time step.
cell_state = encoder_state
)
# print("decoder_initial_state after clone",decoder_initial_state) ###########
if training:
decoder_outputs, _, _ = self.decoder( decoder_embeddings,
initial_state=decoder_initial_state,
training=training )
else:
# sos_id = len(OUTPUT_CHARS) + 1 #==12
start_tokens = tf.zeros_like( encoder_input[:, 0] )+sos_id
# OR
# batch_size = tf.shape(encoder_inputs)[:1]
# start_tokens = tf.fill( dims=batch_size, value=sos_id )
decoder_outputs, _, _ = self.inference_decoder(
decoder_embeddings,
initial_state=decoder_initial_state,
start_tokens=start_tokens,
end_token=0
)
# faster than keras.layers.Activation( "softmax" )( decoder_outputs.rnn_output )
return tf.nn.softmax( decoder_outputs.rnn_output ) # Y_proba
decoder_initial_state AttentionWrapperState(cell_state=[
],
attention=
alignments=
attention_state=
)
decoder_initial_state after clone AttentionWrapperState(cell_state=[
],
attention=
alignments=
attention_state=
)
np.random.seed(42)
tf.random.set_seed(42)
model = DateTranslation()
optimizer = keras.optimizers.Nadam()
model.compile( loss="sparse_categorical_crossentropy", optimizer=optimizer,
metrics=["accuracy"] )
history = model.fit( [X_train, X_train_decoder], Y_train, epochs=25,
validation_data=([X_valid, X_valid_decoder], Y_valid)
)
... ...
Not quite 100% validation accuracy, but close. It took a bit longer to converge this time, but there were also more parameters and more computations per iteration. And we did not use a scheduled sampler.
To use the model, we can write yet another little function:
def ids_to_date_strs( ids, chars=OUTPUT_CHARS ):
# " " since we are using 0 as the padding token ID
return [ "".join([ (" "+chars)[index] for index in sequence ])
for sequence in ids ]
# since we use X = tf.ragged.constant( X_ids, ragged_rank=1 ) # 内部非均匀
max_input_length = X_train.shape[1] # 18
def prepare_date_strs_padded( date_strs ):
X = prepare_date_strs( date_strs )
if X.shape[1]
There are still a few interesting features from TF-Addons that you may want to look at:
BeamSearchDecoder
rather than a BasicDecoder
for inference. Instead of outputing the character with the highest probability, this decoder keeps track of the several candidates, and keeps only the most likely sequences of candidates (see chapter 16 in the book for more details).sequence_length
if the input or target sequences may have very different lengths.ScheduledOutputTrainingSampler
, which gives you more flexibility than the ScheduledEmbeddingTrainingSampler
to decide how to feed the output at time t to the cell at time t+1. By default it feeds the outputs directly to cell, without computing the argmax ID and passing it through an embedding layer. Alternatively, you specify a next_inputs_fn
function that will be used to convert the cell outputs to inputs at the next step.Exercise: Go through TensorFlow's Neural Machine Translation with Attention tutorial : https://www.tensorflow.org/tutorials/text/nmt_with_attention.
Simply open the Colab and follow its instructions. Alternatively, if you want a simpler example of using TF-Addons's seq2seq implementation for Neural Machine Translation (NMT), look at the solution to the previous question. The last model implementation will give you a simpler example of using TF-Addons to build an NMT model using attention mechanisms.
This notebook trains a sequence to sequence (seq2seq) model for Spanish to English translation. This is an advanced example that assumes some knowledge of sequence to sequence models.
After training the model in this notebook, you will be able to input a Spanish sentence, such as "¿todavia estan en casa?", and return the English translation: "are you still at home?"
The translation quality is reasonable for a toy example, but the generated attention plot is perhaps more interesting. This shows which parts of the input sentence has the model's attention while translating:
We'll use a language dataset provided by http://www.manythings.org/anki/ This dataset contains language translation pairs in the format:
May I borrow this book? ¿Puedo tomar prestado este libro?
english sentence and spanish sentence is split with '\t'
There are a variety of languages available, but we'll use the English-Spanish dataset. For convenience, we've hosted a copy of this dataset on Google Cloud, but you can also download your own copy.
import os
import tensorflow as tf
# Download the file
path_to_zip = tf.keras.utils.get_file(
fname='spa-eng.zip',
origin='http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip',
cache_subdir="/content/drive/MyDrive/Colab Notebooks/data/spa_eng",
extract=True
)
path_to_file = os.path.dirname( path_to_zip ) + "/spa-eng/spa.txt"
https://docs.python.org/3/library/unicodedata.html
unicodedata.normalize(form, unistr)
Return the normal form form for the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.
The Unicode standard defines various normalization forms of a Unicode string, based on the definition of canonical [kəˈnɑnɪkl]权威的 equivalence and compatibility equivalence. In Unicode, several characters can be expressed in various way. For example, the character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).
For each character, there are two Normal Forms: normal form C and normal form D.
In addition to these two forms, there are two additional normal forms based on compatibility equivalence. In Unicode, certain characters are supported which normally would be unified with other characters. For example, U+2160 (ROMAN NUMERAL ONE) is really the same thing as U+0049 (LATIN CAPITAL LETTER I). However, it is supported in Unicode for compatibility with existing character sets (e.g. gb2312).
Even if two unicode strings are normalized and look the same to a human reader, if one has combining characters and the other doesn’t, they may not compare equal.
unicodedata.category(chr)
Returns the general category assigned to the character chr as string.
After downloading the dataset, here are the steps we'll take to prepare the data:
import unicodedata
import re
# Converts the unicode file to ascii
def unicode_to_ascii(s):
# D (NFD) : translates each character into its decomposed form
# "Mn" : Mark, Nonspacing https://blog.csdn.net/xc_zhou/article/details/82079753
return ''.join( c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != "Mn"
) # for c in ... if ... then return c
def preprocess_sentence(w):
# u"May I borrow this book?" ==> w.lower().strip() ==> "may i borrow this book?"
# u"¿Puedo tomar prestado este libro?" ==> w.lower().strip() ==> "b'\xc2\xbfpuedo tomar prestado este libro?'"
w = unicode_to_ascii( w.lower().strip() )
# creating a space between a word and the punctuation following it
# eg: "he is a boy." => "he is a boy ."
# Reference:- https://stackoverflow.com/questions/3645931/python-padding-punctuation-with-white-spaces-keeping-punctuation
# \1 : \one
# "may i borrow this book?" ==>
# "may i borrow this book ? "
# note: '\xc2\xbf' represents '¿'
# "b'\xc2\xbfpuedo tomar prestado este libro?'" ==>
# "b' \xc2\xbf puedo tomar prestado este libro ? '"
w = re.sub( r"([?.!,¿])", r" \1 ", w)
# "may i borrow this book ? " ==>
# "may i borrow this book ?"
# "b' \xc2\xbf puedo tomar prestado este libro ? '" ==>
# "b'\xc2\xbf puedo tomar prestado este libro ?'"
w = w.strip()
# adding a start and an end token to the sentence
# so that the model know when to start and stop predicting.
w = ' ' + w + ' '
return w
en_sentence = u"May I borrow this book?"
sp_sentence = u"¿Puedo tomar prestado este libro?"
print( preprocess_sentence(en_sentence) )
print( preprocess_sentence(sp_sentence).encode('utf-8') )
import io
# 1. Remove the accents
# 2. Clean the sentences
# 3. Return word pairs in the format: [ENGLISH, SPANISH]
def create_dataset( path, num_example ):
lines = io.open( path, encoding='UTF-8' ).read().strip().split( '\n' )
word_pairs = [ [ preprocess_sentence(w) for w in line.split('\t') ]
for line in lines[:num_example]
]
return zip( *word_pairs )
en, sp = create_dataset( path_to_file, None )
print( en[0] )
print( sp[0] )
<== def tokenize( langSentence ):
lang_tokenizer = tf.keras.preprocessing.text.Tokenizer( filters='',
char_level=False )
lang_tokenizer.fit_on_texts( langSentence )
tensor_seqs_indx=lang_tokenizer.texts_to_sequences( langSentence )
# tf.keras.preprocessing.sequence.pad_sequences
# maxlen
# Optional Int, maximum length of all sequences. If not provided,
# sequences will be padded to the length of the longest individual sequence.
# padding
# String, 'pre' or 'post' (optional, defaults to 'pre'):
# pad either before or after each sequence.
# value
# Float or String, padding value. (Optional, defaults to 0.)
tensor_seqs_indx=tf.keras.preprocessing.sequence.pad_sequences( tensor_seqs_indx,
padding='post'
)
return tensor_seqs_indx, lang_tokenizer
def load_dataset( path, num_examples=None ):
# creating cleaned input, output pairs
# target: english; input: spanish
targ_lang, inp_lang = create_dataset( path, num_examples )
input_tensor, inp_lang_tokenizer = tokenize( inp_lang )
target_tensor, targ_lang_tokenizer = tokenize( targ_lang )
return input_tensor, target_tensor,\
inp_lang_tokenizer, targ_lang_tokenizer
Training on the complete dataset of >100,000 sentences will take a long time. To train faster, we can limit the size of the dataset to 30,000 sentences (of course, translation quality degrades with fewer data):
# Try experimenting with the size of that dataset
num_examples = 30000
# input: spanish; target: english
input_tensor, target_tensor,\
inp_lang_tokenizer, targ_lang_tokenizer = load_dataset( path_to_file,
num_examples )
# Calculate max_length of the target tensors
max_length_inp, max_length_targ = input_tensor.shape[1], target_tensor.shape[1]
max_length_inp, max_length_targ
from sklearn.model_selection import train_test_split
# Creating training and validation sets using an 80-20 split
input_train, input_val, target_train, target_val=train_test_split(input_tensor,
target_tensor,
test_size=0.2
)
# Show length
print( len(input_train), len(target_train), len(input_val), len(target_val) )
def convert( lang_tokenizer, tensor ):
for t in tensor:
if t != 0: # since 0 : ' '
print( f'{t} ----> {lang_tokenizer.index_word[t]}' )
print( "Input Language: index to word mapping" )
convert( inp_lang_tokenizer, input_train[0] )
print()
print( "Target Language: index to wrod mapping")
convert( targ_lang_tokenizer, target_train[0] )
BUFFER_SIZE = len( input_train )
BATCH_SIZE = 64
steps_per_epoch = len( input_train ) // BATCH_SIZE
embedding_dim = 256
units = 1024
vocab_inp_size = len( inp_lang_tokenizer.word_index )+1 # +1 for oov #9693
vocab_targ_size = len( targ_lang_tokenizer.word_index )+1 #5111
dataset = tf.data.Dataset.from_tensor_slices( (input_train, target_train)
).shuffle( BUFFER_SIZE )
dataset = dataset.batch( BATCH_SIZE, drop_remainder=True )
example_input_batch, example_target_batch = next( iter(dataset) )
example_input_batch.shape, example_target_batch.shape
Implement an encoder-decoder model with attention which you can read about in the TensorFlow Neural Machine Translation (seq2seq) tutorial. This example uses a more recent set of APIs. This notebook implements the attention equations from the seq2seq tutorial. The following diagram shows that each input words is assigned a weight by the attention mechanism which is then used by the decoder to predict the next word in the sentence. The below picture and formulas are an example of attention mechanism from Luong's paper.
The input is put through an encoder model which gives us the encoder output of shape (batch_size, max_length of individual sequence, hidden_size=num_units) and the encoder hidden state of shape (batch_size, hidden_size==num_units).
This tutorial uses Bahdanau attention for the encoder. Let's decide on notation before writing the simplified form:
And the pseudo-code:
= FC( tanh( FC(EO) + FC(H) ) ) <==
attention weights = softmax(score, axis = 1)
. Softmax by default is applied on the last axis but here we want to apply it on the 1st axis, since the shape of score is (batch_size, max_length of individual sequence, hidden_size==num_units). Max_length
is the length of our input. Since we are trying to assign a weight to each input, softmax should be applied on that axis.context vector = sum(attention weights * EO, axis = 1)
. Same reason as above for choosing axis as 1.embedding output
= The input to the decoder X is passed through an embedding layer.merged vector = concat(embedding output, context vector ==>expand_dims(context_vector, 1)
==>
(batch_size, 1, hidden_size==num_units) )The shapes of all the vectors at each step have been specified in the comments in the code:
class Encoder(tf.keras.Model):
def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
super(Encoder, self).__init__()
self.batch_sz = batch_sz
self.enc_units = enc_units
self.embedding = tf.keras.layers.Embedding( input_dim=vocab_size,
output_dim=embedding_dim
)
self.gru = tf.keras.layers.GRU( self.enc_units,
return_sequences=True,
return_state=True,
recurrent_initializer='glorot_uniform'
)
def call(self, x, hidden):
# x : (batch_size, max_length of individual sequence)
x = self.embedding(x) # ==> (batch_size, max_length, embedding_dim)
output, state = self.gru(x, initial_state=hidden)
# output : (batch_size, max_length, self.enc_units)
# final state : (batch_size, self.enc_units)
return output, state
def initialize_hidden_state(self):
# encoder hidden state of shape : (batch_size, hidden_size==num_units)
return tf.zeros( (self.batch_sz, self.enc_units) )
# vocab_inp_size = len( inp_lang_tokenizer.word_index )+1 # +1 for oov
# vocab_inp_size=9693, embedding_dim = 256, units = 1024, BATCH_SIZE = 64
encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)
# sample input
sample_hidden = encoder.initialize_hidden_state()
sample_output, sample_hidden_state = encoder(example_input_batch, sample_hidden)
print("Encoder output shape: (batch size, sequence length, units)", sample_output.shape)
print("Encoder Hidden state shape: (batch size, units)", sample_hidden.shape)
class BahdanauAttention( tf.keras.layers.Layer ):
def __init__(self, units):
super(BahdanauAttention, self).__init__()
self.W1 = tf.keras.layers.Dense(units)
self.W2 = tf.keras.layers.Dense(units)
self.V = tf.keras.layers.Dense(1)
def call(self, query, values):
# query(decoder previous hidden state) shape == (batch_size, hidden size)
# we are doing this to broadcast addition along the time axis to calculate the score
query_with_time_axis = tf.expand_dims( query, 1 )
# query_with_time_axis shape == (batch_size, 1, hidden size)
# values shape == (batch_size, max_len, hidden size==num_units) #from encoder output
# we get 1 at the last axis because we are applying score to self.V
# # + exists broadcasting
score = self.V( tf.nn.tanh(self.W1(query_with_time_axis)+#==>(batch_size,1,units)
self.W2(values) #==>(batch_size,max_length,units)
) # ==>(batch_size, max_length, units)
) #==>alignment score shape : (batch_size, max_length, 1)
# normalization
attention_weights = tf.nn.softmax(score, axis=1)#==>(batch_size,max_length,1)
# context_vector shape after sum == (batch_size, hidden_size)
context_vector = attention_weights*values
context_vector = tf.reduce_sum(context_vector, axis=1)#==>(batch_size, hidden_size==num_units)
return context_vector, attention_weights
attention_layer = BahdanauAttention(10)
attention_result, attention_weights = attention_layer( sample_hidden,
sample_output )
print("Attention result shape: (batch size, units)", attention_result.shape )
print("Attention weights shape: (batch size, sequence_length, 1", attention_weights.shape )
class Decoder( tf.keras.Model ):
def __init__( self, vocab_size, embedding_dim, dec_units, batch_sz ):
super( Decoder, self ).__init__()
self.batch_sz = batch_sz
self.dec_units = dec_units
self.embedding = tf.keras.layers.Embedding( vocab_size, embedding_dim )
self.gru = tf.keras.layers.GRU( self.dec_units,
return_sequences=True,
return_state=True,
recurrent_initializer='glorot_uniform' )
self.fc = tf.keras.layers.Dense( vocab_size )
# used for attention
self.attention = BahdanauAttention( self.dec_units )
# decoder_units==encoder_units since decoder GRUcell OR decoder's attention
# use encoder final state in the begining
def call( self, x, hidden, enc_output ):
# x : target value(training time) OR decoder actual output(inference time)
# enc_output shape == (batch_size, max_length, self.enc_units)
# hidden : encoder final hidden state (batch_size, self.enc_units)
# OR decoder previous time step hidden state (batch_size, self.dec_units)
context_vector, attention_weights = self.attention( hidden, enc_output )
#context_vector: (batch_size, hidden_size==num_units OR self.dec_units)
# x shape (batch_size, 1 time step)
x = self.embedding(x) # ==> (batch_size, 1, embedding_dim)
x = tf.concat([ tf.expand_dims(context_vector,1), #==>(batch_size, 1, hidden_size==num_units)
x # (batch_size, 1, embedding_dim)
], axis=-1) #==>(batch_size, 1, embedding+hidden_size)
# passing the concatenated vector to the GRU
output, state = self.gru(x)
# output : (batch_size, 1, hidden_size=self.dec_units)
# current state : (batch_size, self.dec_units)
# output shape == (batch_size * 1, hidden_size==self.dec_units)
output = tf.reshape( output, (-1, output.shape[2]) )
# output shape == (batch_size, target vocab size)
x = self.fc(output)
return x, state, attention_weights
decoder = Decoder( vocab_targ_size, embedding_dim, units, BATCH_SIZE )
# target value
sample_decoder_output, _, _ = decoder( tf.random.uniform( (BATCH_SIZE,1) ),
sample_hidden, # encoder final hidden state or decorder t-1 hidden state
sample_output )# encoder output sequences
print( "Decoder output shape: (batch_size, vocab size)", sample_decoder_output.shape )
optimizer = tf.keras.optimizers.Adam()
# https://blog.csdn.net/Linli522362242/article/details/108414534
loss_object = tf.keras.losses.SparseCategoricalCrossentropy( from_logits=True,
reduction='none' )
def loss_function( real, pred ):
# if real == 0 return True ==> logical_not ==> False ==> tf.cast ==>0.
# if real == 1 return False ==> logical_not ==> True ==> tf.cast ==>1.
mask = tf.math.logical_not( tf.math.equal(real,0) )
loss_ = loss_object( real, pred )
#https://blog.csdn.net/Linli522362242/article/details/115518150
mask = tf.cast( mask, dtype=loss_.dtype )
loss_ *= mask # to ignore the padding tokens(whose ID is 0) or
# maskout time step whose value is padded with 0
# so the masked time steps will not contribute to the loss (their loss will be 0)
return tf.reduce_mean( loss_ )
checkpoint_dir = '/content/drive/MyDrive/Colab Notebooks/checkpoints'
checkpoint_prefix = os.path.join( checkpoint_dir, "ckpt" )
checkpoint = tf.train.Checkpoint( optimizer=optimizer,
encoder=encoder,
decoder=decoder )
# tf.function
# https://blog.csdn.net/Linli522362242/article/details/107459161
@tf.function
def train_step(inp, targ, enc_hidden):
loss = 0
with tf.GradientTape() as tape:
enc_output, enc_hidden = encoder(inp, enc_hidden)
dec_hidden = enc_hidden
# shift one step to right
# OR dec_input = tf.expand_dims( targ[:,t], 1 ) # since first word is start token
dec_input = tf.expand_dims( [targ_lang_tokenizer.word_index['']
]*BATCH_SIZE, 1)
# Teacher forcing - feeding the target as the next input
for t in range(1, targ.shape[1]):
# passing enc_output to the decoder
predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)
loss += loss_function( targ[:, t], predictions )
# using teacher forcing
dec_input = tf.expand_dims( targ[:,t], 1 )
batch_loss = ( loss/int(targ.shape[1]) )
variables = encoder.trainable_variables + decoder.trainable_variables
gradients = tape.gradient(loss, variables)
optimizer.apply_gradients(zip(gradients, variables))
return batch_loss
# vocab_inp_size = len( inp_lang_tokenizer.word_index )+1 #==>9693=9692 words + 1 oov
# GRU( units =1024 )
https://blog.csdn.net/Linli522362242/article/details/114941730
# tf.Variable encoder.trainable_variables:
# vocab_targ_size = len( targ_lang_tokenizer.word_index )+1 #==>5111=5110 words + 1 oov
# GRU( units =1024 )
# Dense( vocab_size=5111 )
# tf.Variable decoder.trainable_variables:
import time
EPOCHS =10
for epoch in range(EPOCHS):
start = time.time()
enc_hidden = encoder.initialize_hidden_state() ###
total_loss = 0
# BUFFER_SIZE = len( input_train )
# BATCH_SIZE = 64
# steps_per_epoch = len( input_train ) // BATCH_SIZE =24000/64=375
# dataset = tf.data.Dataset.from_tensor_slices( (input_train, target_train)
# ).shuffle( BUFFER_SIZE )
# dataset = dataset.batch( BATCH_SIZE, drop_remainder=True )
for (batch, (inp, targ)) in enumerate( dataset.take(steps_per_epoch) ):
batch_loss = train_step( inp, targ, enc_hidden )
total_loss += batch_loss
if batch % 100 == 0:
print( f'Epoch {epoch+1} Batch {batch} Loss {batch_loss.numpy():.4f}' )
# saving (checkpoint) the model every 2 epochs
if (epoch+1) % 2 ==0:
checkpoint.save( file_prefix=checkpoint_prefix )
print( f'Epoch {epoch+1} Loss {total_loss/steps_per_epoch:.4f}' )
print( f'Time taken for 1 epoch {time.time()-start:.2f} sec\n' )
import numpy as np
# units = 1024
def evaluate(sentence):
# note: the batch_size should be 1 for only one sequence plotting
attention_plot = np.zeros( (max_length_targ, max_length_inp) )
sentence = preprocess_sentence(sentence)
inputs = [ inp_lang_tokenizer.word_index[i] for i in sentence.split(' ') ]
inputs = tf.keras.preprocessing.sequence.pad_sequences( [inputs], # extend dim to 2D (+batch_size dimension)
maxlen=max_length_inp,
padding='post'
)
inputs = tf.convert_to_tensor( inputs )
result = ''
# initializes Decoder hidden state(gru cell including neuron units)
hidden = [ tf.zeros( (1,units) ) ] #shape(1 time step, Encoder units==Decoder units)
enc_out, enc_final_hidden = encoder( inputs, hidden )
dec_hidden = enc_final_hidden
dec_input = tf.expand_dims([ targ_lang_tokenizer.word_index['']
], 0) # 0 for appending a batch_size dimension
for t in range( max_length_targ ):
predictions, dec_hidden, attention_weights = decoder( dec_input,
dec_hidden,
enc_out )
# storing the attention weights of decoder to plot later on
# (batch_size,max_length_inp,1) ==> (batch_size,max_length_inp*1)
attention_weights = tf.reshape( attention_weights, (-1,) )
attention_plot[t] = attention_weights.numpy() # batch_size must one
# predictions shape(batch_size, vocab_targ_size) for current time step ############
predicted_id = tf.argmax( predictions[0] ).numpy()
result += targ_lang_tokenizer.index_word[predicted_id] + ' '
if targ_lang_tokenizer.index_word[predicted_id] == '':
return result, sentence, attention_plot
# the predicted ID is fed back into the model
dec_input = tf.expand_dims([predicted_id],0) # 0 for appending a batch_size dimension
return result, sentence, attention_plot
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
# function for plotting the attention weights
def plot_attention(attention, sentence, predicted_sentence):
fig = plt.figure( figsize=(10,10) )
ax = fig.add_subplot(1,1,1)
ms=ax.matshow( attention, cmap='viridis' )
fig.colorbar(ms)
fontdict = {'fontsize':14}
ax.set_xticklabels( ['']+sentence, fontdict=fontdict, rotation=90 )
ax.set_yticklabels( ['']+predicted_sentence, fontdict=fontdict )
ax.xaxis.set_major_locator( ticker.MultipleLocator(1) )
ax.yaxis.set_major_locator( ticker.MultipleLocator(1) )
plt.show()
def translate( sentence ):
result, sentence, attention_plot = evaluate(sentence)
print('Input:', sentence)
print('Predicted translation:', result)
attention_plot = attention_plot[:len( result.split(' ') ),
:len( sentence.split(' ') )
]
plot_attention( attention_plot, sentence.split(' '), result.split(' ') )
# restoring the latest checkpoint in checkpoint_dir
checkpoint.restore( tf.train.latest_checkpoint( checkpoint_dir) )
translate(u'hace mucho frio aqui.')
The simplest way to use recent language models is to use the excellent transformers library, open sourced by Hugging Face(https://huggingface.co/transformers/). It provides many modern neural net architectures (including BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet and more) for Natural Language Processing (NLP), including many pretrained models. It relies on either TensorFlow or PyTorch. Best of all: it's amazingly simple to use.
First, let's load a pretrained model. In this example, we will use OpenAI's GPT model https://huggingface.co/transformers/v2.0.0/pretrained_models.html, with an additional Language Model on top (just a linear layer with weights tied to the input embeddings). Let's import it and load the pretrained weights (this will download about 445MB of data to ~/.cache/torch/transformers
):
!pip install transformers
from transformers import TFOpenAIGPTLMHeadModel
model = TFOpenAIGPTLMHeadModel.from_pretrained("openai-gpt")
Next we will need a specialized tokenizer for this model. This one will try to use the spaCy and ftfy libraries if they are installed, or else it will fall back to BERT's BasicTokenizer
followed by Byte-Pair Encoding (which should be fine for most use cases). https://huggingface.co/transformers/v2.0.0/model_doc/gpt.html
from transformers import OpenAIGPTTokenizer
tokenizer = OpenAIGPTTokenizer.from_pretrained("openai-gpt")
Now let's use the tokenizer to tokenize and encode the prompt text:
prompt_text = "This royal throne of kings, this sceptred isle"
encoded_prompt = tokenizer.encode( prompt_text, add_special_tokens=False,
return_tensors="tf" )
encoded_prompt
tokenizer.convert_ids_to_tokens([ 616, 5751, 6404, 498, 9606, 240, 616, 26271, 7428, 16187])
Easy! Next, let's use the model to generate text after the prompt. We will generate 5 different sentences, each starting with the prompt text, followed by 40 additional tokens. For an explanation of what all the hyperparameters do, make sure to check out this great blog post by Patrick von Platen (from Hugging Face https://huggingface.co/blog/how-to-generate). You can play around with the hyperparameters to try to obtain better results.
In transformers, we set do_sample=True and deactivate停用 Top-K sampling (more on this later) via top_k=0. In the following, we will fix random_seed=0 for illustration purposes. Feel free to change the random_seed to play around with the model.
Fan et. al (2018) introduced a simple, but very powerful sampling scheme, called Top-K sampling. In Top-K sampling, the K most likely next words are filtered and the probability mass is redistributed among only those K next words. GPT2 adopted this sampling scheme, which was one of the reasons for its success in story generation.
After "The",
Having set K = 6, in both sampling steps we limit our sampling pool to 6 words. While the 6 most likely words, defined as encompass only ca. two-thirds of the whole probability mass in the first step, it includes almost all of the probability mass in the second step. Nevertheless, we see that it successfully eliminates the rather weird candidates in the second sampling step.
Instead of sampling only from the most likely K words, in Top-p sampling chooses from the smallest possible set of words whose cumulative probability exceeds the probability p. The probability mass is then redistributed among this set of words. This way, the size of the set of words (a.k.a the number of words in the set) can dynamically increase and decrease according to the next word's probability distribution. Ok, that was very wordy, let's visualize.
Having set p=0.92, Top-p sampling picks the minimum number of words to exceed together p=92% of the probability mass, defined as . In the first example, this included the 9 most likely words, whereas it only has to pick the top 3 words in the second example to exceed 92%. Quite simple actually! It can be seen that it keeps a wide range of words where the next word is arguably less predictable, e.g. , and only a few words when the next word seems more predictable, e.g. .
repetition_penalty
can be used to penalize words that were already generated or belong to the context. It was first introduced by Kesker et al. (2019) and is also used in the training objective in Welleck et al. (2019). It can be quite effective at preventing repetitions, but seems to be very sensitive to different models and use cases, e.g. see this discussion on Github.
num_sequences = 5
length = 40
generated_sequences = model.generate(
input_ids = encoded_prompt,
do_sample=True,
max_length = length + len(encoded_prompt[0]), #0 for sequence index in abatch
temperature=1.0,
top_k=0,
top_p=0.9, # 0
repetition_penalty=1.0,
num_return_sequences=num_sequences,
)
generated_sequences
Now let's decode the generated sequences and print them:
for sequence in generated_sequences:
text = tokenizer.decode( sequence, clean_up_tokenization_spaces=True )
print(text)
print('-'*80)
You can try more recent (and larger) models, such as GPT-2, CTRL, Transformer-XL or XLNet, which are all available as pretrained models in the transformers library, including variants with Language Models on top. The preprocessing steps vary slightly between models, so make sure to check out this generation example from the transformers documentation (this example uses PyTorch, but it will work with very little tweaks, such as adding TF
at the beginning of the model class name, removing the .to()
method calls, and using return_tensors="tf"
instead of "pt"
.