http 响应消息解码

介绍 (Introduction)

Deep learning has been deployed in many tasks in NLP, such as translation, image captioning, and dialogue systems. In machine translation, it is used to read source language (input) and generate the desired language (output). Similarly in a dialogue system, it is used to generate a response given a context. This is also known as Natural Language Generation (NLG).

深度学习已部署在NLP的许多任务中，例如翻译，图像字幕和对话系统。在机器翻译中，它用于读取源语言(输入)并生成所需的语言(输出)。类似地，在对话系统中，它用于在给定上下文的情况下生成响应。这也称为自然语言生成(NLG)。

The model splits into 2 parts: encoder and decoder. Encoder reads the input text and returns a vector representing that input. Then, the decoder takes that vector and generates a corresponding text.

该模型分为两部分：编码器和解码器。编码器读取输入文本，并返回代表该输入的向量。然后，解码器获取该向量并生成相应的文本。

Figure 1: Encoder-Decoder Architecture 图1：编码器-解码器架构

To generate a text, commonly it is done by generating one token at a time. Without proper techniques, the generated response may be very generic and boring. In this article, we will explore the following strategies:

要生成文本，通常是一次生成一个令牌来完成。没有适当的技术，生成的响应可能会非常通用且令人厌烦。在本文中，我们将探讨以下策略：

Greedy
贪婪
Beam Search
光束搜索
Random Sampling
随机抽样
Temperature
温度
Top-K Sampling
Top-K采样
Nucleus Sampling
核采样

解码策略 (Decoding Strategies)

At each timestep during decoding, we take the vector (that holds the information from one step to another) and apply it with softmax function to convert it into an array of probability for each word.

在解码期间的每个时间步上，我们取向量(将信息从一个步骤转移到另一个步骤)，并将其与softmax函数一起应用以将其转换为每个单词的概率数组。

Equation 1: Softmax Function. x is a token at timestep i. u is the vector that contains the value of every token in the vocabulary. 公式1：Softmax函数。 x是时间步i的令牌。 u是向量，包含词汇表中每个标记的值。

贪婪的方法 (Greedy Approach)

This approach is the simplest. At each time-step, it just chooses whichever token that is the most probable.

这种方法是最简单的。在每个时间步长，它只选择最可能的令牌。

Context:            Try this cake. I baked it myself.
Optimal Response  : This cake tastes great.
Generated Response: This is okay.

However, this approach may generate a suboptimal response, as shown in the example above. The generated response may not be the best possible response. This is due to the training data that commonly have examples like “That is […]”. Therefore, if we generate the most probable token at a time, it might output “is” instead of “cake”.

但是，如上例所示，此方法可能会产生次优响应。生成的响应可能不是最佳的响应。这是由于训练数据通常带有“ That is […]”之类的示例。因此，如果我们一次生成最可能的令牌，则它可能会输出“ is”而不是“ cake”。

光束搜索 (Beam Search)

Exhaustive search can solve the previous problem since it will search for the whole space. However, it would be computationally expensive. Suppose there are 10,000 vocabularies, to generate a sentence with the length of 10 tokens, it would be (10,000)¹⁰.

穷举搜索可以解决先前的问题，因为它将搜索整个空间。然而，这将在计算上昂贵。假设有10,000个词汇表，要生成一个长度为10个令牌的句子，则为(10,000)¹⁰。

Source) Source )

Beam search can cope with this problem. At each timestep, it generates all possible tokens in the vocabulary list; then, it will choose top B candidates that have the most probability. Those B candidates will move to the next time step, and the process repeats. In the end, there will only be B candidates. The search space is only (10,000)*B.

光束搜索可以解决这个问题。在每个时间步，它都会在词汇表中生成所有可能的标记。然后，它将选择可能性最大的前B名候选人。那些B候选者将移至下一个步骤，然后重复该过程。最后，只有B个候选人。搜索空间仅为(10,000)* B。

Context:    Try this cake. I baked it myself.
Response A: That cake tastes great.
Response B: Thank you.

But sometimes, it chooses an even more optimal (Response B). In this case, it makes perfect sense. But imagine that the model likes to play safe and keeps on generating “I don’t know” or “Thank you” to most of the context, that is a pretty bad bot.

但有时，它会选择一个更好的选择(响应B)。在这种情况下，这很有意义。但是，请想象该模型喜欢安全运行，并且在大多数情况下都会不断生成“我不知道”或“谢谢”，这是一个非常糟糕的机器人。

随机抽样 (Random Sampling)

Alternatively, we can look into stochastic approaches to avoid the response being generic. We can utilize the probability of each token from the softmax function to generate the next token.

另外，我们可以研究随机方法来避免响应是通用的。我们可以利用softmax函数中每个令牌的概率来生成下一个令牌。

Suppose we are generating the first token of a context “I love watching movies”, Figure below shows the probability of what the first word should be.

假设我们正在生成上下文“我喜欢看电影”的第一个标记，下图显示了第一个单词应该是什么的概率。

Figure 3: Probability of each word. X-axis is the token index. i.e, index 37 corresponds to the word “yeah” 图3：每个单词的概率。 X轴是令牌索引。即索引37对应于单词“是”

If we use a greedy approach, a token “i” will be chosen. With random sampling, however, token i has a probability of around 0.2 to occur. At the same time, any token that has a probability of 0.0001 can also occur. It’s just very unlikely.

如果我们使用贪婪方法，则将选择标记“ i”。但是，通过随机采样，令牌i 发生的可能性约为0.2。同时，也可能出现任何概率为0.0001的令牌。这是非常不可能的。

温度随机采样 (Random Sampling with Temperature)

Random sampling, by itself, could potentially generate a very random word by chance. Temperature is used to increase the probability of probable tokens while reducing the one that is not. Usually, the range is 0 < temp ≤ 1. Note that when temp=1, there is no effect.

随机抽样本身可能会偶然产生一个非常随机的词。温度用于增加可能的令牌的概率，同时减少不存在的令牌的概率。通常，范围是0

Equation 2: Random sampling with temperature. Temperature t is used to scale the value of each token before going into a softmax function 公式2：随温度随机采样。温度t用于在进入softmax函数之前缩放每个令牌的值

Figure 4: Random sampling vs. random sampling with temperature 图4：随温度随机抽样与随机抽样

In Figure 4, with temp=0.5, the most probable words like i, yeah, me, have more chance of being generated. At the same time, this also lowers the probability of the less probable ones, although this does not stop them from occurring.

在图4中，当temp = 0.5时，最可能出现的单词(如i ， yeah ， me )更有可能被生成。同时，这也降低了可能性较小的可能性，尽管这并不能阻止它们的发生。

Top-K采样 (Top-K Sampling)

Top-K sampling is used to ensure that the less probable words should not have any chance at all. Only top K probable tokens should be considered for a generation.

Top-K采样用于确保不太可能出现的单词完全没有任何机会。世代仅应考虑前K个可能的令牌。

Figure 5: Distribution of the 3 random sampling, random with temp, and top-k 图5：3个随机采样，温度随机和top-k的分布

The token index between 50 to 80 has some small probabilities if we use random sampling with temperature=0.5 or 1.0. With top-k sampling (K=10), those tokens have no chance of being generated. Note that we can also combine Top-K sampling with temperature, but you kinda get the idea already, so we choose not to discuss it here.

如果我们使用温度= 0.5或1.0的随机采样，则介于50到80之间的令牌索引的概率很小。使用top-k采样(K = 10)时，这些标记就没有机会被生成。请注意，我们还可以将Top-K采样与温度结合起来，但是您已经知道了这一点，因此我们选择不在此处讨论。

This sampling technique has been adopted in many recent generation tasks. Its performance is quite good. One limitation with this approach is the number of top K words need to be defined in the beginning. Suppose we choose K=300; however, at a decoding timestep, the model is sure that there should be 10 highly probable words. If we use Top-K, that means we will also consider the other 290 less probable words.

这种采样技术已在许多新一代任务中采用。它的表现还不错。这种方法的局限性在于，一开始需要定义前K个字的数量。假设我们选择K = 300；但是，在解码时，模型可以确定应该有10个高度可能的单词。如果我们使用Top-K，这意味着我们还将考虑其他290个不太可能的单词。

核采样 (Nucleus Sampling)

Nucleus sampling is similar to Top-K sampling. Instead of focusing on Top-K words, nucleus sampling focuses on the smallest possible sets of Top-V words such that the sum of their probability is ≥ p. Then, the tokens that are not in V^(p) are set to 0; the rest are re-scaled to ensure that they sum to 1.

核采样类似于Top-K采样。核采样不是关注Top-K单词，而是关注Top-V单词的最小可能集合，以使它们的概率之和≥p。然后，将不在V ^(p)中的标记设置为0；否则将标记设置为0。其余的将重新缩放以确保它们的总和为1。

Equation 3: Nucleus sampling. V^(p) is the smallest possible of tokens. P(x|…) is the probability of generating token x given the previous generated tokens x from 1 to i-1 公式3：核采样。 V ^(p)是最小的令牌。 P(x |…)是给定先前生成的令牌x从1到i-1的情况下生成令牌x的概率

The intuition is that when the model is very certain on some tokens, the set of potential candidate tokens is small otherwise, there will be more potential candidate tokens.

直觉是，当模型在某些标记上非常确定时，潜在候选标记的集合很小，否则，将有更多潜在候选标记。

Certain → those few tokens have high probability = sum of few tokens is enough to exceed p.Uncertain → Many tokens have small probability = sum of many tokens is needed to exceed p.

某些→少数令牌具有高概率=少数令牌之和足以超过p。不确定→许多令牌具有小概率=需要多个令牌之和超过p。

Figure 6: Distribution of Top-K and Nucleus Sampling 图6：Top-K和核采样的分布

Comparing nucleus sampling (p=0.5) with top-K sampling (K = 10), we can see the nucleus does not consider token “you” to be a candidate. This shows that it can adapt to different cases and select different numbers of tokens, unlike Top-K sampling.

将核采样(p = 0.5)与top-K采样(K = 10)进行比较，我们可以看到核不认为标记“ you”是候选。这表明，与Top-K采样不同，它可以适应不同的情况并选择不同数量的令牌。

摘要 (Summary)

Greedy: Select the best probable token at a time
贪婪：一次选择最佳的可能令牌
Beam Search: Select the best probable response
光束搜索：选择最佳的可能响应
Random Sampling: Random based on probability
随机抽样：基于概率的随机抽样
Temperature: Shrink or enlarge probabilities
温度：缩小或放大概率
Top-K Sampling: Select top probable K tokens
Top-K采样：选择可能的前K个令牌
Nucleus Sampling: Dynamically choose the number of K (sort of)
核采样：动态选择K数(一种)

Commonly, top choices by researchers are beam search, top-K sampling (with temperature), and nucleus sampling.

通常，研究人员的首选是光束搜索，top-K采样(带温度)和原子核采样。

结论 (Conclusion)

We have gone through a list of different ways to decode a response. These techniques can be applied to different generation tasks, i.e., image captioning, translation, and story generation. Using a good model with bad decoding strategies or a bad model with good decoding strategies is not enough. A good balance between the two can make the generation a lot more interesting.

我们已经通过一系列不同的方式来解码响应。这些技术可以应用于不同的生成任务，即图像字幕，翻译和故事生成。仅使用具有不良解码策略的良好模型或具有良好解码策略的不良模型是不够的。两者之间的良好平衡可以使这一代人变得更加有趣。

翻译自: https://towardsdatascience.com/decoding-strategies-that-you-need-to-know-for-response-generation-ba95ee0faadc