xinming31

The Illustrated Transformer 翻译

最近研究NLP，同事推荐了这篇文章，无奈英语水平有限，连续看三行英文就不知所云了。。。无奈怒翻了这篇文章，一边翻译一边研究，要学的东西还很多啊。

The Illustrated Transformer

In the previous post, we looked at Attention – a ubiquitous method in modern deep learning models. Attention is a concept that helped improve the performance of neural machine translation applications. In this post, we will look at The Transformer – a model that uses attention to boost the speed with which these models can be trained. The Transformers outperforms the Google Neural Machine Translation model in specific tasks. The biggest benefit, however, comes from how The Transformer lends itself to parallelization. It is in fact Google Cloud’s recommendation to use The Transformer as a reference model to use their Cloud TPU offering. So let’s try to break the model apart and look at how it functions.

在前一篇文章中，我们讨论了注意力——一种在现代深度学习模型中普遍存在的方法。注意力是一个有助于提高神经机器翻译应用程序性能的概念。在这篇文章中，我们将着眼于Transformer——一个使用注意力来提高这些模型的训练速度的模型。Transformer在特定任务中的性能优于谷歌神经机器翻译模型。然而，最大的好处来自于转换器如何实现并行化。实际上，谷歌Cloud推荐使用Transformer作为参考模型来使用他们的云TPU产品。让我们试着把模型拆开看看它是如何工作的。

The Transformer was proposed in the paper Attention is All You Need. A TensorFlow implementation of it is available as a part of the Tensor2Tensor package. Harvard’s NLP group created a guide annotating the paper with PyTorch implementation. In this post, we will attempt to oversimplify things a bit and introduce the concepts one by one to hopefully make it easier to understand to people without in-depth knowledge of the subject matter.

Transformer的提出是在 Attention is All You Need 论文中。它的TensorFlow实现是tensor2张量包的一部分。哈佛大学的NLP小组创建了一个指南，用PyTorch实现对论文进行注释。在这篇文章中，我们将尝试把事情简单化一点，并逐一介绍概念，希望让没有深入了解主题的人更容易理解。

A High-Level Look

Let’s begin by looking at the model as a single black box. In a machine translation application, it would take a sentence in one language, and output its translation in another.

让我们首先将模型看作一个单独的黑盒。在机器翻译应用程序中，它将使用一种语言的句子，并输出另一种语言的翻译。

Popping open that Optimus Prime goodness, we see an encoding component, a decoding component, and connections between them.

看其内部，我们看到一个编码组件，一个解码组件，以及它们之间的联系。

The encoding component is a stack of encoders (the paper stacks six of them on top of each other – there’s nothing magical about the number six, one can definitely experiment with other arrangements). The decoding component is a stack of decoders of the same number.

编码组件是一堆编码器(纸上6个编码器叠在一起——数字6没有什么神奇之处，肯定可以尝试其他安排)。解码组件是相同数量的解码器的堆栈。

The encoders are all identical in structure (yet they do not share weights). Each one is broken down into two sub-layers:

编码器在结构上都是相同的(但它们不共享权重)。每一个都被分解成两个子层:

The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. We’ll look closer at self-attention later in the post.

编码器的输入首先通过一个Self-Attention层——这个层在编码器编码特定单词时帮助编码器查看输入句子中的其他单词。我们将在稍后的文章中更详细地讨论它。

The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position.

将self-attention层的输出输入前馈神经网络。完全相同的前馈网络分别应用于每个位置。

The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence (similar what attention does in seq2seq models).

解码器也有这两个层，但它们之间是一个attention层，帮助解码器将注意力集中于输入句子的相关部分(类似于注意在seq2seq模型中的作用)。

Bringing The Tensors Into The Picture

Now that we’ve seen the major components of the model, let’s start to look at the various vectors/tensors and how they flow between these components to turn the input of a trained model into an output.

现在我们已经了解了模型的主要组成部分，让我们开始研究各种向量/张量，以及它们如何在这些组成部分之间流动，从而将经过训练的模型的输入转换为输出。

As is the case in NLP applications in general, we begin by turning each input word into a vector using an embedding algorithm.

与NLP应用程序中的一般情况一样，我们首先使用嵌入算法将每个输入单词转换成一个向量。

Each word is embedded into a vector of size 512. We'll represent those vectors with these simple boxes.

每个单词都嵌入到一个大小为512的向量中。我们用这些简单的盒子来表示这些向量。

The embedding only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they receive a list of vectors each of the size 512 – In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder that’s directly below. The size of this list is hyperparameter we can set – basically it would be the length of the longest sentence in our training dataset.

嵌入只发生在最底部的编码器。所有编码器都有一个共同的抽象概念，那就是它们接收到的向量列表，每个向量的大小都是512——在底部的编码器中是词嵌入，但是在其他编码器中，它是下面的编码器的输出。这个列表的大小是我们可以设置的超参数——基本上就是我们的训练数据集中最长句子的长度。

After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder.

在输入序列中嵌入单词之后，每个单词都流经编码器的两层。

Here we begin to see one key property of the Transformer, which is that the word in each position flows through its own path in the encoder. There are dependencies between these paths in the self-attention layer. The feed-forward layer does not have those dependencies, however, and thus the various paths can be executed in parallel while flowing through the feed-forward layer.

在这里，我们开始看到Transformer的一个关键属性，即每个位置上的单词在编码器中通过自己的路径流动。在self-attention层中，这些路径之间存在依赖关系。但是，前馈层没有这些依赖项，因此可以在流经前馈层时并行执行各种路径。

Next, we’ll switch up the example to a shorter sentence and we’ll look at what happens in each sub-layer of the encoder.

接下来，我们将把示例转换为一个更短的句子，并查看在编码器的每个子层中发生了什么。

Now We’re Encoding!

As we’ve mentioned already, an encoder receives a list of vectors as input. It processes this list by passing these vectors into a ‘self-attention’ layer, then into a feed-forward neural network, then sends out the output upwards to the next encoder.

正如我们已经提到的，编码器接收一组向量作为输入。它通过将这些向量传递到一个“self-attention”层来处理这个列表，然后进入前馈神经网络，然后向上发送输出到下一个编码器。

The word at each position passes through a self-encoding process. Then, they each pass through a feed-forward neural network -- the exact same network with each vector flowing through it separately.

每个位置的单词都经过一个自编码过程。然后，它们各自通过前馈神经网络——完全相同的网络，每个向量分别通过它。

Self-Attention at a High Level

Don’t be fooled by me throwing around the word “self-attention” like it’s a concept everyone should be familiar with. I had personally never came across the concept until reading the Attention is All You Need paper. Let us distill how it works.

不要被我随便说的“自我关注”这个词所愚弄，好像它是每个人都应该熟悉的概念。我个人从来没有遇到过这个概念，直到阅读这篇论文 Attention is All You Need 。让我们提炼一下它是如何工作的。

Say the following sentence is an input sentence we want to translate:

”The animal didn't cross the street because it was too tired”

假设下面这句话是我们要翻译的输入句:

”The animal didn't cross the street because it was too tired”

What does “it” in this sentence refer to? Is it referring to the street or to the animal? It’s a simple question to a human, but not as simple to an algorithm.

这个句子中的“it”是什么意思?它指的是街道还是动物?这对人类来说是一个简单的问题，但对算法来说就没那么简单了。

When the model is processing the word “it”, self-attention allows it to associate “it” with “animal”.

当模型处理单词“it”时，self-attention将“it”与“animal”联系起来。

As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.

当模型处理每个单词(输入序列中的每个位置)时，self - attention允许它查看输入序列中的其他位置，以寻找有助于对该单词进行更好编码的线索。

If you’re familiar with RNNs, think of how maintaining a hidden state allows an RNN to incorporate its representation of previous words/vectors it has processed with the current one it’s processing. Self-attention is the method the Transformer uses to bake the “understanding” of other relevant words into the one we’re currently processing.

如果您熟悉RNN，请考虑如何维护一个隐藏状态，使RNN能够将它处理过的前一个单词/向量的表示与它正在处理的当前单词/向量结合起来。Self-attention是Transformer用来将其他相关单词的“理解”转换成我们正在处理的单词的方法。

As we are encoding the word "it" in encoder #5 (the top encoder in the stack), part of the attention mechanism was focusing on "The Animal", and baked a part of its representation into the encoding of "it".

当我们在编码器5中编码“it”(堆栈中最上面的编码器)时，注意力机制的一部分集中在“动物”上，并将其表示的一部分融入到“it”的编码中。

Be sure to check out the Tensor2Tensor notebook where you can load a Transformer model, and examine it using this interactive visualization.

请务必查看"Tensor2Tensor notebook"，在那里您可以加载Transformer模型，并使用这种交互式可视化来检查它。

Self-Attention in Detail

Let’s first look at how to calculate self-attention using vectors, then proceed to look at how it’s actually implemented – using matrices.

让我们先看看如何用向量来计算self-attention，然后再看看它是如何实现的——用矩阵。

The first step in calculating self-attention is to create three vectors from each of the encoder’s input vectors (in this case, the embedding of each word). So for each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process.

计算自我注意的第一步是从编码器的每个输入向量中创建三个向量(在本例中是每个单词的嵌入)。因此，对于每个单词，我们创建一个查询向量、一个键向量和一个值向量。这些向量是通过将嵌入乘以我们在训练过程中训练的三个矩阵得到的。

Notice that these new vectors are smaller in dimension than the embedding vector. Their dimensionality is 64, while the embedding and encoder input/output vectors have dimensionality of 512. They don’t HAVE to be smaller, this is an architecture choice to make the computation of multiheaded attention (mostly) constant.

注意这些新向量的维数比嵌入向量小。其维数为64，而嵌入和编码器的输入/输出向量的维数为512。它们不需要更小，这是一种架构选择，可以使多目标注意力的计算(大部分)保持不变。

Multiplying x1 by the WQ weight matrix produces q1, the "query" vector associated with that word. We end up creating a "query", a "key", and a "value" projection of each word in the input sentence.

x1乘以WQ权重矩阵得到q1，即与这个词相关的“查询”向量。我们最终创建了输入句子中每个单词的“查询”、“键”和“值”投影。

What are the “query”, “key”, and “value” vectors?

什么是“查询”、“键”和“值”向量?

They’re abstractions that are useful for calculating and thinking about attention. Once you proceed with reading how attention is calculated below, you’ll know pretty much all you need to know about the role each of these vectors plays.

它们是对计算和思考注意力很有用的抽象概念。一旦你开始阅读下面计算注意力的方法，你就会知道这些向量所扮演的角色。

The second step in calculating self-attention is to calculate a score. Say we’re calculating the self-attention for the first word in this example, “Thinking”. We need to score each word of the input sentence against this word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.

计算自我注意的第二步是计算分数。假设我们在计算本例中第一个单词“Thinking”的self-attention。我们需要将输入句子中的每个单词与这个单词进行评分。分数决定了当我们在某个位置编码一个单词时，对输入句子的其他部分的关注程度。

The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re scoring. So if we’re processing the self-attention for the word in position #1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2.

分数是通过查询向量与我们正在评分的单词的key向量的点积计算出来的。如果我们处理位置1的单词的self-attention，第一个分数就是q1和k1的点积。第二个分数是q1和k2的点积。

The third and forth steps are to divide the scores by 8 (the square root of the dimension of the key vectors used in the paper – 64. This leads to having more stable gradients. There could be other possible values here, but this is the default), then pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and add up to 1.

第三步和第四步是将分数除以8(论文中使用的关键向量的维数的平方根- 64)。这导致了更稳定的梯度。这里可能有其他可能的值，但这是默认值)，然后通过softmax操作传递结果。Softmax将分数标准化，使其均为正数，加起来为1。

This softmax score determines how much how much each word will be expressed at this position. Clearly the word at this position will have the highest softmax score, but sometimes it’s useful to attend to another word that is relevant to the current word.

softmax分数决定每个单词在这个位置的表达量。显然，这个位置的单词将拥有最高的softmax分数，但有时关注与当前单词相关的另一个单词是有用的。

The fifth step is to multiply each value vector by the softmax score (in preparation to sum them up). The intuition here is to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words (by multiplying them by tiny numbers like 0.001, for example).

第五步是将每个value向量乘以softmax分数(为求和做准备)。这里直观展示的是保持我们要关注的单词的值，淹没无关单词的值(例如，将它们乘以0.001这样的小数字)。

The sixth step is to sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word).

第六步是加权值向量的求和。这将在这个位置(对于第一个单词)生成self-attention层的输出。

That concludes the self-attention calculation. The resulting vector is one we can send along to the feed-forward neural network. In the actual implementation, however, this calculation is done in matrix form for faster processing. So let’s look at that now that we’ve seen the intuition of the calculation on the word level.

这就是自我注意计算的结论。得到的向量是一个我们可以发送到前馈神经网络的向量。然而，在实际实现中，这种计算是以矩阵形式进行的，以便更快地进行处理。现在我们来看看这个我们已经在单词层面上看到了计算的直观展示。

Matrix Calculation of Self-Attention

The first step is to calculate the Query, Key, and Value matrices. We do that by packing our embeddings into a matrix X, and multiplying it by the weight matrices we’ve trained (WQ, WK, WV).

第一步是计算query、key和value矩阵。我们把embeddings包装成矩阵X，然后乘以我们训练过的权矩阵(WQ WK WV)。

Every row in the X matrix corresponds to a word in the input sentence. We again see the difference in size of the embedding vector (512, or 4 boxes in the figure), and the q/k/v vectors (64, or 3 boxes in the figure)

X矩阵中的每一行对应输入句子中的一个单词。我们再次看到embeddings向量(512，或图中4个框)和q/k/v向量(64，或图中3个框)大小的差异

Finally, since we’re dealing with matrices, we can condense steps two through six in one formula to calculate the outputs of the self-attention layer.

最后，由于我们处理的是矩阵，我们可以将步骤2到步骤6压缩到一个公式中来计算自我注意层的输出。

The self-attention calculation in matrix form

矩阵形式的self-attention计算

The Beast With Many Heads

The paper further refined the self-attention layer by adding a mechanism called “multi-headed” attention. This improves the performance of the attention layer in two ways:

本文进一步细化了self-attention层，增加了“多头”注意机制。这通过两种方式提高了注意力层的性能:

1. It expands the model’s ability to focus on different positions. Yes, in the example above, z1 contains a little bit of every other encoding, but it could be dominated by the the actual word itself. It would be useful if we’re translating a sentence like “The animal didn’t cross the street because it was too tired”, we would want to know which word “it” refers to.

它扩展了模型关注不同位置的能力。是的，在上面的例子中，z1包含了一些其他编码，但是它可以被实际单词本身所控制。如果我们翻译像“动物没过马路是因为它太累了”这样的句子，它会很有用，我们想知道“它”指的是哪个单词。

2. It gives the attention layer multiple “representation subspaces”. As we’ll see next, with multi-headed attention we have not only one, but multiple sets of Query/Key/Value weight matrices (the Transformer uses eight attention heads, so we end up with eight sets for each encoder/decoder). Each of these sets is randomly initialized. Then, after training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into a different representation subspace.

它给了attention 层多个“表示子空间”。接下来我们将看到，对于multi-headed ，我们不仅有一个，而且有多个query/key/value权重矩阵集(Transformer 使用8个注意头，因此我们最终为每个编码器/解码器使用8个注意头)。每个集合都是随机初始化的。然后，在训练之后，使用每个集合将输入嵌入(或来自较低编码器/解码器的向量)投影到不同的表示子空间中。

With multi-headed attention, we maintain separate Q/K/V weight matrices for each head resulting in different Q/K/V matrices. As we did before, we multiply X by the WQ/WK/WV matrices to produce Q/K/V matrices.

对于多头注意，我们为每个头部维护单独的Q/K/V权重矩阵，从而产生不同的Q/K/V矩阵。和之前一样，我们用X乘以WQ/WK/WV矩阵得到Q/K/V矩阵。

If we do the same self-attention calculation we outlined above, just eight different times with different weight matrices, we end up with eight different Z matrices.

如果我们做同样的self-attention计算，就像上面概述的那样，用不同的权重矩阵做8次不同的计算，我们得到8个不同的Z矩阵。

This leaves us with a bit of a challenge. The feed-forward layer is not expecting eight matrices – it’s expecting a single matrix (a vector for each word). So we need a way to condense these eight down into a single matrix.

这给我们留下了一点挑战。前馈层不需要八个矩阵——它只需要一个矩阵(每个单词对应一个向量)。我们需要一种方法把这8个压缩成一个矩阵。

How do we do that? We concat the matrices then multiple them by an additional weights matrix WO.

我们怎么做呢?我们把这些矩阵连起来然后乘以一个额外的权重矩阵。

That’s pretty much all there is to multi-headed self-attention. It’s quite a handful of matrices, I realize. Let me try to put them all in one visual so we can look at them in one place

这就是多脑self-attention的全部。我意识到这是相当多的矩阵。我试着把它们放在一个图像中这样我们就能在一个地方看到它们。

Now that we have touched upon attention heads, let’s revisit our example from before to see where the different attention heads are focusing as we encode the word “it” in our example sentence:

既然我们已经提到了注意力头部，让我们再来看看之前的例子，看看当我们在例句中编码单词“it”时，不同的注意力头部集中在哪里:

As we encode the word "it", one attention head is focusing most on "the animal", while another is focusing on "tired" -- in a sense, the model's representation of the word "it" bakes in some of the representation of both "animal" and "tired".

当我们对“it”这个词进行编码时，一个注意力集中在“动物”上，而另一个注意力集中在“疲倦”上——从某种意义上说，“it”这个词的模型表现在“动物”和“疲倦”两个词的表现中。

If we add all the attention heads to the picture, however, things can be harder to interpret:

然而，如果我们把所有的注意力都集中到这幅图上，事情就很难解释了:

Representing The Order of The Sequence Using Positional Encoding

用位置编码表示序列的顺序

One thing that’s missing from the model as we have described it so far is a way to account for the order of the words in the input sequence.

到目前为止，我们所描述的模型中缺少的一件事是一种解释输入序列中单词顺序的方法。

To address this, the transformer adds a vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word, or the distance between different words in the sequence. The intuition here is that adding these values to the embeddings provides meaningful distances between the embedding vectors once they’re projected into Q/K/V vectors and during dot-product attention.

为了解决这个问题，transformer向每个输入的embedding添加一个向量。这些向量遵循模型学习到的特定模式，这有助于它确定每个单词的位置，或序列中不同单词之间的距离。直觉上，在点积attention时，一旦embeddings被投影到Q/K/V向量中，添加到embeddings中的这些值可以提供embeddings向量之间有意义的距离。

To give the model a sense of the order of the words, we add positional encoding vectors -- the values of which follow a specific pattern.

为了给模型一个单词顺序的感觉，我们添加位置编码向量——它们的值遵循特定的模式。

If we assumed the embedding has a dimensionality of 4, the actual positional encodings would look like this:

如果我们假设嵌入的维数是4，那么实际的位置编码应该是这样的:

A real example of positional encoding with a toy embedding size of 4

一个真实的位置编码的例子与玩具嵌入大小为4

What might this pattern look like?

这个模式可能是什么样的?

In the following figure, each row corresponds the a positional encoding of a vector. So the first row would be the vector we’d add to the embedding of the first word in an input sequence. Each row contains 512 values – each with a value between 1 and -1. We’ve color-coded them so the pattern is visible.

在下面的图中，每一行对应一个向量的位置编码。所以第一行代表的向量就是我们在输入序列中嵌入第一个单词的向量。每行包含512个值——每个值在1到-1之间。我们用颜色标记了它们，所以模式是可见的。

A real example of positional encoding for 20 words (rows) with an embedding size of 512 (columns). You can see that it appears split in half down the center. That's because the values of the left half are generated by one function (which uses sine), and the right half is generated by another function (which uses cosine). They're then concatenated to form each of the positional encoding vectors.

一个实际的位置编码示例，包含20个单词(行)，embedding大小为512(列)。你可以看到它从中间一分为二。这是因为左半部分的值是由一个函数(使用sin)生成的，而右半部分是由另一个函数(使用cos)生成的。然后将它们连接起来形成每个位置编码向量。

The formula for positional encoding is described in the paper (section 3.5). You can see the code for generating positional encodings in get_timing_signal_1d(). This is not the only possible method for positional encoding. It, however, gives the advantage of being able to scale to unseen lengths of sequences (e.g. if our trained model is asked to translate a sentence longer than any of those in our training set).

位置编码的公式在本文(章节3.5)中进行了描述。您可以在get_timing_signal_1d()中看到生成位置编码的代码。这不是唯一可能的位置编码方法。然而，它的优势在于能够扩展到序列的看不见的长度(例如，如果我们训练的模型被要求翻译一个比我们训练集中的任何一个都长的句子)。

The Residuals

残差

One detail in the architecture of the encoder that we need to mention before moving on, is that each sub-layer (self-attention, ffnn) in each encoder has a residual connection around it, and is followed by a layer-normalization step.

在继续之前，我们需要提到编码器架构中的一个细节，即每个编码器中的每个子层(self-attention, ffnn)在其周围都有一个残差连接，然后是一个分层规范化步骤。

If we’re to visualize the vectors and the layer-norm operation associated with self attention, it would look like this:

如果我们要把这些向量和与自我注意相关的层规范操作形象化，它看起来是这样的:

This goes for the sub-layers of the decoder as well. If we’re to think of a Transformer of 2 stacked encoders and decoders, it would look something like this:

这也适用于解码器的子层。如果我们考虑一个由两个堆叠编码器和解码器组成的transformer，它看起来是这样的:

The Decoder Side

Now that we’ve covered most of the concepts on the encoder side, we basically know how the components of decoders work as well. But let’s take a look at how they work together.

现在我们已经介绍了编码器端的大多数概念，我们基本上也知道解码器的组件是如何工作的。下面让我们来看看它们是如何一起工作的。

The encoder start by processing the input sequence. The output of the top encoder is then transformed into a set of attention vectors K and V. These are to be used by each decoder in its “encoder-decoder attention” layer which helps the decoder focus on appropriate places in the input sequence:

编码器从处理输入序列开始。将顶层编码器的输出转换为一组attention向量K和v，每个编码器在其“编码器-译码器attention”层中使用，帮助译码器在输入序列中适当的位置聚焦:

After finishing the encoding phase, we begin the decoding phase. Each step in the decoding phase outputs an element from the output sequence (the English translation sentence in this case).

完成编码阶段后，我们开始解码阶段。解码阶段的每个步骤从输出序列中输出一个元素(在本例中是英语翻译句)。

The following steps repeat the process until a special symbol is reached indicating the transformer decoder has completed its output. The output of each step is fed to the bottom decoder in the next time step, and the decoders bubble up their decoding results just like the encoders did. And just like we did with the encoder inputs, we embed and add positional encoding to those decoder inputs to indicate the position of each word.

下面的步骤重复这个过程，直到到达一个特殊的符号，表明transformer解码器已经完成了输出。每一步的输出在下一次的步骤中被输入到底层解码器，解码器就像编码器一样把它们的解码结果放大。就像我们对编码器输入所做的那样，我们嵌入并添加位置编码到这些解码器输入中来表示每个单词的位置。

The self attention layers in the decoder operate in a slightly different way than the one in the encoder:

解码器中的自我注意层的工作方式与编码器中的略有不同:

In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions (setting them to -inf) before the softmax step in the self-attention calculation.

在解码器中，self-attention层只允许注意输出序列中较早的位置。这是通过在self-attention计算中的softmax步骤之前屏蔽未来位置(将它们设置为-inf)来实现的。

The “Encoder-Decoder Attention” layer works just like multiheaded self-attention, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack.

编码器-解码器注意层的工作原理与多头自注意层类似，只是它从下面的层创建查询矩阵，并从编码器堆栈的输出中获取键和值矩阵。

The Final Linear and Softmax Layer

最后的线性和软最大层

The decoder stack outputs a vector of floats. How do we turn that into a word? That’s the job of the final Linear layer which is followed by a Softmax Layer.

解码器堆栈输出一个浮点向量。我们怎么把它变成一个词呢?这是最后一个线性层的工作，然后是一个Softmax层。

The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector.

线性层是一个简单的全连接神经网络，它将解码器堆栈产生的向量投影到一个更大的称为逻辑向量的向量中。

Let’s assume that our model knows 10,000 unique English words (our model’s “output vocabulary”) that it’s learned from its training dataset. This would make the logits vector 10,000 cells wide – each cell corresponding to the score of a unique word. That is how we interpret the output of the model followed by the Linear layer.

让我们假设我们的模型从它的训练数据集中知道10,000个独特的英语单词(我们模型的“输出词汇表”)。这将使logits向量宽10,000个单元格—每个单元格对应一个惟一单词的得分。这就是我们解释线性层之后的模型输出的方式。

The softmax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.

softmax层然后将这些分数转换为概率(全部为正数，全部加起来为1.0)。选择概率最大的单元格，并生成与之关联的单词作为此时间步骤的输出。

This figure starts from the bottom with the vector produced as the output of the decoder stack. It is then turned into an output word.

此图从底部开始，输出的矢量作为解码器堆栈的输出。然后将其转换为输出字。

Recap Of Training

训练回顾

Now that we’ve covered the entire forward-pass process through a trained Transformer, it would be useful to glance at the intuition of training the model.

既然我们已经介绍了通过训练过的转换器的整个前向传递过程，那么了解一下训练模型的直观了解将是很有用的。

During training, an untrained model would go through the exact same forward pass. But since we are training it on a labeled training dataset, we can compare its output with the actual correct output.

在训练过程中，未训练的模型将经历完全相同的向前传球。但是由于我们是在一个标记的训练数据集上训练它，我们可以将它的输出与实际正确的输出进行比较。

To visualize this, let’s assume our output vocabulary only contains six words(“a”, “am”, “i”, “thanks”, “student”, and “” (short for ‘end of sentence’)).

为了直观地理解这一点，我们假设输出词汇表只包含6个单词(“a”、“am”、“i”、“thanks”、“student”和“”(“end of sentence”的缩写))。

The output vocabulary of our model is created in the preprocessing phase before we even begin training.

我们模型的输出词汇表是在我们开始训练之前的预处理阶段创建的。

Once we define our output vocabulary, we can use a vector of the same width to indicate each word in our vocabulary. This also known as one-hot encoding. So for example, we can indicate the word “am” using the following vector:

一旦定义了输出词汇表，就可以使用相同宽度的向量来表示词汇表中的每个单词。这也称为one-hot编码。例如，我们可以用下面的向量来表示am:

Example: one-hot encoding of our output vocabulary

示例:输出词汇表的one-hot编码

The Loss Function

损失函数

Say we are training our model. Say it’s our first step in the training phase, and we’re training it on a simple example – translating “merci” into “thanks”.

假设我们正在训练我们的模型。假设这是我们在训练阶段的第一步，我们用一个简单的例子来进行训练——将“merci”翻译成“thanks”。

What this means, is that we want the output to be a probability distribution indicating the word “thanks”. But since this model is not yet trained, that’s unlikely to happen just yet.

这意味着，我们希望输出是一个表示“谢谢”的概率分布。但是由于这个模型还没有经过训练，所以现在还不太可能实现。

Since the model's parameters (weights) are all initialized randomly, the (untrained) model produces a probability distribution with arbitrary values for each cell/word. We can compare it with the actual output, then tweak all the model's weights using backpropagation to make the output closer to the desired output.

由于模型的参数(权重)都是随机初始化的，因此(未经训练的)模型生成每个单元格/单词的任意值的概率分布。我们可以将其与实际输出进行比较，然后使用反向传播调整模型的所有权重，使输出更接近所需的输出。

How do you compare two probability distributions? We simply subtract one from the other. For more details, look atcross-entropy and Kullback–Leibler divergence.

如何比较两个概率分布?我们只要把其中一个减去另一个。更多细节，请看交叉熵和库尔巴克-莱布尔散度（KL散度）。

But note that this is an oversimplified example. More realistically, we’ll use a sentence longer than one word. For example – input: “je suis étudiant” and expected output: “i am a student”. What this really means, is that we want our model to successively output probability distributions where:

但是请注意，这是一个过于简化的示例。更实际的情况是，我们将使用一个句子，而不是一个单词。例如，输入:“je suis etudiant”和期望输出:“i am a student”。这真正的意思是，我们想要我们的模型连续输出概率分布，其中:

Each probability distribution is represented by a vector of width vocab_size (6 in our toy example, but more realistically a number like 3,000 or 10,000)
The first probability distribution has the highest probability at the cell associated with the word “i”
The second probability distribution has the highest probability at the cell associated with the word “am”
And so on, until the fifth output distribution indicates ‘’ symbol, which also has a cell associated with it from the 10,000 element vocabulary.

每个概率分布都由一个宽度为vocab_size的向量表示(在我们的玩具示例中是6，但更实际的数字是3000或10000)

第一个概率分布在与“i”相关的单元格中具有最高的概率

第二个概率分布在与am相关的单元格中具有最高的概率

以此类推，直到第5个输出分布表示' '符号，该符号也有一个来自10,000个元素词汇表的与之关联的单元格。

After training the model for enough time on a large enough dataset, we would hope the produced probability distributions would look like this:

在足够大的数据集上训练模型足够长的时间后，我们希望生成的概率分布是这样的:

Hopefully upon training, the model would output the right translation we expect. Of course it's no real indication if this phrase was part of the training dataset (see: cross validation). Notice that every position gets a little bit of probability even if it's unlikely to be the output of that time step -- that's a very useful property of softmax which helps the training process.

希望通过训练，该模型能够输出我们期望的正确翻译。当然，如果这个短语是训练数据集的一部分(参见:交叉验证)，这并不是真正的指示。注意，每个位置都有一点概率即使它不太可能是那个时间步长的输出——这是softmax的一个非常有用的属性，它有助于训练过程。

Now, because the model produces the outputs one at a time, we can assume that the model is selecting the word with the highest probability from that probability distribution and throwing away the rest. That’s one way to do it (called greedy decoding). Another way to do it would be to hold on to, say, the top two words (say, ‘I’ and ‘a’ for example), then in the next step, run the model twice: once assuming the first output position was the word ‘I’, and another time assuming the first output position was the word ‘me’, and whichever version produced less error considering both positions #1 and #2 is kept. We repeat this for positions #2 and #3…etc. This method is called “beam search”, where in our example, beam_size was two (because we compared the results after calculating the beams for positions #1 and #2), and top_beams is also two (since we kept two words). These are both hyperparameters that you can experiment with.

现在，因为这个模型每次产生一个输出，我们可以假设这个模型从概率分布中选择概率最大的单词，然后扔掉剩下的。这是一种方法(称为贪婪解码)。另一个方法是坚持,好比,前两个单词(比如“我”和“a”),然后在下一步中,运行模型两次:一次假设第一个输出位置是“I”这个词,而另一个假设第一个输出位置是‘me’这个词,和哪个版本产生更少的错误考虑# 1和# 2保存位置。我们对2号和3号位置重复这个。这种方法称为“beam search”，在我们的示例中，beam_size为2(因为我们在计算位置1和位置2的beam后对结果进行了比较)，top_beam也是2(因为我们保留了两个单词)。这两个都是可以实验的超参数。

Go Forth And Transform

I hope you’ve found this a useful place to start to break the ice with the major concepts of the Transformer. If you want to go deeper, I’d suggest these next steps:

我希望这篇文章能对你有所帮助，让你可以从这里开始了解Transformer的主要概念。如果你想深入了解，我建议你采取以下步骤:

Read the Attention Is All You Need paper, the Transformer blog post (Transformer: A Novel Neural Network Architecture for Language Understanding), and the Tensor2Tensor announcement.
阅读《Attention Is All You Need》这篇文章、Transformer博客文章(Transformer: A Novel Neural Network Architecture for Language Understanding)和Tensor2Tensor announcement。

Watch Łukasz Kaiser’s talk walking through the model and its details
看 Łukasz Kaiser’s talk，讲的是关于模型和细节

Play with the Jupyter Notebook provided as part of the Tensor2Tensor repo
操作一下Jupyter Notebook provided as part of the Tensor2Tensor repo
Explore the Tensor2Tensor repo.
探索 Tensor2Tensor repo.

Acknowledgements

Thanks to Illia Polosukhin, Jakob Uszkoreit, Llion Jones , Lukasz Kaiser, Niki Parmar, and Noam Shazeer for providing feedback on earlier versions of this post.

Please hit me up on Twitter for any corrections or feedback.

Written on June 27, 2018

感谢Illia Polosukhin、Jakob Uszkoreit、Llion Jones、Lukasz Kaiser、Niki Parmar和Noam Shazeer对本文早期版本提供的反馈。

如果有任何更正或反馈，请在Twitter上联系我。

写于2018年6月27日

原文： https://jalammar.github.io/illustrated-transformer/

本来是有图片的，粘贴过来就没了。。。

你可能感兴趣的:(NLP)

免费的GPT可在线直接使用（一键收藏） kkai人工智能 gpt
1、LuminAI（https://kk.zlrxjh.top）LuminAI标志着一款融合了星辰大数据模型与文脉深度模型的先进知识增强型语言处理系统，旨在自然语言处理（NLP）的技术开发领域发光发热。此系统展现了卓越的语义把握与内容生成能力，轻松驾驭多样化的自然语言处理任务。VisionAI在NLP界的应用领域广泛，能够胜任从机器翻译、文本概要撰写、情绪分析到问答等众多任务。通过对大量文本数据的
机器学习-聚类算法不良人龍木木机器学习机器学习算法聚类
机器学习-聚类算法1.AHC2.K-means3.SC4.MCL仅个人笔记，感谢点赞关注！1.AHC2.K-means3.SC传统谱聚类：个人对谱聚类算法的理解以及改进4.MCL目前仅专注于NLP的技术学习和分享感谢大家的关注与支持！
轻量级模型解读——轻量transformer系列 lishanlu136 #图像分类轻量级模型 transformer 图像分类
先占坑，持续更新。。。文章目录1、DeiT2、ConViT3、Mobile-Former4、MobileViTTransformer是2017谷歌提出的一篇论文，最早应用于NLP领域的机器翻译工作，Transformer解读，但随着2020年DETR和ViT的出现(DETR解读，ViT解读)，其在视觉领域的应用也如雨后春笋般渐渐出现，其特有的全局注意力机制给图像识别领域带来了重要参考。但是tran
FlagEmbedding 吉小雨 python库 python
FlagEmbedding教程FlagEmbedding是一个用于生成文本嵌入（textembeddings）的库，适合处理自然语言处理（NLP）中的各种任务。嵌入（embeddings）是将文本表示为连续向量，能够捕捉语义上的相似性，常用于文本分类、聚类、信息检索等场景。官方文档链接：FlagEmbedding官方GitHub一、FlagEmbedding库概述1.1什么是FlagEmbeddi
【NumPy】深入解析numpy.zeros()函数二七830 numpy
欢迎莅临我的个人主页这里是我深耕Python编程、机器学习和自然语言处理（NLP）领域，并乐于分享知识与经验的小天地！博主简介：我是二七830，一名对技术充满热情的探索者。多年的Python编程和机器学习实践，使我深入理解了这些技术的核心原理，并能够在实际项目中灵活应用。尤其是在NLP领域，我积累了丰富的经验，能够处理各种复杂的自然语言任务。技术专长：我熟练掌握Python编程语言，并深入研究了机
CV、NLP、数据控掘推荐、量化海的那边- AI算法自然语言处理人工智能
下面是对CV（计算机视觉）、NLP（自然语言处理）、数据挖掘推荐和量化的简要概述及其应用领域的介绍：1.CV（计算机视觉，ComputerVision）定义：计算机视觉是一门让计算机能够从图像或视频中提取有用信息，并做出决策的学科。它通过模拟人类的视觉系统来识别、处理和理解视觉信息。主要任务：图像分类：识别图像中的物体并分类，比如猫、狗、车等。目标检测：在图像或视频中定位并识别多个对象，如人脸检测
深度解析：如何使用输出解析器将大型语言模型（LLM）的响应解析为结构化JSON格式 m0_57781768 语言模型 json 人工智能
深度解析：如何使用输出解析器将大型语言模型（LLM）的响应解析为结构化JSON格式在现代自然语言处理（NLP）的应用中，大型语言模型（LLM）已经成为了重要的工具。这些模型能够生成丰富的自然语言文本，适用于各种应用场景。然而，在某些应用中，开发者不仅仅需要生成文本，还需要将这些生成的文本转换为结构化的数据格式，例如JSON。这种结构化的数据格式在数据传输、存储以及进一步处理时具有显著优势。本文将深
使用LangChain和OpenAI实现高效文本标注 aehrutktrjk langchain python
使用LangChain和OpenAI实现高效文本标注引言在自然语言处理(NLP)领域，文本标注是一项重要且常见的任务。它涉及为文本分配标签，如情感、语言、风格等。本文将介绍如何使用LangChain和OpenAI的API来实现高效的文本标注系统。我们将探讨如何设置环境、定义标注模式，以及如何使用OpenAI的模型来执行标注任务。环境准备首先，我们需要安装必要的库并设置API密钥：%pipinsta
【NLP5-RNN模型、LSTM模型和GRU模型】一蓑烟雨紫洛 nlp rnn lstm gru nlp
RNN模型、LSTM模型和GRU模型1、什么是RNN模型RNN（RecurrentNeuralNetwork)中文称为循环神经网络，它一般以序列数据为输入，通过网络内部的结构设计有效捕捉序列之间的关系特征，一般也是以序列形式进行输出RNN的循环机制使模型隐层上一时间步产生的结果，能够作为当下时间步输入的一部分（当下时间步的输入除了正常的输入外还包括上一步的隐层输出）对当下时间步的输出产生影响2、R
基于深度学习的文本引导的图像编辑 SEU-WYL 深度学习dnn 深度学习人工智能
基于深度学习的文本引导的图像编辑（Text-GuidedImageEditing）是一种通过自然语言文本指令对图像进行编辑或修改的技术。它结合了图像生成和自然语言处理（NLP）的最新进展，使用户能够通过描述性文本对图像内容进行精确的调整和操控。1.文本引导的图像编辑的挑战文本和图像之间的对齐：如何将文本中的语义信息准确地映射到图像中的特定区域或元素是一个关键挑战。这涉及到多模态数据的对齐和理解。编
甘超波：NLP婚姻中如何与老人相处甘超波
哈喽，大家好我是甘超波，是一名NLP爱好者，每天一篇原创文章或视频，分享我的实战经验和案例，希望给你些启发和帮助看一下，在家庭中子女与老人观念不一致时案例1：在教育孩子方面，老人习惯用老一套教育方式教育孙子，子女受不了老人这种习惯，从而发生口舌之争？2：在生活习惯方面，老人喜欢吃剩菜剩饭，子女受不了老人这种习惯，从而发生口舌之争？.....这样的事情，我相信你或多或少都听过和看过，甚至了深有感悟。
transformer架构(Transformer Architecture)原理与代码实战案例讲解 AI架构设计之禅大数据AI人工智能 Python入门实战计算科学神经计算深度学习神经网络大数据人工智能大型语言模型 AI AGI LLM Java Python 架构设计 Agent RPA
transformer架构(TransformerArchitecture)原理与代码实战案例讲解关键词：Transformer,自注意力机制,编码器-解码器,预训练,微调,NLP,机器翻译作者：禅与计算机程序设计艺术/ZenandtheArtofComputerProgramming1.背景介绍1.1问题的由来自然语言处理（NLP）领域的发展经历了从规则驱动到统计驱动再到深度学习驱动的三个阶段。
英伟达（NVIDIA）B200架构解读 weixin_41205263 芯际争霸 GPGPU架构 gpu算力人工智能硬件架构
H100芯片是一款高性能AI芯片，其中的TransformerEngine是专门用于加速Transformer模型计算的核心部件。Transformer模型是一种自然语言处理（NLP）模型，广泛应用于机器翻译、文本生成等任务。TransformerEngine的电路设计原理主要包括以下几个方面：
《昇思 25 天学习打卡营第 25 天 | 基于 MindSpore 实现 BERT 对话情绪识别》 Sam9029 Mindscope模型学习深度学习
《昇思25天学习打卡营第25天|基于MindSpore实现BERT对话情绪识别》活动地址：https://xihe.mindspore.cn/events/mindspore-training-camp签名：Sam9029环境配置确保安装了正确版本的MindSpore和MindNLP库。!pipuninstallmindspore-y!pipinstall-ihttps://pypi.mirror
基于人工智能的智能语音助手人工智能发烧友人工智能
语音助手的自然语言处理模块是语音助手系统的关键组成部分。通过这个模块，系统能够识别用户的意图并做出相应的回应。我们可以使用NLP技术来解析文本输入，并将其转换为系统可以理解的命令或指令。在本项目中，我们将结合语音识别、自然语言处理和语音合成技术，构建一个功能简化的语音助手。一、项目背景与需求分析1.1项目目标本项目旨在创建一个语音助手系统，它可以：1.语音识别：从用户的语音输入中提取文本信息。2.
NLP_jieba中文分词的常用模块 Hiweir · NLP_jieba的使用自然语言处理中文分词人工智能 nlp
1.jieba分词模式（1）精确模式:把句子最精确的切分开,比较适合文本分析.默认精确模式.（2）全模式:把句子中所有可能成词的词都扫描出来,cut_all=True,缺点:速度快,不能解决歧义（3）paddle:利用百度的paddlepaddle深度学习框架.简单来说就是使用百度提供的分词模型.use_paddle=True.（4）搜索引擎模式:在精确模式的基础上,对长词再进行切分,提高召回率,
Linux如何查看端口 lanhuazui10 linux操作系统 linux
方法一：lsof-i:端口号用于查看某一端口的占用情况，比如查看9092端口使用情况，lsof-i:9095可以看到9095端口已经被nginx占用方法二：netstat-tunlp|grep端口号，用于查看指定的端口号的进程情况，如查看5050端口的情况，netstat-tunlp|grep5050-t(tcp)仅显示tcp相关选项-u(udp)仅显示udp相关选项-n拒绝显示别名，能显示数字的
【笔记】自然语言处理NLP---概论 xhanZ NLP相关
（from人文学院开设课程）目录1.自然语言处理概论1.1自然语言处理研究的意义、历史与现状1.1.1自然语言的特点1.1.2自然语言处理研究的意义1.1.3国外研究现状1.2NLP的方法、特点和规律1.2.1理性主义与经验主义1.2.2语料库语言学：经验主义研究方法1.2.3汉语语言处理的方法1.2.4基于知识图谱的深度学习1.自然语言处理概论1.1自然语言处理研究的意义、历史与现状1.1.1自
【笔记与idea】——ACL2017论文报告会胖胖的飞象深度学习人工智能笔记 idea
这篇是2017年我有幸参加了中文信息学会组织的ACL2017论文报告会记的笔记，当时还是研一新生，对NLP感兴趣，偶然通过老师知晓了这次报告会，所以想去现场听听大牛们的idea、和大牛们交流（然而由于当时没有入门，啥也不懂，交流失败。。。）但是总的来说，非常感谢组织这次报告会的老师们，尽管没能和大牛们有效的交流，但是这次报告会相当于在最短的时间内读懂了数十篇精彩论文的核心内容，对我后面的学习起到了
如何利用AI技术来提升用户的个性化体验和社区参与度？ Itfuture03 AI前沿技术人工智能
要利用AI技术提升用户的个性化体验和社区参与度，可以采取以下几种策略：个性化推荐系统：通过AI算法分析用户的行为和偏好，提供定制化的服务和内容推荐，如智能推荐活动、健康管理等，让居民感受到社区的温暖和关怀。智能助手与聊天机器人：引入AI驱动的虚拟助手，提供实时帮助、个性化建议和交互式对话，改善客户体验。自然语言处理（NLP）：实现具有AI能力的NLP，创建对用户友好的应用程序，简化用户体验，如客服
【Python】成功解决IndexError: list index out of range 高斯小哥 BUG解决方案合集 python list 新手入门学习 debug
【Python】成功解决IndexError:listindexoutofrange下滑查看解决方法欢迎莅临我的个人主页这里是我静心耕耘深度学习领域、真诚分享知识与智慧的小天地！博主简介：985高校的普通本硕，曾有幸发表过人工智能领域的中科院顶刊一作论文，熟练掌握PyTorch框架。技术专长：在CV、NLP及多模态等领域有丰富的项目实战经验。已累计一对一为数百位用户提供近千次专业服务，助力他们少走
使用Python和Jieba库进行中文情感分析：从文本预处理到模型训练的完整指南快撑死的鱼 Python算法精解 python 人工智能开发语言
使用Python和Jieba库进行中文情感分析：从文本预处理到模型训练的完整指南情感分析（SentimentAnalysis）是自然语言处理（NLP）领域中的一个重要分支，旨在从文本中识别出情绪、态度或意见等主观信息。在中文文本处理中，由于语言特性不同于英语，如何高效、准确地分词和提取关键词成为情感分析的关键步骤之一。在这篇文章中，我们将深入探讨如何使用Python和Jieba库进行中文情感分析，
论文阅读笔记: DINOv2: Learning Robust Visual Features without Supervision 小夏refresh 论文计算机视觉深度学习论文阅读笔记深度学习计算机视觉人工智能
DINOv2:LearningRobustVisualFeatureswithoutSupervision论文地址:https://arxiv.org/abs/2304.07193代码地址:https://github.com/facebookresearch/dinov2摘要大量数据上的预训练模型在NLP方面取得突破，为计算机视觉中的类似基础模型开辟了道路。这些模型可以通过生成通用视觉特征(即无
第3篇：LangChain的架构总览与设计理念 Gemini技术窝 langchain 架构大数据人工智能 AIGC nlp
LangChain库是一个专为自然语言处理（NLP）设计的强大工具包，致力于简化复杂语言模型链的构建和执行。在本文中，我们将深入解析LangChain库的架构，详细列出其核心组件、设计理念及其在不同场景中的应用，并讨论其优缺点。文章目录1.LangChain库简介2.核心组件2.1数据输入模块作用2.2数据预处理模块作用2.3数据增强模块作用2.4数据加载与批处理模块作用2.5模型训练模块作用2.
读李中莹先生论“阿Q精神" 猫咪06
这阵子重读《重塑心灵》，对“阿Q精神"一段很有感慨，在我们从小的信念里，阿Q的精神胜利法是被贬低的，是对无能力改变自己的境遇时，似手只能采用自我安慰的人的讽刺。李中莹先生在他的书中结合对话者的认可，定义阿Q精神“只求精神胜利，罔顾真实情况"，他就针对这两句话，解析阿Q精神，并进行了肯定‘，。首先“精神胜利"指的是自己内心有成功的感觉，这很符合NLP!如果所有人都认为你成功，而你自己没有成功的喜悦，
书单用户5521
提高思维（13本）：影响力逻辑思维（理查德·尼斯贝特）离经叛道:不按常理出牌的人如何改变世界（只看最后一章总结即可）改变:问题形成和解决的原则语言的魔力:谈笑间转变信念之NLP技巧（意识到语言顺序的重要性）改变心理学的40项研究对伪心理学说不你的误区:如何摆脱负面思维掌控你的生活战胜拖拉你的灯亮着吗?别做正常的傻瓜学会提问:批判性思维指南不确定世界的理性选择小说（5本）：霍乱时期的爱情那些回不去的
【Python】解决AttributeError: ‘NoneType‘ object has no attribute ‘xxxx‘ 云天徽上 Pandas python 开发语言 pandas 机器学习 numpy
【Python】解决AttributeError:'NoneType'objecthasnoattribute'xxxx'报错欢迎莅临我的个人主页这里是我深耕Python编程、机器学习和自然语言处理（NLP）领域，并乐于分享知识与经验的小天地！博主简介：我是云天徽上，一名对技术充满热情的探索者。多年的Python编程和机器学习实践，使我深入理解了这些技术的核心原理，并能够在实际项目中灵活应用。尤其
【自然语言处理】自然语言处理NLP概述及应用 @我们的天空人工智能技术 nlp 人工智能深度学习 python 机器学习自然语言处理 scikit-learn
自然语言处理（NaturalLanguageProcessing，简称NLP）是一门集计算机科学、人工智能以及语言学于一体的交叉学科，致力于让计算机能够理解、解析、生成和处理人类的自然语言。它是人工智能领域的一个关键分支，旨在缩小人与机器之间的交流障碍，使得机器能够更有效地识别并响应人类的自然语言指令或内容。自然语言处理NLP概述基本任务：文本分类：将文本划分为预定义的类别，如情感分析、主题分类等
OPENAI中RAG实现原理以及示例代码用PYTHON来实现 dzend aigc python 开发语言 ai
OPENAI中RAG实现原理以及示例代码用PYTHON来实现1.引言在当今人工智能领域，自然语言处理（NLP）是一个非常重要的研究方向。近年来，OPENAI发布了许多创新的NLP模型，其中之一就是RAG（Retrieval-AugmentedGeneration）模型。RAG模型结合了检索和生成两种方法，可以用于生成与给定问题相关的高质量文本。本文将介绍RAG模型的实现原理，并提供使用Python
开源AI图像识别：支持扫描文件批量识别快速对接数据库存储思通数科x 人工智能计算机视觉图像处理 OCR 文本识别
随着数字化转型的不断深入，图像识别技术在各行各业中的应用越来越广泛。文件封识别作为图像识别技术的一个分支，能够有效地提高文件处理的自动化程度和准确性。本文将探讨文件封识别技术的原理、应用场景以及如何将识别后的内容批量对应数据库字段进行存储。开源项目介绍(可本地部署，支持国产化)思通数科研发了一款多模态AI能力引擎，专注于提供自然语言处理（NLP）、情感分析、实体识别、图像识别与分类、OCR识别和语
Linux的Initrd机制被触发 linux
Linux 的 initrd 技术是一个非常普遍使用的机制，linux2.6 内核的 initrd 的文件格式由原来的文件系统镜像文件转变成了 cpio 格式，变化不仅反映在文件格式上， linux 内核对这两种格式的 initrd 的处理有着截然的不同。本文首先介绍了什么是 initrd 技术，然后分别介绍了 Linux2.4 内核和 2.6 内核的 initrd 的处理流程。最后通过对 Lin
maven本地仓库路径修改 bitcarter maven
默认maven本地仓库路径：C:\Users\Administrator\.m2 修改maven本地仓库路径方法： 1.打开E:\maven\apache-maven-2.2.1\conf\settings.xml 2.找到
XSD和XML中的命名空间 darrenzhu xml xsd schema namespace 命名空间
http://www.360doc.com/content/12/0418/10/9437165_204585479.shtml http://blog.csdn.net/wanghuan203/article/details/9203621 http://blog.csdn.net/wanghuan203/article/details/9204337 http://www.cn
Java 求素数运算周凡杨 java 算法素数
网络上对求素数之解数不胜数，我在此总结归纳一下，同时对一些编码，加以改进，效率有成倍热提高。第一种：原理: 6N(+-)1法任何一个自然数，总可以表示成为如下的形式之一： 6N，6N+1，6N+2，6N+3，6N+4，6N+5 (N=0，1，2，…)
java 单例模式 g21121 java
想必单例模式大家都不会陌生，有如下两种方式来实现单例模式： class Singleton { private static Singleton instance=new Singleton(); private Singleton(){} static Singleton getInstance() { return instance; }
Linux下Mysql源码安装 510888780 mysql
1.假设已经有mysql-5.6.23-linux-glibc2.5-x86_64.tar.gz (1)创建mysql的安装目录及数据库存放目录解压缩下载的源码包，目录结构，特殊指定的目录除外：
32位和64位操作系统墙头上一根草 32位和64位操作系统
32位和64位操作系统是指：CPU一次处理数据的能力是32位还是64位。现在市场上的CPU一般都是64位的，但是这些CPU并不是真正意义上的64 位CPU，里面依然保留了大部分32位的技术，只是进行了部分64位的改进。32位和64位的区别还涉及了内存的寻址方面，32位系统的最大寻址空间是2 的32次方= 4294967296（bit）= 4（GB）左右，而64位系统的最大寻址空间的寻址空间则达到了
我的spring学习笔记10-轻量级_Spring框架 aijuans Spring 3
一、问题提问： → 请简单介绍一下什么是轻量级？轻量级（Leightweight）是相对于一些重量级的容器来说的，比如Spring的核心是一个轻量级的容器，Spring的核心包在文件容量上只有不到1M大小，使用Spring核心包所需要的资源也是很少的，您甚至可以在小型设备中使用Spring。
mongodb 环境搭建及简单CURD antlove Web Install curd NoSQL mongo
一搭建mongodb环境 1. 在mongo官网下载mongodb 2. 在本地创建目录 "D:\Program Files\mongodb-win32-i386-2.6.4\data\db" 3. 运行mongodb服务 [mongod.exe --dbpath "D:\Program Files\mongodb-win32-i386-2.6.4\data\
数据字典和动态视图百合不是茶 oracle 数据字典动态视图系统和对象权限
数据字典（data dictionary）是 Oracle 数据库的一个重要组成部分，这是一组用于记录数据库信息的只读（read-only）表。随着数据库的启动而启动,数据库关闭时数据字典也关闭数据字典中包含数据库中所有方案对象（schema object）的定义(包括表，视图，索引，簇，同义词，序列，过程，函数，包，触发器等等) 数据库为一
多线程编程一般规则 bijian1013 java thread 多线程 java多线程
如果两个工两个以上的线程都修改一个对象，那么把执行修改的方法定义为被同步的，如果对象更新影响到只读方法，那么只读方法也要定义成同步的。不要滥用同步。如果在一个对象内的不同的方法访问的不是同一个数据，就不要将方法设置为synchronized的。
将文件或目录拷贝到另一个Linux系统的命令scp bijian1013 linux unix scp
一.功能说明 scp就是security copy，用于将文件或者目录从一个Linux系统拷贝到另一个Linux系统下。scp传输数据用的是SSH协议，保证了数据传输的安全，其格式如下： scp 远程用户名@IP地址：文件的绝对路径
【持久化框架MyBatis3五】MyBatis3一对多关联查询 bit1129 Mybatis3
以教员和课程为例介绍一对多关联关系，在这里认为一个教员可以叫多门课程，而一门课程只有1个教员教，这种关系在实际中不太常见，通过教员和课程是多对多的关系。示例数据：地址表： CREATE TABLE ADDRESSES ( ADDR_ID INT(11) NOT NULL AUTO_INCREMENT, STREET VAR
cookie状态判断引发的查找问题 bitcarter form cgi
先说一下我们的业务背景： 1.前台将图片和文本通过form表单提交到后台，图片我们都做了base64的编码，并且前台图片进行了压缩 2.form中action是一个cgi服务 3.后台cgi服务同时供PC，H5，APP 4.后台cgi中调用公共的cookie状态判断方法（公共的，大家都用，几年了没有问题）问题：（折腾两天。。。。） 1.PC端cgi服务正常调用，cookie判断没
通过Nginx,Tomcat访问日志(access log)记录请求耗时 ronin47
一、Nginx通过$upstream_response_time $request_time统计请求和后台服务响应时间 nginx.conf使用配置方式： log_format main '$remote_addr - $remote_user [$time_local] "$request" ''$status $body_bytes_sent "$http_r
java-67- n个骰子的点数。把n个骰子扔在地上，所有骰子朝上一面的点数之和为S。输入n，打印出S的所有可能的值出现的概率。 bylijinnan java
public class ProbabilityOfDice { /** * Q67 n个骰子的点数 * 把n个骰子扔在地上，所有骰子朝上一面的点数之和为S。输入n，打印出S的所有可能的值出现的概率。 * 在以下求解过程中，我们把骰子看作是有序的。 * 例如当n=2时，我们认为（1，2）和（2，1）是两种不同的情况 */ private stati
看别人的博客，觉得心情很好 Cb123456 博客心情
以为写博客，就是总结，就和日记一样吧，同时也在督促自己。今天看了好长时间博客: 职业规划: http://www.iteye.com/blogs/subjects/zhiyeguihua android学习: 1.http://byandby.i
[JWFD开源工作流]尝试用原生代码引擎实现循环反馈拓扑分析 comsci 工作流
我们已经不满足于仅仅跳跃一次，通过对引擎的升级，今天我测试了一下循环反馈模式，大概跑了200圈，引擎报一个溢出错误在一个流程图的结束节点中嵌入一段方程，每次引擎运行到这个节点的时候，通过实时编译器GM模块，计算这个方程，计算结果与预设值进行比较，符合条件则跳跃到开始节点，继续新一轮拓扑分析，直到遇到
JS常用的事件及方法 cwqcwqmax9 js
事件描述 onactivate 当对象设置为活动元素时触发。 onafterupdate 当成功更新数据源对象中的关联对象后在数据绑定对象上触发。 onbeforeactivate 对象要被设置为当前元素前立即触发。 onbeforecut 当选中区从文档中删除之前在源对象触发。 onbeforedeactivate 在 activeElement 从当前对象变为父文档其它对象之前立即
正则表达式验证日期格式 dashuaifu 正则表达式 IT其它 java其它
正则表达式验证日期格式 function isDate(d){ var v = d.match(/^(\d{4})-(\d{1,2})-(\d{1,2})$/i); if(!v) { this.focus(); return false; } } <input value="2000-8-8" onblu
Yii CModel.rules() 方法、validate预定义完整列表、以及说说验证 dcj3sjt126com yii
public array rules () {return} array 要调用 validate() 时应用的有效性规则。返回属性的有效性规则。声明验证规则，应重写此方法。每个规则是数组具有以下结构：array('attribute list', 'validator name', 'on'=>'scenario name', ...validation
UITextAttributeTextColor = deprecated in iOS 7.0 dcj3sjt126com ios
In this lesson we used the key "UITextAttributeTextColor" to change the color of the UINavigationBar appearance to white. This prompts a warning "first deprecated in iOS 7.0." Ins
判断一个数是质数的几种方法 EmmaZhao Math python
质数也叫素数，是只能被1和它本身整除的正整数，最小的质数是2，目前发现的最大的质数是p=2^57885161-1【注1】。判断一个数是质数的最简单的方法如下： def isPrime1(n): for i in range(2, n): if n % i == 0: return False return True 但是在上面的方法中有一些冗余的计算，所以
SpringSecurity工作原理小解读坏我一锅粥 SpringSecurity
SecurityContextPersistenceFilter ConcurrentSessionFilter WebAsyncManagerIntegrationFilter HeaderWriterFilter CsrfFilter LogoutFilter Use
JS实现自适应宽度的Tag切换 ini JavaScript html Web css html5
效果体验：http://hovertree.com/texiao/js/3.htm 该效果使用纯JavaScript代码，实现TAB页切换效果，TAB标签根据内容自适应宽度，点击TAB标签切换内容页。 HTML文件代码： <!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml"
Hbase Rest API : 数据查询 kane_xie REST hbase
hbase（hadoop）是用java编写的，有些语言（例如python）能够对它提供良好的支持，但也有很多语言使用起来并不是那么方便，比如c#只能通过thrift访问。Rest就能很好的解决这个问题。Hbase的org.apache.hadoop.hbase.rest包提供了rest接口，它内嵌了jetty作为servlet容器。启动命令：./bin/hbase rest s
JQuery实现鼠标拖动元素移动位置（源码+注释）明子健 jquery js 源码拖动鼠标
欢迎讨论指正！ print.html代码： <!DOCTYPE html> <html> <head> <meta http-equiv=Content-Type content="text/html;charset=utf-8"> <title>发票打印</title> &l
Postgresql 连表更新字段语法 update qifeifei PostgreSQL
下面这段sql本来目的是想更新条件下的数据，可是这段sql却更新了整个表的数据。sql如下： UPDATE tops_visa.visa_order SET op_audit_abort_pass_date = now() FROM tops_visa.visa_order as t1 INNER JOIN tops_visa.visa_visitor as t2 ON t1.
将redis,memcache结合使用的方案? tcrct redis cache
公司架构上使用了阿里云的服务，由于阿里的kvstore收费相当高，打算自建，自建后就需要自己维护，所以就有了一个想法，针对kvstore(redis)及ocs(memcache)的特点，想自己开发一个cache层，将需要用到list，set，map等redis方法的继续使用redis来完成，将整条记录放在memcache下，即findbyid，save等时就memcache，其它就对应使用redi
开发中遇到的诡异的bug wudixiaotie bug
今天我们服务器组遇到个问题：我们的服务是从Kafka里面取出数据，然后把offset存储到ssdb中，每个topic和partition都对应ssdb中不同的key，服务启动之后，每次kafka数据更新我们这边收到消息，然后存储之后就发现ssdb的值偶尔是-2,这就奇怪了，最开始我们是在代码中打印存储的日志，发现没什么问题，后来去查看ssdb的日志，才发现里面每次set的时候都会对同一个key