Attention Is All You Need

注意力机制是你需要的全部
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

attention [ə'tenʃ(ə)n]：n. 注意，关注，注意力，关心 int. 注意，立正
attention mechanism：注意力机制
Computer Science，CS：计算机科学
Computer Vision，CV：计算机视觉
Computation and Language，CL
Bilingual Evaluation Understudy，BLEU：双语评估替换，双语评估替补
University of Toronto，UofT or UToronto：多伦多大学，多大
Google Research
long short-term memory，LSTM：长短期记忆
recurrent neural network，RNN：循环神经网络
recursive neural network，RvNN：递归神经网络

arXiv (archive - the X represents the Greek letter chi [χ]) is a repository of electronic preprints approved for posting after moderation, but not full peer review.

Google Brain is a deep learning artificial intelligence research team at Google.

December 6, 2017

Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started the effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models and has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head attention and the parameter-free position representation and became the other person involved in nearly every detail. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and tensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and efficient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating our research.
同等贡献。名单顺序随机。Jakob 建议以 self-attention 取代 RNN，并开始努力评估这一想法。Ashish 与 Illia 一起设计并实现了第一个 Transformer 模型，并在这项工作中的各个方面起着至关重要的作用。Noam 提出了 scaled dot-product attention, multi-head attention 和参数无关的位置表示，并成为涉及几乎每个细节的另一个人。Niki 在我们原始的代码库和 tensor2tensor 中设计、实现、调优和评估了无数模型变体。Llion 还尝试了新的模型变体，负责我们的初始代码库以及高效的推理和可视化。Lukasz 和 Aidan 花了无数漫长的时间来设计和实现 tensor2tensor 的各个部分，以取代我们之前的代码库，从而大大改善了结果并极大地加速了我们的研究。

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

Neural Information Processing Systems，NeurIPS，NIPS：神经信息处理系统
Long Beach：长滩市，长堤
California，CA：加利福尼亚州、加州、黄金之州、金州

Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
主流的序列转换模型基于复杂的循环或卷积神经网络 (RNN/CNN)，这些神经网络包含一个编码器和一个解码器。表现最佳的模型还通过 attention 机制连接编码器和解码器。我们提出了一种新的简单网络架构 Transformer，它仅仅基于 attention 机制，完全避免了循环和卷积。在两个机器翻译任务上进行的实验表明，这些模型在质量上具有优势，同时具有更高的可并行性，并且所需的训练时间明显更少。我们的模型在 2014 年 WMT 英语到德语翻译任务中达到 28.4 BLEU，比包括整合模型在内的现有最佳结果提高了 2 BLEU 以上。在 2014 年 WMT 英语到法语翻译任务中，我们的模型在八个 GPU 上进行了 3.5 天的训练后 (这个时间只是目前文献中记载的最好的模型训练成本的一小部分)，建立了一个新的单模型最新的 BLEU 分数 41.8。我们通过将其成功地应用于大量和有限训练数据的英语相关的成分分析任务上，表明了 Transformer 对其他任务具有很好的通用性。

主流的序列到序列转换模型都是基于含有 encoder 和 decoder 的复杂循环或卷积网络，性能最好的模型在 encoder和 decoder 之间加了 attention 机制。本文提出一种新的网络结构，摒弃了循环和卷积网络，仅基于 attention 机制。

主流序列传导模型大多基于 RNN/CNN。Google Transformer 完全舍弃了 RNN/CNN 结构，从自然语言本身的特性出发，实现了完全基于注意力机制的 Transformer 机器翻译网络架构。

transduction [træns'dʌkʃən]：n. 转导 (作用)，能量转换
dispense [dɪ'spens]：v. 分配，分发，提供 (尤指服务)，配 (药)
ensemble [ɒn'sɒmb(ə)l]：n. 乐团，整体，全体，全套服装
constituency [kən'stɪtjʊənsi]：n. (选举议会议员的) 选区，选区的选民，(统称) 支持者

1 Introduction

Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15].
循环神经网络，特别是长短期记忆 [13] 和门控循环 [7] 神经网络已被牢固地确立为序列建模和转换问题 (如语言建模和机器翻译) 中的最先进方法 [35, 2, 5]。自那以来，已经进行了许多努力来突破循环语言模型和编码器-解码器体系结构的发展 [38, 24, 15]。

Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states $h_t$ , as a function of the previous hidden state $h_{t-1}$ and the input for position $t$ . This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.
循环模型通常沿输入和输出序列的符号位置进行因子计算。通过在计算期间将位置与步骤对齐，它们根据前一步的隐藏状态 $h_{t-1}$ 和输入产生位置 $t$ 的隐藏状态序列 $h_t$ 。这种固有的顺序特性阻止了训练样本内的并行化，这在较长的序列长度上变得至关重要，因为有限的内存限制了样本的批处理大小。最近的工作通过巧妙的因式分解 [21] 和条件计算 (conditional computation) [32] 在计算效率上取得了显着提高，同时在后者的情况下还提高了模型性能。但是，顺序计算的基本约束仍然存在。

循环网络模型通常是考虑了输入和输出序列的中的字符位置的计算。

Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network.
在各种任务中，attention 机制已经成为序列建模和转换模型不可或缺的一部分，可以建模依赖关系而不考虑其在输入或输出序列中的距离 [2, 19]。然而除少数情况外 [27]，在所有情况下，此类 attention 机制都与循环网络结合使用。

In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.
在这项工作中，我们提出了 Transformer，它是一种避免循环发生的模型体系结构，完全依赖于注意力机制来绘制输入和输出之间的全局依存关系。该 Transformer 可以实现更多的并行化，在八个 P100 GPU 上进行了长达 12 个小时的训练之后，可以在翻译质量方面达到新的最佳水平。

RNN/LSTM/GRU 是建立 sequence modeling and transduction problems 的标配，尤其是在 language modeling and machine translation 领域。典型的循环网络会根据输入和输出序列的符号位置信息进行运算，每一步得到的一个隐变量 $h_t$ 是上一个隐变量 $h_{t-1}$ 和当前位置输入 $t$ 的一个函数。输入、输出信息的序列化特性，却成为训练并行化的阻碍。Attention 机制旨在解决序列中长距离依赖的问题。Transformer 模型规避了循环而完全只依赖于注意力机制，为输入和输出序列刻画全局的依赖信息。

基于 RNN 的 seq2seq 模型难以处理长序列的句子，无法实现并行，面临对齐的问题。加入 attention 的 seq2seq 模型在精度上有所提升，现在的 seq2seq 模型都是 RNN + attention 的模型。

神经网络需要能够将源语句的所有必要信息压缩成固定长度的向量，可能使得神经网络难以应付长的句子，特别是那些比训练语料库中的句子更长的句子。每个时间步的输出需要依赖于前面时间步的输出，这使得模型没有办法并行，效率低，面临对齐问题。CNN 不能直接用于处理变长的序列样本但可以实现并行计算。完全基于 CNN的 seq2seq 模型虽然可以并行实现，但非常占内存，大数据量上参数调整并不容易。Attention Is All You Need 抛弃了先前 encoder-decoder 模型结合 RNN/CNN 的模式，只用 attention。减少计算量和提高并行效率的同时不损害最终的精度。

未来发展方向：输入的方向性 (单向 -> 双向)

2 Background

The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU [16], ByteNet [18] and ConvS2S [9], all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions [12]. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section 3.2.
减少顺序计算的目标也构成了 Extended Neural GPU [16], ByteNet [18] and ConvS2S [9] 的基础，它们全部使用卷积神经网络作为基本构建模块，并行计算所有输入和输出位置的隐藏表示。在这些模型中，关联来自两个任意输入或输出位置的信号所需的操作数会随着位置之间的距离而增加，对于 ConvS2S 线性增长，而对于 ByteNet 则对数增长。这使得学习远距离之间的依存关系变得更加困难 [12]。在 Transformer 中，这可以减少到固定的操作次数，尽管会因用 attention 权重化的位置取平均而导致有效分辨率降低的代价，这是我们在第 3.2 节中所述的 Multi-Head Attention 抵消的效果。

albeit [ɔːl'biːɪt]：conj. 尽管，虽然
counteract [.kaʊntər'ækt]：v. 抵消，抵抗，抵制
entailment [en'teɪlmənt]：n. 导出

Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations [4, 27, 28, 22].
self-attention，有时也称为 intra-attention，是一种注意力机制，它关联单个序列的不同位置以计算序列的表示。self-attention 已经成功地用于各种任务中，包括阅读理解、抽象概括、文本蕴涵和学习与任务无关的句子表示 [4, 27, 28, 22]。

End-to-end memory networks are based on a recurrent attention mechanism instead of sequence-aligned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks [34].
端到端的记忆网络基于循环 attention 机制，而不是序列对齐的循环，并且已被证明在简单语言问答和语言建模任务中表现良好 [34]。

To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution. In the following sections, we will describe the Transformer, motivate self-attention and discuss its advantages over models such as [17, 18] and [9].
然而，据我们所知，Transformer 是第一个完全依靠 self-attention 来计算其输入和输出表示的转换模型，而无需使用序列对齐的 RNN 或卷积。在以下各节中，我们将描述 Transformer，激发自注意并讨论其相对于模型 [17, 18] 和 [9] 的优势。

3 Model Architecture

Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 35]. Here, the encoder maps an input sequence of symbol representations $x_{1}, ..., x_{n})$ to a sequence of continuous representations $\mathbf{z} = (z_{1}, ..., z_{n})$ . Given $\mathbf{z}$ , the decoder then generates an output sequence $y_{1}, ..., y_{m})$ of symbols one element at a time. At each step the model is auto-regressive [10], consuming the previously generated symbols as additional input when generating the next.
大多数有竞争力的神经序列转换模型具有编码器-解码器结构 [5, 2, 35]。在此，编码器将符号表示 $x_{1}, ..., x_{n})$ 的输入序列映射到一系列连续表示 $\mathbf{z} = (z_{1}, ..., z_{n})$ 。在给定 $\mathbf{z}$ 的情况下，解码器然后一次生成一个符号的符号输出序列 $y_{1}, ..., y_{m})$ 。模型的每一步都是自回归的 [10]，在生成下一个时，会将先前生成的符号用作附加输入。

The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.
Transformer 遵循这种总体架构，对编码器和解码器使用堆叠式 self-attention 和逐点、全连接层，分别如图 1 的左半部分和右半部分所示。

Figure 1: The Transformer - model architecture.

The Transformer

sequence-to-sequence，Seq2Seq：序列到序列
左边 encoder 输入，右边 decoder 输出。

左边 encoder 和右边 decoder 结合。encoder 里面是有 N 层，decoder 里面是有 N 层。encoder 的输出会和每一层的 decoder 进行结合。

3.1 Encoder and Decoder Stacks

Encoder: The encoder is composed of a stack of $N = 6$ identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm( $x$ + Sublayer( $x$ )), where Sublayer( $x$ ) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension $d_\text{model} = 512$ .
编码器由 $N = 6$ 个相同的层堆叠组成。每层都有两个子层。第一个子层是一个 multi-head self-attention 机制，第二个子层是一个简单的、位置完全连接的前馈网络。我们对每个子层再采用一个残差连接 [11]，然后进行 layer normalization [1]。也就是说，每个子层的输出是 LayerNorm( $x$ + Sublayer( $x$ ))，其中 Sublayer( $x$ ) 是由子层本身实现的函数。为了方便这些残差连接，模型中的所有子层以及嵌入层均产生维度为 $d_\text{model} = 512$ 的输出。

encoder 的 $N = 6$ 层，每层包括两个 sub-layers。
第一个 sub-layer 是 multi-head self-attention mechanism，用来计算输入的 self-attention。
第二个 sub-layer 是简单的全连接网络。

在每个 sub-layer 都模拟了残差网络，每个 sub-layer 的输出都是每个子层的输出是 LayerNorm( $x$ + Sublayer( $x$ ))，其中 Sublayer( $x$ ) 是由子层本身实现的函数。Sublayer( $x$ ) 表示 sub-layer 对输入 $x$ 做的映射，为了确保连接，所有的 sub-layers 和 embedding layer 输出的维数都相同。

Decoder: The decoder is also composed of a stack of $N = 6$ identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization. We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position $i$ can depend only on the known outputs at positions less than $i$ .
解码器还由 $N = 6$ 个相同的层堆叠组成。除了每个编码器层中的两个子层之外，解码器还插入第三子层，该第三子层对编码器堆栈的输出执行 multi-head attention。与编码器类似，我们在每个子层再采用残差连接，然后进行 layer normalization。我们还修改了解码器堆栈中的 self-attention 子层，以防止位置关注后续位置。这种掩码，加上输出嵌入偏移一个位置的事实，确保了对位置 $i$ 的预测只能依赖于位置小于 $i$ 的已知输出。

attend [ə'tend]：v. 参加，出席，注意，专心
facilitate [fə'sɪləteɪt]：v. 促进，促使，使便利

decoder 的 $N = 6$ 层，每层包括 3 个 sub-layers。
第一个 masked multi-head attention，也是计算输入的 self-attention。因为是生成过程，在时刻 i 的时候，大于 i 的时刻都没有结果，只有小于 i 的时刻有结果，因此需要做 mask。
第二个 sub-layer 是全连接网络，与 encoder 相同。
第三个 sub-layer 是对 encoder 的输入进行 attention 计算。
decoder 中的 self-attention 层需要进行修改，只能获取到当前时刻之前的输入，只对时刻 t 之前的时刻输入进行 attention 计算，称为 mask 操作。

3.2 Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.
attention 函数可以描述为将 query 和一组键值对映射到输出，其中 query, keys, values 和输出都是向量。将输出为 values 的加权和，其中分配给每个值的权重是通过 query 与相应 key 的兼容性函数来计算的。

attention 用于计算相关程度 (例如在翻译过程中，不同的英文对中文的依赖程度不同)。attention 表示为将 a query (Q) 和 a set of key-value pairs 映射到 an output，其中 query, keys, values, and output 都是向量，输出是所有 values (V) 的加权和，其中权重是由 query 和 the corresponding key 计算出来的，计算方法分为三步。

比较 $Q$ 和 $K$ 的相似度，用 $f$ 表示。
$f(Q, K_{i}), \ i = 1, 2, 3 ...$
计算得到的相似度进行 softmax 运算。
$w_{i} = \frac{e^{f(Q, K_{i})}}{\sum_{i=1}^{m} e^{f(Q, K_{i})}}, \ i = 1, 2, 3 ...$
计算得到的权重 $w_{i}$ ，对 $V$ 中所有的 value 进行加权求和计算，得到 attention 向量。
$\sum^m_{i=1} w_{i} V_{i}, \ i = 1, 2, 3 ...$

第一步中计算方法包括以下四种：
点乘 (dot product)： $f(Q, K_{i}) = Q^{T}K_{i}, \ i = 1, 2, 3 ...$
权重： $f(Q, K_{i}) = Q^{T}WK_{i}, \ i = 1, 2, 3 ...$
拼接权重 (concat)： $f(Q, K_{i}) = W[Q^{T}; K_{i}], \ i = 1, 2, 3 ...$
感知器 (perceptron)： $f(Q, K_{i}) = V^{T} tanh(WQ + UK_{i}), \ i = 1, 2, 3 ...$

Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel.

3.2.1 Scaled Dot-Product Attention

We call our particular attention “Scaled Dot-Product Attention” (Figure 2). The input consists of queries and keys of dimension $d_k$ , and values of dimension $d_v$ . We compute the dot products of the query with all keys, divide each by $\sqrt{dk}$ , and apply a softmax function to obtain the weights on the values.
我们称我们特别的 attention “Scaled Dot-Product Attention” (Figure 2)。输入由 queries 和维度为 $d_k$ 的 keys 以及维度为 $d_v$ 的 values 组成。我们计算 query 和所有 keys 的点积，除以 $\sqrt{dk}$ ，然后应用 softmax 函数获得 values 的权重。

In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix $Q$ . The keys and values are also packed together into matrices $K$ and $V$ . We compute the matrix of outputs as:
实际上，我们在一组 queries 上同时计算注意力函数，将它们打包成矩阵 $Q$ 。keys and values 也打包到矩阵 $K$ and $V$ 中。我们将输出矩阵计算为：
$softmax(\frac{QK^{T}} {\sqrt{d_{k}}}) V \tag{1}$

The two most commonly used attention functions are additive attention [2], and dot-product (multiplicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor of $\frac{1}{\sqrt{d_{k}}}$ . Additive attention computes the compatibility function using a feed-forward network with a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.
两个最常用的 attention 函数是加法 attention [2] 和点积 (乘法) attention。除 $\frac{1}{\sqrt{d_{k}}}$ 的缩放因子外，点积 attention 与我们的算法相同。加法 attention 使用具有单个隐藏层的前馈网络来计算兼容性函数。尽管两者在理论上的复杂度相似，但是在实践中点积的 attention 要快得多，而且空间效率更高，因为可以使用高度优化的矩阵乘法代码来实现。

While for small values of $d_k$ the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of $d_k$ [3]. We suspect that for large values of $d_k$ , the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients. To counteract this effect, we scale the dot products by $\frac{1}{\sqrt{d_k}}$ .
当 $d_k$ 的值比较小的时候，这两个机制的性能相差相近，当 $d_k$ 的值比较大时，加法 attention 比不带缩放的点积 attention 性能好 [3]。我们猜想，对于很大的 $d_k$ 值，点积大幅度增长，将 softmax 函数推向具有极小梯度的区域。为了抵消这种影响，我们缩小点积 $\frac{1}{\sqrt{d_k}}$ 倍。

To illustrate why the dot products get large, assume that the components of $q$ and $k$ are independent random variables with mean 0 and variance 1. Then their dot product, $\cdot k = \sum^{d_k}_{i=1} q_{i}k_{i}$ , has mean 0 and variance $d_k$ .
为了说明点积为何变大，假定 $q$ and $k$ 的分量是均值为 0 且方差为 1 的独立随机变量。

Scaled Dot-Product Attention 中的 $Q, K, V$ 。根据输入 $X$ ，通过 3 个线性转换把 $X$ 转换为 $Q, K, V$ 。例如两个单词 Thinking 和 Machines，通过嵌入变换得到 $X_{1}, X_2$ 两个向量 $\times 4]$ 。然后分别与 $W^{Q}, W^{K}, W^{V})$ 三个矩阵 $\times 3]$ 点乘得到 ${q_{1}, q_{2}\}, \{k_{1}, k_{2}\}, \{v_{1}, v_{2}\}$ 6 个向量 $\times 3]$ 。
向量 ${q_{1}, k_{1}}$ 点乘计算得分 (score) 112, ${q_{1}, k_{2}}$ 点乘计算得分 (score) 96。
为了使得梯度更稳定，对得分进行规范化，除以 8。然后对得分 $[14, 12]$ 做 softmax 得到比例 $[0.88, 0.12]$ 。
比例 $[0.88, 0.12]$ 乘以 $v_{1}, v_{2}]$ 值 (values) 得到一个加权后的值。然后将这些值加起来得到 $z_{1}$ ，这就是这一层的输出。用 $Q, K$ 去计算一个 thinking 对于 thinking, machine 的权重，用权重乘以 thinking, machine 的 $V$ 得到加权后的 thinking,machine 的 $V$ ，最后求和得到针对各单词的输出 $Z$ 。
区别于前面单个向量的运算。下面展示的是矩阵运算，输入是一个 $\times 4]$ 的矩阵 (单词嵌入)，每个运算是 $\times 3]$ 的矩阵，得到 $Q, K, V$ 。

$Q$ 与 $K$ 的转置做点乘，除以 $d_k$ 的平方根。softmax 得到和为 1 的比例，对 $V$ 做点乘得到输出 $Z$ 。 $Z$ 就是一个考虑过 thinking 周围单词 (machine) 的输出。

$QK^T$ 会组成一个 word2word 的 attention map (softmax 之后就是一个和为 1 的权重)。例如输入是一句话 I have a dream 总共 4 个单词，就会形成一张 $\times 4$ 的注意力机制的图，每一个单词就对应每一个单词有一个权重。

encoder 里面是 self-attention，decoder 里面是 masked self-attention。这里的 masked 就是要在做 language modelling (或者翻译) 的时候，不给模型看到未来的信息。mask 就是沿着对角线把灰色的区域用 0 覆盖掉，不给模型看到未来的信息。

I 作为第一个单词，只能有和 I 自己的 attention。have 作为第二个单词，有和 I, have 两个 attention。a 作为第三个单词，有和 I, have, a 前面三个单词的 attention。到了最后一个单词 dream 的时候，才有对整个句子 4 个单词的 attention。softmax 操作后横轴和为 1。

3.2.2 Multi-Head Attention

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.
Multi-head attention 允许模型的不同表示子空间联合关注不同位置的信息。如果只有一个 attention head，它的平均值会削弱这个信息。

$\begin{aligned} \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_{1}, ..., \text{head}_{h})W^{O} \\ \text{where} \ \text{head}_{i} = \text{Attention}(QW_{i}^{Q}, KW_{i}^{K}, VW_{i}^{V}) \\ \end{aligned}$

Where the projections are parameter matrices $W_{i}^{Q} \in \mathbb{ℝ}^{d_\text{model}\times d_{k}}, W_{i}^{K} \in \mathbb{ℝ}^{d_\text{model}\times d_{k}}, W_{i}^{V} \in \mathbb{ℝ}^{d_\text{model}\times d_{v}} \text{and} W_{i}^{O} \in \mathbb{ℝ}^{h d_{v} \times d_\text{model}}$ .
其中，映射为参数矩阵为 $W_{i}^{Q} \in \mathbb{ℝ}^{d_\text{model}\times d_{k}}, W_{i}^{K} \in \mathbb{ℝ}^{d_\text{model}\times d_{k}}, W_{i}^{V} \in \mathbb{ℝ}^{d_\text{model}\times d_{v}} \text{and} W_{i}^{O} \in \mathbb{ℝ}^{h d_{v} \times d_\text{model}}$ 。

In this work we employ $h = 8$ parallel attention layers, or heads. For each of these we use $d_{k} = d_{v} = d_{model}/h = 64$ . Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.
在这项工作中，我们采用 $h = 8$ 个并行 attention 层或 heads。对每个 head，我们使用 $d_{k} = d_{v} = d_{model}/h = 64$ 。由于每个 head 的维度减小，总的计算成本与具有全部维度的单个 head attention 相似。

inhibit [ɪn'hɪbɪt]：v. 抑制，阻止，阻碍，使拘束
projection [prə'dʒekʃ(ə)n]：n. 投影，投射，预测，放映

Multi-Head Attention 就是把 Scaled Dot-Product Attention 的过程做 $h$ 次，然后把输出 $Z$ 合起来。

重复执行 8 次相似的操作，得到 8 个 $Z_i$ 矩阵。
为了使得输出与输入结构对标，乘以一个线性 $W^0$ 得到最终的 $Z$ 。

3.2.3 Applications of Attention in our Model

The Transformer uses multi-head attention in three different ways:

In “encoder-decoder attention” layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as [38, 2, 9].
在 encoder-decoder attention 层中，queries 来自上面解码器层，而存储 keys and values 来自编码器的输出。这允许解码器中的每个位置都能关注到输入序列中的所有位置。这模仿了序列到序列模型 (例如 [38, 2, 9]) 中的典型编码器-解码器注意机制。
The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.
编码器包含 self-attention 层。在 self-attention 层中，所有 keys, values and queries 都来自同一位置，在这种情况下，是编码器中前一层的输出。编码器中的每个位置都可以关注编码器上一层中的所有位置。
Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to $-\infty$ ) all values in the input of the softmax which correspond to illegal connections. See Figure 2.
类似地，解码器中的 self-attention 层允许解码器中的每个位置都关注解码器中的所有位置，直到包括该位置为止。我们需要防止解码器中的向左信息流，以保留自回归属性。通过屏蔽 softmax 的输入中所有不合法连接的值 (设置为 $-\infty$ )，我们在 scaled dot-product attention 中实现此目标。参见图 2。

mimic ['mɪmɪk]：v. 模仿 (某人的言行举止)，(尤指) 做滑稽模仿，似 adj. 模仿的，拟态的 n. 会模仿的人 (或动物)

3.3 Position-wise Feed-Forward Networks - 基于位置的前馈网络

In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.
除了 attention 子层之外，我们的编码器和解码器中的每个层都包含一个全连接的前馈网络，该前馈网络单独且相同地应用于每个位置。它由两个线性变换组成，之间有一个 ReLU 激活。

$\text{FFN}(x) = \text{max}(0, xW_{1} + b_{1})W_{2} + b_{2} \tag{2}$

While the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1. The dimensionality of input and output is $d_\text{model} = 512$ , and the inner-layer has dimensionality $d_{ff} = 2048$ .
尽管线性变换在不同位置上是相同的，但它们层与层之间使用不同的参数。它的另一种描述方式是两个内核大小为1 的卷积。输入和输出的维度为 $d_\text{model} = 512$ ，内部层的维度为 $d_{ff} = 2048$ 。

3.4 Embeddings and Softmax - 嵌入和 Softmax

Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension $d_\text{model}$ . We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to [30]. In the embedding layers, we multiply those weights by $\sqrt{d_\text{model}}$ .
与其他序列转导模型类似，我们使用学习的嵌入将输入词符和输出词符转换为维度为 $d_\text{model}$ 的向量。我们还使用普通学习的线性变换和 softmax 函数将解码器输出转换为预测的下一个词符的概率。在我们的模型中，两个嵌入层之间和 pre-softmax 线性变换共享相同的权重矩阵，类似于 [30]。在嵌入层中，我们将这些权重乘以 $\sqrt{d_\text{model}}$ 。

decoder 的堆栈输出作为输入，从底部开始，进行最终的预测。

3.5 Positional Encoding (位置编码)

Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add “positional encodings” to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension $d_\text{model}$ as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed [9].
由于我们的模型不包含循环和卷积，为了让模型利用序列的顺序，我们必须注入序列中关于词符相对或者绝对位置的一些信息。为此，我们将位置编码 (positional encodings) 和 embeddings 相加，作为编码器和解码器堆栈底部的输入。位置编码和嵌入的维度 $d_\text{model}$ 相同，所以它们可以相加。有多种位置编码可以选择，例如通过学习得到的位置编码和固定的位置编码 [9]。

In this work, we use sine and cosine functions of different frequencies:
在这项工作中，我们使用不同频率的正弦和余弦函数：

$\begin{aligned} PE(pos, 2i) = \sin(pos/10000^{2i/d_\text{model}}) \\ PE(pos, 2i+1) = \cos(pos/10000^{2i/d_\text{model}}) \\ \end{aligned}$

where $p o s$ is the position and $i$ is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from $2\pi$ to $10000 \cdot 2\pi$ . We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset $k$ , $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$ .
其中 $p o s$ 是位置， $i$ 是维度。也就是说，位置编码的每个维度对应于一个正弦曲线。这些波长形成一个几何级数，从 $2\pi$ 到 $10000 \cdot 2\pi$ 。我们选择这个函数是因为我们假设它允许模型很容易学习对相对位置的关注，因为对任意确定的偏移 $k$ , $PE_{pos+k}$ 可以表示为 $PE_{pos}$ 的线性函数。

We also experimented with using learned positional embeddings [9] instead, and found that the two versions produced nearly identical results (see Table 3 row (E)). We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.
我们还尝试使用学习的位置嵌入 [9] 进行实验，发现这两个版本产生了几乎相同的结果 (see Table 3 row (E))。我们选择正弦曲线版本是因为它可以使模型外推到比训练过程中遇到的序列长度更长的序列长度。

sinusoidal [ˌsɪnə'sɔɪdl]：adj. 血窦，曲形，正弦，正弦曲线的
extrapolate [ɪk'stræpəleɪt]：v. 推断，推知，判定，推测
hypothesize [haɪ'pɒθəsaɪz]：v. 假设，假定

模型不包括 recurrence/convolution，无法捕捉到序列顺序信息，例如将 $K$ 、 $V$ 按行进行打乱，attention 后的结果是一样的。但是序列信息非常重要，代表着全局的结构，因此必须将序列的分词相对或者绝对位置信息利用起来。

每个分词的 position embedding 向量维度是 $d_\text{model}$ , 然后将原本的 input embedding 和 position embedding 加起来组成最终的 embedding 作为 encoder/decoder 的输入。其中 position embedding 计算公式如下：

$\begin{aligned} PE(pos, 2i) = \sin(pos/10000^{2i/d_\text{model}}) \\ PE(pos, 2i+1) = \cos(pos/10000^{2i/d_\text{model}}) \\ \end{aligned}$

其中 $p o s$ 表示 position index， $i$ 表示 dimension index。

position embedding 本身是一个绝对位置的信息，但在语言中，相对位置也很重要，Google 选择前述的位置向量公式的一个重要原因是，由于我们有：

$\begin{aligned} \sin(\alpha + \beta) = \sin \alpha \cos \beta + \cos \alpha \sin \beta \\ \cos(\alpha + \beta) = \cos \alpha \cos \beta - \sin \alpha \sin \beta \end{aligned}$

这表明位置 $p + k$ 的向量可以表示成位置 $p$ 的向量的线性变换，这提供了表达相对位置信息的可能性。

position embedding 通常是一个训练的向量，但是 position embedding 只是 extra features，有该信息会更好，但是没有性能也不会产生极大下降。RNN/CNN 本身就能够捕捉到位置信息，但是在 Transformer 模型中，position embedding 是位置信息的唯一来源，因此是该模型的核心成分，并非是辅助性质的特征。

4 Why Self-Attention

In this section we compare various aspects of self-attention layers to the recurrent and convolutional layers commonly used for mapping one variable-length sequence of symbol representations $x_{1}, …, x_{n})$ to another sequence of equal length $z_{1}, …, z_{n})$ , with $x_{i}, z_{i} \in \mathbb{R}^{d}$ , such as a hidden layer in a typical sequence transduction encoder or decoder. Motivating our use of self-attention we consider three desiderata.
本节，我们比较 self-attention 与循环层、卷积层的各个方面，它们通常用于映射变长的符号序列表示 $x_{1}, …, x_{n})$ 到另一个等长的序列 $z_{1}, …, z_{n})$ ，其中 $x_{i}, z_{i} \in \mathbb{R}^{d}$ ，例如一个典型的序列转导编码器或解码器中的隐藏层。我们使用 self-attention 是考虑到解决三个问题。

One is the total computational complexity per layer. Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required.
一个是每层的总计算复杂度。另一个是可以并行化的计算量，以所需的最少顺序操作数衡量。

The third is the path length between long-range dependencies in the network. Learning long-range dependencies is a key challenge in many sequence transduction tasks. One key factor affecting the ability to learn such dependencies is the length of the paths forward and backward signals have to traverse in the network. The shorter these paths between any combination of positions in the input and output sequences, the easier it is to learn long-range dependencies [12]. Hence we also compare the maximum path length between any two input and output positions in networks composed of the different layer types.
第三个是网络中长距离依赖之间的路径长度。学习长距离依赖性是许多序列转导任务中的关键挑战。影响学习这种依赖性能力的一个关键因素是前向和后向信号必须在网络中传播的路径长度。输入和输出序列中任意位置组合之间的这些路径越短，学习长距离依赖性就越容易 [12]。因此，我们还比较了由不同图层类型组成的网络中任意两个输入和输出位置之间的最大路径长度。

As noted in Table 1, a self-attention layer connects all positions with a constant number of sequentially executed operations, whereas a recurrent layer requires $O (n)$ sequential operations. In terms of computational complexity, self-attention layers are faster than recurrent layers when the sequence length $n$ is smaller than the representation dimensionality $d$ , which is most often the case with sentence representations used by state-of-the-art models in machine translations, such as word-piece [38] and byte-pair [31] representations. To improve computational performance for tasks involving very long sequences, self-attention could be restricted to considering only a neighborhood of size $r$ in the input sequence centered around the respective output position. This would increase the maximum path length to $O (n / r)$ . We plan to investigate this approach further in future work.
如表 1 所示，self-attention 层将所有位置连接到恒定数量的顺序执行的操作，而循环层需要 $O (n)$ 顺序操作。在计算复杂性方面，当序列长度 $n$ 小于表示维度 $d$ 时，self-attention 层比循环层快，这是机器翻译中最先进的模型最常见情况，例如 word-piece [38] and byte-pair [31] 表示法。为了提高涉及很长序列的任务的计算性能，可以将 self-attention 限制在仅考虑大小为 $r$ 的邻域。这会将最大路径长度增加到 $O (n / r)$ 。我们计划在未来的工作中进一步调查这种方法。

Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operations for different layer types. $n$ is the sequence length, $d$ is the representation dimension, $k$ is the kernel size of convolutions and $r$ the size of the neighborhood in restricted self-attention.
不同层类型的最大路径长度、每层复杂度和最少顺序操作数。 $n$ 为序列的长度， $d$ 为表示的维度， $k$ 为卷积的核的大小， $r$ 为受限 self-attention 中邻域的大小。

A single convolutional layer with kernel width $k < n$ does not connect all pairs of input and output positions. Doing so requires a stack of $O (n / k)$ convolutional layers in the case of contiguous kernels, or $O(log_{k}(n))$ in the case of dilated convolutions [18], increasing the length of the longest paths between any two positions in the network. Convolutional layers are generally more expensive than recurrent layers, by a factor of $k$ . Separable convolutions [6], however, decrease the complexity considerably, to $\cdot n \cdot d + n \cdot d^{2})$ . Even with $k = n$ , however, the complexity of a separable convolution is equal to the combination of a self-attention layer and a point-wise feed-forward layer, the approach we take in our model.
kernel width $k < n$ 的单卷积层不会连接每一对输入和输出的位置。要这么做，在邻近核的情况下需要堆叠 $O (n / k)$ 个卷积层，在 dilated convolutions [18] 的情况下需要 $O(log_{k}(n))$ 个层，它们增加了网络中任意两个位置之间的最长路径的长度。卷积层通常比循环层更昂贵，大约是 $k$ 倍。然而，separable convolutions [6] 大幅减少复杂度到 $\cdot n \cdot d + n \cdot d^{2})$ 。然而，即使 $k = n$ ，一个可分卷积的复杂度等同于self-attention层和 point-wise 前向层的组合，这是我们在模型中采用的方法。

As side benefit, self-attention could yield more interpretable models. We inspect attention distributions from our models and present and discuss examples in the appendix. Not only do individual attention heads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic and semantic structure of the sentences.
作为附带的好处，self-attention 可以产生更多可解释的模型。我们从模型中检查注意力分布，并在附录中介绍和讨论示例。每个 attention head 不仅清楚地学习到执行不同的任务，许多似乎展现与句子的句法和语义结构的行为。

desideratum：n. 急需品 (desiderata 是 desideratum 的复数)
present [prɪ'zent]：n. 目前，现在，礼物，礼品 adj. 存在，出席，在场，出现 v. 出现，提出，显示，提交

5 Training

This section describes the training regime for our models.

regime [reɪ'ʒiːm]：n. 政体，组织方法，管理体制
token ['təʊkən]：n. 表示，(用以启动某些机器或用作支付方式的) 代币，代金券，赠券 adj. 装样子的，装点门面的，敷衍的，象征性的

5.1 Training Data and Batching

We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. Sentences were encoded using byte-pair encoding [3], which has a shared source-target vocabulary of about 37000 tokens. For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary [38]. Sentence pairs were batched together by approximate sequence length. Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens.
我们在标准的 WMT 2014 英语-德语数据集上进行训练，该数据集包含约 450 万个句子对。句子是使用字节对编码 [3] 编码的，字节对编码 [3] 具有源语句和目标语句共享大约37000个词符的词汇表。对于英语-法语，我们使用了更大的 WMT 2014 英语-法语数据集，该数据集由 36M 个句子和将 tokens 拆分成 32000 个词条的词汇表组成 [38]。序列长度相近的句子一起进行批处理。每个训练批次包含一系列句子对，其中包含大约 25000 个源词符和 25000 个目标词符。

5.2 Hardware and Schedule

We trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using the hyperparameters described throughout the paper, each training step took about 0.4 seconds. We trained the base models for a total of 100,000 steps or 12 hours. For our big models, (described on the bottom line of table 3), step time was 1.0 seconds. The big models were trained for 300,000 steps (3.5 days).
对于使用本文所述的超参数的基本模型，每个训练步骤大约需要 0.4 秒。我们对基本模型进行了总共 100,000 步或 12 个小时的训练。对于我们的大型模型 (在表 3 的底部描述)，步长为 1.0 秒。型模型接受了 300,000 步 (3.5天) 的训练。

5.3 Optimizer - 优化器

We used the Adam optimizer [20] with $\beta_{1} = 0.9$ , $\beta_{2} = 0.98$ and $\epsilon = 10^{-9}$ . We varied the learning rate over the course of training, according to the formula:
我们使用 Adam 优化器 [20]，其中 $\beta_{1} = 0.9$ , $\beta_{2} = 0.98$ and $\epsilon = 10^{-9}$ 。我们根据以下公式在训练过程中改变学习率：

$d^{-0.5}_{\text{model}} \cdot \text{min}(step\_num^{-0.5}, step\_num \cdot warmup\_steps^{-1.5}) \tag{3}$

This corresponds to increasing the learning rate linearly for the first $warmup\_steps$ training steps, and decreasing it thereafter proportionally to the inverse square root of the step number. We used $warmup\_steps = 4000$ .
这对应于前 $warmup\_steps$ 训练步骤的线性增加学习速率，并且随后将其与步骤数的平方根成比例地降低学习速率。我们使用 $warmup\_steps = 4000$ 。

5.4 Regularization - 正则化

We employ three types of regularization during training:
训练期间我们采用三种正则化方法：

Residual Dropout We apply dropout [33] to the output of each sub-layer, before it is added to the sub-layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of $P_{drop} = 0.1$ .
我们将 dropout [33] 应用到每个子层的输出，在将它与子层的输入相加和 normalization 之前。此外，在编码器和解码器堆叠中，我们将 dropout 应用到嵌入和位置编码的和。对于基本模型，我们使用 $P_{drop} = 0.1$ 丢弃率。

Label Smoothing During training, we employed label smoothing of value $\epsilon_{ls} = 0.1$ [36]. This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.
在训练过程中，我们使用的 label smoothing 的值为 $\epsilon_{ls} = 0.1$ [36]。这让模型不易理解，模型学得更加不确定，但提高了准确性和 BLEU 得分。

hurt [hɜː(r)t]：v. 受伤，感到疼痛，使不快，使烦恼 n. 委屈，心灵创伤 adj.（身体上） 受伤的，（感情上） 受伤的
perplexity [pə(r)'pleksəti]：n. 困惑，迷惘，难以理解的事物，疑团

6 Results

6.1 Machine Translation

On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big) in Table 2) outperforms the best previously reported models (including ensembles) by more than 2.0 BLEU, establishing a new state-of-the-art BLEU score of 28.4. The configuration of this model is listed in the bottom line of Table 3. Training took 3:5 days on 8 P100 GPUs. Even our base model surpasses all previously published models and ensembles, at a fraction of the training cost of any of the competitive models.
在 WMT 2014 英语-德语翻译任务中，大型 transformer 模型 (表 2 中的 Transformer (big)) 比以前报道的最佳模型 (包括整合模型) 高出 2.0 个 BLEU 以上，确立了一个全新的新的最高 BLEU 分数为28.4。该模型的配置列在表 3 的底部。训练在 8 个 P100 GPU 上花费 3.5 天。即使我们的基础模型也超过了以前发布的所有模型和整合模型，且训练成本只是这些模型的一小部分。

Table 2: The Transformer achieves better BLEU scores than previous state-of-the-art models on the English-to-German and English-to-French newstest2014 tests at a fraction of the training cost.

Table 3: Variations on the Transformer architecture. Unlisted values are identical to those of the base model. All metrics are on the English-to-German translation development set, newstest2013. Listed perplexities are per-wordpiece, according to our byte-pair encoding, and should not be compared to per-word perplexities.
Transformer 架构的变体。未列出的值与基本模型的值相同。所有指标都基于英文到德文翻译开发集 newstest2013。

对模型自身的参数执行改变自变量的测试，确认哪些参数对模型的影响比较大。

workpiece ['wɜ:kˌpi:s]：na. 工作件
perplexity [pə(r)'pleksəti]：n. 困惑，迷惘，难以理解的事物，疑团

On the WMT 2014 English-to-French translation task, our big model achieves a BLEU score of 41.0, outperforming all of the previously published single models, at less than 1/4 the training cost of the previous state-of-the-art model. The Transformer (big) model trained for English-to-French used dropout rate $P_{drop} = 0.1$ , instead of 0.3.
在 WMT 2014 英语-法语翻译任务中，我们的大型模型的 BLEU 得分为 41.0，超过了之前发布的所有单一模型，训练成本低于先前最先进模型的 1/4。英语-法语的 Transformer (big) 模型使用丢弃率为 dropout rate $P_{drop} = 0.1$ ，而不是 0.3。

For the base models, we used a single model obtained by averaging the last 5 checkpoints, which were written at 10-minute intervals. For the big models, we averaged the last 20 checkpoints. We used beam search with a beam size of 4 and length penalty $\alpha = 0.6$ [38]. These hyperparameters were chosen after experimentation on the development set. We set the maximum output length during inference to input length + 50, but terminate early when possible [38].
对于基础模型，我们使用的单个模型来自最后 5 个检查点的平均值，这些检查点每 10 分钟写一次。对于大型模型，我们对最后 20 个检查点进行了平均。我们使用 beam search，beam 大小为 4，长度惩罚 $\alpha = 0.6$ [38]。这些超参数是在开发集上进行实验后选定的。在推断时，我们设置最大输出长度为输入长度 + 50，但在可能时尽早终止 [38]。

beam search：定向搜索
beam [biːm]：n. 梁，光线，平衡木，(电波的) 波束 v. 照射，发光，笑容满面，眉开眼笑

Table 2 summarizes our results and compares our translation quality and training costs to other model architectures from the literature. We estimate the number of floating point operations used to train a model by multiplying the training time, the number of GPUs used, and an estimate of the sustained single-precision floating-point capacity of each GPU.
表 2 总结了我们的结果，并将我们的翻译质量和训练成本与文献中的其他模型架构进行了比较。我们通过将训练时间、所使用的 GPU 的数量以及每个 GPU 的持续单精度浮点能力的估计值相乘来估计用于训练模型的浮点运算的数量。

We used values of 2.8, 3.7, 6.0 and 9.5 TFLOPS for K80, K40, M40 and P100, respectively

6.2 Model Variations

To evaluate the importance of different components of the Transformer, we varied our base model in different ways, measuring the change in performance on English-to-German translation on the development set, newstest2013. We used beam search as described in the previous section, but no checkpoint averaging. We present these results in Table 3.
为了评估 Transformer 不同组件的重要性，我们以不同的方式改变我们的基础模型，测量开发集 newstest2013 上英文-德文翻译的性能变化。我们使用前一节所述的 beam search，但没有平均检查点。我们在 Table 3 中列出这些结果.

In Table 3 rows (A), we vary the number of attention heads and the attention key and value dimensions, keeping the amount of computation constant, as described in Section 3.2.2. While single-head attention is 0.9 BLEU worse than the best setting, quality also drops off with too many heads.
在 Table 3 rows (A) 中，我们改变 attention head 的数量和 attention key 和 value 的维度，保持计算量不变，如 3.2.2节所述。虽然只有一个 head attention 比最佳设置差 0.9 BLEU，但质量也随着 head 太多而下降。

In Table 3 rows (B), we observe that reducing the attention key size $d_k$ hurts model quality. This suggests that determining compatibility is not easy and that a more sophisticated compatibility function than dot product may be beneficial. We further observe in rows (C) and (D) that, as expected, bigger models are better, and dropout is very helpful in avoiding over-fitting. In row (E) we replace our sinusoidal positional encoding with learned positional embeddings [9], and observe nearly identical results to the base model.
在 Table 3 rows (B) 中，我们观察到减小 key 的大小 $d_k$ 会有损模型质量。这表明确定兼容性并不容易，并且比点积更复杂的兼容性函数可能更有用。我们在 rows (C) and (D) 中进一步观察到，如预期的那样，更大的模型更好，并且丢弃对避免过度拟合非常有帮助。在 row (E) 中，我们用学习到的位置嵌入 [9] 来替换我们的正弦位置编码，并观察到与基本模型几乎相同的结果。

determine [dɪ'tɜː(r)mɪn]：v. 确定，决定，测定，查明
sophisticate [sə'fɪstɪkeɪt]：v. 用诡辩欺骗，使迷惑，窜改，掺坏 n. 老于世故的人，见多识广的人

6.3 English Constituency Parsing

To evaluate if the Transformer can generalize to other tasks we performed experiments on English constituency parsing. This task presents specific challenges: the output is subject to strong structural constraints and is significantly longer than the input. Furthermore, RNN sequence-to-sequence models have not been able to attain state-of-the-art results in small-data regimes [37].
为了评估 Transformer 是否可以推广到其他任务，我们对 English constituency parsing 进行了实验。这项任务提出了具体的挑战：输出受到强大的结构约束，并且比输入要长得多。此外，RNN 序列到序列模型还无法在小数据体制中获得最好的结果 [37]。

attain [ə'teɪn]：v. 得到，达到 （某年龄、水平、状况）

Table 4: The Transformer generalizes well to English constituency parsing (Results are on Section 23 of WSJ)

We trained a 4-layer transformer with $d_{model} = 1024$ on the Wall Street Journal (WSJ) portion of the Penn Treebank [25], about 40K training sentences. We also trained it in a semi-supervised setting, using the larger high-confidence and BerkleyParser corpora from with approximately 17M sentences [37]. We used a vocabulary of 16K tokens for the WSJ only setting and a vocabulary of 32K tokens for the semi-supervised setting.
我们用 $d_{model} = 1024$ 在 Penn Treebank [25] 的 Wall Street Journal (WSJ) 部分训练了一个 4 层 transformer，约 40K 个训练句子。我们还使用更大的高置信度和 BerkleyParser 语料库，在半监督环境中对其进行了训练，大约 17M 个句子 [37]。我们使用了一个 16K 词符的词汇表作为 WSJ 唯一设置，和一个 32K 词符的词汇表用于半监督设置。

We performed only a small number of experiments to select the dropout, both attention and residual (section 5.4), learning rates and beam size on the Section 22 development set, all other parameters remained unchanged from the English-to-German base translation model. During inference, we increased the maximum output length to input length + 300. We used a beam size of 21 and $\alpha = 0.3$ for both WSJ only and the semi-supervised setting.
我们仅进行了少量实验，以选择 Section 22 开发集上的 dropout、注意力和残差 (section 5.4)、学习率和波束大小，所有其他参数在英语到德语基础翻译模型中均保持不变。在推论过程中，我们将最大输出长度增加到输入长度 + 300。对于仅 WSJ 和半监督设置，我们使用 21 的波束大小和 $\alpha = 0.3$ 。

corpus ['kɔːpərə]：n. 语料，全集，文集，语料库

Our results in Table 4 show that despite the lack of task-specific tuning our model performs surprisingly well, yielding better results than all previously reported models with the exception of the Recurrent Neural Network Grammar [8].
表 4 中我们的结果表明，尽管缺少特定任务的调优，我们的模型表现得非常好，得到的结果比之前报告的除循环Recurrent Neural Network Grammar [8] 之外的所有模型都好。

In contrast to RNN sequence-to-sequence models [37], the Transformer outperforms the BerkeleyParser [29] even when training only on the WSJ training set of 40K sentences.
与 RNN 序列到序列模型 [37] 相比，即使仅在 40K 句子的 WSJ training set 上训练时，Transformer 也胜过 BerkeleyParser [29]。

7 Conclusion

In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention.
在这项工作中，我们介绍了 Transformer，这是完全基于注意力的第一个序列转换模型，用 multi-headed self-attention代替了编码器-解码器体系结构中最常用的循环层。

For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles.
对于翻译任务，与基于循环层或卷积层的体系结构相比，可以大大加快 Transformer 的训练速度。在 WMT 2014 English-to-German and WMT 2014 English-to-French 翻译任务上，我们都达到了新的最佳水平。在前面的任务中，我们最好的模型甚至胜过以前报道过的所有整合模型。

We are excited about the future of attention-based models and plan to apply them to other tasks. We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video. Making generation less sequential is another research goals of ours.
我们对基于注意力的模型的未来感到兴奋，并计划将其应用于其他任务。我们计划将 Transformer 扩展到文本以外的涉及输入和输出方式的问题，并研究局部受限的注意机制，以有效处理大型输入和输出，例如图像、音频和视频。让生成具有更少的顺序性是我们的另一个研究目标。

The code we used to train and evaluate our models is available at https://github.com/tensorflow/tensor2tensor.

Acknowledgements We are grateful to Nal Kalchbrenner and Stephan Gouws for their fruitful comments, corrections and inspiration.

fruitful ['fruːtf(ə)l]：adj. 成果丰硕的，富有成效的，富饶的，丰产的
inspiration [.ɪnspə'reɪʃ(ə)n]：n. 灵感，妙计，启发灵感的人 (或事物)，使人产生动机的人 (或事物)
modality [məʊ'dæləti]：n. 形态，情态，形式，方式

Attention Visualizations

Figure 3: An example of the attention mechanism following long-distance dependencies in the encoder self-attention in layer 5 of 6. Many of the attention heads attend to a distant dependency of the verb ‘making’, completing the phrase ‘making…more difficult’. Attentions here shown only for the word ‘making’. Different colors represent different heads. Best viewed in color.
attention 机制的一个示例，5/6 层的编码器 self-attention 中的长距离依赖。很多 attention head 都关注与动词 making 的远距离依赖关系，正好补全 making…more difficult 这个短语。不同的颜色代表不同的 head。彩色效果最佳。

spirit ['spɪrɪt]：n. 精神，灵魂，心灵，勇气 v. 偷偷带走，让人不可思议地弄走
registration [.redʒɪ'streɪʃ(ə)n]：n. 登记，注册，挂号，登记文档

Figure 4: Two attention heads, also in layer 5 of 6, apparently involved in anaphora resolution. Top: Full attentions for head 5. Bottom: Isolated attentions from just the word ‘its’ for attention heads 5 and 6. Note that the attentions are very sharp for this word.
两个 attention head，也在 5/6 层，显然有逆向照应 (下文的词返指或代替上文的词)。Top: head 5 的完整的 attention。Bottom: 仅将 attention heads 5 和 6 中单词 its 的 attention 分离出来。请注意，这个词的 attention 非常明确。

anaphora [ə'næfərə]：n. 逆向照应 (下文的词返指或代替上文的词)

Figure 5: Many of the attention heads exhibit behaviour that seems related to the structure of the sentence. We give two such examples above, from two different heads from the encoder self-attention at layer 5 of 6. The heads clearly learned to perform different tasks.
很多 attention head 表现出的行为似乎与句子的结构有关。我们给出了两个这样的例子，来自编码器 5/6 层 self-attention 的两个不同的 head。Heads 清楚地学会了执行不同的任务。

References

公众号 ID：小小挖掘机
公众号 ID：深度学习自然语言处理

你可能感兴趣的:(speech,recognition,-,语音识别)

逆水行舟，不进则退舒乔终身成长
昨天例会做了一篇speech，匆匆写稿，完全没有rehearsal。结果嘛，当然不尽人意。首先，noticeablenerviness。紧张偶尔会有，但都controllable，毕竟按我的马龄，也算是个老司机，可被人看出，也是头一遭。原因嘛，也不言而喻。最近半年，参会频率和发言频率都太低，练习太少，至今还未完全适应ZOOMmeeting。此外，昨天一天其实没有什么安排，可却拖延到下午四点来钟才开
用好外呼机器人，帮助企业提升客户管理效率天润融通机器人
外呼机器人，作为现代科技与企业管理的结合体，正在企业客户管理领域掀起革命性的变化。随着人工智能技术的不断进步，外呼机器人不仅实现了自动化呼叫，还能根据客户的语音情感进行相应的反馈和操作，极大地提高了客户满意度和企业运营效率。一、外呼机器人的基本原理外呼机器人是一种以人工智能为核心，结合语音识别、自然语言处理等技术，替代人工完成呼叫任务的智能系统。其主要工作流程包括以下几个步骤：1.数据导入和整理：
深度探索：机器学习中的序列到序列模型（Seq2Seq）原理及其应用生瓜蛋子机器学习机器学习人工智能
目录1.引言与背景2.庞特里亚金定理与动态规划3.算法原理4.算法实现5.优缺点分析优点缺点6.案例应用7.对比与其他算法8.结论与展望1.引言与背景在当今信息爆炸的时代，机器学习作为人工智能领域的核心驱动力，正以前所未有的深度和广度渗透进我们的日常生活。从语言翻译、文本摘要、语音识别到对话系统，众多自然语言处理（NLP）任务的成功解决离不开一种强大的模型架构——序列到序列（Sequence-to
快速使用transformers的pipeline实现各种深度学习任务 E寻数据 huggingface 计算机视觉 nlp 深度学习人工智能 python pipeline transformers
目录引言安装情感分析文本生成文本摘要图片分类实例分割目标检测音频分类自动语音识别视觉问答文档问题回答图文描述引言在这篇中文博客中，我们将深入探讨使用transformers库中的pipeline()函数，它为预训练模型提供了一个简单且快速的推理方法。pipeline()函数支持多种任务，包括文本分类、文本生成、摘要生成、图像分类、图像分割、对象检测、音频分类、自动语音识别、视觉问题回答、文档问题回
【ShuQiHere】卷积神经网络（CNN）：从输入到输出的逐层解析 ShuQiHere cnn 人工智能神经网络
【ShuQiHere】卷积神经网络（ConvolutionalNeuralNetwork，CNN）是深度学习领域的一个里程碑。它的出现不仅改变了计算机视觉的格局，还影响了各类数据处理任务，如语音识别和自然语言处理。随着深度学习的蓬勃发展，CNN成为了图像处理任务中的标准工具。那么，CNN到底是什么？它又是如何工作的？在本文中，我们将通过手写数字识别的例子，逐层解析CNN的每个部分，帮助你全面理解这
自然语言处理（NLP）与机器学习：深度探索两者的关系听忆. 自然语言处理机器学习人工智能
自然语言处理（NLP）与机器学习：深度探索两者的关系1.自然语言处理(NLP)的概述NLP的主要任务包括：2.机器学习(ML)的概述机器学习的主要类型包括：3.NLP与机器学习的关系1.机器学习驱动NLP任务2.深度学习与NLP的结合4.NLP和ML的相互促进5.挑战与未来展望边走、边悟迟早会好自然语言处理（NLP）与机器学习（ML）有着密切的关系，二者结合在一起可以实现自动化文本分析、语音识别、
基于STM32开发的智能语音控制系统 stm32发烧友 stm32 嵌入式硬件单片机
目录引言环境准备工作硬件准备软件安装与配置系统设计系统架构硬件连接代码实现系统初始化语音识别与指令处理控制设备OLED显示与状态提示Wi-Fi通信与远程监控应用场景家庭环境中的智能语音控制办公环境中的语音交互常见问题及解决方案常见问题解决方案结论1.引言随着智能家居的发展，语音控制成为了人机交互的重要方式。本文将介绍如何使用STM32微控制器开发一个智能语音控制系统，通过语音识别模块、OLED显示
AdventureCreator学习笔记13：脸部表情 AlpacasKing
BlendShape设置在模型上添加Shapeable脚本，可以添加表情组，然后在表情组里可以添加表情。BlendShape设置表情设置在NPC脚本上，还需要添加表情，名字与表情组相同。表情设置使用表情可以在ActionList中添加新的Action，也可以直接写在对话文本中。使用表情嘴型设置对话时根据内容嘴型可以相应变化。首先在ACGameEditor的Speech选项卡开启Lipsyncing
ITTS, VALL-E,soundstorm 0010000100 PyTorch 人工智能
ITTS,VALL-E,andSoundStormarealladvancedtechnologiesandmodelsrelatedtospeechandaudioprocessing.ITTS(InteractiveText-to-Speech):ITTSusuallyreferstoaText-to-Speech(TTS)systemthatallowsinteractivecontrolo
基于STM32开发的智能家居语音控制系统嵌入式详谈 stm32 智能家居嵌入式硬件
目录引言环境准备工作硬件准备软件安装与配置系统设计系统架构硬件连接代码实现系统初始化语音识别处理设备控制与状态显示Wi-Fi通信与远程控制应用场景家庭环境的语音控制办公室的智能化管理常见问题及解决方案常见问题解决方案结论1.引言随着人工智能技术的发展，智能家居设备逐渐普及。通过语音识别技术，用户可以通过简单的语音指令控制家中的设备，如灯光、空调、电视等，提升生活的便利性和舒适性。本文将介绍如何使用
【好书分享第十期】大模型应用解决方案_基于ChatGPT和GPT-4等Transformer架构的自然语言处理（文末送书）屿小夏书籍推荐 chatgpt transformer 架构大模型 AI
文章目录前言一、内容简介二、作者简介三、目录四、摘录粉丝福利前言在不到4年的时间里，Transformer模型以其强大的性能和创新的思想，迅速在NLP社区崭露头角，打破了过去30年的记录。BERT、T5和GPT等模型现在已成为计算机视觉、语音识别、翻译、蛋白质测序、编码等各个领域中新应用的基础构件。因此，斯坦福大学最近提出了“基础模型”这个术语，用于定义基于巨型预训练Transformer的一系列
语音识别技能汇总语音不识别语音识别语音识别人工智能 linux python
语音识别技能汇总常见问题汇总importwarningswarnings.filterwarnings('ignore')基础知识Attention-注意力机制原理：人在说话的时候或者读取文字的时候，是根据某个关键字或者多个关键字来判断某些句子或者说话内容的含义的。即通过对上下文的内容增加不同的权重，可以实现这样对局部内容关注更多。常用语音识别工具相关包的安装pipinstallpygameSpe
如何本地搭建 Whisper 语音识别模型？一文解决玩AI的小胡子 whisper AIGC 人工智能语音识别
Whisper是OpenAI开发的强大语音识别模型，适用于多种语言的语音转文字任务。要在本地搭建Whisper模型，需要完成以下几个步骤，确保模型在你的设备上顺利运行。1.准备环境首先，确保你的系统上安装了Python（版本3.8到3.11之间）。此外，还需要安装PyTorch，这是Whisper依赖的深度学习框架。2.安装Whisper在命令行中运行以下命令来安装Whisper和其依赖项：pip
探索创新语音识别：IMS Toucan - 你的智能语音解决方案班歆韦Divine
探索创新语音识别：IMSToucan-你的智能语音解决方案IMS-ToucanText-to-SpeechToolkitoftheSpeechandLanguageTechnologiesGroupattheUniversityofStuttgart.Objectivesofthedevelopmentaresimplicity,modularity,controllabilityandmulti
什么是LLM，主要用途有哪些，在应用中有哪些优势和局限性？好好学习的不知名程序员机器学习深度学习 AIGC 人工智能
LLM（大型语言模型）在实际应用中的优势包括多领域应用、技术突破、创新应用等。其局限性则包括设计挑战、行为问题、科学难题等。LLM在实际中的应用优势：1.多领域应用：自然语言处理：LLM在机器翻译、语音识别、文本生成等领域表现出色。智能对话系统：LLM能够提供与人类相似的聊天机器人体验。内容创作：从文章写作到代码开发，LLM都能提供高效的辅助。2.技术突破：深度学习架构：LLM基于先进的深度学习技
【机器学习】机器学习与大模型在人工智能领域的融合应用与性能优化新探索 E绵绵 Everything 人工智能机器学习大模型 python AIGC 应用科技
文章目录引言机器学习与大模型的基本概念机器学习概述监督学习无监督学习强化学习大模型概述GPT-3BERTResNetTransformer机器学习与大模型的融合应用自然语言处理文本生成文本分类机器翻译图像识别自动驾驶医学影像分析语音识别智能助手语音转文字大模型性能优化的新探索模型压缩权重剪枝量化知识蒸馏分布式训练数据并行模型并行异步训练高效推理模型裁剪缓存机制专用硬件未来展望跨领域应用智能化系统人
TensorFlow库详解：Python中的深度学习框架 Ambition_LAO tensorflow 深度学习
TensorFlow是一个开源的深度学习框架，由GoogleBrain团队开发，并于2015年正式发布。TensorFlow被广泛应用于各种深度学习任务，如图像识别、自然语言处理、语音识别等。它能够处理大规模的多维数据，并支持在多种硬件平台上运行，如CPU、GPU和TPU（TensorProcessingUnit）。TensorFlow在Python中的使用非常广泛，因为Python是机器学习和数
再一次用RAlN 文迪蓉蓉
学正念时，应对困难情绪的一个方法是RAIN！RAIN分别指——识别（Recognition）、接受（Acceptance）、探究（Investigation）和非认同（Nonidentificaition）。我们知道情绪只是情绪，当我们不把情绪等同于我们自身的全部时，就会发现情绪不停地起起落落，升起又散去，既不是与生俱来的，也不是一成不变的。它产生于特定的状况，从外面进来，像一个突然造访的客人。运
WebKit的语音交互新篇章：Web Speech API深度解析 2401_85742452 前端 webkit 交互
WebKit的语音交互新篇章：WebSpeechAPI深度解析随着技术的进步，人机交互的方式正在不断演变。WebSpeechAPI作为现代Web技术的一部分，为浏览器提供了语音识别和语音合成的能力。这项API在WebKit中的支持为开发者带来了创建具有语音交互功能的Web应用的可能性。本文将详细介绍WebKit对WebSpeechAPI的支持，并提供实际的代码示例。一、WebSpeechAPI简介
通义千问( 六 ) 声音识别春哥的魔法书人工智能通义千问 AI 声音识别
5.2.声音识别5.2.1.介绍通义千问Audio是阿里云研发的大规模音频语言模型。通义千问Audio可以以多种音频(包括说话人语音、自然音、音乐、歌声）和文本作为输入，并以文本作为输出。通义千问Audio模型的特点包括：1、全类型音频感知：通义千问Audio是一个性能卓越的通用音频理解模型，支持30秒内的自然音、人声、音乐等类型音频理解，如多语种语音识别，时间抽定位，说话人情绪、性别识别，环境识
无需联网的离线语音识别ic方案让全屋家电更智能九芯电子九芯电子语音芯片方案语音芯片语音识别
概括方便用户控制智能设备、电器，用户只须说一下口令就实现制智能设备、电器。特性●定制多种国家语音播报功能●低功耗高性价比●‌多种接口和协议支持●‌高度稳定性和可靠性●‌采用数字信号处理技术和人工智能算法●‌拥有完善的软件开发工具和技术支持语音相关参数●高性能32位RISC内核●主频240MHz●‌内置1MBSPIFLASH存储●‌采用最新的神经网络(TDNN)算法和语音降噪算法●支持硬件浮点运算●
【机器学习】Whisper：开源语音转文本（speech-to-text）大模型实战 LDG_AGI AI智能体研发之路-模型篇机器学习 whisper 人工智能语音识别实时音视频 python transformer
目录一、引言二、Whisper模型原理2.1模型架构2.2语音处理2.3文本处理三、Whisper模型实战3.1环境安装3.2模型下载3.3模型推理3.4完整代码3.5模型部署四、总结一、引言上一篇对ChatTTS文本转语音模型原理和实战进行了讲解，第6次拿到了热榜第一。今天，分享其对称功能（语音转文本）模型：Whisper。Whisper由OpenAI研发并开源，参数量最小39M，最大1550M
Conformer 模型实现教程邬稳研Beneficient
Conformer模型实现教程ConformerOfficialcodeforConformer:LocalFeaturesCouplingGlobalRepresentationsforVisualRecognition项目地址:https://gitcode.com/gh_mirrors/con/Conformer1.项目目录结构及介绍在pengzhiliang/Conformer开源项目中，
探索Umi-OCR：一款高效易用的图像文字识别工具任翊昆Mary
探索Umi-OCR：一款高效易用的图像文字识别工具Umi-OCRUmi-OCR:这是一个免费、开源、可批量处理的离线OCR软件，适用于Windows系统，支持截图OCR、批量OCR、二维码识别等功能。项目地址:https://gitcode.com/gh_mirrors/um/Umi-OCR项目简介是一个基于深度学习的开源OCR（OpticalCharacterRecognition，光学字符识别
什么是ChatGPT 丨逐风者丨
什么是ChatGPT？ChatGPT是OpenAI公司训练的一个大型语言模型。它是基于Transformer架构的，拥有超过350GB的参数，可以进行各种自然语言处理任务，如语音识别、机器翻译、对话生成和问答等。ChatGPT模型是在大量的网络文本数据上进行训练的，因此它可以生成高质量的文本内容。它可以根据输入文本生成一段相关的文本，或者回答问题并生成针对性的回答。它还可以根据输入的提示生成一段文
使用pyttsx3实现文字转语音静候光阴语音识别语音识别人工智能
专栏总目录该方法不需要生成音频文件，可以直接输出声音。但是，声音比较生硬，不自然。只能说是一种比较方便实现的文字转语音简单方案一、安装pyttsx3安装命令：pipinstallpyttsx3二、代码执行后，即可听到转换后的声音importpyttsx3#创建文字转语音函数deftext_to_speech(text):#初始化语音引擎engine=pyttsx3.init()#设置语音速度eng
嵌入式详细教程：基于STM32实现语音识别系统嵌入式详谈 c语言语音识别开发语言
目录文章主题环境准备语音识别系统基础代码示例：实现语音识别系统应用场景：智能家居与便携设备问题解决方案与优化1.文章主题文章主题本教程将详细介绍如何在STM32嵌入式系统中使用C语言实现语音识别系统，特别是如何通过STM32与麦克风模块进行通信并实现基本的语音命令识别。本文包括环境准备、基础知识、代码示例、应用场景及问题解决方案和优化方法。嵌入式C语言高级教程：基于STM32实现语音识别系统目录文
电子学生证·录音上传与语音识别（三） netkiller- AIGC 语音识别人工智能
电子学生证·录音上传与语音识别（三）前面的文章中已经讲了，AMR录音和上传电子学生证·录音上传与语音识别（一）-CSDN博客文章浏览阅读347次，点赞13次，收藏4次。电子学生证开发一个英语口语聊天功能，一问一答，例如：问题：howareyou?回答：由人工智能语言大模型回答。https://blog.csdn.net/u010604770/article/details/141360059电子学
大模型开源，让人工智能更普惠智能助手观察大模型开源人工智能大模型
人工智能的发展，离不开大模型支撑。而开源模型的出现，则让更多的企业和开发者能够利用这些模型，来构建出更加智能的应用。那么，什么是大模型呢？大模型顾名思义，指的是规模较大的深度学习模型，通常需要运行在高性能计算机上。这些模型包含了数以亿计的参数，可以处理自然语言处理、图像识别、语音识别等多个领域的任务。一般来说，大型模型的训练需要耗费大量的计算资源和时间。不过，随着云计算的发展和开源社区的成熟，越来
【全网独家】OpenCV C++ 图像处理实战：OCR字符识别（代码+测试部署）鱼弦 OpenCV系列实践 opencv c++图像处理
一、介绍OCR（OpticalCharacterRecognition，光学字符识别）是一种将图像中的文字转换成机器可读文本的技术。它在自动化办公、文档管理、身份验证等领域得到广泛应用。二、应用使用场景文档数字化：将纸质文档转换成电子文本。车牌识别：用于停车场管理或交通监控。手写体识别：应用于平板电脑、智能手机上的手写输入。票据扫描：银行票据、发票等金融单据的自动处理。三、原理解释OCR通过以下几
sql统计相同项个数并按名次显示朱辉辉33 java oracle
现在有如下这样一个表： A表 ID Name time ------------------------------ 0001 aaa 2006-11-18 0002 ccc 2006-11-18 0003 eee 2006-11-18 0004 aaa 2006-11-18 0005 eee 2006-11-18 0004 aaa 2006-11-18 0002 ccc 20
Android+Jquery Mobile学习系列-目录白糖_ JQuery Mobile
最近在研究学习基于Android的移动应用开发，准备给家里人做一个应用程序用用。向公司手机移动团队咨询了下，觉得使用Android的WebView上手最快，因为WebView等于是一个内置浏览器，可以基于html页面开发，不用去学习Android自带的七七八八的控件。然后加上Jquery mobile的样式渲染和事件等，就能非常方便的做动态应用了。从现在起，往后一段时间，我打算
如何给线程池命名 daysinsun 线程池
在系统运行后，在线程快照里总是看到线程池的名字为pool-xx，这样导致很不好定位，怎么给线程池一个有意义的名字呢。参照ThreadPoolExecutor类的ThreadFactory，自己实现ThreadFactory接口，重写newThread方法即可。参考代码如下： public class Named
IE 中"HTML Parsing Error:Unable to modify the parent container element before the 周凡杨 html 解析 error readyState
错误： IE 中"HTML Parsing Error:Unable to modify the parent container element before the child element is closed" 现象：同事之间几个IE 测试情况下，有的报这个错，有的不报。经查询资料后，可归纳以下原因。
java上传 g21121 java
我们在做web项目中通常会遇到上传文件的情况，用struts等框架的会直接用的自带的标签和组件，今天说的是利用servlet来完成上传。我们这里利用到commons-fileupload组件，相关jar包可以取apache官网下载：http://commons.apache.org/ 下面是servlet的代码： //定义一个磁盘文件工厂 DiskFileItemFactory fact
SpringMVC配置学习 510888780 spring mvc
spring MVC配置详解现在主流的Web MVC框架除了Struts这个主力外，其次就是Spring MVC了，因此这也是作为一名程序员需要掌握的主流框架，框架选择多了，应对多变的需求和业务时，可实行的方案自然就多了。不过要想灵活运用Spring MVC来应对大多数的Web开发，就必须要掌握它的配置及原理。　　一、Spring MVC环境搭建：（Spring 2.5.6 + Hi
spring mvc-jfreeChart 柱图(1) 布衣凌宇 jfreechart
第一步：下载jfreeChart包，注意是jfreeChart文件lib目录下的，jcommon-1.0.23.jar和jfreechart-1.0.19.jar两个包即可；第二步：配置web.xml; web.xml代码如下 <servlet> <servlet-name>jfreechart</servlet-nam
我的spring学习笔记13-容器扩展点之PropertyPlaceholderConfigurer aijuans Spring3
PropertyPlaceholderConfigurer是个bean工厂后置处理器的实现，也就是BeanFactoryPostProcessor接口的一个实现。关于BeanFactoryPostProcessor和BeanPostProcessor类似。我会在其他地方介绍。PropertyPlaceholderConfigurer可以将上下文（配置文件）中的属性值放在另一个单独的标准java P
java 线程池使用 Runnable&Callable&Future antlove java thread Runnable callable future
1. 创建线程池 ExecutorService executorService = Executors.newCachedThreadPool(); 2. 执行一次线程，调用Runnable接口实现 Future<?> future = executorService.submit(new DefaultRunnable()); System.out.prin
XML语法元素结构的总结百合不是茶 xml 树结构
1.XML介绍1969年 gml (主要目的是要在不同的机器进行通信的数据规范)1985年 sgml standard generralized markup language1993年 html(www网)1998年 xml extensible markup language
改变eclipse编码格式 bijian1013 eclipse 编码格式
1.改变整个工作空间的编码格式改变整个工作空间的编码格式，这样以后新建的文件也是新设置的编码格式。 Eclipse->window->preferences->General->workspace-
javascript中return的设计缺陷 bijian1013 JavaScript AngularJS
代码1： <script> var gisService = (function(window) { return { name:function () { alert(1); } }; })(this); gisService.name(); &l
【持久化框架MyBatis3八】Spring集成MyBatis3 bit1129 Mybatis3
pom.xml配置 Maven的pom中主要包括： MyBatis MyBatis-Spring Spring MySQL-Connector-Java Druid applicationContext.xml配置 <?xml version="1.0" encoding="UTF-8"?> &
java web项目启动时自动加载自定义properties文件 bitray java Web 监听器相对路径
创建一个类 public class ContextInitListener implements ServletContextListener 使得该类成为一个监听器。用于监听整个容器生命周期的，主要是初始化和销毁的。类创建后要在web.xml配置文件中增加一个简单的监听器配置，即刚才我们定义的类。 <listener> <des
用nginx区分文件大小做出不同响应 ronin47
昨晚和前21v的同事聊天，说到我离职后一些技术上的更新。其中有个给某大客户(游戏下载类)的特殊需求设计，因为文件大小差距很大——估计是大版本和补丁的区别——又走的是同一个域名，而squid在响应比较大的文件时，尤其是初次下载的时候，性能比较差，所以拆成两组服务器，squid服务于较小的文件，通过pull方式从peer层获取，nginx服务于较大的文件，通过push方式由peer层分发同步。外部发布
java-67-扑克牌的顺子.从扑克牌中随机抽5张牌，判断是不是一个顺子，即这5张牌是不是连续的.2-10为数字本身，A为1，J为11，Q为12，K为13，而大 bylijinnan java
package com.ljn.base; import java.util.Arrays; import java.util.Random; public class ContinuousPoker { /** * Q67 扑克牌的顺子从扑克牌中随机抽5张牌，判断是不是一个顺子，即这5张牌是不是连续的。 * 2-10为数字本身，A为1，J为1
翟鸿燊老师语录 ccii 翟鸿燊
一、国学应用智慧TAT之亮剑精神A 1. 角色就是人格就像你一回家的时候，你一进屋里面，你已经是儿子，是姑娘啦，给老爸老妈倒怀水吧，你还觉得你是老总呢？还拿派呢？就像今天一样，你们往这儿一坐，你们之间是什么，同学，是朋友。还有下属最忌讳的就是领导向他询问情况的时候，什么我不知道，我不清楚，该你知道的你凭什么不知道
[光速与宇宙]进行光速飞行的一些问题 comsci 问题
在人类整体进入宇宙时代，即将开展深空宇宙探索之前，我有几个猜想想告诉大家仅仅是猜想。。。未经官方证实 1：要在宇宙中进行光速飞行，必须首先获得宇宙中的航行通行证，而这个航行通行证并不是我们平常认为的那种带钢印的证书，是什么呢？下面我来告诉
oracle undo解析 cwqcwqmax9 oracle
oracle undo解析2012-09-24 09:02:01 我来说两句作者：虫师收藏我要投稿 Undo是干嘛用的？ &nb
java中各种集合的详细介绍 dashuaifu java 集合
一，java中各种集合的关系图 Collection 接口的接口对象的集合 ├ List 子接口 &n
卸载windows服务的方法 dcj3sjt126com windows service
卸载Windows服务的方法在Windows中，有一类程序称为服务，在操作系统内核加载完成后就开始加载。这里程序往往运行在操作系统的底层，因此资源占用比较大、执行效率比较高，比较有代表性的就是杀毒软件。但是一旦因为特殊原因不能正确卸载这些程序了，其加载在Windows内的服务就不容易删除了。即便是删除注册表中的相应项目，虽然不启动了，但是系统中仍然存在此项服务，只是没有加载而已。如果安装其他
Warning: The Copy Bundle Resources build phase contains this target's Info.plist dcj3sjt126com ios xcode
http://developer.apple.com/iphone/library/qa/qa2009/qa1649.html Excerpt: You are getting this warning because you probably added your Info.plist file to your Copy Bundle
2014之C++学习笔记（一） Etwo C++Etwo Etwo iterator 迭代器
已经有很长一段时间没有写博客了，可能大家已经淡忘了Etwo这个人的存在，这一年多以来，本人从事了AS的相关开发工作，但最近一段时间，AS在天朝的没落，相信有很多码农也都清楚，现在的页游基本上达到饱和，手机上的游戏基本被unity3D与cocos占据，AS基本没有容身之处。so。。。最近我并不打算直接转型
js跨越获取数据问题记录 haifengwuch jsonp json Ajax
js的跨越问题，普通的ajax无法获取服务器返回的值。第一种解决方案，通过getson，后台配合方式，实现。 Java后台代码： protected void doPost(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException { String ca
蓝色jQuery导航条 ini JavaScript html jquery Web html5
效果体验：http://keleyi.com/keleyi/phtml/jqtexiao/39.htmHTML文件代码： <!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>jQuery鼠标悬停上下滑动导航条 - 柯乐义<
linux部署jdk,tomcat,mysql kerryg jdk tomcat linux mysql
1、安装java环境jdk: 一般系统都会默认自带的JDK,但是不太好用，都会卸载了，然后重新安装。 1.1）、卸载：（rpm -qa :查询已经安装哪些软件包； rmp -q 软件包：查询指定包是否已
DOMContentLoaded VS onload VS onreadystatechange mutongwu jquery js
1. DOMContentLoaded 在页面html、script、style加载完毕即可触发，无需等待所有资源（image/iframe）加载完毕。（IE9+） 2. onload是最早支持的事件，要求所有资源加载完毕触发。 3. onreadystatechange 开始在IE引入，后来其它浏览器也有一定的实现。涉及以下 document , applet, embed, fra
sql批量插入数据 qifeifei 批量插入
hi，自己在做工程的时候，遇到批量插入数据的数据修复场景。我的思路是在插入前准备一个临时表，临时表的整理就看当时的选择条件了，临时表就是要插入的数据集，最后再批量插入到数据库中。 WITH tempT AS ( SELECT item_id AS combo_id, item_id, now() AS create_date FROM a
log4j打印日志文件如何实现相对路径到项目工程下 thinkfreer Web log4j 应用服务器日志
最近为了实现统计一个网站的访问量，记录用户的登录信息，以方便站长实时了解自己网站的访问情况，选择了Apache 的log4j,但是在选择相对路径那块卡主了，X度了好多方法(其实大多都是一样的内用，还一个字都不差的)，都没有能解决问题，无奈搞了2天终于解决了，与大家分享一下需求：用户登录该网站时，把用户的登录名,ip,时间。统计到一个txt文档里，以方便其他系统调用此txt。项目名
linux下mysql-5.6.23.tar.gz安装与配置笑我痴狂 mysql linux unix
1.卸载系统默认的mysql [root@localhost ~]# rpm -qa | grep mysql mysql-libs-5.1.66-2.el6_3.x86_64 mysql-devel-5.1.66-2.el6_3.x86_64 mysql-5.1.66-2.el6_3.x86_64 [root@localhost ~]# rpm -e mysql-libs-5.1