open ai gpt

We all heard a lot about GPT-3 recently all over from LinkedIn, Twitter, Medium about the use cases of GPT-3. But what is GPT-3 , Why is it so powerful is the question running in everyone’s mind.GPT-3 released 70 pages long paper. Let’s try to understand what’s there in GPT-3.

最近，我们都从LinkedIn，Twitter和Medium听到了关于GPT-3的很多信息，其中涉及GPT-3的用例。但是GPT-3是什么，它为何如此强大，是每个人都想知道的问题。GPT-3发表了长达70页的论文。让我们尝试了解GPT-3中的内容。

Note : I won’t be explaining any use cases of it since there are plenty of papers available out there. And I also attached them in Reference down there.

注意：因为有很多可用的论文，所以我不会解释它的任何用例。我也将它们附加在下面的参考中。

Let’s get to work now :) …

现在开始工作：)…

技术概述 (Technical Overview)

Generally, before training any model the first important thing is clean data.So let’s get into data used by GPT-3 Team.

通常，在训练任何模型之前，最重要的是干净的数据。因此，让我们开始研究GPT-3团队使用的数据。

资料： (Data :)

Major portion of the data is collected from a common crawl corpus that contains raw web page data, metadata, and text extracts. On top of this data, some filtering techniques are used at the document level to avoid any deduplication. Also, a high-quality curated dataset is combined along with the expanded version of the WebText dataset.

数据的大部分是从包含原始网页数据，元数据和文本摘录的通用爬网语料库中收集的。在此数据之上，在文档级别使用了一些过滤技术来避免任何重复数据删除。同样，将高质量的策划数据集与WebText数据集的扩展版本结合在一起。

So the team downloaded 41 shards from common crawl by selecting data from 2016 to 2019 of 570 GB after filtering, which is equivalent to 400 billion byte-pair (subword algorithm helpful to compress data by finding common byte pair combinations using frequency) tokens.

因此，该团队通过筛选后从2016年到2019年选择了570 GB的数据从公共爬网中下载了41个分片，这相当于4000亿字节对 (子字算法有助于通过使用频率找到常见的字节对组合来压缩数据)令牌。

Along with curated and crawled datasets they have also included two internet based books and wikipedia

除策展和抓取的数据集外，它们还包括两本基于互联网的书籍和维基百科

Total dataset comprises total of 300 billion tokens

数据集总数包括3000亿个令牌

In GPT-3 multiple datasets are not sampled based on the size, the higher quality is sampled more frequently. In the paper, they also mention that they accepted a small amount of overfitting for the high quality in training data. It makes us clear that they have polished the data very well before training.

在GPT-3中，不会根据大小对多个数据集进行采样，因此，较高质量的采样频率更高。在论文中，他们还提到他们接受了少量的过拟合以获取高质量的训练数据。这清楚地表明，他们在训练之前已经很好地完善了数据。

They accept that during filtering due to some bug there is some contamination(overlapping) in the train and test dataset. It is not removed as retraining is costly.

他们接受在过滤期间由于某些错误而导致的训练和测试数据集中存在污染(重叠)。由于重新培训成本高昂，因此无法将其删除。

建筑： (Architecture :)

As mentioned in the paper, GPT-3 is a predecessor from a GPT-2 model which is an Auto-Regressive Transformer (decoder) with the same pre- normalization and reversible tokenization techniques.

如本文所述，GPT-3是GPT-2模型的前身，该模型是具有相同预归一化和可逆标记化技术的自动回归变压器(解码器)。

The significant change to be noted from GPT-2 to GPT-3 are

从GPT-2到GPT-3的重大变化是

Additional decoder layers for each model and rich dataset
每个模型和丰富数据集的附加解码器层
Parallelism applied to each layer of the model as well as for matrix multiplication
并行应用于模型的每一层以及矩阵乘法
Usage of a sparse transformer ( similar to factorization of attention matrix) to manage memory
使用稀疏变换器(类似于注意矩阵的因式分解)来管理内存

Open AI released a blog long back emphasizing on how to select the batch size by using the approach of gradient noise scale a simple statistic calculation of signal to noise ratio of network gradients.

Open AI发布了一个很久以前的博客，着重强调了如何通过使用梯度噪声标度的方法来选择批量大小，该方法是对网络梯度的信噪比进行简单的统计计算。

This approach eventually helped GPT-3 (175 B) model to use 3.2 M as batch size for its training.

这种方法最终帮助GPT-3(175 B)模型使用3.2 M作为训练批量大小。

The team developed 8 different models varying in parameters, layers and model size ranging from 125 Million Parameters to 175 Billion Parameters. It was given that all these models are trained on the same number of tokens (300 billion tokens) and trained on V100 GPU’S having high bandwidth provided by Microsoft. It is found that with the help of V100 GPU it will take 355 GPU Years (With one GPU task will take 355 years) for GPT-3 model to train and cost $4.6M for a single training run

该团队开发了8种不同的模型，其参数，层和模型大小各不相同，范围从1.25亿个参数到1750亿个参数。假设所有这些模型都在相同数量的令牌(3000亿个令牌)上进行训练，并在Microsoft提供的具有高带宽的V100 GPU上进行训练。我们发现，在V100 GPU的帮助下，GPT-3模型的训练将需要355年的GPU时间(其中一个GPU任务将需要355年)，单次训练费用为460万美元

Before getting deep into architecture there is a necessity to know a few terms like Auto-Regressive Model, Meta-Learning, Zero-Shot, One Shot, Few Shots

在深入了解体系结构之前，有必要了解一些术语，例如自动回归模型，元学习，零射，一枪，几枪

Auto Regressive model: In an autoregression model, we forecast the variable of interest using a linear combination of past values of the variable. The term autoregression indicates that it is a regression of the variable against itself.

自回归模型：在自回归模型中，我们使用变量的过去值的线性组合来预测目标变量。术语自动回归表示它是变量相对于自身的回归。

Meta -Learning : In the context of language models it means the model develops a broad set of skills and pattern recognition abilities at training time, and then uses these abilities at inference time to rapidly adapt to or recognize the desired task.

元学习 ：在语言模型的上下文中，这意味着模型在训练时会开发出广泛的技能和模式识别能力，然后在推理时使用这些能力快速适应或识别所需任务。

Paper use the term In-Context learning to describe the same task which happens inside the forward-pass (values passed from input layer to output layer) of each sequence

论文使用术语“ 情境学习”来描述在每个序列的前向传递(从输入层传递到输出层的值)内部发生的相同任务

ZERO SHOT :

零射：

In Zero-shot we will describe the task to GPT-3 followed by our question in the prompt.GPT-3 tries to understand the description and give us the solution for the prompt.

在Zero-shot中，我们将向GPT-3描述任务，然后在提示中询问我们的问题.GPT-3试图理解说明并为我们提供提示的解决方案。

Zero Shot 零射

ONE SHOT :

一枪：

In One shot we give one example along with the task description and then prompt a question to GPT-3.

在一个镜头中，我们给出一个示例以及任务说明，然后向GPT-3提出问题。

One Shot 一枪

FEW SHOTS:

几枪：

In Few shots we will give multiple examples along with task description and then prompt a question

在少量镜头中，我们将提供多个示例以及任务说明，然后提示一个问题

Few shots 几枪

Now that we got an idea on how to give input to the GPT-3 model. Let’s see what’s happening in the transformer decoder architecture of GPT-3 by taking the reference of the GPT-2 model.

现在，我们对如何为GPT-3模型提供输入有了一个想法。让我们通过参考GPT-2模型来了解GPT-3的变压器解码器体系结构中发生了什么。

Let’s unpack each layer and observe what operations are happening ..

让我们打开每一层的包装，观察正在发生的操作..

First Layer of decoder 解码器第一层

The flow starts with the word embeddings of tokens(calculated based on the size of the model) followed by position embeddings( position vector having a specific pattern about the position of each word) are added to form input token.

流程从令牌的单词嵌入 (基于模型的大小计算)开始，然后添加位置嵌入 (位置矢量具有关于每个单词的位置的特定模式)，以形成输入令牌 。

This is applied to all tokens in a sequence which is limited to 2048 .If the document don’t have the size of 2048 then multiple documents are packed together as it effectively help to manage computational efficiency .

这将应用于限制为2048的序列中的所有令牌。如果文档的大小不为2048，则将多个文档打包在一起，因为它可以有效地帮助管理计算效率。

At the end of each sequence a special delimiter text token is used to separate document.This indicates the model ,the context after end of text token is not related.

在每个序列的末尾都有一个特殊的定界符 文本标记用于分隔文档。这表示模型，文本标记结束后的上下文无关。

The next step after input processing is the decoder .Let’s get into layer 1 which contains masked self attention (followed by some layer normalization) and feed forward neural network inside it.

输入处理之后的下一步是解码器。让我们进入第1层，该层包含被屏蔽的自我注意 (随后进行了一些层标准化)并在其中包含前馈神经网络 。

reference from jalammar.github.io 来自jalammar.github.io的参考

Masked Self Attention is similar to Self-attention expect that in the former one model can’t see the words after the particular current word. Like in self-attention the calculation of query, key, value vector is calculated with the same attention formula defining the score of the current word.

掩蔽的自我注意类似于自我注意，它期望在前一个模型中看不到特定当前单词之后的单词。就像在自我注意中一样，查询，键，值向量的计算使用定义当前单词分数的相同注意公式进行计算。

For each attention head, scores of tokens are calculated (In GPT-3 175B model there are 96 heads hence 96 sets of matrix scores are calculated for each word which is then concatenated by multiplying them and adding weight matrix to form final single matrix)Please refer attention calculation here. Now, this final matrix is sent to the feed-forward network.

对于每个关注头，都会计算令牌得分(在GPT-3 175B模型中有96个头，因此将为每个单词计算96组矩阵得分，然后将它们相乘并加权重矩阵以形成最终的单个矩阵)。请参考此处的注意力计算。现在，该最终矩阵被发送到前馈网络。

In feed forward network again there are two layers .The first layer is 4 times the size of model(4*12288) larger size is to provide capacity to handle task.which is then compressed to original size of model and then sent to second layer of network.The output of the feed forward network is the output of the particular decoder layer.

在前馈网络中又有两层，第一层是模型大小的4倍(4 * 12288)，更大的大小是为了提供处理任务的能力，然后将其压缩到模型的原始大小，然后发送到第二层前馈网络的输出是特定解码器层的输出。

Each model has its own same embedding matrix (size of vocab,size of model) and same positional embeddings all over the blocks.

每个模型都有自己相同的嵌入矩阵(词汇的大小，模型的大小)和在整个块上相同的位置嵌入。

The transformer decoder (stack of 96 layers in GPT-3 175B ) produce a output which is a vector.

变压器解码器(GPT-3 175B中的96层堆栈)产生的输出是矢量。

To convert the output vector into word the vector it is passed to linear layer which produces a logits vector having scores.Then the softmax layer turns them into probabilities of each token the highest one is chosen as index and then from the index the word is retrieved from vocabulary.

为了将输出向量转换为单词，将其传递到线性层，该线性层产生具有分数的logits向量。然后softmax层将它们转换为每个令牌的概率，选择最高的一个作为索引，然后从索引中检索单词来自词汇。

Note : Additional layer normalization is included in each layers

注意：每层都包含附加的层规范化

Now that we have got an idea about Architecture. Let’s check the performance of the GPT-3 model on various tasks tested.

现在，我们对架构有了一个想法。让我们检查一下GPT-3模型在测试的各种任务上的性能。

Tasks Tested On and it’s performance 经过测试的任务及其性能

局限性 (LIMITATIONS)

GPT-3 seems to have special difficulty with “common sense physics”,It has difficulty with questions of the type “If I put cheese into the fridge, will it melt?” .
GPT-3似乎在“ 常识物理学 ” 上有特殊的困难，在“如果我将奶酪放到冰箱中，它会融化吗？”这类问题上却有困难。。
Since the model is not bidirectional or don’t have any training objectives like denoising it performs worst in some tasks which include fill in the blanks,considering a long passage and then generating a very short answer. such as WIC (which involves comparing the use of a word in two sentences), ANLI (which involves comparing two sentences to see if one implies the other), and several reading comprehension tasks
由于该模型不是双向的，或者没有任何训练目标(例如去噪)，因此在某些任务中表现最差，其中包括填空，考虑较长的时间然后生成非常短的答案。例如WIC(涉及比较两个句子中一个单词的用法)，ANLI(涉及比较两个句子以查看一个句子是否暗示另一个句子)，以及一些阅读理解任务
A limitation associated with models at the scale of GPT-3, regardless of objective function or algorithm, is that they are both expensive and inconvenient to perform inference on, which may present a challenge for practical applicability of models of this scale in their current form
无论目标函数或算法如何，与GPT-3规模的模型相关联的局限性在于，它们既昂贵又不便于进行推断 ，这可能会对当前规模的模型在当前形式上的实用性构成挑战
Uncertainty whether model learnt from pre-training and there by recognising patterns or actually learns tasks from scratch at inference time. Synthetic tasks such as word scrambling or defining nonsense words seem especially likely to be learned from beginning. whereas translation clearly must be learned during pre-training .
不确定模型是从预训练中学习到的还是通过识别模式来学习的，还是在推理时从头开始学习任务的不确定性。诸如单词加扰或定义无意义的单词之类的合成任务似乎很可能是从一开始就学会的。而显然在预培训期间必须学习翻译。

未来作品(摘自本文) (Future Works (picked from the paper))

Looking forward to work on fine-tuning feature.
期待着进行微调功能。
One possible future direction to address is the distillation of large models down to a manageable size for specific tasks. Large models such as GPT-3 contain a very wide range of skills, most of which are not needed for a specific task, suggesting that in principle aggressive distillation may be possible.
未来一个可能的方向解决的是大型模型蒸馏到特定任务管理的大小。大型模型(例如GPT-3)包含非常广泛的技能，对于特定任务并不需要其中的大多数技能，这表明从原理上讲，主动蒸馏是可能的。
More aggressively to working in removal of data contamination (Removing overlaps between test and train data)
更加积极地致力于消除数据污染(消除测试数据和训练数据之间的重叠)
Making a bidirectional model at the scale of GPT-3, and trying to make bidirectional models work with few- or zero-shot learning, is a promising direction for future research
在GPT-3的规模上建立双向模型，并尝试使双向模型在少拍或零拍学习中起作用，这是未来研究的有希望的方向

翻译自: https://medium.com/@harika3196/technical-dive-into-open-ai-gpt-3-2ed546085c2b