Unlike traditional supervised learning, which trains a model to take in an input x and predict an output y as P(y|x), prompt-based learning is based on language models that model the probability of text directly. To use these models to perform prediction tasks, the original input x is modified using a template into a textual string prompt x’ that has some unfilled slots, and then the language model is used to probabilistically fill the unfilled information to obtain a final string ˆx, from which the final output y can be derived.
摘要中作者对于提示学习的说明:输入的x经过模板得到x’,x’是有空位的文本串,然后使用语言模型进行填空得到x^,最后通过x_hat可以得到y。
.
感觉和bert的预训练过程有些许相似之处,其优点在于可以在大规模的生语料上进行预训练。
提示学习的示意图,似乎prompt的定位对应上文中提到的“模板”,而进行填空的语言模型则对应现有的这些“芝麻街”预训练模型。
Fully supervised learning, where a task-specific model is trained solely on a dataset of input-output examples for the target task, has long played a central role in many machine learning tasks, and natural language processing (NLP) was no exception. Because such fully supervised datasets are ever-insufficient for learning high-quality models, early NLP models relied heavily on feature engineering, where NLP researchers or engineers used their domain knowledge to define and extract salient features from raw data and provide models with the appropriate inductive bias to learn from this limited data.
With the advent of neural network models for NLP, salient features were learned jointly with the training of the model itself, and hence focus shifted to architecture engineering, where inductive bias was rather provided through the design of a suitable network architecture conducive to learning such features.
However, from 2017-2019 there was a sea change in the learning of NLP models, and this fully supervised paradigm is now playing an ever-shrinking role. Specifically, the standard shifted to the pre-train and fine-tune paradigm. In this paradigm, a model with a fixed architecture is pre-trained as a language model (LM), predicting the probability of observed textual data. Because the raw textual data necessary to train LMs is available in abundance, these LMs can be trained on large datasets, in the process learning robust general-purpose features of the language it is modeling. The above pre-trained LM will be then adapted to different downstream tasks by introducing additional parameters and fine-tuning them using task-specific objective functions. Within this paradigm, the focus turned mainly to objective engineering, designing the training objectives used at both the pre-training and fine-tuning stages. For example, Zhang et al. (2020a) show that introducing a loss function of predicting salient sentences from a document will lead to a better pre-trained model for text summarization. Notably, the main body of the pre-trained LM is generally (but not always;) fine-tuned as well to make it more suitable for solving the downstream task.
这里是作者总结的nlp发展中的第一次巨大变化:
- 特征工程阶段(对应传统机器学习方法):此时全监督学习盛行,标注语料作为监督的必要条件是很难以获取的。此时期的研究工作专注于人工抽取出特征,从而更好的诱导机器学习模型从有限的数据中进行学习。
- 架构工程阶段(对应神经网络早期):神经网络的一大突出优势就是不需要人工设计特征,网络会自动进行特征提取。此时期的研究工作则专注于设计更适合任务的神经网络结构,从而让模型的特征提取能力更强。
- 目标工程阶段(对应预训练模型兴起):全监督学习始终受制于有限的学习语料,而预训练语言模型的无监督学习模式则可不受限制地在丰富的无标注预料上进行学习。因此“预训练+微调”成为主流模式,而研究工作的重点则转变为给预训练阶段和微调阶段设计训练目标(损失函数)。
Now, as of this writing in 2021, we are in the middle of a second sea change, in which the “pre-train, fine-tune” procedure is replaced by one in which we dub “pre-train, prompt, and predict”. In this paradigm, instead of adapting pre-trained LMs to downstream tasks via objective engineering, downstream tasks are reformulated to look more like those solved during the original LM training with the help of a textual prompt.
For example, when recognizing the emotion of a social media post, “I missed the bus today.”, we may continue with a prompt “I felt so ”, and ask the LM to fill the blank with an emotion-bearing word. Or if we choose the prompt “English: I missed the bus today. French: ”), an LM may be able to fill in the blank with a French translation. In this way, by selecting the appropriate prompts we can manipulate the model behavior so that the pre-trained LM itself can be used to predict the desired output, sometimes even without any additional task-specific training (Tab. 1). The advantage of this method is that, given a suite of appropriate prompts, a single LM trained in an entirely unsupervised fashion can be used to solve a great number of tasks .
However, as with most conceptually enticing prospects, there is a catch – this method introduces the necessity for prompt engineering, finding the most appropriate prompt to allow a LM to solve the task at hand.
作者认为我们当前正处于第二次巨大变化的中期,即提示学习的引入。
在此,作者对于提示学习给出了例子:
- “I missed the bus today.” -> “I missed the bus today. I felt so ___” (对应情感分类任务)
- “I missed the bus today.” -> “English: I missed the bus today. French: ______”(对应机器翻译任务)
我们知道,这些预训练模型是很擅长填空的,因此我们将问题稍作转化成为填空题,这样我们只需要有预训练模型就可以完成任务(似乎就不需要下游任务了)。
因此当前阶段的研究任务专注于提示工程,即设计合适的提示信息,让预训练模型更好的解决问题。
In Prompt Addition step a prompting function fprompt(x) is applied to modify the input text x into a prompt x‘ = fprompt(x). In the majority of previous work (Kumar et al., 2016; McCann et al., 2018; Radford et al., 2019; Schick and Sch¨utze,2021a), this function consists of a two step process:
- Apply a template, which is a textual string that has two slots: an input slot [X] for input x and an answer slot [Z] for an intermediate generated answer text z that will later be mapped into y.
- Fill slot [X] with the input text x.
In Answer Search step , we search for the highest-scoring text zˆ that maximizes the score of the LM. We first define Z as a set of permissible values for z. Z could range from the entirety of the language in the case of generative tasks, or could be a small subset of the words in the language in the case of classification, such as defining Z = {“excellent”, “good”, “OK”, “bad”, “horrible”} to represent each of the classes in Y = {++, +, ~ , -, --}.
We then define a function ffill(x0 , z) that fills in the location [Z] in prompt x0 with the potential answer z. We will call any prompt that has gone through this process as a filled prompt. Particularly, if the prompt is filled with a true answer, we will refer to it as an answered prompt (Tab. 2 shows an example). Finally, we search over the set of potential answers z by calculating the probability of their corresponding filled prompts using a pre-trained LM.
This search function could be an argmax search that searches for the highest-scoring output, or sampling that randomly generates outputs following the probability distribution of the LM.
In Answer Mapping step, we would like to go from the highest-scoring answer zˆ to the highest-scoring output yˆ. This is trivial in some cases, where the answer itself is the output (as in language generation tasks such as translation), but there are also other cases where multiple answers could result in the same output. For example, one may use multiple different sentiment-bearing words (e.g. “excellent”, “fabulous”, “wonderful”) to represent a single class (e.g. “++”), in which case it is necessary to have a mapping between the searched answer and the output value.
文章中提到,提示学习的过程大致可分为三步:
- Prompt Addition (添加提示):套公式,给字符串添加提示
- Answer Search (回答提示):答案搜索的过程中,根据任务不同备选集的大小可以不同
- Answer Mapping (回答匹配):根据任务不同,有时Output=Answer,有时Output=Map(Answer)
Training the model to optimize the probability P(x) of text from a training corpus (Radford et al., 2019). In these cases, the text is generally predicted in an autoregressive fashion, predicting the tokens in the sequence one at a time. This is usually done from left to right (as detailed below), but can be done in other orders as well.
.
Predict the upcoming words or assign a probability P(x) to a sequence of words x = x1, · · · , xn . The probability is commonly broken down using the chain rule in a left-to-right fashion: P(x) = P(x1) × · · · P(xn|x1 · · · xn−1).
从左到右每次预测出一个词
apply some noising function ˜x = fnoise(x) to the input sentence (details in the following subsection), then try to predict the original input sentence given this noised text P(x|˜x).
先给句子加噪音,再让模型通过噪音句预测原句
- Corrupted Text Reconstruction (CTR) These objectives restore the processed text to its uncorrupted state by calculating loss over only the noised parts of the input sentence.
- Full Text Reconstruction (FTR) These objectives reconstruct the text by calculating the loss over the entirety of the input texts whether it has been noised or not (Lewis et al., 2020a).
语言建模的方式会影响到在任务中的适用性,比如:
- FTR更关注整体,所以在语言生成任务中更有优势(句子通顺)
- 从左至右自回归建模更适合前缀的提示任务
- 去噪建模更适合完形填空式的提示任务(Mask建模)
Masking (e.g. Devlin et al. (2019)) The text will be masked in different levels, replacing a token or multi-token span with a special token such as [MASK]. Notably, masking can either be random from some distribution or specifically designed to introduce prior knowledge, such as the above-mentioned example of masking entities to encourage the model to be good at predicting entities.
.
which aims to predict masked text pieces based on surrounded context. For example, P(xi|x1, . . . , xi−1, xi+1, . . . , xn) represents the probability of the word xi given the surrounding context.
.
关于词的选择,可以是随机的挑选,也可以是有针对性地选择某类词。比如将命名实体Mask住,以针对性学习命名实体的预测。
Replacement (e.g. Raffel et al. (2020)) Replacement is similar to masking, except that the token or multi-token span is not replaced with a [MASK] but rather another token or piece of information (e.g., an image region (Su et al., 2020)).
Deletion (e.g. Lewis et al. (2020a)) Tokens or multi-token spans will be deleted from a text without the addition of [MASK] or any other token. This operation is usually used together with the FTR loss.
Permutation (e.g. Liu et al. (2020a)) The text is first divided into different spans (tokens, sub-sentential spans, or sentences), and then these spans are be permuted into a new text.
- Left-to-Right The representation of each word is calculated based on the word itself and all previous words in the sentence. For example, if we have a sentence “This is a good movie”, the representation of the word “good” would be calculated based on previous words. This variety of factorization is particularly widely used when calculating standard LM objectives or when calculating the output side of an FTR objective, as we discuss in more detail below.
- Bidirectional The representation of each word is calculated based on all words in the sentence, including words to the left of the current word. In the example above, “good” would be influenced by all words in the sentence, even the following “movie”.
left to right 对应传统的语言模型
bidrectional 更像是注意力机制的语言模型
The prompts above will have an empty slot to fill in for z, either in the middle of the prompt or at the end. In the following text, we will refer to the first variety of prompt with a slot to fill in the middle of the text as a cloze prompt, and the second variety of prompt where the input text comes entirely before z as a prefix prompt.
As noted above, there are two main varieties of prompts: cloze prompts (Petroni et al., 2019; Cui et al., 2021), which fill in the blanks of a textual string, and prefix prompts (Li and Liang, 2021; Lester et al., 2021), which continue a string prefix.
Which one is chosen will depend both on the task and the model that is being used to solve the task. In general, for tasks regarding generation, or tasks being solved using a standard auto-regressive LM, prefix prompts tend to be more conducive, as they mesh well with the left-to-right nature of the model. For tasks that are solved using masked LMs, cloze prompts are a good fit, as they very closely match the form of the pre-training task. Full text reconstruction models are more versatile, and can be used with either cloze or prefix prompts. Finally, for some tasks regarding multiple inputs such as text pair classification, prompt templates must contain space for two inputs, [X1] and [X2], or more.
有两种提示形式:
- cloze prompt :完形填空型提示
- prefix prompt :结尾处填空型
.选择哪种提示取决于要做的任务类型和使用的语言模型的训练方式。
While the strategy of manually crafting templates is intuitive and does allow solving various tasks with some degree of accuracy, there are also several issues with this approach: (1) creating and experimenting with these prompts is an art that takes time and experience, particularly for some complicated tasks such as semantic parsing (Shin et al.,2021); (2) even experienced prompt designers may fail to manually discover optimal prompts (Jiang et al., 2020c).
手动设计很直观,但是难度高,效率低,这方面的工作介绍比较少。
To address these problems, a number of methods have been proposed to automate the template design process. In particular, the automatically induced prompts can be further separated into discrete prompts, where the prompt is an actual text string, and continuous prompts, where the prompt is instead described directly in the embedding space of the underlying LM.
有两种自动模板方法:
- discrete prompts :离散型提示,即提示就是文本
- continuous prompts : 连续型提示,即提示是通过embedding的形式加入的。(名称来源于Word2Vec,离散型词向量)
One other orthogonal design consideration is whether the prompting function fprompt(x) is static, using essentially the same prompt template for each input, or dynamic, generating a custom template for each input. Both static and dynamic strategies have been used for different varieties of discrete and continuous prompts, as we will mention below.
另一种区分方式:
- static :对于数据集中的所有文本,添加的提示都一样
- dynamic :针对数据集中的每一个文本,都有不同的提示(感觉这个会更好一些)
Given a set of training inputs x and outputs y. This method scrapes a large text corpus (e.g. Wikipedia) for strings containing x and y, and finds either the middle words or dependency paths between the inputs and outputs. Frequent middle words or dependency paths can serve as a template as in “[X] middle words [Z]”.
以下为原论文地址:https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00324/96460/How-Can-We-Know-What-Language-Models-Know
.
we identify all the Wikipedia sentences that contain both subjects and objects of a specific relation r using the assumption of distant supervision
.
For example, “Barack Obama was born in Hawaii” is converted into a prompt “xwas born iny” by replacing the subject and the object with placeholders.(middle words)
.
For instance, the dependency path in the above example is “France ← of ← capital ← is → Paris”, where the leftmost and rightmost words are “capital” and “Paris”, giving a prompt of “capital of x is y”.
.
Notably, these mining-based methods do not rely on any manually created prompts, and can thus be flexibly applied to any relation where we can obtain a set of subject-object pairs. This will result in diverse prompts, covering a wide variety of ways that the relation may be expressed in text. However, it may also be prone to noise, as many prompts acquired in this way may not be very indicative of the relation (e.g., “x,y”), even if they are frequent.
首先在语料库中找到条件x和目标y(根据任务类型而定),在语料库中找到x与y共现的句子,然后有两种方法处理句子: 中间词直接作为提示词 、 先对句子做句法分析根据句法分析重组出提示词。
这种方法找到的提示可能是多种多样的,甚至可能带来噪音。
Paraphrasing-based approaches take in an existing seed prompt (e.g. manually constructed or mined), and paraphrases it into a set of other candidate prompts, then selects the one that achieves the highest training accuracy on the target task.
首先通过手动的方式 或者 挖掘的方式 获得一个种子提示,然后对种子提示进行各种复述,得到多个候选提示,并从候选提示中选择最好的提示。复述的方式包括:回译、字典解释、专门训练复述神经网络等
Wallace et al. (2019a) applied a gradient-based search over actual tokens to find short sequences that can trigger the underlying pre-trained LM to generate the desired target prediction. This search is done in an iterative fashion, stepping through tokens in the prompt .
这篇文章借用了网络攻击的思想,在神经网络攻击方法中,利用梯度可以根据梯度逆向计算出网络最不擅长处理的输入(李宏毅讲过,细节有点忘了,可以去复习复习)。作者将此思想利用到了提示生成中,利用梯度计算出最能使模型输出正确答案的提示。
Other works treat the generation of prompts as a text generation task and use standard natural language generation models to perform this task.
应该是最标准的NLP做法,就是把任务当成seq2seq任务去训练,需要提供一些训练数据
Davison et al. (2019) investigate the task of knowledge base completion and design a template for an input (head-relation-tail triple) using LMs. They first hand-craft a set of templates as potential candidates, and fill the input and answer slots to form a filled prompt. They then use a unidirectional LM to score those filled prompts, selecting the one with the highest LM probability.
先人工做很多提示模板,然后把x填进模板,得到一句话。最后利用语言模型给一句话打分,像一句合理的话就是好提示,如果不是人话就不是好提示。充分利用了语言模型,但是大前提不一定准确。
Because the purpose of prompt construction is to find a method that allows an LM to effectively perform a task,
rather than being for human consumption, it is not necessary to limit the prompt to human-interpretable natural
language.
做提示的目的是让语言模型能尽可能好的发挥性能,而不是让一句话看起来合理或者读起来顺口,因此提示的形式不一定局限在自然语言范围内,提示不一定非要人类能够理解。因此有了离散型提示
Prefix Tuning (Li and Liang, 2021) is a method that prepends a sequence of continuous task-specific vectors to the input, while keeping the LM parameters frozen. Mathematically, this consists of optimizing over the following log-likelihood objective given a trainable prefix matrix Mφ and a fixed pre-trained LM parameterized by θ.
定义中提到了time step,那么显然适用的LM不是基于self-attention的
此方法的关键思想就是前缀一个可训练矩阵,这个矩阵的每行等价于语言模型的一次完整计算中各层结果的级联(i个元素就相当于在原文前追加了i个token),但实际上里面并没有神经连接,更新时直接更新矩阵的值。
那么这个方法得到的提示肯定也是静态的
There are also methods that initialize the search for a continuous prompt using a prompt that has already been created or discovered using discrete prompt search methods.
这个很好理解的方法,就是把离散提示通过模型转为连续型,作为初始值继续微调
Instead of using a purely learnable prompt template, these methods insert some tunable embeddings into a hard prompt template.
Liu et al. (2021b) propose “P-tuning”, where continuous prompts are learned by inserting trainable variables into the embedded input. To account for interaction between prompt tokens, they represent prompt embeddings as the output of a BiLSTM (Graves et al., 2013). P-tuning also introduces the use of task-related anchor tokens (such as “capital” in relation extraction) within the template for further improvement. These anchor tokens are not tuned during training.
In contrast to prompt engineering, which designs appropriate inputs for prompting methods, answer engineering aims to search for an answer space Z and a map to the original output Y that results in an effective predictive model. Fig.1’s “Answer Engineering” section illustrates two dimensions that must be considered when performing answer engineering: deciding the answer shape and choosing an answer design method.
正如前文所述,有时Z并不一定等于Y,因此可能还需要额外进行一次Z→Y的Mapping。
答案工程的过程就是:找到Z,匹配到Y
其中重要的两个事情:答案形状、设计方法
The shape of an answer characterizes its granularity. Some common choices include:
- Tokens: One of the tokens in the pre-trained LM’s vocabulary, or a subset of the vocabulary.
- Span: A short multi-token span. These are usually used together with cloze prompts.
- Sentence: A sentence or document. These are commonly used with prefix prompts.
答案的形式可根据粒度进行划分:词级、短语级、句子级
不同形式的任务需要不同形式的答案,比如分类任务可能就是词级、生成任务需要句子级
The next question to answer is how to design the appropriate answer space Z, as well as the mapping to the output
space Y if the answers are not used as the final outputs.
即候选集的如何确定,以及如何从Z匹配到Y的问题
In manual design, the space of potential answers Z and its mapping to Y are crafted manually by an interested
system or benchmark designer. There are a number of strategies that can be taken to perform this design.
无限空间:Z的候选集就是整个词表,这种情况下,一般Z=Y
受限空间:一般是文本分类任务等,Z≠Y
These methods start with an initial answer space Z’ , and then use paraphrasing to expand this answer space to broaden its coverage (Jiang et al., 2020b). Given a pair of answer and output
, we define a function that generates a paraphrased set of answers para(z’). The probability of the final output is then defined as the marginal probability all of the answers in this paraphrase set P(y|x) = Σz∈para(z‘) P(z|x).
首先需要有一个答案备选集Z,并且这个备选集和Y的对应关系是清楚的。
然后将Z中的每个备选项做复述,获得由多个复述组成的集
将所有复述概率加和,在对应到Y上,就是Y的概率
In these methods, first, an initial pruned answer space of several plausible answers Z’ is generated, and then an algorithm further searches over this pruned space to select a final set of answers.
这里实在看不懂了,有机会去看原文吧
For example, for the relation per:city-of-death, the decomposed label words would be {person, city, death}. The probability of the answer span will be calculated as the sum of each token’s probability.
关系抽取任务中特有的方法
Very few works explore the possibility of using soft answer tokens which can be optimized through gradient descent. Hambardzumyan et al. (2021) assign a virtual token for each class label and optimize the token embedding for each class together with prompt token embeddings. Since the answer tokens are optimized directly in the embedding space, they do not make use of the embeddings learned by the LM and instead learn an embedding from scratch for each label.
离散型的方法很少,这里作者只介绍了一种方法。该方法把每个类标签作为一个可学习向量,这个向量不是语言模型给出的,而是直接初始化的。
The prompt engineering methods we discussed so far focused mainly on constructing a single prompt for an input. However, a significant body of research has demonstrated that the use of multiple prompts can further improve the efficacy of prompting methods, and we will call these methods multi-prompt learning methods. In practice, there are several ways to extend the single prompt learning to the use multiple prompts, which have a variety of motivations. We summarize representative methods in the “Multi-prompt Learning” section of Fig.1 as well as Fig.4.
前面所有的提示学习中,每个输入都只对应一个提示。然后研究表明给一个输入添加多个提示能有效提高效果。以下为一些多提示的方法。
Prompt ensembling is the process of using multiple unanswered prompts for an input at inference time to make
predictions.The multiple prompts can either be discrete prompts or continuous prompts.
This sort of prompt ensembling can
(1) leverage the complementary advantages of different prompts,
(2)alleviate the cost of prompt engineering, since choosing one best-performing prompt is challenging,
(3) stabilize performance on downstream tasks.
集成的优点:优势互补、我全都要(不需要考虑选哪个)、表现会更稳定
下文的关注点在于如何处理不同提示的预测结果
The most intuitive way to combine the predictions when using multiple prompts is to take the average of probabilities from different prompts.
就是把最优的K个提示取得的结果进行平均。
简单的平均在某些情况下可能不是最优方法,因此也有使用加权平均进行集成的。
投票方法适用于分类任务,根据多个分类器的投票结果选出最终结果。
An ensemble of deep learning models can typically improve the performance, and this
superior performance can be distilled into a single model using knowledge distillation.
多模型集成之后,可以把多个模型的知识蒸馏到一个模型里面。
There is relatively little work on prompt ensembling for generation tasks (i.e. tasks where the answers is a string of tokens instead of a single one). A simple way to perform ensembling in this case is to use standard methods that generate the output based on the ensembled probability of the next word in the answer sequence.
一个词一个词的集成,其实从这里也能看出来,提示学习的方法在翻译(生成)任务还不是很适用,更多的适用于那种答案只有一个字的任务(分类、抽取)。
Prompt augmentation, also sometimes called demonstration learning (Gao et al., 2021), provides a few additional answered prompts that can be used to demonstrate how the LM should provide the answer to the actual prompt instantiated with the input x.
举例:
These few-shot demonstrations take advantage of the ability of strong language models to learn repetitive patterns (Brown et al., 2020).Although the idea of prompt augmentation is simple, there are several aspects that make it challenging: (1)Sample Selection: how to choose the most effective examples? (2) Sample Ordering: How to order the chosen examples with the prompt?
附加提示的方法优势是能更充分利用语言模型。但也会带来更多的困难:
- 如何选择最有效的附加:影响很大,甚至会直接崩溃模型
- 附加提示应该如何排序
For those composable tasks, which can be composed based on more fundamental subtasks, we can also perform prompt composition, using multiple sub-prompts, each for one subtask, and then defining a composite prompt based on those sub-prompts.
对于一些特殊的任务类型,主任务可以分解为多个子任务,那么每个子任务会有一个子提示,然后基于子提示组成一个主提示。
.
比如关系抽取任务,目标是获取两个实体之间的关系。可以分解成为两个子任务,分别负责获取两个实体、判断关系。
For tasks where multiple predictions should be performed for one sample (e.g., sequence labeling), directly defining a holistic prompt with regards to the entire input text x is challenging. One intuitive method to address this problem is to break down the holistic prompt into different sub-prompts, and then answer each sub-prompt separately.
对于一些需要多个输出结果的任务,比如序列标注任务,输入一句话,输出多个标签。单提示很难适用这类任务,因此可以把提示打碎,只需回答每个子提示就可以完成任务。
In many cases, prompting methods can be used without any explicit training of the LM for the down-stream task, simply taking an LM that has been trained to predict the probability of text P(x) and applying it as-is to fill the cloze or prefix prompts defined to specify the task. This is traditionally called the zero-shot setting, as there is zero training data for the task of interest.
理论上在提示学习方法中,语言模型是可以拿来即用的。不需要额外的训练数据。
a very small number of examples are used to train the model
a reasonably large number of training examples are used to train the model
In prompt-based downstream task learning, there are usually two types of parameters, namely those from (1)pre-trained models and (2) prompts. Which part of parameters should be updated is one important design decision, which can lead to different levels of applicability in different scenarios. We summarize five tuning strategies (as shown in Tab. 6) based on (i) whether the parameters of the underlying LM are tuned, (ii) whether there are additional prompt-related parameters, (iii) if there are additional prompt-related parameters, whether those parameters are tuned.
参数主要分为两类:语言模型参数、提示参数
策略分为5类,取决于:
Here we refer to pre-training and fine-tuning without prompts as promptless fine-tuning. In this strategy, given a dataset of a task, all (or some (Howard and Ruder, 2018; Peters et al., 2019)) of the parameters of the pre-trained LM will be updated via gradients induced from downstream training samples.
代表是Bert.指无提示的传统方法。
- 优势:简单,不需要设计提示;
- 劣势:需要较多训练数据,否则可能过拟合或者不稳定(catastrophic forgetting ,指语言模型经过微调后失去功能)
Tuning-free prompting directly generates the answers without changing the parameters of the pre-trained LMs based only on a prompt.
语言模型的参数不进行改变,仅添加提示
- 优势:不需要训练、不需要更新参数,效率高
- 劣势:模型的准确率完全取决于提示的质量好坏
In the scenario where additional prompt-relevant parameters are introduced besides parameters of the pre-trained model, fixed-LM prompt tuning updates only the prompts’ parameters using the supervision signal obtained from the downstream training samples, while keeping the entire pre-trained LM unchanged.
语言模型的参数不进行改变,添加提示,并在提示部分引入额外参数。仅对提示部分的参数进行训练。
- 优势:优于无微调
- 劣势:需要一定的训练数据,连续型提示通常人类无法理解。
Fixed-prompt LM tuning tunes the parameters of the LM, as in the standard pre-train and fine-tune paradigm, but additionally uses prompts with fixed parameters to specify the model behavior. This potentially leads to improvements, particularly in few-shot scenarios.
语言模型的参数参与训练,提示部分的参数固定,与上一种方法相反。
- 优势:提示的影响力更强,学习更有效(因为更新的参数更多)
- 劣势:语言模型更契合下游任务,但是也会带来过拟合,以及迁移难的问题。
In this setting, there are prompt-relevant parameters, which can be fine-tuned together with the all or some of the parameters of the pre-trained models.
全部参数参与微调
- 优势:拟合能力最强,在数据量大时优势显著
- 劣势:如果数据量不大,会过拟合
The motivation of exploring this task is to quantify how much factual knowledge the pre-trained LM’s internal representations bear.
Besides factual knowledge, large-scale pre-training also allows LMs to handle linguistic phenomena such as analogies (Brown et al., 2020), negations (Ettinger, 2020), semantic role sensitivity (Ettinger, 2020), semantic similarity (Sun et al., 2021), cant understanding (Sun et al., 2021), and rare word understanding (Schick and Sch¨utze, 2020). The above knowledge can also be elicited by presenting linguistic probing tasks in the form of natural language sentences that are to be completed by the LM.
探索很广泛,较多使用完形填空式的提示。研究更多关注于小样本环境下的任务。
NLI aims to predict the relationship (e.g., entailment) of two given sentences.
同样更多使用完形填空式的提示,关注小样本环境
Relation extraction is a task of predicting the relation between two entities in a sentence.
关系抽取任务有两个难点:
Semantic parsing is a task of generating a structured meaning representation given a natural language input.
There is still a debate about if deep neural networks are capable of performing “reasoning” or just memorizing patterns based on large training data.
For example, to self-diagnosis whether the generated text contains violent information,we can use the following template “The following text contains violence. [X][Z]”. Then we fill [X] with the input text and look at the generation probability at [Z], if the probability of “Yes” is greater than “No”.
As an example, suppose we have an unlabeled dataset in which each sample is a sentence. If we want to construct a dataset containing pairs of semantically similar sentences, then we can use the following template for each input sentence: “Write two sentences that mean the same thing. [X][Z]”
集成学习需要训练多个模型。但是提示学习提供了新的思路,只要更换提示就能获得不同的结果,这样就不需要重复训练了。
Applications to information extraction and text analysis tasks have been discussed less, largely because the design of prompts is less straightforward.
这类任务的提示不太好设计
In many NLP tasks, the inputs are imbued with some variety of structure, such as tree, graph, table, or relational structures. How to best express these structures in prompt or answer engineering is a major challenge.
如果输入是图结构或树结构的,那么也就需要提示做出相应的变化。
For classification-based tasks, there are two main challenges for answer engineering:
(a) When there are too many classes, how to select an appropriate answer space becomes a difficult combinatorial optimization problem.
(b) When using multi-token answers, how to best decode multiple tokens using LMs remains unknown
For text generation tasks, qualified answers can be semantically equivalent but syntactically diverse. So far, almost all works use prompt learning for text generation relying solely on a single answer, with only a few exceptions.
答案缺乏多样性
In prompt ensembling methods, the space and time complexity increase as we consider more prompts. How to distill the knowledge from different prompts remains underexplored.
Both prompt composition and decomposition aim to break down the difficulty of a complicated task input by introducing multiple sub-prompts. In practice, how to make a good choice between them is a crucial step.
Existing prompt augmentation methods are limited by the input length, i.e., feeding too
many demonstrations to input is infeasible. Therefore, how to select informative demonstrations, and order them in an appropriate is an interesting but challenging problem.
We may also consider prompt sharing, where prompt learning is applied to multiple tasks, domains, or languages. Some key issues that may arise include how to design individual prompts for different tasks, and how to modulate their interaction with each other.
With plenty of pre-trained LMs to select from (see §3), how to choose them to better leverage prompt-based learning is an interesting and difficult problem.
Despite their success in many scenarios, theoretical analysis and guarantees for prompt-based learning are scarce.
Understanding the extent to which prompts are specific to the model and improving the transferability of prompts are also important topics.
并不是所有提示都是可以迁移的,也不是所有提示都是不可迁移的。