《reStructured Pre-training》笔记

reStructured Pre-training 笔记

本文主要记录论文中我觉得比较重要的部分,并加入个人的理解,如有错误请可直接指出;由于格式问题,强烈建议去notion观看,完整版内容请移步notion网页进行详细阅读,谢谢!

Abstract

In such a paradigm, the role of data will be re-emphasized, and model
pre-training and fine-tuning of downstream tasks are viewed as a process of data storing and accessing.

a good storage mechanism should not only have the ability to cache a large amount of data but also consider the ease of access.
We achieve this by pre-training models over restructured data that consist of a variety of valuable information instead of raw data after overcoming several engineering challenges.

Hypothesis of NLP technique evolution

自然语言处理技术进化的假说

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-NpOvV25C-1657291156008)(reStructured%20Pre-training%20dd713e6e66b74c20a893aa232e0efe3c/%E6%88%AA%E5%B1%8F2022-07-06_17.21.43.png)]

1 Introduction

We argue that the ultimate goal of data storage is to better serve human life, and how data is accessed is as important as how it is stored. However, there are often differences in the way that data is stored and accessed.

作者提出存储数据的最终目标是更好地服务于人们的生活,因此如何获取数据和如何存储数据一样重要。

Although prompting methods have narrowed the difference between data storage and access, it does not fundamentally eliminate the gap, as the way models store data in the pre-training stage is not transparent to diverse downstream tasks.

尽管prompting methods减少了存储和获取的差别,但没有在根本上消除他们之间的代沟,因为模型在预训练过程中存储数据的方式对不同的下流任务是不透明的

换句话说,下流任务不知道使用何种方法(即prompts)可以更好地从预训练模型中获取想要的数据。

比如,在情感分类任务中,为了在预训练模型的帮助下预测句子的情感,我们必须选择一个模型熟悉的提问方式,然而系统设计者并不了解模型更倾向于使用那种提问格式,因为预训练数据的分布或者结构是不可解释的。 下面的图可以生动地解释这个例子:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-nXKgyIhr-1657291156009)(reStructured%20Pre-training%20dd713e6e66b74c20a893aa232e0efe3c/%E6%88%AA%E5%B1%8F2022-07-06_20.32.43.png)]

Methodologically, we present a new way to look at data that contains various types of information, which could be regarded as pre-training signals that can instruct models for parameter optimization. We structurally represent data in the unit of signals and claim that a good PLM should mark various signals during pre-training in a way that expected information could be accessed efficiently by downstream tasks.

作者将数据中包含的不同种类的数据看作预训练信号,用来指导模型的参数优化,在结构上以信号为单位表示数据。

一个好的PLM应该在预训练过程中标记不同种类的信号,以便下游任务可以有效地获取需要的数据

就像我们使用数据库存储数据一样,我们需要先将他们结构化并放进一个有结构的表格中,这样就可以通过结构化语言(如SQL)准确地获取我们想要的数据

Moreover, we argue that valuable signals are rich and exist everywhere from the data in the world instead of simply existing in the supervised datasets that are manually curated

有价值的信号是丰富的并且存在于世界的任何地方,而不仅仅存在于有监督数据集中。

and what we need to do is to (a) identify them, (b) restructure them in a unified language, © integrate and store them into the pre-trained language model. We call this learning paradigm reStructured Pre-training.

我们需要做的是:

  1. 识别它们
  2. 用统一的语言将它们重组
  3. 将它们整合并存储到预训练好的模型中

我们称这种学习范式为重构式预训练。

A good PLM should have a clear picture of the composition of the various signals in the data to provide accurate information for downstream tasks according to their different
needs
.

一个好的PLM应该对数据中不同种类信号的组成有清楚的认知,从而根据下流任务的不同需求提供准确的信息。

2 reStructured Pre-training

2.1 Paradigm Shift in Modern NLP

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5rweyvv8-1657291156010)(reStructured%20Pre-training%20dd713e6e66b74c20a893aa232e0efe3c/%E6%88%AA%E5%B1%8F2022-07-07_14.11.13.png)]

2.2 reStructured Pre-training

Unlike existing paradigms that mainly focus on model-centric design, we think more from the data perspective to maximize the utility of the already available data.

专注于最大化利用现有数据

Specifically, we take a data storing & accessing view where the pre-training stage is considered as a data storing process while downstream task training based on pre-trained models is regarded as data accessing process from pre-trained models, and claim that a good data storage mechanism should make the stored data more accessible.

我们采用数据存储和获取的角度,其中预训练阶段被看作数据存储过程,而下流任务则看作数据获取过程。

一个好的数据存储方法应该使存储的数据更容易获取。

To achieve this goal, we look at data as an object that consists of diverse signals and argue that a good pre-trained model should (1) cover as many types of signals as possible and (2) provide precise access mechanisms for these signals when required by downstream tasks. i.e., a shift from pre-training over plain texts to pre-training over structured signals. In general, there are three steps within this new paradigm.

为了实现这个目标,我们将数据看作由各种信号组成的对象,并主张一个好的预训练模型应该:

  1. 包含尽可能多类型的信号
  2. 为下游任务需要的信号提供精确的获取方法(即从训练纯文本转变为结构化信号)

总体上,这个新范式有3个步骤:

  1. reStructure
  2. Pre-train
  3. Fine-tune

reStructure:由于现有的信号格式多种多样,有必要将它们重组为统一的格式用以模型的预训练。

Pre-train:当所有训练数据都被结构化为统一的数据后,选择预训练结构,并训练结构化数据。

Fine-tune:预训练完成后,模型可以用结构化标签数据进一步微调;另一种常见情况是直接将它们用于下游任务,通常通过zero-shot prompting。

2.3 Evolutionary Process of Engineering Cycles

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-TwKyC2cA-1657291156011)(reStructured%20Pre-training%20dd713e6e66b74c20a893aa232e0efe3c/%E6%88%AA%E5%B1%8F2022-07-07_15.05.46.png)]

机器学习技术的核心推动力:

the iteration of technology always moves along the direction that system developers can design a better and more general system by doing fewer things.

技术的迭代总是朝着——系统开发者通过做更少的事情就可以设计一个更好的、更通用的系统——的方向发展。

2.4 Design Considerations

  1. Signal Definition

作为restructured learning的第一步,我们需要知道哪些signals自然地存在于世界上,并且可收集、可获取。

  1. Data Mine Identification

Data Mine指一组包含多种类型信号的数据。一旦完成Siganal Defination就开始寻找合适的Data Mine。

  1. Signal Extraction

如何有效地从Data Mine中提取Signals也很重要。

  1. Signal Restructuring

这个过程关注如何使用统一的格式表示所有类型的signals,并且缩小数据存储和检索的差距。

  1. Pre-training and Tuning

这个过程关注使用什么预训练结构,使得所有结构化数据都可以被有效地表示。

3 reStructuring Engineering

你可能感兴趣的:(学习,人工智能,机器学习,自然语言处理)