《reStructured Pre-training》笔记

reStructured Pre-training 笔记



In such a paradigm, the role of data will be re-emphasized, and model
pre-training and fine-tuning of downstream tasks are viewed as a process of data storing and accessing.

a good storage mechanism should not only have the ability to cache a large amount of data but also consider the ease of access.
We achieve this by pre-training models over restructured data that consist of a variety of valuable information instead of raw data after overcoming several engineering challenges.

Hypothesis of NLP technique evolution



1 Introduction

We argue that the ultimate goal of data storage is to better serve human life, and how data is accessed is as important as how it is stored. However, there are often differences in the way that data is stored and accessed.


Although prompting methods have narrowed the difference between data storage and access, it does not fundamentally eliminate the gap, as the way models store data in the pre-training stage is not transparent to diverse downstream tasks.

尽管prompting methods减少了存储和获取的差别,但没有在根本上消除他们之间的代沟,因为模型在预训练过程中存储数据的方式对不同的下流任务是不透明的


比如,在情感分类任务中,为了在预训练模型的帮助下预测句子的情感,我们必须选择一个模型熟悉的提问方式,然而系统设计者并不了解模型更倾向于使用那种提问格式,因为预训练数据的分布或者结构是不可解释的。 下面的图可以生动地解释这个例子:


Methodologically, we present a new way to look at data that contains various types of information, which could be regarded as pre-training signals that can instruct models for parameter optimization. We structurally represent data in the unit of signals and claim that a good PLM should mark various signals during pre-training in a way that expected information could be accessed efficiently by downstream tasks.




Moreover, we argue that valuable signals are rich and exist everywhere from the data in the world instead of simply existing in the supervised datasets that are manually curated


and what we need to do is to (a) identify them, (b) restructure them in a unified language, © integrate and store them into the pre-trained language model. We call this learning paradigm reStructured Pre-training.


  1. 识别它们
  2. 用统一的语言将它们重组
  3. 将它们整合并存储到预训练好的模型中


A good PLM should have a clear picture of the composition of the various signals in the data to provide accurate information for downstream tasks according to their different


2 reStructured Pre-training

2.1 Paradigm Shift in Modern NLP


2.2 reStructured Pre-training

Unlike existing paradigms that mainly focus on model-centric design, we think more from the data perspective to maximize the utility of the already available data.


Specifically, we take a data storing & accessing view where the pre-training stage is considered as a data storing process while downstream task training based on pre-trained models is regarded as data accessing process from pre-trained models, and claim that a good data storage mechanism should make the stored data more accessible.



To achieve this goal, we look at data as an object that consists of diverse signals and argue that a good pre-trained model should (1) cover as many types of signals as possible and (2) provide precise access mechanisms for these signals when required by downstream tasks. i.e., a shift from pre-training over plain texts to pre-training over structured signals. In general, there are three steps within this new paradigm.


  1. 包含尽可能多类型的信号
  2. 为下游任务需要的信号提供精确的获取方法(即从训练纯文本转变为结构化信号)


  1. reStructure
  2. Pre-train
  3. Fine-tune



Fine-tune:预训练完成后,模型可以用结构化标签数据进一步微调;另一种常见情况是直接将它们用于下游任务,通常通过zero-shot prompting。

2.3 Evolutionary Process of Engineering Cycles



the iteration of technology always moves along the direction that system developers can design a better and more general system by doing fewer things.


2.4 Design Considerations

  1. Signal Definition

作为restructured learning的第一步,我们需要知道哪些signals自然地存在于世界上,并且可收集、可获取。

  1. Data Mine Identification

Data Mine指一组包含多种类型信号的数据。一旦完成Siganal Defination就开始寻找合适的Data Mine。

  1. Signal Extraction

如何有效地从Data Mine中提取Signals也很重要。

  1. Signal Restructuring


  1. Pre-training and Tuning


3 reStructuring Engineering
