【论文阅读 T5】Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

文章目录

    • Foreword
    • Intro
    • Setting
      • Model
      • The Colossal Clean Crawled Corpus
      • Downstream Tasks
      • Input and Output Format
    • Experiments
      • Baselines
        • Model
        • Training
        • Unsupervised Objective
        • Baseline Performance
    • Architecture
      • Model Structure
      • Comparing Different Model Structure
      • Objectives
      • Result

Foreword

  • The paper is famous and we call it T5
  • The paper is too long so i’m stilling reading
  • I think this paper is a survey, there are many proposed approaches being compared in this papre. This is very helpful for us to connect some of the previous knowledge we have learned
  • Original paper: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Intro

Basic idea: treat every text processing problem as a text-to-text problem

  • taking text as input and producing new text as output
  • The text-to-text framework allows us to directly apply the same model, objective, training procedure, and decoding process to every task we consider

Goal: not to propose new methods but instead to provide a comprehensive perspective on where the field stands

Data: In order to perform experiments at this scale(11 B parameters), we introduce the “Colossal Clean Crawled Corpus” (C4),

An interesting and straightforward conclusion:

  • transfer learning to computer vision , pre-training is typically done via supervised learning on a large labeled data set
    like ImageNet
  • In contrast, modern techniques for transfer learning in NLP often pre-train using unsupervised learning on unlabeled data

Setting

Model

Transformer-based model

The Colossal Clean Crawled Corpus

leverage Common Crawl as a source of text scraped from the web

  • The majority of the resulting text is not natural language
  • To address these issues, we clean up Common Crawl's web extracted text

Download the web extracted text from April 2019 and applied the aforementioned filtering

  • produces a collection of text that is not only orders of magnitude larger than most data sets used for pre-training (about 750 GB) but also comprises reasonably clean and natural English text

Common Crawl is a publicly-available web archive that provides “web extracted text” by removing markup and other non-text content from the scraped HTML files

  • produce 20TB each month

Downstream Tasks

Input and Output Format

【论文阅读 T5】Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer_第1张图片

The model is trained with a maximum likelihood objective regardless of the task

To specify which task the model should perform, we add a task-specific (text) prefix to the original input sequence before feeding it to the model.

  • we convert all tasks to the format which can be casted as a text-to-text format

Note: The choice of text prefix used for a given task is essentially a hyperparameter

Experiments

Baselines

Pre-train a standard Transformer (described in Section 2.1) using a simple denoising objective and then separately fine-tune on each of our downstream tasks

Model

A standard encoder-decoder Transformer

  • the encoder and decoder are each similar in size and configuration to a “ B E R T B A S E BERT_{BASE} BERTBASE
  • this results in a model with about 220 million parameters(twice the number of B E R T B A S E BERT_{BASE} BERTBASE)

Training

Train: Use standard maximum likelihood(teacher forcing + cross-entropy loss)

Test: greedy decoding(choosing the highest-probability logit at every timestep)

Pre-train data: C4

  • steps: 2 19 2^{19} 219 steps
  • maximum seq length: 512
  • batch size: 128 seqs
  • total: 2 35 2^{35} 235 tokens(34B), just a fraction of the entire C4(BERT 137B, RoBERTa 2.2T)

Fine-tune:

  • steps: 2 18 2^{18} 218 steps
  • seq len & batch size: as before

Unsupervised Objective

Use a denoising objective(produce better performance): predict missing or otherwise corrupted tokens in the input

  • Randomly samples and then drops out 15% of tokens in the input sequence

【论文阅读 T5】Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer_第2张图片

Baseline Performance

【论文阅读 T5】Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer_第3张图片

  • train our baseline model 10 times from scratch (i.e. with different random initializations and data set shuffling)

Architecture

Model Structure

【论文阅读 T5】Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer_第4张图片

【论文阅读 T5】Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer_第5张图片

Comparing Different Model Structure

An encoder-decoder model with L layers in the encoder and L layers in the decoder has approximately the same number of parameters as a language model with 2L layers

  • However, they have approximately the same computational cost

  • Because, the encoder is only applied to the input sequence and the decoder is only applied to the output sequence, while

    the language model must be applied to both the input and output sequence,

Objectives

consider both a basic language modeling objective as well as our baseline denoising objective

  • we sample a span of text from our unlabeled data set and choose a random point to split it into prefix and target portions
    • For the standard language model, we train the model to predict the entire span from beginning to end
    • For denoising objective(adapted to a language model): concatenate the inputs and targets

Result

【论文阅读 T5】Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer_第6张图片

  • For all tasks, the encoder-decoder architecture with the denoising objective performed best.
  • sharing parameters across the encoder and decoder performed nearly as well
    • We also note that the shared parameter encoder-decoder outperforms the decoder-only prefix LM
  • using a denoising objective always results in better downstream task performance compared to a language modeling objective

你可能感兴趣的:(信息检索,论文阅读,论文阅读,transformer,深度学习)