射大雕的迪西。

7.2 迁移学习

迁移学习

2.1 迁移学习理论

学习目标:
- 了解迁移学习中的有关概念.
- 掌握迁移学习的两种迁移方式.

2.1.1 迁移学习中的有关概念:
- 预训练模型
- 微调
- 微调脚本

2.1.2 预训练模型(Pretrained model):
- 一般情况下预训练模型都是大型模型，具备复杂的网络结构，众多的参数量，以及在足够大的数据集下进行训练而产生的模型. 在NLP领域，预训练模型往往是语言模型，因为语言模型的训练是无监督的，可以获得大规模语料，同时语言模型又是许多典型NLP任务的基础，如机器翻译，文本生成，阅读理解等，常见的预训练模型有BERT, GPT, roBERTa, transformer-XL等.

2.1.3 微调(Fine-tuning):
- 根据给定的预训练模型，改变它的部分参数或者为其新增部分输出结构后，通过在小部分数据集上训练，来使整个模型更好的适应特定任务.

2.1.3.1 微调脚本(Fine-tuning script):
- 实现微调过程的代码文件。这些脚本文件中，应包括对预训练模型的调用，对微调参数的选定以及对微调结构的更改等，同时，因为微调是一个训练过程，它同样需要一些超参数的设定，以及损失函数和优化器的选取等, 因此微调脚本往往也包含了整个迁移学习的过程.

关于微调脚本的说明:
- 一般情况下，微调脚本应该由不同的任务类型开发者自己编写，但是由于目前研究的NLP任务类型（分类，提取，生成）以及对应的微调输出结构都是有限的，有些微调方式已经在很多数据集上被验证是有效的，因此微调脚本也可以使用已经完成的规范脚本.

2.1.4 两种迁移方式:
- 直接使用预训练模型，进行相同任务的处理，不需要调整参数或模型结构，这些模型开箱即用。但是这种情况一般只适用于普适任务, 如：fasttest工具包中预训练的词向量模型。另外，很多预训练模型开发者为了达到开箱即用的效果，将模型结构分各个部分保存为不同的预训练模型，提供对应的加载方法来完成特定目标.
- 更加主流的迁移学习方式是发挥预训练模型特征抽象的能力，然后再通过微调的方式，通过训练更新小部分参数以此来适应不同的任务。这种迁移方式需要提供小部分的标注数据来进行监督学习.

关于迁移方式的说明:
- 直接使用预训练模型的方式, 已经在fasttext的词向量迁移中学习. 接下来的迁移学习实践将主要讲解通过微调的方式进行迁移学习.

2.2 NLP中的标准数据集

学习目标:
- 了解NLP中GLUE标准数据集合的相关知识.
- 掌握GLUE标准数据集合的下载方式, 数据样式及其对应的任务类型.

2.2.1 GLUE数据集合的介绍:
- GLUE由纽约大学, 华盛顿大学, Google联合推出, 涵盖不同NLP任务类型, 截止至2020年1月其中包括11个子任务数据集, 成为衡量NLP研究发展的衡量标准.

2.2.1.1 GLUE数据集合包含以下数据集:
- CoLA 数据集
- SST-2 数据集
- MRPC 数据集
- STS-B 数据集
- QQP 数据集
- MNLI 数据集
- SNLI 数据集
- QNLI 数据集
- RTE 数据集
- WNLI 数据集
- diagnostics数据集(官方未完善)

GLUE数据集合的下载方式:

下载脚本代码:

''' Script for downloading all GLUE data.'''


import os
import sys
import shutil
import argparse
import tempfile
import urllib.request
import zipfile


TASKS = ["CoLA", "SST", "MRPC", "QQP", "STS", "MNLI", "SNLI", "QNLI", "RTE", "WNLI", "diagnostic"]
TASK2PATH = {"CoLA":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FCoLA.zip?alt=media&token=46d5e637-3411-4188-bc44-5809b5bfb5f4',
             "SST":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FSST-2.zip?alt=media&token=aabc5f6b-e466-44a2-b9b4-cf6337f84ac8',
             "MRPC":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2Fmrpc_dev_ids.tsv?alt=media&token=ec5c0836-31d5-48f4-b431-7480817f1adc',
             "QQP":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FQQP.zip?alt=media&token=700c6acf-160d-4d89-81d1-de4191d02cb5',
             "STS":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FSTS-B.zip?alt=media&token=bddb94a7-8706-4e0d-a694-1109e12273b5',
             "MNLI":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FMNLI.zip?alt=media&token=50329ea1-e339-40e2-809c-10c40afff3ce',
             "SNLI":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FSNLI.zip?alt=media&token=4afcfbb2-ff0c-4b2d-a09a-dbf07926f4df',
             "QNLI": 'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FQNLIv2.zip?alt=media&token=6fdcf570-0fc5-4631-8456-9505272d1601',
             "RTE":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FRTE.zip?alt=media&token=5efa7e85-a0bb-4f19-8ea2-9e1840f077fb',
             "WNLI":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FWNLI.zip?alt=media&token=068ad0a0-ded7-4bd7-99a5-5e00222e0faf',
             "diagnostic":'https://storage.googleapis.com/mtl-sentence-representations.appspot.com/tsvsWithoutLabels%2FAX.tsv?GoogleAccessId=firebase-adminsdk-0khhl@mtl-sentence-representations.iam.gserviceaccount.com&Expires=2498860800&Signature=DuQ2CSPt2Yfre0C%2BiISrVYrIFaZH1Lc7hBVZDD4ZyR7fZYOMNOUGpi8QxBmTNOrNPjR3z1cggo7WXFfrgECP6FBJSsURv8Ybrue8Ypt%2FTPxbuJ0Xc2FhDi%2BarnecCBFO77RSbfuz%2Bs95hRrYhTnByqu3U%2FYZPaj3tZt5QdfpH2IUROY8LiBXoXS46LE%2FgOQc%2FKN%2BA9SoscRDYsnxHfG0IjXGwHN%2Bf88q6hOmAxeNPx6moDulUF6XMUAaXCSFU%2BnRO2RDL9CapWxj%2BDl7syNyHhB7987hZ80B%2FwFkQ3MEs8auvt5XW1%2Bd4aCU7ytgM69r8JDCwibfhZxpaa4gd50QXQ%3D%3D'}


MRPC_TRAIN = 'https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_train.txt'
MRPC_TEST = 'https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_test.txt'


def download_and_extract(task, data_dir):
    print("Downloading and extracting %s..." % task)
    data_file = "%s.zip" % task
    urllib.request.urlretrieve(TASK2PATH[task], data_file)
    with zipfile.ZipFile(data_file) as zip_ref:
        zip_ref.extractall(data_dir)
    os.remove(data_file)
    print("\tCompleted!")


def format_mrpc(data_dir, path_to_data):
    print("Processing MRPC...")
    mrpc_dir = os.path.join(data_dir, "MRPC")
    if not os.path.isdir(mrpc_dir):
        os.mkdir(mrpc_dir)
    if path_to_data:
        mrpc_train_file = os.path.join(path_to_data, "msr_paraphrase_train.txt")
        mrpc_test_file = os.path.join(path_to_data, "msr_paraphrase_test.txt")
    else:
        print("Local MRPC data not specified, downloading data from %s" % MRPC_TRAIN)
        mrpc_train_file = os.path.join(mrpc_dir, "msr_paraphrase_train.txt")
        mrpc_test_file = os.path.join(mrpc_dir, "msr_paraphrase_test.txt")
        urllib.request.urlretrieve(MRPC_TRAIN, mrpc_train_file)
        urllib.request.urlretrieve(MRPC_TEST, mrpc_test_file)
    assert os.path.isfile(mrpc_train_file), "Train data not found at %s" % mrpc_train_file
    assert os.path.isfile(mrpc_test_file), "Test data not found at %s" % mrpc_test_file
    urllib.request.urlretrieve(TASK2PATH["MRPC"], os.path.join(mrpc_dir, "dev_ids.tsv"))


    dev_ids = []
    with open(os.path.join(mrpc_dir, "dev_ids.tsv"), encoding="utf8") as ids_fh:
        for row in ids_fh:
            dev_ids.append(row.strip().split('\t'))


    with open(mrpc_train_file, encoding="utf8") as data_fh, \
         open(os.path.join(mrpc_dir, "train.tsv"), 'w', encoding="utf8") as train_fh, \
         open(os.path.join(mrpc_dir, "dev.tsv"), 'w', encoding="utf8") as dev_fh:
        header = data_fh.readline()
        train_fh.write(header)
        dev_fh.write(header)
        for row in data_fh:
            label, id1, id2, s1, s2 = row.strip().split('\t')
            if [id1, id2] in dev_ids:
                dev_fh.write("%s\t%s\t%s\t%s\t%s\n" % (label, id1, id2, s1, s2))
            else:
                train_fh.write("%s\t%s\t%s\t%s\t%s\n" % (label, id1, id2, s1, s2))


    with open(mrpc_test_file, encoding="utf8") as data_fh, \
            open(os.path.join(mrpc_dir, "test.tsv"), 'w', encoding="utf8") as test_fh:
        header = data_fh.readline()
        test_fh.write("index\t#1 ID\t#2 ID\t#1 String\t#2 String\n")
        for idx, row in enumerate(data_fh):
            label, id1, id2, s1, s2 = row.strip().split('\t')
            test_fh.write("%d\t%s\t%s\t%s\t%s\n" % (idx, id1, id2, s1, s2))
    print("\tCompleted!")


def download_diagnostic(data_dir):
    print("Downloading and extracting diagnostic...")
    if not os.path.isdir(os.path.join(data_dir, "diagnostic")):
        os.mkdir(os.path.join(data_dir, "diagnostic"))
    data_file = os.path.join(data_dir, "diagnostic", "diagnostic.tsv")
    urllib.request.urlretrieve(TASK2PATH["diagnostic"], data_file)
    print("\tCompleted!")
    return


def get_tasks(task_names):
    task_names = task_names.split(',')
    if "all" in task_names:
        tasks = TASKS
    else:
        tasks = []
        for task_name in task_names:
            assert task_name in TASKS, "Task %s not found!" % task_name
            tasks.append(task_name)
    return tasks


def main(arguments):
    parser = argparse.ArgumentParser()
    parser.add_argument('--data_dir', help='directory to save data to', type=str, default='glue_data')
    parser.add_argument('--tasks', help='tasks to download data for as a comma separated string',
                        type=str, default='all')
    parser.add_argument('--path_to_mrpc', help='path to directory containing extracted MRPC data, msr_paraphrase_train.txt and msr_paraphrase_text.txt',
                        type=str, default='')
    args = parser.parse_args(arguments)


    if not os.path.isdir(args.data_dir):
        os.mkdir(args.data_dir)
    tasks = get_tasks(args.tasks)


    for task in tasks:
        if task == 'MRPC':
            format_mrpc(args.data_dir, args.path_to_mrpc)
        elif task == 'diagnostic':
            download_diagnostic(args.data_dir)
        else:
            download_and_extract(task, args.data_dir)




if __name__ == '__main__':
    sys.exit(main(sys.argv[1:]))

运行脚本下载所有数据集:

# 假设你已经将以上代码copy到download_glue_data.py文件中

# 运行这个python脚本, 你将同目录下得到一个glue文件夹

python download_glue_data.py

输出效果:

Downloading and extracting CoLA...

Completed!

Downloading and extracting SST...

Completed!

Processing MRPC...

Local MRPC data not specified, downloading data from https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_train.txt

Completed!

Downloading and extracting QQP...

Completed!

Downloading and extracting STS...

Completed!

Downloading and extracting MNLI...

Completed!

Downloading and extracting SNLI...

Completed!

Downloading and extracting QNLI...

Completed!

Downloading and extracting RTE...

Completed!

Downloading and extracting WNLI...

Completed!

Downloading and extracting diagnostic...

Completed!

2.2.2 GLUE数据集合中子数据集的样式及其任务类型:

CoLA数据集文件样式:

- CoLA/

- dev.tsv

- original/

- test.tsv

- train.tsv

文件样式说明:

在使用中常用到的文件是train.tsv, dev.tsv, test.tsv, 分别代表训练集, 验证集和测试集. 其中train.tsv与dev.tsv数据样式相同, 都是带有标签的数据, 其中test.tsv是不带有标签的数据.

train.tsv数据样式:

...

gj04 1 She coughed herself awake as the leaf landed on her nose.

gj04 1 The worm wriggled onto the carpet.

gj04 1 The chocolate melted onto the carpet.

gj04 0 * The ball wriggled itself loose.

gj04 1 Bill wriggled himself loose.

bc01 1 The sinking of the ship to collect the insurance was very devious.

bc01 1 The ship's sinking was very devious.

bc01 0 * The ship's sinking to collect the insurance was very devious.

bc01 1 The testing of such drugs on oneself is too risky.

bc01 0 * This drug's testing on oneself is too risky.

...

train.tsv数据样式说明:

train.tsv中的数据内容共分为4列, 第一列数据, 如gj04, bc01等代表每条文本数据的来源即出版物代号; 第二列数据, 0或1, 代表每条文本数据的语法是否正确, 0代表不正确, 1代表正确; 第三列数据, '', 是作者最初的正负样本标记, 与第二列意义相同, ''表示不正确; 第四列即是被标注的语法使用是否正确的文本句子.

test.tsv数据样式:

index sentence

0 Bill whistled past the house.

1 The car honked its way down the road.

2 Bill pushed Harry off the sofa.

3 the kittens yawned awake and played.

4 I demand that the more John eats, the more he pay.

5 If John eats more, keep your mouth shut tighter, OK?

6 His expectations are always lower than mine are.

7 The sooner you call, the more carefully I will word the letter.

8 The more timid he feels, the more people he interviews without asking questions of.

9 Once Janet left, Fred became a lot crazier.

...

test.tsv数据样式说明:

test.tsv中的数据内容共分为2列, 第一列数据代表每条文本数据的索引; 第二列数据代表用于测试的句子.

CoLA数据集的任务类型:

二分类任务

评估指标为: MMC(马修斯相关系数, 在正负样本分布十分不均衡的情况下使用的二分类评估指标)

SST-2数据集文件样式:

- SST-2/

- dev.tsv

- original/

- test.tsv

- train.tsv

文件样式说明:

在使用中常用到的文件是train.tsv, dev.tsv, test.tsv, 分别代表训练集, 验证集和测试集. 其中train.tsv与dev.tsv数据样式相同, 都是带有标签的数据, 其中test.tsv是不带有标签的数据.

train.tsv数据样式:

sentence label

hide new secretions from the parental units 0

contains no wit , only labored gags 0

that loves its characters and communicates something rather beautiful about human nature 1

remains utterly satisfied to remain the same throughout 0

on the worst revenge-of-the-nerds clichés the filmmakers could dredge up 0

that 's far too tragic to merit such superficial treatment 0

demonstrates that the director of such hollywood blockbusters as patriot games can still turn out a small , personal film with an emotional wallop . 1

of saucy 1

a depressed fifteen-year-old 's suicidal poetry 0

...

train.tsv数据样式说明:

train.tsv中的数据内容共分为2列, 第一列数据代表具有感情色彩的评论文本; 第二列数据, 0或1, 代表每条文本数据是积极或者消极的评论, 0代表消极, 1代表积极.

test.tsv数据样式:

index sentence

0 uneasy mishmash of styles and genres .

1 this film 's relationship to actual tension is the same as what christmas-tree flocking in a spray can is to actual snow : a poor -- if durable -- imitation .

2 by the end of no such thing the audience , like beatrice , has a watchful affection for the monster .

3 director rob marshall went out gunning to make a great one .

4 lathan and diggs have considerable personal charm , and their screen rapport makes the old story seem new .

5 a well-made and often lovely depiction of the mysteries of friendship .

6 none of this violates the letter of behan 's book , but missing is its spirit , its ribald , full-throated humor .

7 although it bangs a very cliched drum at times , this crowd-pleaser 's fresh dialogue , energetic music , and good-natured spunk are often infectious .

8 it is not a mass-market entertainment but an uncompromising attempt by one artist to think about another .

9 this is junk food cinema at its greasiest .

...

test.tsv数据样式说明: * test.tsv中的数据内容共分为2列, 第一列数据代表每条文本数据的索引; 第二列数据代表用于测试的句子.

2.2.3 SST-2数据集的任务类型:

二分类任务

评估指标为: ACC

MRPC数据集文件样式:

- MRPC/

- dev.tsv

- test.tsv

- train.tsv

- dev_ids.tsv

- msr_paraphrase_test.txt

- msr_paraphrase_train.txt

文件样式说明:

在使用中常用到的文件是train.tsv, dev.tsv, test.tsv, 分别代表训练集, 验证集和测试集. 其中train.tsv与dev.tsv数据样式相同, 都是带有标签的数据, 其中test.tsv是不带有标签的数据.

train.tsv数据样式:

Quality #1 ID #2 ID #1 String #2 String

1 702876 702977 Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence . Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .

0 2108705 2108831 Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion . Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .

1 1330381 1330521 They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added . On June 10 , the ship 's owners had published an advertisement on the Internet , offering the explosives for sale .

0 3344667 3344648 Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 . Tab shares jumped 20 cents , or 4.6 % , to set a record closing high at A $ 4.57 .

1 1236820 1236712 The stock rose $ 2.11 , or about 11 percent , to close Friday at $ 21.51 on the New York Stock Exchange . PG & E Corp. shares jumped $ 1.63 or 8 percent to $ 21.03 on the New York Stock Exchange on Friday .

1 738533 737951 Revenue in the first quarter of the year dropped 15 percent from the same period a year earlier . With the scandal hanging over Stewart 's company , revenue the first quarter of the year dropped 15 percent from the same period a year earlier .

0 264589 264502 The Nasdaq had a weekly gain of 17.27 , or 1.2 percent , closing at 1,520.15 on Friday . The tech-laced Nasdaq Composite .IXIC rallied 30.46 points , or 2.04 percent , to 1,520.15 .

1 579975 579810 The DVD-CCA then appealed to the state Supreme Court . The DVD CCA appealed that decision to the U.S. Supreme Court .

...

train.tsv数据样式说明:

train.tsv中的数据内容共分为5列, 第一列数据, 0或1, 代表每对句子是否具有相同的含义, 0代表含义不相同, 1代表含义相同. 第二列和第三列分别代表每对句子的id, 第四列和第五列分别具有相同/不同含义的句子对.