時間如逝

11111111111111111111

111111111111111111111111111111111111111111111111111111

跳转至
logo
Transformer
第二章:Transformer架构解析
logo
Transformer
第一章:Transformer背景介绍
第二章:Transformer架构解析
目录
2.1 认识Transformer架构
学习目标
Transformer模型的作用
Transformer总体架构图
小节总结
2.2 输入部分实现
学习目标
文本嵌入层的作用
位置编码器的作用
小节总结
2.3 编码器部分实现
学习目标
2.3.1 掩码张量
2.3.2 注意力机制
2.3.3 多头注意力机制
2.3.4 前馈全连接层
2.3.5 规范化层
2.3.6 子层连接结构
2.3.7 编码器层
2.3.8 编码器
2.4 解码器部分实现
学习目标
2.4.1 解码器层
2.4.2 解码器
2.5 输出部分实现
学习目标
线性层的作用
softmax层的作用
小节总结
2.6 模型构建
学习目标
编码器-解码器结构的代码实现
Tansformer模型构建过程的代码分析
小节总结
第二章:Transformer架构解析
2.1 认识Transformer架构
学习目标
了解Transformer模型的作用.
了解Transformer总体架构图中各个组成部分的名称.
Transformer模型的作用
基于seq2seq架构的transformer模型可以完成NLP领域研究的典型任务, 如机器翻译, 文本生成等. 同时又可以构建预训练语言模型，用于不同任务的迁移学习.
声明:
在接下来的架构分析中, 我们将假设使用Transformer模型架构处理从一种语言文本到另一种语言文本的翻译工作, 因此很多命名方式遵循NLP中的规则. 比如: Embeddding层将称作文本嵌入层, Embedding层产生的张量称为词嵌入张量, 它的最后一维将称作词向量等.
Transformer总体架构图
avatar
Transformer总体架构可分为四个部分:
输入部分
输出部分
编码器部分
解码器部分
输入部分包含:
源文本嵌入层及其位置编码器
目标文本嵌入层及其位置编码器
avatar
输出部分包含:
线性层
softmax层
avatar
编码器部分:
由N个编码器层堆叠而成
每个编码器层由两个子层连接结构组成
第一个子层连接结构包括一个多头自注意力子层和规范化层以及一个残差连接
第二个子层连接结构包括一个前馈全连接子层和规范化层以及一个残差连接
avatar
解码器部分:
由N个解码器层堆叠而成
每个解码器层由三个子层连接结构组成
第一个子层连接结构包括一个多头自注意力子层和规范化层以及一个残差连接
第二个子层连接结构包括一个多头注意力子层和规范化层以及一个残差连接
第三个子层连接结构包括一个前馈全连接子层和规范化层以及一个残差连接
avatar
小节总结
学习了Transformer模型的作用:

基于seq2seq架构的transformer模型可以完成NLP领域研究的典型任务, 如机器翻译, 文本生成等. 同时又可以构建预训练语言模型，用于不同任务的迁移学习.
Transformer总体架构可分为四个部分:

输入部分
输出部分
编码器部分
解码器部分
输入部分包含:

源文本嵌入层及其位置编码器
目标文本嵌入层及其位置编码器
输出部分包含:

线性层
softmax处理器
编码器部分:

由N个编码器层堆叠而成
每个编码器层由两个子层连接结构组成
第一个子层连接结构包括一个多头自注意力子层和规范化层以及一个残差连接
第二个子层连接结构包括一个前馈全连接子层和规范化层以及一个残差连接
解码器部分:

由N个解码器层堆叠而成
每个解码器层由三个子层连接结构组成
第一个子层连接结构包括一个多头自注意力子层和规范化层以及一个残差连接
第二个子层连接结构包括一个多头注意力子层和规范化层以及一个残差连接
第三个子层连接结构包括一个前馈全连接子层和规范化层以及一个残差连接
2.2 输入部分实现
学习目标
了解文本嵌入层和位置编码的作用.
掌握文本嵌入层和位置编码的实现过程.
输入部分包含:
源文本嵌入层及其位置编码器
目标文本嵌入层及其位置编码器
avatar
文本嵌入层的作用
无论是源文本嵌入还是目标文本嵌入，都是为了将文本中词汇的数字表示转变为向量表示, 希望在这样的高维空间捕捉词汇间的关系.
pytorch 0.3.0及其必备工具包的安装:

使用pip安装的工具包包括pytorch-0.3.0, numpy, matplotlib, seaborn

pip install http://download.pytorch.org/whl/cu80/torch-0.3.0.post4-cp36-cp36m-linux_x86_64.whl numpy matplotlib seaborn

MAC系统安装, python版本<=3.6

pip install torch==0.3.0.post4 numpy matplotlib seaborn
文本嵌入层的代码分析:

导入必备的工具包

import torch

预定义的网络层torch.nn, 工具开发者已经帮助我们开发好的一些常用层,

比如，卷积层, lstm层, embedding层等, 不需要我们再重新造轮子.

import torch.nn as nn

数学计算工具包

import math

torch中变量封装函数Variable.

from torch.autograd import Variable

定义Embeddings类来实现文本嵌入层，这里s说明代表两个一模一样的嵌入层, 他们共享参数.

该类继承nn.Module, 这样就有标准层的一些功能, 这里我们也可以理解为一种模式, 我们自己实现的所有层都会这样去写.

class Embeddings(nn.Module):
def init(self, d_model, vocab):
“”“类的初始化函数, 有两个参数, d_model: 指词嵌入的维度, vocab: 指词表的大小.”""
# 接着就是使用super的方式指明继承nn.Module的初始化函数, 我们自己实现的所有层都会这样去写.
super(Embeddings, self).init()
# 之后就是调用nn中的预定义层Embedding, 获得一个词嵌入对象self.lut
self.lut = nn.Embedding(vocab, d_model)
# 最后就是将d_model传入类中
self.d_model = d_model

def forward(self, x):
    """可以将其理解为该层的前向传播逻辑，所有层中都会有此函数
       当传给该类的实例化对象参数时, 自动调用该类函数
       参数x: 因为Embedding层是首层, 所以代表输入给模型的文本通过词汇映射后的张量"""

    # 将x传给self.lut并与根号下self.d_model相乘作为结果返回
    return self.lut(x) * math.sqrt(self.d_model)

nn.Embedding演示:

embedding = nn.Embedding(10, 3)
input = torch.LongTensor([[1,2,4,5],[4,3,2,9]])
embedding(input)
tensor([[[-0.0251, -1.6902, 0.7172],
[-0.6431, 0.0748, 0.6969],
[ 1.4970, 1.3448, -0.9685],
[-0.3677, -2.7265, -0.1685]],

    [[ 1.4970,  1.3448, -0.9685],
     [ 0.4362, -0.4004,  0.9400],
     [-0.6431,  0.0748,  0.6969],
     [ 0.9124, -2.3616,  1.1151]]])

embedding = nn.Embedding(10, 3, padding_idx=0)
input = torch.LongTensor([[0,2,0,5]])
embedding(input)
tensor([[[ 0.0000, 0.0000, 0.0000],
[ 0.1535, -2.0309, 0.9315],
[ 0.0000, 0.0000, 0.0000],
[-0.1655, 0.9897, 0.0635]]])
实例化参数:

词嵌入维度是512维

d_model = 512

词表大小是1000

vocab = 1000
输入参数:

输入x是一个使用Variable封装的长整型张量, 形状是2 x 4

x = Variable(torch.LongTensor([[100,2,421,508],[491,998,1,221]]))
调用:

emb = Embeddings(d_model, vocab)
embr = emb(x)
print(“embr:”, embr)
输出效果:

embr: Variable containing:
( 0 ,.,.) =
35.9321 3.2582 -17.7301 … 3.4109 13.8832 39.0272
8.5410 -3.5790 -12.0460 … 40.1880 36.6009 34.7141
-17.0650 -1.8705 -20.1807 … -12.5556 -34.0739 35.6536
20.6105 4.4314 14.9912 … -0.1342 -9.9270 28.6771

( 1 ,.,.) =
27.7016 16.7183 46.6900 … 17.9840 17.2525 -3.9709
3.0645 -5.5105 10.8802 … -13.0069 30.8834 -38.3209
33.1378 -32.1435 -3.9369 … 15.6094 -29.7063 40.1361
-31.5056 3.3648 1.4726 … 2.8047 -9.6514 -23.4909
[torch.FloatTensor of size 2x4x512]
位置编码器的作用
因为在Transformer的编码器结构中, 并没有针对词汇位置信息的处理，因此需要在Embedding层后加入位置编码器，将词汇位置不同可能会产生不同语义的信息加入到词嵌入张量中, 以弥补位置信息的缺失.
位置编码器的代码分析:

定义位置编码器类, 我们同样把它看做一个层, 因此会继承nn.Module

class PositionalEncoding(nn.Module):
def init(self, d_model, dropout, max_len=5000):
“”“位置编码器类的初始化函数, 共有三个参数, 分别是d_model: 词嵌入维度,
dropout: 置0比率, max_len: 每个句子的最大长度”""
super(PositionalEncoding, self).init()

    # 实例化nn中预定义的Dropout层, 并将dropout传入其中, 获得对象self.dropout
    self.dropout = nn.Dropout(p=dropout)

    # 初始化一个位置编码矩阵, 它是一个0阵，矩阵的大小是max_len x d_model.
    pe = torch.zeros(max_len, d_model)

    # 初始化一个绝对位置矩阵, 在我们这里，词汇的绝对位置就是用它的索引去表示. 
    # 所以我们首先使用arange方法获得一个连续自然数向量，然后再使用unsqueeze方法拓展向量维度使其成为矩阵， 
    # 又因为参数传的是1，代表矩阵拓展的位置，会使向量变成一个max_len x 1 的矩阵， 
    position = torch.arange(0, max_len).unsqueeze(1)

    # 绝对位置矩阵初始化之后，接下来就是考虑如何将这些位置信息加入到位置编码矩阵中，
    # 最简单思路就是先将max_len x 1的绝对位置矩阵， 变换成max_len x d_model形状，然后覆盖原来的初始位置编码矩阵即可， 
    # 要做这种矩阵变换，就需要一个1xd_model形状的变换矩阵div_term，我们对这个变换矩阵的要求除了形状外，
    # 还希望它能够将自然数的绝对位置编码缩放成足够小的数字，有助于在之后的梯度下降过程中更快的收敛.  这样我们就可以开始初始化这个变换矩阵了.
    # 首先使用arange获得一个自然数矩阵， 但是细心的同学们会发现， 我们这里并没有按照预计的一样初始化一个1xd_model的矩阵， 
    # 而是有了一个跳跃，只初始化了一半即1xd_model/2 的矩阵。 为什么是一半呢，其实这里并不是真正意义上的初始化了一半的矩阵，
    # 我们可以把它看作是初始化了两次，而每次初始化的变换矩阵会做不同的处理，第一次初始化的变换矩阵分布在正弦波上， 第二次初始化的变换矩阵分布在余弦波上， 
    # 并把这两个矩阵分别填充在位置编码矩阵的偶数和奇数位置上，组成最终的位置编码矩阵.
    div_term = torch.exp(torch.arange(0, d_model, 2) *
                         -(math.log(10000.0) / d_model))
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)

    # 这样我们就得到了位置编码矩阵pe, pe现在还只是一个二维矩阵，要想和embedding的输出（一个三维张量）相加，
    # 就必须拓展一个维度，所以这里使用unsqueeze拓展维度.
    pe = pe.unsqueeze(0)

    # 最后把pe位置编码矩阵注册成模型的buffer，什么是buffer呢，
    # 我们把它认为是对模型效果有帮助的，但是却不是模型结构中超参数或者参数，不需要随着优化步骤进行更新的增益对象. 
    # 注册之后我们就可以在模型保存后重加载时和模型结构与参数一同被加载.
    self.register_buffer('pe', pe)

def forward(self, x):
    """forward函数的参数是x, 表示文本序列的词嵌入表示"""
    # 在相加之前我们对pe做一些适配工作， 将这个三维张量的第二维也就是句子最大长度的那一维将切片到与输入的x的第二维相同即x.size(1)，
    # 因为我们默认max_len为5000一般来讲实在太大了，很难有一条句子包含5000个词汇，所以要进行与输入张量的适配. 
    # 最后使用Variable进行封装，使其与x的样式相同，但是它是不需要进行梯度求解的，因此把requires_grad设置成false.
    x = x + Variable(self.pe[:, :x.size(1)], 
                     requires_grad=False)
    # 最后使用self.dropout对象进行'丢弃'操作, 并返回结果.
    return self.dropout(x)

nn.Dropout演示:

m = nn.Dropout(p=0.2)
input = torch.randn(4, 5)
output = m(input)
output
Variable containing:
0.0000 -0.5856 -1.4094 0.0000 -1.0290
2.0591 -1.3400 -1.7247 -0.9885 0.1286
0.5099 1.3715 0.0000 2.2079 -0.5497
-0.0000 -0.7839 -1.2434 -0.1222 1.2815
[torch.FloatTensor of size 4x5]
torch.unsqueeze演示:

x = torch.tensor([1, 2, 3, 4])
torch.unsqueeze(x, 0)
tensor([[ 1, 2, 3, 4]])

torch.unsqueeze(x, 1)
tensor([[ 1],
[ 2],
[ 3],
[ 4]])
实例化参数:

词嵌入维度是512维

d_model = 512

置0比率为0.1

dropout = 0.1

句子最大长度

max_len=60
输入参数:

输入x是Embedding层的输出的张量, 形状是2 x 4 x 512

x = embr
Variable containing:
( 0 ,.,.) =
35.9321 3.2582 -17.7301 … 3.4109 13.8832 39.0272
8.5410 -3.5790 -12.0460 … 40.1880 36.6009 34.7141
-17.0650 -1.8705 -20.1807 … -12.5556 -34.0739 35.6536
20.6105 4.4314 14.9912 … -0.1342 -9.9270 28.6771

pe = PositionalEncoding(d_model, dropout, max_len)
pe_result = pe(x)
print(“pe_result:”, pe_result)
输出效果:

pe_result: Variable containing:
( 0 ,.,.) =
-19.7050 0.0000 0.0000 … -11.7557 -0.0000 23.4553
-1.4668 -62.2510 -2.4012 … 66.5860 -24.4578 -37.7469
9.8642 -41.6497 -11.4968 … -21.1293 -42.0945 50.7943
0.0000 34.1785 -33.0712 … 48.5520 3.2540 54.1348

( 1 ,.,.) =
7.7598 -21.0359 15.0595 … -35.6061 -0.0000 4.1772
-38.7230 8.6578 34.2935 … -43.3556 26.6052 4.3084
24.6962 37.3626 -26.9271 … 49.8989 0.0000 44.9158
-28.8435 -48.5963 -0.9892 … -52.5447 -4.1475 -3.0450
[torch.FloatTensor of size 2x4x512]
绘制词汇向量中特征的分布曲线:

import matplotlib.pyplot as plt

创建一张15 x 5大小的画布

plt.figure(figsize=(15, 5))

实例化PositionalEncoding类得到pe对象, 输入参数是20和0

pe = PositionalEncoding(20, 0)

然后向pe传入被Variable封装的tensor, 这样pe会直接执行forward函数,

且这个tensor里的数值都是0, 被处理后相当于位置编码张量

y = pe(Variable(torch.zeros(1, 100, 20)))

然后定义画布的横纵坐标, 横坐标到100的长度, 纵坐标是某一个词汇中的某维特征在不同长度下对应的值

因为总共有20维之多, 我们这里只查看4，5，6，7维的值.

plt.plot(np.arange(100), y[0, :, 4:8].data.numpy())

在画布上填写维度提示信息

plt.legend([“dim %d”%p for p in [4,5,6,7]])
输出效果:
avatar
效果分析:
每条颜色的曲线代表某一个词汇中的特征在不同位置的含义.
保证同一词汇随着所在位置不同它对应位置嵌入向量会发生变化.
正弦波和余弦波的值域范围都是1到-1这又很好的控制了嵌入数值的大小, 有助于梯度的快速计算.
小节总结
学习了文本嵌入层的作用:

无论是源文本嵌入还是目标文本嵌入，都是为了将文本中词汇的数字表示转变为向量表示, 希望在这样的高维空间捕捉词汇间的关系.
学习并实现了文本嵌入层的类: Embeddings

初始化函数以d_model, 词嵌入维度, 和vocab, 词汇总数为参数, 内部主要使用了nn中的预定层Embedding进行词嵌入.
在forward函数中, 将输入x传入到Embedding的实例化对象中, 然后乘以一个根号下d_model进行缩放, 控制数值大小.
它的输出是文本嵌入后的结果.
学习了位置编码器的作用:

因为在Transformer的编码器结构中, 并没有针对词汇位置信息的处理，因此需要在Embedding层后加入位置编码器，将词汇位置不同可能会产生不同语义的信息加入到词嵌入张量中, 以弥补位置信息的缺失.
学习并实现了位置编码器的类: PositionalEncoding

初始化函数以d_model, dropout, max_len为参数, 分别代表d_model: 词嵌入维度, dropout: 置0比率, max_len: 每个句子的最大长度.
forward函数中的输入参数为x, 是Embedding层的输出.
最终输出一个加入了位置编码信息的词嵌入张量.
实现了绘制词汇向量中特征的分布曲线:

保证同一词汇随着所在位置不同它对应位置嵌入向量会发生变化.
正弦波和余弦波的值域范围都是1到-1, 这又很好的控制了嵌入数值的大小, 有助于梯度的快速计算.
2.3 编码器部分实现
学习目标
了解编码器中各个组成部分的作用.
掌握编码器中各个组成部分的实现过程.
编码器部分:
由N个编码器层堆叠而成
每个编码器层由两个子层连接结构组成
第一个子层连接结构包括一个多头自注意力子层和规范化层以及一个残差连接
第二个子层连接结构包括一个前馈全连接子层和规范化层以及一个残差连接
avatar
2.3.1 掩码张量
学习目标:
了解什么是掩码张量以及它的作用.
掌握生成掩码张量的实现过程.
什么是掩码张量:
掩代表遮掩，码就是我们张量中的数值，它的尺寸不定，里面一般只有1和0的元素，代表位置被遮掩或者不被遮掩，至于是0位置被遮掩还是1位置被遮掩可以自定义，因此它的作用就是让另外一个张量中的一些数值被遮掩，也可以说被替换, 它的表现形式是一个张量.
掩码张量的作用:
在transformer中, 掩码张量的主要作用在应用attention(将在下一小节讲解)时，有一些生成的attention张量中的值计算有可能已知了未来信息而得到的，未来信息被看到是因为训练时会把整个输出结果都一次性进行Embedding，但是理论上解码器的的输出却不是一次就能产生最终结果的，而是一次次通过上一次结果综合得出的，因此，未来的信息可能被提前利用. 所以，我们会进行遮掩. 关于解码器的有关知识将在后面的章节中讲解.
生成掩码张量的代码分析:

def subsequent_mask(size):
“”“生成向后遮掩的掩码张量, 参数size是掩码张量最后两个维度的大小, 它的最后两维形成一个方阵”""
# 在函数中, 首先定义掩码张量的形状
attn_shape = (1, size, size)

# 然后使用np.ones方法向这个形状中添加1元素,形成上三角阵, 最后为了节约空间, 
# 再使其中的数据类型变为无符号8位整形unit8 
subsequent_mask = np.triu(np.ones(attn_shape), k=1).astype('uint8')

# 最后将numpy类型转化为torch中的tensor, 内部做一个1 - 的操作, 
# 在这个其实是做了一个三角阵的反转, subsequent_mask中的每个元素都会被1减, 
# 如果是0, subsequent_mask中的该位置由0变成1
# 如果是1, subsequent_mask中的该位置由1变成0 
return torch.from_numpy(1 - subsequent_mask)

np.triu演示:

np.triu([[1,2,3],[4,5,6],[7,8,9],[10,11,12]], k=-1)
array([[ 1, 2, 3],
[ 4, 5, 6],
[ 0, 8, 9],
[ 0, 0, 12]])

np.triu([[1,2,3],[4,5,6],[7,8,9],[10,11,12]], k=0)
array([[ 1, 2, 3],
[ 0, 5, 6],
[ 0, 0, 9],
[ 0, 0, 0]])

np.triu([[1,2,3],[4,5,6],[7,8,9],[10,11,12]], k=1)
array([[ 0, 2, 3],
[ 0, 0, 6],
[ 0, 0, 0],
[ 0, 0, 0]])
输入实例:

生成的掩码张量的最后两维的大小

size = 5
调用:

sm = subsequent_mask(size)
print(“sm:”, sm)
输出效果:

最后两维形成一个下三角阵

sm: (0 ,.,.) =
1 0 0 0 0
1 1 0 0 0
1 1 1 0 0
1 1 1 1 0
1 1 1 1 1
[torch.ByteTensor of size 1x5x5]
掩码张量的可视化:

plt.figure(figsize=(5,5))
plt.imshow(subsequent_mask(20)[0])
输出效果:
avatar
效果分析:
通过观察可视化方阵, 黄色是1的部分, 这里代表被遮掩, 紫色代表没有被遮掩的信息, 横坐标代表目标词汇的位置, 纵坐标代表可查看的位置;
我们看到, 在0的位置我们一看望过去都是黄色的, 都被遮住了，1的位置一眼望过去还是黄色, 说明第一次词还没有产生, 从第二个位置看过去, 就能看到位置1的词, 其他位置看不到, 以此类推.
2.3.1 掩码张量总结:

学习了什么是掩码张量:
掩代表遮掩，码就是我们张量中的数值，它的尺寸不定，里面一般只有1和0的元素，代表位置被遮掩或者不被遮掩，至于是0位置被遮掩还是1位置被遮掩可以自定义，因此它的作用就是让另外一个张量中的一些数值被遮掩, 也可以说被替换, 它的表现形式是一个张量.
学习了掩码张量的作用:
在transformer中, 掩码张量的主要作用在应用attention(将在下一小节讲解)时，有一些生成的attetion张量中的值计算有可能已知量未来信息而得到的，未来信息被看到是因为训练时会把整个输出结果都一次性进行Embedding，但是理论上解码器的的输出却不是一次就能产生最终结果的，而是一次次通过上一次结果综合得出的，因此，未来的信息可能被提前利用. 所以，我们会进行遮掩. 关于解码器的有关知识将在后面的章节中讲解.
学习并实现了生成向后遮掩的掩码张量函数: subsequent_mask
它的输入是size, 代表掩码张量的大小.
它的输出是一个最后两维形成1方阵的下三角阵.
最后对生成的掩码张量进行了可视化分析, 更深一步理解了它的用途.
2.3.2 注意力机制
学习目标:
了解什么是注意力计算规则和注意力机制.
掌握注意力计算规则的实现过程.
什么是注意力:
我们观察事物时，之所以能够快速判断一种事物(当然允许判断是错误的), 是因为我们大脑能够很快把注意力放在事物最具有辨识度的部分从而作出判断，而并非是从头到尾的观察一遍事物后，才能有判断结果. 正是基于这样的理论，就产生了注意力机制.
什么是注意力计算规则:
它需要三个指定的输入Q(query), K(key), V(value), 然后通过公式得到注意力的计算结果, 这个结果代表query在key和value作用下的表示. 而这个具体的计算规则有很多种, 我这里只介绍我们用到的这一种.
我们这里使用的注意力的计算规则:
avatar
Q, K, V的比喻解释:

假如我们有一个问题: 给出一段文本，使用一些关键词对它进行描述!
为了方便统一正确答案，这道题可能预先已经给大家写出了一些关键词作为提示.其中这些给出的提示就可以看作是key，
而整个的文本信息就相当于是query，value的含义则更抽象，可以比作是你看到这段文本信息后，脑子里浮现的答案信息，
这里我们又假设大家最开始都不是很聪明，第一次看到这段文本后脑子里基本上浮现的信息就只有提示这些信息，
因此key与value基本是相同的，但是随着我们对这个问题的深入理解，通过我们的思考脑子里想起来的东西原来越多，
并且能够开始对我们query也就是这段文本，提取关键信息进行表示. 这就是注意力作用的过程，通过这个过程，
我们最终脑子里的value发生了变化，
根据提示key生成了query的关键词表示方法，也就是另外一种特征表示方法.

刚刚我们说到key和value一般情况下默认是相同，与query是不同的，这种是我们一般的注意力输入形式，
但有一种特殊情况，就是我们query与key和value相同，这种情况我们称为自注意力机制，就如同我们的刚刚的例子，
使用一般注意力机制，是使用不同于给定文本的关键词表示它. 而自注意力机制,
需要用给定文本自身来表达自己，也就是说你需要从给定文本中抽取关键词来表述它, 相当于对文本自身的一次特征提取.
什么是注意力机制:
注意力机制是注意力计算规则能够应用的深度学习网络的载体, 除了注意力计算规则外, 还包括一些必要的全连接层以及相关张量处理, 使其与应用网络融为一体. 使用自注意力计算规则的注意力机制称为自注意力机制.
注意力机制在网络中实现的图形表示:
avatar
注意力计算规则的代码分析:

def attention(query, key, value, mask=None, dropout=None):
“”“注意力机制的实现, 输入分别是query, key, value, mask: 掩码张量,
dropout是nn.Dropout层的实例化对象, 默认为None”""
# 在函数中, 首先取query的最后一维的大小, 一般情况下就等同于我们的词嵌入维度, 命名为d_k
d_k = query.size(-1)
# 按照注意力公式, 将query与key的转置相乘, 这里面key是将最后两个维度进行转置, 再除以缩放系数根号下d_k, 这种计算方法也称为缩放点积注意力计算.
# 得到注意力得分张量scores
scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)

# 接着判断是否使用掩码张量
if mask is not None:
    # 使用tensor的masked_fill方法, 将掩码张量和scores张量每个位置一一比较, 如果掩码张量处为0
    # 则对应的scores张量用-1e9这个值来替换, 如下演示
    scores = scores.masked_fill(mask == 0, -1e9)

# 对scores的最后一维进行softmax操作, 使用F.softmax方法, 第一个参数是softmax对象, 第二个是目标维度.
# 这样获得最终的注意力张量
p_attn = F.softmax(scores, dim = -1)

# 之后判断是否使用dropout进行随机置0
if dropout is not None:
    # 将p_attn传入dropout对象中进行'丢弃'处理
    p_attn = dropout(p_attn)

# 最后, 根据公式将p_attn与value张量相乘获得最终的query注意力表示, 同时返回注意力张量
return torch.matmul(p_attn, value), p_attn

tensor.masked_fill演示:

input = Variable(torch.randn(5, 5))
input
Variable containing:
2.0344 -0.5450 0.3365 -0.1888 -2.1803
1.5221 -0.3823 0.8414 0.7836 -0.8481
-0.0345 -0.8643 0.6476 -0.2713 1.5645
0.8788 -2.2142 0.4022 0.1997 0.1474
2.9109 0.6006 -0.6745 -1.7262 0.6977
[torch.FloatTensor of size 5x5]

mask = Variable(torch.zeros(5, 5))
mask
Variable containing:
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
[torch.FloatTensor of size 5x5]

input.masked_fill(mask == 0, -1e9)
Variable containing:
-1.0000e+09 -1.0000e+09 -1.0000e+09 -1.0000e+09 -1.0000e+09
-1.0000e+09 -1.0000e+09 -1.0000e+09 -1.0000e+09 -1.0000e+09
-1.0000e+09 -1.0000e+09 -1.0000e+09 -1.0000e+09 -1.0000e+09
-1.0000e+09 -1.0000e+09 -1.0000e+09 -1.0000e+09 -1.0000e+09
-1.0000e+09 -1.0000e+09 -1.0000e+09 -1.0000e+09 -1.0000e+09
[torch.FloatTensor of size 5x5]
输入参数:

我们令输入的query, key, value都相同, 位置编码的输出

query = key = value = pe_result
Variable containing:
( 0 ,.,.) =
46.5196 16.2057 -41.5581 … -16.0242 -17.8929 -43.0405
-32.6040 16.1096 -29.5228 … 4.2721 20.6034 -1.2747
-18.6235 14.5076 -2.0105 … 15.6462 -24.6081 -30.3391
0.0000 -66.1486 -11.5123 … 20.1519 -4.6823 0.4916

( 1 ,.,.) =
-24.8681 7.5495 -5.0765 … -7.5992 -26.6630 40.9517
13.1581 -3.1918 -30.9001 … 25.1187 -26.4621 2.9542
-49.7690 -42.5019 8.0198 … -5.4809 25.9403 -27.4931
-52.2775 10.4006 0.0000 … -1.9985 7.0106 -0.5189
[torch.FloatTensor of size 2x4x512]
调用:

attn, p_attn = attention(query, key, value)
print(“attn:”, attn)
print(“p_attn:”, p_attn)
输出效果:

将得到两个结果

query的注意力表示:

attn: Variable containing:
( 0 ,.,.) =
12.8269 7.7403 41.2225 … 1.4603 27.8559 -12.2600
12.4904 0.0000 24.1575 … 0.0000 2.5838 18.0647
-32.5959 -4.6252 -29.1050 … 0.0000 -22.6409 -11.8341
8.9921 -33.0114 -0.7393 … 4.7871 -5.7735 8.3374

( 1 ,.,.) =
-25.6705 -4.0860 -36.8226 … 37.2346 -27.3576 2.5497
-16.6674 73.9788 -33.3296 … 28.5028 -5.5488 -13.7564
0.0000 -29.9039 -3.0405 … 0.0000 14.4408 14.8579
30.7819 0.0000 21.3908 … -29.0746 0.0000 -5.8475
[torch.FloatTensor of size 2x4x512]

注意力张量:

p_attn: Variable containing:
(0 ,.,.) =
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1

(1 ,.,.) =
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
[torch.FloatTensor of size 2x4x4]
带有mask的输入参数：

query = key = value = pe_result

令mask为一个2x4x4的零张量

mask = Variable(torch.zeros(2, 4, 4))
调用:

attn, p_attn = attention(query, key, value, mask=mask)
print(“attn:”, attn)
print(“p_attn:”, p_attn)
带有mask的输出效果:

query的注意力表示:

attn: Variable containing:
( 0 ,.,.) =
0.4284 -7.4741 8.8839 … 1.5618 0.5063 0.5770
0.4284 -7.4741 8.8839 … 1.5618 0.5063 0.5770
0.4284 -7.4741 8.8839 … 1.5618 0.5063 0.5770
0.4284 -7.4741 8.8839 … 1.5618 0.5063 0.5770

( 1 ,.,.) =
-2.8890 9.9972 -12.9505 … 9.1657 -4.6164 -0.5491
-2.8890 9.9972 -12.9505 … 9.1657 -4.6164 -0.5491
-2.8890 9.9972 -12.9505 … 9.1657 -4.6164 -0.5491
-2.8890 9.9972 -12.9505 … 9.1657 -4.6164 -0.5491
[torch.FloatTensor of size 2x4x512]

注意力张量:

p_attn: Variable containing:
(0 ,.,.) =
0.2500 0.2500 0.2500 0.2500
0.2500 0.2500 0.2500 0.2500
0.2500 0.2500 0.2500 0.2500
0.2500 0.2500 0.2500 0.2500

(1 ,.,.) =
0.2500 0.2500 0.2500 0.2500
0.2500 0.2500 0.2500 0.2500
0.2500 0.2500 0.2500 0.2500
0.2500 0.2500 0.2500 0.2500
[torch.FloatTensor of size 2x4x4]
2.3.2 注意力机制总结:

学习了什么是注意力:
我们观察事物时，之所以能够快速判断一种事物(当然允许判断是错误的), 是因为我们大脑能够很快把注意力放在事物最具有辨识度的部分从而作出判断，而并非是从头到尾的观察一遍事物后，才能有判断结果. 正是基于这样的理论，就产生了注意力机制.
什么是注意力计算规则:
它需要三个指定的输入Q(query), K(key), V(value), 然后通过公式得到注意力的计算结果, 这个结果代表query在key和value作用下的表示. 而这个具体的计算规则有很多种, 我这里只介绍我们用到的这一种.
学习了Q, K, V的比喻解释:
Q是一段准备被概括的文本; K是给出的提示; V是大脑中的对提示K的延伸.
当Q=K=V时, 称作自注意力机制.
什么是注意力机制:
注意力机制是注意力计算规则能够应用的深度学习网络的载体, 除了注意力计算规则外, 还包括一些必要的全连接层以及相关张量处理, 使其与应用网络融为一体. 使用自注意力计算规则的注意力机制称为自注意力机制.
学习并实现了注意力计算规则的函数: attention
它的输入就是Q，K，V以及mask和dropout, mask用于掩码, dropout用于随机置0.
它的输出有两个, query的注意力表示以及注意力张量.
2.3.3 多头注意力机制
学习目标:
了解多头注意力机制的作用.
掌握多头注意力机制的实现过程.
什么是多头注意力机制:
从多头注意力的结构图中，貌似这个所谓的多个头就是指多组线性变换层，其实并不是，我只有使用了一组线性变化层，即三个变换张量对Q，K，V分别进行线性变换，这些变换不会改变原有张量的尺寸，因此每个变换矩阵都是方阵，得到输出结果后，多头的作用才开始显现，每个头开始从词义层面分割输出的张量，也就是每个头都想获得一组Q，K，V进行注意力机制的计算，但是句子中的每个词的表示只获得一部分，也就是只分割了最后一维的词嵌入向量. 这就是所谓的多头，将每个头的获得的输入送到注意力机制中, 就形成多头注意力机制.
多头注意力机制结构图:
avatar
多头注意力机制的作用:
这种结构设计能让每个注意力机制去优化每个词汇的不同特征部分，从而均衡同一种注意力机制可能产生的偏差，让词义拥有来自更多元的表达，实验表明可以从而提升模型效果.
多头注意力机制的代码实现:

用于深度拷贝的copy工具包

import copy

首先需要定义克隆函数, 因为在多头注意力机制的实现中, 用到多个结构相同的线性层.

我们将使用clone函数将他们一同初始化在一个网络层列表对象中. 之后的结构中也会用到该函数.

def clones(module, N):
“”“用于生成相同网络层的克隆函数, 它的参数module表示要克隆的目标网络层, N代表需要克隆的数量”""
# 在函数中, 我们通过for循环对module进行N次深度拷贝, 使其每个module成为独立的层,
# 然后将其放在nn.ModuleList类型的列表中存放.
return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])

我们使用一个类来实现多头注意力机制的处理

class MultiHeadedAttention(nn.Module):
def init(self, head, embedding_dim, dropout=0.1):
“”“在类的初始化时, 会传入三个参数，head代表头数，embedding_dim代表词嵌入的维度，
dropout代表进行dropout操作时置0比率，默认是0.1.”""
super(MultiHeadedAttention, self).init()

    # 在函数中，首先使用了一个测试中常用的assert语句，判断h是否能被d_model整除，
    # 这是因为我们之后要给每个头分配等量的词特征.也就是embedding_dim/head个.
    assert embedding_dim % head == 0

    # 得到每个头获得的分割词向量维度d_k
    self.d_k = embedding_dim // head

    # 传入头数h
    self.head = head

    # 然后获得线性层对象，通过nn的Linear实例化，它的内部变换矩阵是embedding_dim x embedding_dim，然后使用clones函数克隆四个，
    # 为什么是四个呢，这是因为在多头注意力中，Q，K，V各需要一个，最后拼接的矩阵还需要一个，因此一共是四个.
    self.linears = clones(nn.Linear(embedding_dim, embedding_dim), 4)

    # self.attn为None，它代表最后得到的注意力张量，现在还没有结果所以为None.
    self.attn = None

    # 最后就是一个self.dropout对象，它通过nn中的Dropout实例化而来，置0比率为传进来的参数dropout.
    self.dropout = nn.Dropout(p=dropout)

def forward(self, query, key, value, mask=None):
    """前向逻辑函数, 它的输入参数有四个，前三个就是注意力机制需要的Q, K, V，
       最后一个是注意力机制中可能需要的mask掩码张量，默认是None. """

    # 如果存在掩码张量mask
    if mask is not None:
        # 使用unsqueeze拓展维度
        mask = mask.unsqueeze(0)

    # 接着，我们获得一个batch_size的变量，他是query尺寸的第1个数字，代表有多少条样本.
    batch_size = query.size(0)

    # 之后就进入多头处理环节
    # 首先利用zip将输入QKV与三个线性层组到一起，然后使用for循环，将输入QKV分别传到线性层中，
    # 做完线性变换后，开始为每个头分割输入，这里使用view方法对线性变换的结果进行维度重塑，多加了一个维度h，代表头数，
    # 这样就意味着每个头可以获得一部分词特征组成的句子，其中的-1代表自适应维度，
    # 计算机会根据这种变换自动计算这里的值.然后对第二维和第三维进行转置操作，
    # 为了让代表句子长度维度和词向量维度能够相邻，这样注意力机制才能找到词义与句子位置的关系，
    # 从attention函数中可以看到，利用的是原始输入的倒数第一和第二维.这样我们就得到了每个头的输入.
    query, key, value = \
       [model(x).view(batch_size, -1, self.head, self.d_k).transpose(1, 2)
        for model, x in zip(self.linears, (query, key, value))]

    # 得到每个头的输入后，接下来就是将他们传入到attention中，
    # 这里直接调用我们之前实现的attention函数.同时也将mask和dropout传入其中.
    x, self.attn = attention(query, key, value, mask=mask, dropout=self.dropout)

    # 通过多头注意力计算后，我们就得到了每个头计算结果组成的4维张量，我们需要将其转换为输入的形状以方便后续的计算，
    # 因此这里开始进行第一步处理环节的逆操作，先对第二和第三维进行转置，然后使用contiguous方法，
    # 这个方法的作用就是能够让转置后的张量应用view方法，否则将无法直接使用，
    # 所以，下一步就是使用view重塑形状，变成和输入形状相同.
    x = x.transpose(1, 2).contiguous().view(batch_size, -1, self.head * self.d_k)

    # 最后使用线性层列表中的最后一个线性层对输入进行线性变换得到最终的多头注意力结构的输出.
    return self.linears[-1](x)

tensor.view演示:

x = torch.randn(4, 4)
x.size()
torch.Size([4, 4])

y = x.view(16)
y.size()
torch.Size([16])

z = x.view(-1, 8) # the size -1 is inferred from other dimensions
z.size()
torch.Size([2, 8])

a = torch.randn(1, 2, 3, 4)
a.size()
torch.Size([1, 2, 3, 4])

b = a.transpose(1, 2) # Swaps 2nd and 3rd dimension
b.size()
torch.Size([1, 3, 2, 4])

c = a.view(1, 3, 2, 4) # Does not change tensor layout in memory
c.size()
torch.Size([1, 3, 2, 4])

torch.equal(b, c)
False
torch.transpose演示:

x = torch.randn(2, 3)
x
tensor([[ 1.0028, -0.9893, 0.5809],
[-0.1669, 0.7299, 0.4942]])

torch.transpose(x, 0, 1)
tensor([[ 1.0028, -0.1669],
[-0.9893, 0.7299],
[ 0.5809, 0.4942]])
实例化参数:

头数head

head = 8

词嵌入维度embedding_dim

embedding_dim = 512

置零比率dropout

dropout = 0.2
输入参数:

假设输入的Q，K，V仍然相等

query = value = key = pe_result

输入的掩码张量mask

mask = Variable(torch.zeros(8, 4, 4))
调用:

mha = MultiHeadedAttention(head, embedding_dim, dropout)
mha_result = mha(query, key, value, mask)
print(mha_result)
输出效果:

tensor([[[-0.3075, 1.5687, -2.5693, …, -1.1098, 0.0878, -3.3609],
[ 3.8065, -2.4538, -0.3708, …, -1.5205, -1.1488, -1.3984],
[ 2.4190, 0.5376, -2.8475, …, 1.4218, -0.4488, -0.2984],
[ 2.9356, 0.3620, -3.8722, …, -0.7996, 0.1468, 1.0345]],

    [[ 1.1423,  0.6038,  0.0954,  ...,  2.2679, -5.7749,  1.4132],
     [ 2.4066, -0.2777,  2.8102,  ...,  0.1137, -3.9517, -2.9246],
     [ 5.8201,  1.1534, -1.9191,  ...,  0.1410, -7.6110,  1.0046],
     [ 3.1209,  1.0008, -0.5317,  ...,  2.8619, -6.3204, -1.3435]]],
   grad_fn=)

torch.Size([2, 4, 512])
2.3.3 多头注意力机制总结:

学习了什么是多头注意力机制:
每个头开始从词义层面分割输出的张量，也就是每个头都想获得一组Q，K，V进行注意力机制的计算，但是句子中的每个词的表示只获得一部分，也就是只分割了最后一维的词嵌入向量. 这就是所谓的多头.将每个头的获得的输入送到注意力机制中, 就形成了多头注意力机制.
学习了多头注意力机制的作用:
这种结构设计能让每个注意力机制去优化每个词汇的不同特征部分，从而均衡同一种注意力机制可能产生的偏差，让词义拥有来自更多元的表达，实验表明可以从而提升模型效果.
学习并实现了多头注意力机制的类: MultiHeadedAttention
因为多头注意力机制中需要使用多个相同的线性层, 首先实现了克隆函数clones.
clones函数的输入是module，N，分别代表克隆的目标层，和克隆个数.
clones函数的输出是装有N个克隆层的Module列表.
接着实现MultiHeadedAttention类, 它的初始化函数输入是h, d_model, dropout分别代表头数，词嵌入维度和置零比率.
它的实例化对象输入是Q, K, V以及掩码张量mask.
它的实例化对象输出是通过多头注意力机制处理的Q的注意力表示.
2.3.4 前馈全连接层
学习目标:
了解什么是前馈全连接层及其它的作用.
掌握前馈全连接层的实现过程.
什么是前馈全连接层:
在Transformer中前馈全连接层就是具有两层线性层的全连接网络.
前馈全连接层的作用:
考虑注意力机制可能对复杂过程的拟合程度不够, 通过增加两层网络来增强模型的能力.
前馈全连接层的代码分析:

通过类PositionwiseFeedForward来实现前馈全连接层

class PositionwiseFeedForward(nn.Module):
def init(self, d_model, d_ff, dropout=0.1):
“”“初始化函数有三个输入参数分别是d_model, d_ff,和dropout=0.1，第一个是线性层的输入维度也是第二个线性层的输出维度，
因为我们希望输入通过前馈全连接层后输入和输出的维度不变. 第二个参数d_ff就是第二个线性层的输入维度和第一个线性层的输出维度.
最后一个是dropout置0比率.”""
super(PositionwiseFeedForward, self).init()

    # 首先按照我们预期使用nn实例化了两个线性层对象，self.w1和self.w2
    # 它们的参数分别是d_model, d_ff和d_ff, d_model
    self.w1 = nn.Linear(d_model, d_ff)
    self.w2 = nn.Linear(d_ff, d_model)
    # 然后使用nn的Dropout实例化了对象self.dropout
    self.dropout = nn.Dropout(dropout)

def forward(self, x):
    """输入参数为x，代表来自上一层的输出"""
    # 首先经过第一个线性层，然后使用Funtional中relu函数进行激活,
    # 之后再使用dropout进行随机置0，最后通过第二个线性层w2，返回最终结果.
    return self.w2(self.dropout(F.relu(self.w1(x))))

ReLU函数公式: ReLU(x)=max(0, x)
ReLU函数图像:
avatar
实例化参数:

d_model = 512

线性变化的维度

d_ff = 64

dropout = 0.2
输入参数:

输入参数x可以是多头注意力机制的输出

x = mha_result
tensor([[[-0.3075, 1.5687, -2.5693, …, -1.1098, 0.0878, -3.3609],
[ 3.8065, -2.4538, -0.3708, …, -1.5205, -1.1488, -1.3984],
[ 2.4190, 0.5376, -2.8475, …, 1.4218, -0.4488, -0.2984],
[ 2.9356, 0.3620, -3.8722, …, -0.7996, 0.1468, 1.0345]],

    [[ 1.1423,  0.6038,  0.0954,  ...,  2.2679, -5.7749,  1.4132],
     [ 2.4066, -0.2777,  2.8102,  ...,  0.1137, -3.9517, -2.9246],
     [ 5.8201,  1.1534, -1.9191,  ...,  0.1410, -7.6110,  1.0046],
     [ 3.1209,  1.0008, -0.5317,  ...,  2.8619, -6.3204, -1.3435]]],
   grad_fn=)

torch.Size([2, 4, 512])
调用:

ff = PositionwiseFeedForward(d_model, d_ff, dropout)
ff_result = ff(x)
print(ff_result)
输出效果:

tensor([[[-1.9488e+00, -3.4060e-01, -1.1216e+00, …, 1.8203e-01,
-2.6336e+00, 2.0917e-03],
[-2.5875e-02, 1.1523e-01, -9.5437e-01, …, -2.6257e-01,
-5.7620e-01, -1.9225e-01],
[-8.7508e-01, 1.0092e+00, -1.6515e+00, …, 3.4446e-02,
-1.5933e+00, -3.1760e-01],
[-2.7507e-01, 4.7225e-01, -2.0318e-01, …, 1.0530e+00,
-3.7910e-01, -9.7730e-01]],

    [[-2.2575e+00, -2.0904e+00,  2.9427e+00,  ...,  9.6574e-01,
      -1.9754e+00,  1.2797e+00],
     [-1.5114e+00, -4.7963e-01,  1.2881e+00,  ..., -2.4882e-02,
      -1.5896e+00, -1.0350e+00],
     [ 1.7416e-01, -4.0688e-01,  1.9289e+00,  ..., -4.9754e-01,
      -1.6320e+00, -1.5217e+00],
     [-1.0874e-01, -3.3842e-01,  2.9379e-01,  ..., -5.1276e-01,
      -1.6150e+00, -1.1295e+00]]], grad_fn=)

torch.Size([2, 4, 512])
2.3.4 前馈全连接层总结:

学习了什么是前馈全连接层:
在Transformer中前馈全连接层就是具有两层线性层的全连接网络.
学习了前馈全连接层的作用:
考虑注意力机制可能对复杂过程的拟合程度不够, 通过增加两层网络来增强模型的能力.
学习并实现了前馈全连接层的类: PositionwiseFeedForward
它的实例化参数为d_model, d_ff, dropout, 分别代表词嵌入维度, 线性变换维度, 和置零比率.
它的输入参数x, 表示上层的输出.
它的输出是经过2层线性网络变换的特征表示.
2.3.5 规范化层
学习目标:
了解规范化层的作用.
掌握规范化层的实现过程.
规范化层的作用:
它是所有深层网络模型都需要的标准网络层，因为随着网络层数的增加，通过多层的计算后参数可能开始出现过大或过小的情况，这样可能会导致学习过程出现异常，模型可能收敛非常的慢. 因此都会在一定层数后接规范化层进行数值的规范化，使其特征数值在合理范围内.
规范化层的代码实现:

通过LayerNorm实现规范化层的类

class LayerNorm(nn.Module):
def init(self, features, eps=1e-6):
“”“初始化函数有两个参数, 一个是features, 表示词嵌入的维度,
另一个是eps它是一个足够小的数, 在规范化公式的分母中出现,
防止分母为0.默认是1e-6.”""
super(LayerNorm, self).init()

    # 根据features的形状初始化两个参数张量a2，和b2，第一个初始化为1张量，
    # 也就是里面的元素都是1，第二个初始化为0张量，也就是里面的元素都是0，这两个张量就是规范化层的参数，
    # 因为直接对上一层得到的结果做规范化公式计算，将改变结果的正常表征，因此就需要有参数作为调节因子，
    # 使其即能满足规范化要求，又能不改变针对目标的表征.最后使用nn.parameter封装，代表他们是模型的参数。
    self.a2 = nn.Parameter(torch.ones(features))
    self.b2 = nn.Parameter(torch.zeros(features))

    # 把eps传到类中
    self.eps = eps

def forward(self, x):
    """输入参数x代表来自上一层的输出"""
    # 在函数中，首先对输入变量x求其最后一个维度的均值，并保持输出维度与输入维度一致.
    # 接着再求最后一个维度的标准差，然后就是根据规范化公式，用x减去均值除以标准差获得规范化的结果，
    # 最后对结果乘以我们的缩放参数，即a2，*号代表同型点乘，即对应位置进行乘法操作，加上位移参数b2.返回即可.
    mean = x.mean(-1, keepdim=True)
    std = x.std(-1, keepdim=True)
    return self.a2 * (x - mean) / (std + self.eps) + self.b2

实例化参数:

features = d_model = 512
eps = 1e-6
输入参数:

输入x来自前馈全连接层的输出

x = ff_result
tensor([[[-1.9488e+00, -3.4060e-01, -1.1216e+00, …, 1.8203e-01,
-2.6336e+00, 2.0917e-03],
[-2.5875e-02, 1.1523e-01, -9.5437e-01, …, -2.6257e-01,
-5.7620e-01, -1.9225e-01],
[-8.7508e-01, 1.0092e+00, -1.6515e+00, …, 3.4446e-02,
-1.5933e+00, -3.1760e-01],
[-2.7507e-01, 4.7225e-01, -2.0318e-01, …, 1.0530e+00,
-3.7910e-01, -9.7730e-01]],

    [[-2.2575e+00, -2.0904e+00,  2.9427e+00,  ...,  9.6574e-01,
      -1.9754e+00,  1.2797e+00],
     [-1.5114e+00, -4.7963e-01,  1.2881e+00,  ..., -2.4882e-02,
      -1.5896e+00, -1.0350e+00],
     [ 1.7416e-01, -4.0688e-01,  1.9289e+00,  ..., -4.9754e-01,
      -1.6320e+00, -1.5217e+00],
     [-1.0874e-01, -3.3842e-01,  2.9379e-01,  ..., -5.1276e-01,
      -1.6150e+00, -1.1295e+00]]], grad_fn=)

torch.Size([2, 4, 512])
调用:

ln = LayerNorm(feature, eps)
ln_result = ln(x)
print(ln_result)
输出效果:

tensor([[[ 2.2697, 1.3911, -0.4417, …, 0.9937, 0.6589, -1.1902],
[ 1.5876, 0.5182, 0.6220, …, 0.9836, 0.0338, -1.3393],
[ 1.8261, 2.0161, 0.2272, …, 0.3004, 0.5660, -0.9044],
[ 1.5429, 1.3221, -0.2933, …, 0.0406, 1.0603, 1.4666]],

    [[ 0.2378,  0.9952,  1.2621,  ..., -0.4334, -1.1644,  1.2082],
     [-1.0209,  0.6435,  0.4235,  ..., -0.3448, -1.0560,  1.2347],
     [-0.8158,  0.7118,  0.4110,  ...,  0.0990, -1.4833,  1.9434],
     [ 0.9857,  2.3924,  0.3819,  ...,  0.0157, -1.6300,  1.2251]]],
   grad_fn=)

torch.Size([2, 4, 512])
2.3.5 规范化层总结:

学习了规范化层的作用:
它是所有深层网络模型都需要的标准网络层，因为随着网络层数的增加，通过多层的计算后参数可能开始出现过大或过小的情况，这样可能会导致学习过程出现异常，模型可能收敛非常的慢. 因此都会在一定层数后接规范化层进行数值的规范化，使其特征数值在合理范围内.
学习并实现了规范化层的类: LayerNorm
它的实例化参数有两个, features和eps，分别表示词嵌入特征大小，和一个足够小的数.
它的输入参数x代表来自上一层的输出.
它的输出就是经过规范化的特征表示.
2.3.6 子层连接结构
学习目标:
了解什么是子层连接结构.
掌握子层连接结构的实现过程.
什么是子层连接结构:
如图所示，输入到每个子层以及规范化层的过程中，还使用了残差链接（跳跃连接），因此我们把这一部分结构整体叫做子层连接（代表子层及其链接结构），在每个编码器层中，都有两个子层，这两个子层加上周围的链接结构就形成了两个子层连接结构.
子层连接结构图:
avatar
avatar
子层连接结构的代码分析:

使用SublayerConnection来实现子层连接结构的类

class SublayerConnection(nn.Module):
def init(self, size, dropout=0.1):
“”“它输入参数有两个, size以及dropout， size一般是都是词嵌入维度的大小，
dropout本身是对模型结构中的节点数进行随机抑制的比率，
又因为节点被抑制等效就是该节点的输出都是0，因此也可以把dropout看作是对输出矩阵的随机置0的比率.
“””
super(SublayerConnection, self).init()
# 实例化了规范化对象self.norm
self.norm = LayerNorm(size)
# 又使用nn中预定义的droupout实例化一个self.dropout对象.
self.dropout = nn.Dropout(p=dropout)

def forward(self, x, sublayer):
    """前向逻辑函数中, 接收上一个层或者子层的输入作为第一个参数，
       将该子层连接中的子层函数作为第二个参数"""

    # 我们首先对输出进行规范化，然后将结果传给子层处理，之后再对子层进行dropout操作，
    # 随机停止一些网络中神经元的作用，来防止过拟合. 最后还有一个add操作， 
    # 因为存在跳跃连接，所以是将输入x与dropout后的子层输出结果相加作为最终的子层连接输出.
    return x + self.dropout(sublayer(self.norm(x)))

实例化参数

size = 512
dropout = 0.2
head = 8
d_model = 512
输入参数:

令x为位置编码器的输出

x = pe_result
mask = Variable(torch.zeros(8, 4, 4))

假设子层中装的是多头注意力层, 实例化这个类

self_attn = MultiHeadedAttention(head, d_model)

使用lambda获得一个函数类型的子层

sublayer = lambda x: self_attn(x, x, x, mask)
调用:

sc = SublayerConnection(size, dropout)
sc_result = sc(x, sublayer)
print(sc_result)
print(sc_result.shape)
输出效果:

tensor([[[ 14.8830, 22.4106, -31.4739, …, 21.0882, -10.0338, -0.2588],
[-25.1435, 2.9246, -16.1235, …, 10.5069, -7.1007, -3.7396],
[ 0.1374, 32.6438, 12.3680, …, -12.0251, -40.5829, 2.2297],
[-13.3123, 55.4689, 9.5420, …, -12.6622, 23.4496, 21.1531]],

    [[ 13.3533,  17.5674, -13.3354,  ...,  29.1366,  -6.4898,  35.8614],
     [-35.2286,  18.7378, -31.4337,  ...,  11.1726,  20.6372,  29.8689],
     [-30.7627,   0.0000, -57.0587,  ...,  15.0724, -10.7196, -18.6290],
     [ -2.7757, -19.6408,   0.0000,  ...,  12.7660,  21.6843, -35.4784]]],
   grad_fn=)

torch.Size([2, 4, 512])
2.3.6 子层连接结构总结:

什么是子层连接结构:
如图所示，输入到每个子层以及规范化层的过程中，还使用了残差链接（跳跃连接），因此我们把这一部分结构整体叫做子层连接（代表子层及其链接结构）, 在每个编码器层中，都有两个子层，这两个子层加上周围的链接结构就形成了两个子层连接结构.
学习并实现了子层连接结构的类: SublayerConnection
类的初始化函数输入参数是size, dropout, 分别代表词嵌入大小和置零比率.
它的实例化对象输入参数是x, sublayer, 分别代表上一层输出以及子层的函数表示.
它的输出就是通过子层连接结构处理的输出.
2.3.7 编码器层
学习目标:
了解编码器层的作用.
掌握编码器层的实现过程.
编码器层的作用:
作为编码器的组成单元, 每个编码器层完成一次对输入的特征提取过程, 即编码过程.
编码器层的构成图:
avatar
编码器层的代码分析:

使用EncoderLayer类实现编码器层

class EncoderLayer(nn.Module):
def init(self, size, self_attn, feed_forward, dropout):
“”“它的初始化函数参数有四个，分别是size，其实就是我们词嵌入维度的大小，它也将作为我们编码器层的大小,
第二个self_attn，之后我们将传入多头自注意力子层实例化对象, 并且是自注意力机制,
第三个是feed_froward, 之后我们将传入前馈全连接层实例化对象, 最后一个是置0比率dropout.”""
super(EncoderLayer, self).init()

    # 首先将self_attn和feed_forward传入其中.
    self.self_attn = self_attn
    self.feed_forward = feed_forward

    # 如图所示, 编码器层中有两个子层连接结构, 所以使用clones函数进行克隆
    self.sublayer = clones(SublayerConnection(size, dropout), 2)
    # 把size传入其中
    self.size = size

def forward(self, x, mask):
    """forward函数中有两个输入参数，x和mask，分别代表上一层的输出，和掩码张量mask."""
    # 里面就是按照结构图左侧的流程. 首先通过第一个子层连接结构，其中包含多头自注意力子层，
    # 然后通过第二个子层连接结构，其中包含前馈全连接子层. 最后返回结果.
    x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
    return self.sublayer[1](x, self.feed_forward)

实例化参数:

size = 512
head = 8
d_model = 512
d_ff = 64
x = pe_result
dropout = 0.2
self_attn = MultiHeadedAttention(head, d_model)
ff = PositionwiseFeedForward(d_model, d_ff, dropout)
mask = Variable(torch.zeros(8, 4, 4))
调用:

el = EncoderLayer(size, self_attn, ff, dropout)
el_result = el(x, mask)
print(el_result)
print(el_result.shape)
输出效果:

tensor([[[ 33.6988, -30.7224, 20.9575, …, 5.2968, -48.5658, 20.0734],
[-18.1999, 34.2358, 40.3094, …, 10.1102, 58.3381, 58.4962],
[ 32.1243, 16.7921, -6.8024, …, 23.0022, -18.1463, -17.1263],
[ -9.3475, -3.3605, -55.3494, …, 43.6333, -0.1900, 0.1625]],

    [[ 32.8937, -46.2808,   8.5047,  ...,  29.1837,  22.5962, -14.4349],
     [ 21.3379,  20.0657, -31.7256,  ..., -13.4079, -44.0706,  -9.9504],
     [ 19.7478,  -1.0848,  11.8884,  ...,  -9.5794,   0.0675,  -4.7123],
     [ -6.8023, -16.1176,  20.9476,  ...,  -6.5469,  34.8391, -14.9798]]],
   grad_fn=)

torch.Size([2, 4, 512])
2.3.7 编码器层总结:

学习了编码器层的作用:
作为编码器的组成单元, 每个编码器层完成一次对输入的特征提取过程, 即编码过程.
学习并实现了编码器层的类: EncoderLayer
类的初始化函数共有4个, 别是size，其实就是我们词嵌入维度的大小. 第二个self_attn，之后我们将传入多头自注意力子层实例化对象, 并且是自注意力机制. 第三个是feed_froward, 之后我们将传入前馈全连接层实例化对象. 最后一个是置0比率dropout.
实例化对象的输入参数有2个，x代表来自上一层的输出, mask代表掩码张量.
它的输出代表经过整个编码层的特征表示.
2.3.8 编码器
学习目标:
了解编码器的作用.
掌握编码器的实现过程.
编码器的作用:
编码器用于对输入进行指定的特征提取过程, 也称为编码, 由N个编码器层堆叠而成.
编码器的结构图:
avatar
编码器的代码分析:

使用Encoder类来实现编码器

class Encoder(nn.Module):
def init(self, layer, N):
“”“初始化函数的两个参数分别代表编码器层和编码器层的个数”""
super(Encoder, self).init()
# 首先使用clones函数克隆N个编码器层放在self.layers中
self.layers = clones(layer, N)
# 再初始化一个规范化层, 它将用在编码器的最后面.
self.norm = LayerNorm(layer.size)

def forward(self, x, mask):
    """forward函数的输入和编码器层相同, x代表上一层的输出, mask代表掩码张量"""
    # 首先就是对我们克隆的编码器层进行循环，每次都会得到一个新的x，
    # 这个循环的过程，就相当于输出的x经过了N个编码器层的处理. 
    # 最后再通过规范化层的对象self.norm进行处理，最后返回结果. 
    for layer in self.layers:
        x = layer(x, mask)
    return self.norm(x)

实例化参数:

第一个实例化参数layer, 它是一个编码器层的实例化对象, 因此需要传入编码器层的参数

又因为编码器层中的子层是不共享的, 因此需要使用深度拷贝各个对象.

size = 512
head = 8
d_model = 512
d_ff = 64
c = copy.deepcopy
attn = MultiHeadedAttention(head, d_model)
ff = PositionwiseFeedForward(d_model, d_ff, dropout)
dropout = 0.2
layer = EncoderLayer(size, c(attn), c(ff), dropout)

编码器中编码器层的个数N

N = 8
mask = Variable(torch.zeros(8, 4, 4))
调用:

en = Encoder(layer, N)
en_result = en(x, mask)
print(en_result)
print(en_result.shape)
输出效果:

tensor([[[-0.2081, -0.3586, -0.2353, …, 2.5646, -0.2851, 0.0238],
[ 0.7957, -0.5481, 1.2443, …, 0.7927, 0.6404, -0.0484],
[-0.1212, 0.4320, -0.5644, …, 1.3287, -0.0935, -0.6861],
[-0.3937, -0.6150, 2.2394, …, -1.5354, 0.7981, 1.7907]],

    [[-2.3005,  0.3757,  1.0360,  ...,  1.4019,  0.6493, -0.1467],
     [ 0.5653,  0.1569,  0.4075,  ..., -0.3205,  1.4774, -0.5856],
     [-1.0555,  0.0061, -1.8165,  ..., -0.4339, -1.8780,  0.2467],
     [-2.1617, -1.5532, -1.4330,  ..., -0.9433, -0.5304, -1.7022]]],
   grad_fn=)

torch.Size([2, 4, 512])
2.3.8 编码器总结:

学习了编码器的作用:
编码器用于对输入进行指定的特征提取过程, 也称为编码, 由N个编码器层堆叠而成.
学习并实现了编码器的类: Encoder
类的初始化函数参数有两个，分别是layer和N，代表编码器层和编码器层的个数.
forward函数的输入参数也有两个, 和编码器层的forward相同, x代表上一层的输出, mask代码掩码张量.
编码器类的输出就是Transformer中编码器的特征提取表示, 它将成为解码器的输入的一部分.
2.4 解码器部分实现
学习目标
了解解码器中各个组成部分的作用.
掌握解码器中各个组成部分的实现过程.
解码器部分:
由N个解码器层堆叠而成
每个解码器层由三个子层连接结构组成
第一个子层连接结构包括一个多头自注意力子层和规范化层以及一个残差连接
第二个子层连接结构包括一个多头注意力子层和规范化层以及一个残差连接
第三个子层连接结构包括一个前馈全连接子层和规范化层以及一个残差连接
avatar
说明:
解码器层中的各个部分，如，多头注意力机制，规范化层，前馈全连接网络，子层连接结构都与编码器中的实现相同. 因此这里可以直接拿来构建解码器层.
2.4.1 解码器层
学习目标:
了解解码器层的作用.
掌握解码器层的实现过程.
解码器层的作用:
作为解码器的组成单元, 每个解码器层根据给定的输入向目标方向进行特征提取操作，即解码过程.
解码器层的代码实现:

使用DecoderLayer的类实现解码器层

class DecoderLayer(nn.Module):
def init(self, size, self_attn, src_attn, feed_forward, dropout):
“”“初始化函数的参数有5个, 分别是size，代表词嵌入的维度大小, 同时也代表解码器层的尺寸，
第二个是self_attn，多头自注意力对象，也就是说这个注意力机制需要Q=K=V，
第三个是src_attn，多头注意力对象，这里Q!=K=V，第四个是前馈全连接层对象，最后就是droupout置0比率.
“””
super(DecoderLayer, self).init()
# 在初始化函数中，主要就是将这些输入传到类中
self.size = size
self.self_attn = self_attn
self.src_attn = src_attn
self.feed_forward = feed_forward
# 按照结构图使用clones函数克隆三个子层连接对象.
self.sublayer = clones(SublayerConnection(size, dropout), 3)

def forward(self, x, memory, source_mask, target_mask):
    """forward函数中的参数有4个，分别是来自上一层的输入x，
       来自编码器层的语义存储变量mermory， 以及源数据掩码张量和目标数据掩码张量.
    """
    # 将memory表示成m方便之后使用
    m = memory

    # 将x传入第一个子层结构，第一个子层结构的输入分别是x和self-attn函数，因为是自注意力机制，所以Q,K,V都是x，
    # 最后一个参数是目标数据掩码张量，这时要对目标数据进行遮掩，因为此时模型可能还没有生成任何目标数据，
    # 比如在解码器准备生成第一个字符或词汇时，我们其实已经传入了第一个字符以便计算损失，
    # 但是我们不希望在生成第一个字符时模型能利用这个信息，因此我们会将其遮掩，同样生成第二个字符或词汇时，
    # 模型只能使用第一个字符或词汇信息，第二个字符以及之后的信息都不允许被模型使用.
    x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, target_mask))

    # 接着进入第二个子层，这个子层中常规的注意力机制，q是输入x; k，v是编码层输出memory， 
    # 同样也传入source_mask，但是进行源数据遮掩的原因并非是抑制信息泄漏，而是遮蔽掉对结果没有意义的字符而产生的注意力值，
    # 以此提升模型效果和训练速度. 这样就完成了第二个子层的处理.
    x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, source_mask))

    # 最后一个子层就是前馈全连接子层，经过它的处理后就可以返回结果.这就是我们的解码器层结构.
    return self.sublayer[2](x, self.feed_forward)

实例化参数:

类的实例化参数与解码器层类似, 相比多出了src_attn, 但是和self_attn是同一个类.

head = 8
size = 512
d_model = 512
d_ff = 64
dropout = 0.2
self_attn = src_attn = MultiHeadedAttention(head, d_model, dropout)

前馈全连接层也和之前相同

ff = PositionwiseFeedForward(d_model, d_ff, dropout)
输入参数:

x是来自目标数据的词嵌入表示, 但形式和源数据的词嵌入表示相同, 这里使用per充当.

x = pe_result

memory是来自编码器的输出

memory = en_result

实际中source_mask和target_mask并不相同, 这里为了方便计算使他们都为mask

mask = Variable(torch.zeros(8, 4, 4))
source_mask = target_mask = mask
调用:

dl = DecoderLayer(size, self_attn, src_attn, ff, dropout)
dl_result = dl(x, memory, source_mask, target_mask)
print(dl_result)
print(dl_result.shape)
输出效果:

tensor([[[ 1.9604e+00, 3.9288e+01, -5.2422e+01, …, 2.1041e-01,
-5.5063e+01, 1.5233e-01],
[ 1.0135e-01, -3.7779e-01, 6.5491e+01, …, 2.8062e+01,
-3.7780e+01, -3.9577e+01],
[ 1.9526e+01, -2.5741e+01, 2.6926e-01, …, -1.5316e+01,
1.4543e+00, 2.7714e+00],
[-2.1528e+01, 2.0141e+01, 2.1999e+01, …, 2.2099e+00,
-1.7267e+01, -1.6687e+01]],

    [[ 6.7259e+00, -2.6918e+01,  1.1807e+01,  ..., -3.6453e+01,
      -2.9231e+01,  1.1288e+01],
     [ 7.7484e+01, -5.0572e-01, -1.3096e+01,  ...,  3.6302e-01,
       1.9907e+01, -1.2160e+00],
     [ 2.6703e+01,  4.4737e+01, -3.1590e+01,  ...,  4.1540e-03,
       5.2587e+00,  5.2382e+00],
     [ 4.7435e+01, -3.7599e-01,  5.0898e+01,  ...,  5.6361e+00,
       3.5891e+01,  1.5697e+01]]], grad_fn=)

torch.Size([2, 4, 512])
2.4.1 解码器层总结:

学习了解码器层的作用:
作为解码器的组成单元, 每个解码器层根据给定的输入向目标方向进行特征提取操作，即解码过程.
学习并实现了解码器层的类: DecoderLayer
类的初始化函数的参数有5个, 分别是size，代表词嵌入的维度大小, 同时也代表解码器层的尺寸，第二个是self_attn，多头自注意力对象，也就是说这个注意力机制需要Q=K=V，第三个是src_attn，多头注意力对象，这里Q!=K=V，第四个是前馈全连接层对象，最后就是droupout置0比率.
forward函数的参数有4个，分别是来自上一层的输入x，来自编码器层的语义存储变量mermory，以及源数据掩码张量和目标数据掩码张量.
最终输出了由编码器输入和目标数据一同作用的特征提取结果.
2.4.2 解码器
学习目标:
了解解码器的作用.
掌握解码器的实现过程.
解码器的作用:
根据编码器的结果以及上一次预测的结果, 对下一次可能出现的’值’进行特征表示.
解码器的代码分析:

使用类Decoder来实现解码器

class Decoder(nn.Module):
def init(self, layer, N):
“”“初始化函数的参数有两个，第一个就是解码器层layer，第二个是解码器层的个数N.”""
super(Decoder, self).init()
# 首先使用clones方法克隆了N个layer，然后实例化了一个规范化层.
# 因为数据走过了所有的解码器层后最后要做规范化处理.
self.layers = clones(layer, N)
self.norm = LayerNorm(layer.size)

def forward(self, x, memory, source_mask, target_mask):
    """forward函数中的参数有4个，x代表目标数据的嵌入表示，memory是编码器层的输出，
       source_mask, target_mask代表源数据和目标数据的掩码张量"""

    # 然后就是对每个层进行循环，当然这个循环就是变量x通过每一个层的处理，
    # 得出最后的结果，再进行一次规范化返回即可. 
    for layer in self.layers:
        x = layer(x, memory, source_mask, target_mask)
    return self.norm(x)

实例化参数:

分别是解码器层layer和解码器层的个数N

size = 512
d_model = 512
head = 8
d_ff = 64
dropout = 0.2
c = copy.deepcopy
attn = MultiHeadedAttention(head, d_model)
ff = PositionwiseFeedForward(d_model, d_ff, dropout)
layer = DecoderLayer(d_model, c(attn), c(attn), c(ff), dropout)
N = 8
输入参数:

输入参数与解码器层的输入参数相同

x = pe_result
memory = en_result
mask = Variable(torch.zeros(8, 4, 4))
source_mask = target_mask = mask
调用:

de = Decoder(layer, N)
de_result = de(x, memory, source_mask, target_mask)
print(de_result)
print(de_result.shape)
输出效果:

tensor([[[ 0.9898, -0.3216, -1.2439, …, 0.7427, -0.0717, -0.0814],
[-0.7432, 0.6985, 1.5551, …, 0.5232, -0.5685, 1.3387],
[ 0.2149, 0.5274, -1.6414, …, 0.7476, 0.5082, -3.0132],
[ 0.4408, 0.9416, 0.4522, …, -0.1506, 1.5591, -0.6453]],

    [[-0.9027,  0.5874,  0.6981,  ...,  2.2899,  0.2933, -0.7508],
     [ 1.2246, -1.0856, -0.2497,  ..., -1.2377,  0.0847, -0.0221],
     [ 3.4012, -0.4181, -2.0968,  ..., -1.5427,  0.1090, -0.3882],
     [-0.1050, -0.5140, -0.6494,  ..., -0.4358, -1.2173,  0.4161]]],
   grad_fn=)

torch.Size([2, 4, 512])
2.4.2 解码器总结:

学习了解码器的作用:
根据编码器的结果以及上一次预测的结果, 对下一次可能出现的’值’进行特征表示.
学习并实现了解码器的类: Decoder
类的初始化函数的参数有两个，第一个就是解码器层layer，第二个是解码器层的个数N.
forward函数中的参数有4个，x代表目标数据的嵌入表示，memory是编码器层的输出，src_mask, tgt_mask代表源数据和目标数据的掩码张量.
输出解码过程的最终特征表示.
2.5 输出部分实现
学习目标
了解线性层和softmax的作用.
掌握线性层和softmax的实现过程.
输出部分包含:
线性层
softmax层
avatar
线性层的作用
通过对上一步的线性变化得到指定维度的输出, 也就是转换维度的作用.
softmax层的作用
使最后一维的向量中的数字缩放到0-1的概率值域内, 并满足他们的和为1.
线性层和softmax层的代码分析:

nn.functional工具包装载了网络层中那些只进行计算, 而没有参数的层

import torch.nn.functional as F

将线性层和softmax计算层一起实现, 因为二者的共同目标是生成最后的结构

因此把类的名字叫做Generator, 生成器类

class Generator(nn.Module):
def init(self, d_model, vocab_size):
“”“初始化函数的输入参数有两个, d_model代表词嵌入维度, vocab_size代表词表大小.”""
super(Generator, self).init()
# 首先就是使用nn中的预定义线性层进行实例化, 得到一个对象self.project等待使用,
# 这个线性层的参数有两个, 就是初始化函数传进来的两个参数: d_model, vocab_size
self.project = nn.Linear(d_model, vocab_size)

def forward(self, x):
    """前向逻辑函数中输入是上一层的输出张量x"""
    # 在函数中, 首先使用上一步得到的self.project对x进行线性变化, 
    # 然后使用F中已经实现的log_softmax进行的softmax处理.
    # 在这里之所以使用log_softmax是因为和我们这个pytorch版本的损失函数实现有关, 在其他版本中将修复.
    # log_softmax就是对softmax的结果又取了对数, 因为对数函数是单调递增函数, 
    # 因此对最终我们取最大的概率值没有影响. 最后返回结果即可.
    return F.log_softmax(self.project(x), dim=-1)

nn.Linear演示:

m = nn.Linear(20, 30)
input = torch.randn(128, 20)
output = m(input)
print(output.size())
torch.Size([128, 30])
实例化参数:

词嵌入维度是512维

d_model = 512

词表大小是1000

vocab_size = 1000
输入参数:

输入x是上一层网络的输出, 我们使用来自解码器层的输出

x = de_result
调用:

gen = Generator(d_model, vocab_size)
gen_result = gen(x)
print(gen_result)
print(gen_result.shape)
输出效果:

tensor([[[-7.8098, -7.5260, -6.9244, …, -7.6340, -6.9026, -7.5232],
[-6.9093, -7.3295, -7.2972, …, -6.6221, -7.2268, -7.0772],
[-7.0263, -7.2229, -7.8533, …, -6.7307, -6.9294, -7.3042],
[-6.5045, -6.0504, -6.6241, …, -5.9063, -6.5361, -7.1484]],

    [[-7.1651, -6.0224, -7.4931,  ..., -7.9565, -8.0460, -6.6490],
     [-6.3779, -7.6133, -8.3572,  ..., -6.6565, -7.1867, -6.5112],
     [-6.4914, -6.9289, -6.2634,  ..., -6.2471, -7.5348, -6.8541],
     [-6.8651, -7.0460, -7.6239,  ..., -7.1411, -6.5496, -7.3749]]],
   grad_fn=)

torch.Size([2, 4, 1000])
小节总结
学习了输出部分包含:

线性层
softmax层
线性层的作用:

通过对上一步的线性变化得到指定维度的输出, 也就是转换维度的作用.
softmax层的作用:

使最后一维的向量中的数字缩放到0-1的概率值域内, 并满足他们的和为1.
学习并实现了线性层和softmax层的类: Generator

初始化函数的输入参数有两个, d_model代表词嵌入维度, vocab_size代表词表大小.
forward函数接受上一层的输出.
最终获得经过线性层和softmax层处理的结果.
2.6 模型构建
学习目标
掌握编码器-解码器结构的实现过程.
掌握Transformer模型的构建过程.
通过上面的小节, 我们已经完成了所有组成部分的实现, 接下来就来实现完整的编码器-解码器结构.
Transformer总体架构图:
avatar
编码器-解码器结构的代码实现

使用EncoderDecoder类来实现编码器-解码器结构

class EncoderDecoder(nn.Module):
def init(self, encoder, decoder, source_embed, target_embed, generator):
“”“初始化函数中有5个参数, 分别是编码器对象, 解码器对象,
源数据嵌入函数, 目标数据嵌入函数, 以及输出部分的类别生成器对象
“””
super(EncoderDecoder, self).init()
# 将参数传入到类中
self.encoder = encoder
self.decoder = decoder
self.src_embed = source_embed
self.tgt_embed = target_embed
self.generator = generator

def forward(self, source, target, source_mask, target_mask):
    """在forward函数中，有四个参数, source代表源数据, target代表目标数据, 
       source_mask和target_mask代表对应的掩码张量"""

    # 在函数中, 将source, source_mask传入编码函数, 得到结果后,
    # 与source_mask，target，和target_mask一同传给解码函数.
    return self.decode(self.encode(source, source_mask), source_mask,
                        target, target_mask)

def encode(self, source, source_mask):
    """编码函数, 以source和source_mask为参数"""
    # 使用src_embed对source做处理, 然后和source_mask一起传给self.encoder
    return self.encoder(self.src_embed(source), source_mask)

def decode(self, memory, source_mask, target, target_mask):
    """解码函数, 以memory即编码器的输出, source_mask, target, target_mask为参数"""
    # 使用tgt_embed对target做处理, 然后和source_mask, target_mask, memory一起传给self.decoder
    return self.decoder(self.tgt_embed(target), memory, source_mask, target_mask)

实例化参数

vocab_size = 1000
d_model = 512
encoder = en
decoder = de
source_embed = nn.Embedding(vocab_size, d_model)
target_embed = nn.Embedding(vocab_size, d_model)
generator = gen
输入参数:

假设源数据与目标数据相同, 实际中并不相同

source = target = Variable(torch.LongTensor([[100, 2, 421, 508], [491, 998, 1, 221]]))

假设src_mask与tgt_mask相同，实际中并不相同

source_mask = target_mask = Variable(torch.zeros(8, 4, 4))
调用:

ed = EncoderDecoder(encoder, decoder, source_embed, target_embed, generator)
ed_result = ed(source, target, source_mask, target_mask)
print(ed_result)
print(ed_result.shape)
输出效果:

tensor([[[ 0.2102, -0.0826, -0.0550, …, 1.5555, 1.3025, -0.6296],
[ 0.8270, -0.5372, -0.9559, …, 0.3665, 0.4338, -0.7505],
[ 0.4956, -0.5133, -0.9323, …, 1.0773, 1.1913, -0.6240],
[ 0.5770, -0.6258, -0.4833, …, 0.1171, 1.0069, -1.9030]],

    [[-0.4355, -1.7115, -1.5685,  ..., -0.6941, -0.1878, -0.1137],
     [-0.8867, -1.2207, -1.4151,  ..., -0.9618,  0.1722, -0.9562],
     [-0.0946, -0.9012, -1.6388,  ..., -0.2604, -0.3357, -0.6436],
     [-1.1204, -1.4481, -1.5888,  ..., -0.8816, -0.6497,  0.0606]]],
   grad_fn=)

torch.Size([2, 4, 512])
接着将基于以上结构构建用于训练的模型.
Tansformer模型构建过程的代码分析

def make_model(source_vocab, target_vocab, N=6,
d_model=512, d_ff=2048, head=8, dropout=0.1):
“”“该函数用来构建模型, 有7个参数，分别是源数据特征(词汇)总数，目标数据特征(词汇)总数，
编码器和解码器堆叠数，词向量映射维度，前馈全连接网络中变换矩阵的维度，
多头注意力结构中的多头数，以及置零比率dropout.”""

# 首先得到一个深度拷贝命令，接下来很多结构都需要进行深度拷贝，
# 来保证他们彼此之间相互独立，不受干扰.
c = copy.deepcopy

# 实例化了多头注意力类，得到对象attn
attn = MultiHeadedAttention(head, d_model)

# 然后实例化前馈全连接类，得到对象ff 
ff = PositionwiseFeedForward(d_model, d_ff, dropout)

# 实例化位置编码类，得到对象position
position = PositionalEncoding(d_model, dropout)

# 根据结构图, 最外层是EncoderDecoder，在EncoderDecoder中，
# 分别是编码器层，解码器层，源数据Embedding层和位置编码组成的有序结构，
# 目标数据Embedding层和位置编码组成的有序结构，以及类别生成器层. 
# 在编码器层中有attention子层以及前馈全连接子层，
# 在解码器层中有两个attention子层以及前馈全连接层.
model = EncoderDecoder(
    Encoder(EncoderLayer(d_model, c(attn), c(ff), dropout), N),
    Decoder(DecoderLayer(d_model, c(attn), c(attn), 
                         c(ff), dropout), N),
    nn.Sequential(Embeddings(d_model, source_vocab), c(position)),
    nn.Sequential(Embeddings(d_model, target_vocab), c(position)),
    Generator(d_model, target_vocab))

# 模型结构完成后，接下来就是初始化模型中的参数，比如线性层中的变换矩阵
# 这里一但判断参数的维度大于1，则会将其初始化成一个服从均匀分布的矩阵，
for p in model.parameters():
    if p.dim() > 1:
        nn.init.xavier_uniform(p)
return model

nn.init.xavier_uniform演示:

结果服从均匀分布U(-a, a)

w = torch.empty(3, 5)
w = nn.init.xavier_uniform_(w, gain=nn.init.calculate_gain(‘relu’))
w
tensor([[-0.7742, 0.5413, 0.5478, -0.4806, -0.2555],
[-0.8358, 0.4673, 0.3012, 0.3882, -0.6375],
[ 0.4622, -0.0794, 0.1851, 0.8462, -0.3591]])
输入参数:

source_vocab = 11
target_vocab = 11
N = 6

其他参数都使用默认值

调用:

if name == ‘main’:
res = make_model(source_vocab, target_vocab, N)
print(res)
输出效果:

根据Transformer结构图构建的最终模型结构

EncoderDecoder(
(encoder): Encoder(
(layers): ModuleList(
(0): EncoderLayer(
(self_attn): MultiHeadedAttention(
(linears): ModuleList(
(0): Linear(in_features=512, out_features=512)
(1): Linear(in_features=512, out_features=512)
(2): Linear(in_features=512, out_features=512)
(3): Linear(in_features=512, out_features=512)
)
(dropout): Dropout(p=0.1)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048)
(w_2): Linear(in_features=2048, out_features=512)
(dropout): Dropout(p=0.1)
)
(sublayer): ModuleList(
(0): SublayerConnection(
(norm): LayerNorm(
)
(dropout): Dropout(p=0.1)
)
(1): SublayerConnection(
(norm): LayerNorm(
)
(dropout): Dropout(p=0.1)
)
)
)
(1): EncoderLayer(
(self_attn): MultiHeadedAttention(
(linears): ModuleList(
(0): Linear(in_features=512, out_features=512)
(1): Linear(in_features=512, out_features=512)
(2): Linear(in_features=512, out_features=512)
(3): Linear(in_features=512, out_features=512)
)
(dropout): Dropout(p=0.1)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048)
(w_2): Linear(in_features=2048, out_features=512)
(dropout): Dropout(p=0.1)
)
(sublayer): ModuleList(
(0): SublayerConnection(
(norm): LayerNorm(
)
(dropout): Dropout(p=0.1)
)
(1): SublayerConnection(
(norm): LayerNorm(
)
(dropout): Dropout(p=0.1)
)
)
)
)
(norm): LayerNorm(
)
)
(decoder): Decoder(
(layers): ModuleList(
(0): DecoderLayer(
(self_attn): MultiHeadedAttention(
(linears): ModuleList(
(0): Linear(in_features=512, out_features=512)
(1): Linear(in_features=512, out_features=512)
(2): Linear(in_features=512, out_features=512)
(3): Linear(in_features=512, out_features=512)
)
(dropout): Dropout(p=0.1)
)
(src_attn): MultiHeadedAttention(
(linears): ModuleList(
(0): Linear(in_features=512, out_features=512)
(1): Linear(in_features=512, out_features=512)
(2): Linear(in_features=512, out_features=512)
(3): Linear(in_features=512, out_features=512)
)
(dropout): Dropout(p=0.1)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048)
(w_2): Linear(in_features=2048, out_features=512)
(dropout): Dropout(p=0.1)
)
(sublayer): ModuleList(
(0): SublayerConnection(
(norm): LayerNorm(
)
(dropout): Dropout(p=0.1)
)
(1): SublayerConnection(
(norm): LayerNorm(
)
(dropout): Dropout(p=0.1)
)
(2): SublayerConnection(
(norm): LayerNorm(
)
(dropout): Dropout(p=0.1)
)
)
)
(1): DecoderLayer(
(self_attn): MultiHeadedAttention(
(linears): ModuleList(
(0): Linear(in_features=512, out_features=512)
(1): Linear(in_features=512, out_features=512)
(2): Linear(in_features=512, out_features=512)
(3): Linear(in_features=512, out_features=512)
)
(dropout): Dropout(p=0.1)
)
(src_attn): MultiHeadedAttention(
(linears): ModuleList(
(0): Linear(in_features=512, out_features=512)
(1): Linear(in_features=512, out_features=512)
(2): Linear(in_features=512, out_features=512)
(3): Linear(in_features=512, out_features=512)
)
(dropout): Dropout(p=0.1)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048)
(w_2): Linear(in_features=2048, out_features=512)
(dropout): Dropout(p=0.1)
)
(sublayer): ModuleList(
(0): SublayerConnection(
(norm): LayerNorm(
)
(dropout): Dropout(p=0.1)
)
(1): SublayerConnection(
(norm): LayerNorm(
)
(dropout): Dropout(p=0.1)
)
(2): SublayerConnection(
(norm): LayerNorm(
)
(dropout): Dropout(p=0.1)
)
)
)
)
(norm): LayerNorm(
)
)
(src_embed): Sequential(
(0): Embeddings(
(lut): Embedding(11, 512)
)
(1): PositionalEncoding(
(dropout): Dropout(p=0.1)
)
)
(tgt_embed): Sequential(
(0): Embeddings(
(lut): Embedding(11, 512)
)
(1): PositionalEncoding(
(dropout): Dropout(p=0.1)
)
)
(generator): Generator(
(proj): Linear(in_features=512, out_features=11)
)
)
小节总结
学习并实现了编码器-解码器结构的类: EncoderDecoder

类的初始化函数传入5个参数, 分别是编码器对象, 解码器对象, 源数据嵌入函数, 目标数据嵌入函数, 以及输出部分的类别生成器对象.
类中共实现三个函数, forward, encode, decode
forward是主要逻辑函数, 有四个参数, source代表源数据, target代表目标数据, source_mask和target_mask代表对应的掩码张量.
encode是编码函数, 以source和source_mask为参数.
decode是解码函数, 以memory即编码器的输出, source_mask, target, target_mask为参数
学习并实现了模型构建函数: make_model

有7个参数，分别是源数据特征(词汇)总数，目标数据特征(词汇)总数，编码器和解码器堆叠数，词向量映射维度，前馈全连接网络中变换矩阵的维度，多头注意力结构中的多头数，以及置零比率dropout.
该函数最后返回一个构建好的模型对象.
上一页第一章:Transformer背景介绍
©Copyright 2019, itcast.cn.
Made with Material for MkDocs

跳转至
logo
迁移学习
第一章:fasttext工具的使用
logo
迁移学习
第一章:fasttext工具的使用
第二章:迁移学习
目录
1.1 认识fasttext工具
学习目标
1.2 进行文本分类
学习目标
什么是文本分类
文本分类的种类
使用fasttext工具进行文本分类的过程
第一步: 获取数据
第二步: 训练集与验证集的划分
第三步: 训练模型
第四步: 使用模型进行预测并评估
第五步: 模型调优
第六步: 模型保存与重加载
小节总结
1.3 训练词向量
学习目标
使用fasttext工具训练词向量的过程
第一步: 获取数据
第二步: 训练词向量
第三步: 模型超参数设定
第四步: 模型效果检验
第五步: 模型的保存与重加载
小节总结
1.4 词向量迁移
学习目标
如何使用fasttext进行词向量模型迁移
第一步: 下载词向量模型压缩的bin.gz文件
第二步: 解压bin.gz文件到bin文件
第三步: 加载bin文件获取词向量
第四步: 利用邻近词进行效果检验
小节总结
第一章:fasttext工具的使用
1.1 认识fasttext工具
学习目标
了解fasttext工具的作用.
了解fasttext工具的优势及其原因.
掌握fasttext的安装方法.

作为NLP工程领域常用的工具包, fasttext有两大作用:
进行文本分类
训练词向量
fasttext工具包的优势:
正如它的名字, 在保持较高精度的情况下, 快速的进行训练和预测是fasttext的最大优势.
fasttext优势的原因:
fasttext工具包中内含的fasttext模型具有十分简单的网络结构.
使用fasttext模型训练词向量时使用层次softmax结构, 来提升超多类别下的模型性能.
由于fasttext模型过于简单无法捕捉词序特征, 因此会进行n-gram特征提取以弥补模型缺陷提升精度.
fasttext的安装:

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText

使用pip安装python中的fasttext工具包

$ sudo pip install .
验证安装:

Python 3.7.3 (default, Mar 27 2019, 22:11:17)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type “help”, “copyright”, “credits” or “license” for more information.

import fasttext

1.2 进行文本分类
学习目标
了解什么是文本分类及其种类.
掌握fasttext工具进行文本分类的过程.
什么是文本分类
文本分类的是将文档（例如电子邮件，帖子，文本消息，产品评论等）分配给一个或多个类别. 当今文本分类的实现多是使用机器学习方法从训练数据中提取分类规则以进行分类, 因此构建文本分类器需要带标签的数据.
文本分类的种类
二分类:
文本被分类两个类别中, 往往这两个类别是对立面, 比如: 判断一句评论是好评还是差评.
单标签多分类:
文本被分入到多个类别中, 且每条文本只能属于某一个类别(即被打上某一个标签), 比如: 输入一个人名, 判断它是来自哪个国家的人名.
多标签多分类:
文本被分人到多个类别中, 但每条文本可以属于多个类别(即被打上多个标签), 比如: 输入一段描述, 判断可能是和哪些兴趣爱好有关, 一段描述中可能即讨论了美食, 又太讨论了游戏爱好.
使用fasttext工具进行文本分类的过程
第一步: 获取数据
第二步: 训练集与验证集的划分
第三步: 训练模型
第四步: 使用模型进行预测并评估
第五步: 模型调优
第六步: 模型保存与重加载
第一步: 获取数据

获取烹饪相关的数据集, 它是由facebook AI实验室提供的演示数据集

$ wget https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz && tar xvzf cooking.stackexchange.tar.gz

查看数据的前10条

$ head cooking.stackexchange.txt

__label__sauce __label__cheese How much does potato starch affect a cheese sauce recipe?
__label__food-safety __label__acidity Dangerous pathogens capable of growing in acidic environments
__label__cast-iron __label__stove How do I cover up the white spots on my cast iron stove?
__label__restaurant Michelin Three Star Restaurant; but if the chef is not there
__label__knife-skills __label__dicing Without knife skills, how can I quickly and accurately dice vegetables?
__label__storage-method __label__equipment __label__bread What’s the purpose of a bread box?
__label__baking __label__food-safety __label__substitutions __label__peanuts how to seperate peanut oil from roasted peanuts at home?
__label__chocolate American equivalent for British chocolate terms
__label__baking __label__oven __label__convection Fan bake vs bake
__label__sauce __label__storage-lifetime __label__acidity __label__mayonnaise Regulation and balancing of readymade packed mayonnaise and other sauces
数据说明:
cooking.stackexchange.txt中的每一行都包含一个标签列表，后跟相应的文档, 标签列表以类似"__label__sauce __label__cheese"的形式展现, 代表有两个标签sauce和cheese, 所有标签__label__均以前缀开头，这是fastText识别标签或单词的方式. 标签之后的一段话就是文本信息.如: How much does potato starch affect a cheese sauce recipe?
第二步: 训练集与验证集的划分

查看数据总数

$ wc cooking.stackexchange.txt

15404 169582 1401900 cooking.stackexchange.txt

12404条数据作为训练数据

$ head -n 12404 cooking.stackexchange.txt > cooking.train

3000条数据作为验证数据

$ tail -n 3000 cooking.stackexchange.txt > cooking.valid
第三步: 训练模型

代码运行在python解释器中

导入fasttext

import fasttext

使用fasttext的train_supervised方法进行文本分类模型的训练

model = fasttext.train_supervised(input=“cooking.train”)

获得结果

Read 0M words

不重复的词汇总数

Number of words: 14543

标签总数

Number of labels: 735

Progress: 训练进度, 因为我们这里显示的是最后的训练完成信息, 所以进度是100%

words/sec/thread: 每个线程每秒处理的平均词汇数

lr: 当前的学习率, 因为训练完成所以学习率是0

avg.loss: 训练过程的平均损失

ETA: 预计剩余训练时间, 因为已训练完成所以是0

Progress: 100.0% words/sec/thread: 60162 lr: 0.000000 avg.loss: 10.056812 ETA: 0h 0m 0s
第四步: 使用模型进行预测并评估

使用模型预测一段输入文本, 通过我们常识, 可知预测是正确的, 但是对应预测概率并不大

model.predict(“Which baking dish is best to bake a banana bread ?”)

元组中的第一项代表标签, 第二项代表对应的概率

((’__label__baking’,), array([0.06550845]))

通过我们常识可知预测是错误的

model.predict(“Why not put knives in the dishwasher?”)
((’__label__food-safety’,), array([0.07541209]))

为了评估模型到底表现如何, 我们在3000条的验证集上进行测试

model.test(“cooking.valid”)

元组中的每项分别代表, 验证集样本数量, 精度以及召回率

我们看到模型精度和召回率表现都很差, 接下来我们讲学习如何进行优化.

(3000, 0.124, 0.0541)
第五步: 模型调优
原始数据处理:

通过查看数据, 我们发现数据中存在许多标点符号与单词相连以及大小写不统一,

这些因素对我们最终的分类目标没有益处, 反是增加了模型提取分类规律的难度,

因此我们选择将它们去除或转化

处理前的部分数据

__label__fish Arctic char available in North-America
__label__pasta __label__salt __label__boiling When cooking pasta in salted water how much of the salt is absorbed?
__label__coffee Emergency Coffee via Chocolate Covered Coffee Beans?
__label__cake Non-beet alternatives to standard red food dye
__label__cheese __label__lentils Could cheese “halt” the tenderness of cooking lentils?
__label__asian-cuisine __label__chili-peppers __label__kimchi __label__korean-cuisine What kind of peppers are used in Gochugaru ()?
__label__consistency Pavlova Roll failure
__label__eggs __label__bread What qualities should I be looking for when making the best French Toast?
__label__meat __label__flour __label__stews __label__braising Coating meat in flour before browning, bad idea?
__label__food-safety Raw roast beef on the edge of safe?
__label__pork __label__food-identification How do I determine the cut of a pork steak prior to purchasing it?

通过服务器终端进行简单的数据预处理

使标点符号与单词分离并统一使用小写字母

cat cooking.stackexchange.txt | sed -e “s/([.!?,’/()])/ \1 /g” | tr “[:upper:]” “[:lower:]” > cooking.preprocessed.txt
head -n 12404 cooking.preprocessed.txt > cooking.train
tail -n 3000 cooking.preprocessed.txt > cooking.valid

处理后的部分数据

__label__fish arctic char available in north-america
__label__pasta __label__salt __label__boiling when cooking pasta in salted water how much of the salt is absorbed ?
__label__coffee emergency coffee via chocolate covered coffee beans ?
__label__cake non-beet alternatives to standard red food dye
__label__cheese __label__lentils could cheese “halt” the tenderness of cooking lentils ?
__label__asian-cuisine __label__chili-peppers __label__kimchi __label__korean-cuisine what kind of peppers are used in gochugaru ( ) ?
__label__consistency pavlova roll failure
__label__eggs __label__bread what qualities should i be looking for when making the best french toast ?
__label__meat __label__flour __label__stews __label__braising coating meat in flour before browning , bad idea ?
__label__food-safety raw roast beef on the edge of safe ?
__label__pork __label__food-identification how do i determine the cut of a pork steak prior to purchasing it ?
数据处理后进行训练并测试:

重新训练

model = fasttext.train_supervised(input=“cooking.train”)
Read 0M words

不重复的词汇总数减少很多, 因为之前会把带大写字母或者与标点符号相连接的单词都认为是新的单词

Number of words: 8952
Number of labels: 735

我们看到平均损失有所下降

Progress: 100.0% words/sec/thread: 65737 lr: 0.000000 avg.loss: 9.966091 ETA: 0h 0m 0s

重新测试

model.test(“cooking.valid”)

我们看到精度和召回率都有所提升

(3000, 0.161, 0.06962663975782038)
增加训练轮数:

设置train_supervised方法中的参数epoch来增加训练轮数, 默认的轮数是5次

增加轮数意味着模型能够有更多机会在有限数据中调整分类规律, 当然这也会增加训练时间

model = fasttext.train_supervised(input=“cooking.train”, epoch=25)
Read 0M words
Number of words: 8952
Number of labels: 735

我们看到平均损失继续下降

Progress: 100.0% words/sec/thread: 66283 lr: 0.000000 avg.loss: 7.203885 ETA: 0h 0m 0s

model.test(“cooking.valid”)

我们看到精度已经提升到了42%, 召回率提升至18%.

(3000, 0.4206666666666667, 0.1819230214790255)
调整学习率:

设置train_supervised方法中的参数lr来调整学习率, 默认的学习率大小是0.1

增大学习率意味着增大了梯度下降的步长使其在有限的迭代步骤下更接近最优点

model = fasttext.train_supervised(input=“cooking.train”, lr=1.0, epoch=25)
Read 0M words
Number of words: 8952
Number of labels: 735

平均损失继续下降

Progress: 100.0% words/sec/thread: 66027 lr: 0.000000 avg.loss: 4.278283 ETA: 0h 0m 0s

model.test(“cooking.valid”)

我们看到精度已经提升到了47%, 召回率提升至20%.

(3000, 0.47633333333333333, 0.20599682860025947)
增加n-gram特征:

设置train_supervised方法中的参数wordNgrams来添加n-gram特征, 默认是1, 也就是没有n-gram特征

我们这里将其设置为2意味着添加2-gram特征, 这些特征帮助模型捕捉前后词汇之间的关联, 更好的提取分类规则用于模型分类, 当然这也会增加模型训时练占用的资源和时间.

model = fasttext.train_supervised(input=“cooking.train”, lr=1.0, epoch=25, wordNgrams=2)
Read 0M words
Number of words: 8952
Number of labels: 735

平均损失继续下降

Progress: 100.0% words/sec/thread: 65084 lr: 0.000000 avg.loss: 3.189422 ETA: 0h 0m 0s

model.test(“cooking.valid”)

我们看到精度已经提升到了49%, 召回率提升至21%.

(3000, 0.49233333333333335, 0.2129162462159435)
修改损失计算方式:

随着我们不断的添加优化策略, 模型训练速度也越来越慢

为了能够提升fasttext模型的训练效率, 减小训练时间

设置train_supervised方法中的参数loss来修改损失计算方式(等效于输出层的结构), 默认是softmax层结构

我们这里将其设置为’hs’, 代表层次softmax结构, 意味着输出层的结构(计算方式)发生了变化, 将以一种更低复杂度的方式来计算损失.

model = fasttext.train_supervised(input=“cooking.train”, lr=1.0, epoch=25, wordNgrams=2, loss=‘hs’)
Read 0M words
Number of words: 8952
Number of labels: 735
Progress: 100.0% words/sec/thread: 1341740 lr: 0.000000 avg.loss: 2.225962 ETA: 0h 0m 0s

model.test(“cooking.valid”)

我们看到精度和召回率稍有波动, 但训练时间却缩短到仅仅几秒

(3000, 0.483, 0.20887991927346114)
自动超参数调优:

手动调节和寻找超参数是非常困难的, 因为参数之间可能相关, 并且不同数据集需要的超参数也不同,

因此可以使用fasttext的autotuneValidationFile参数进行自动超参数调优.

autotuneValidationFile参数需要指定验证数据集所在路径, 它将在验证集上使用随机搜索方法寻找可能最优的超参数.

使用autotuneDuration参数可以控制随机搜索的时间, 默认是300s, 根据不同的需求, 我们可以延长或缩短时间.

验证集路径’cooking.valid’, 随机搜索600秒

model = fasttext.train_supervised(input=‘cooking.train’, autotuneValidationFile=‘cooking.valid’, autotuneDuration=600)

Progress: 100.0% Trials: 38 Best score: 0.376170 ETA: 0h 0m 0s
Training again with best arguments
Read 0M words
Number of words: 8952
Number of labels: 735
Progress: 100.0% words/sec/thread: 63791 lr: 0.000000 avg.loss: 1.888165 ETA: 0h 0m 0s
实际生产中多标签多分类问题的损失计算方式:

针对多标签多分类问题, 使用’softmax’或者’hs’有时并不是最佳选择, 因为我们最终得到的应该是多个标签, 而softmax却只能最大化一个标签.

所以我们往往会选择为每个标签使用独立的二分类器作为输出层结构,

对应的损失计算方式为’ova’表示one vs all.

这种输出层的改变意味着我们在统一语料下同时训练多个二分类模型,

对于二分类模型来讲, lr不宜过大, 这里我们设置为0.2

model = fasttext.train_supervised(input=“cooking.train”, lr=0.2, epoch=25, wordNgrams=2, loss=‘ova’)
Read 0M words
Number of words: 8952
Number of labels: 735
Progress: 100.0% words/sec/thread: 65044 lr: 0.000000 avg.loss: 7.713312 ETA: 0h 0m 0s

我们使用模型进行单条样本的预测, 来看一下它的输出结果.

参数k代表指定模型输出多少个标签, 默认为1, 这里设置为-1, 意味着尽可能多的输出.

参数threshold代表显示的标签概率阈值, 设置为0.5, 意味着显示概率大于0.5的标签

model.predict(“Which baking dish is best to bake a banana bread ?”, k=-1, threshold=0.5)

我看到根据输入文本, 输出了它的三个最有可能的标签

((u’__label__baking’, u’__label__bananas’, u’__label__bread’), array([1.00000, 0.939923, 0.592677]))
第六步: 模型保存与重加载

使用model的save_model方法保存模型到指定目录

你可以在指定目录下找到model_cooking.bin文件

model.save_model("./model_cooking.bin")

使用fasttext的load_model进行模型的重加载

model = fasttext.load_model("./model_cooking.bin")

重加载后的模型使用方法和之前完全相同

model.predict(“Which baking dish is best to bake a banana bread ?”, k=-1, threshold=0.5)
((u’__label__baking’, u’__label__bananas’, u’__label__bread’), array([1.00000, 0.939923, 0.592677]))
小节总结
学习了什么是文本分类:

文本分类的是将文档（例如电子邮件，帖子，文本消息，产品评论等）分配给一个或多个类别. 当今文本分类的实现多是使用机器学习方法从训练数据中提取分类规则以进行分类, 因此构建文本分类器需要带标签的数据.
文本分类的种类:

二分类:
文本被分类两个类别中, 往往这两个类别是对立面, 比如: 判断一句评论是好评还是差评.
单标签多分类:
文本被分入到多个类别中, 且每条文本只能属于某一个类别(即被打上某一个标签), 比如: 输入一个人名, 判断它是来自哪个国家的人名.
多标签多分类:
文本被分人到多个类别中, 但每条文本可以属于多个类别(即被打上多个标签), 比如: 输入一段描述, 判断可能是和哪些兴趣爱好有关, 一段描述中可能即讨论了美食, 又太讨论了游戏爱好.
使用fasttext工具进行文本分类的过程:

第一步: 获取数据
第二步: 训练集与验证集的划分
第三步: 训练模型
第四步: 使用模型进行预测并评估
第五步: 模型调优
第六步: 模型保存与重加载
1.3 训练词向量
学习目标
了解词向量的相关知识.
掌握fasttext工具训练词向量的过程.
词向量的相关知识:
用向量表示文本中的词汇(或字符)是现代机器学习中最流行的做法, 这些向量能够很好的捕捉语言之间的关系, 从而提升基于词向量的各种NLP任务的效果.
使用fasttext工具训练词向量的过程
第一步: 获取数据
第二步: 训练词向量
第三步: 模型超参数设定
第四步: 模型效果检验
第五步: 模型的保存与重加载
第一步: 获取数据

在这里, 我们将研究英语维基百科的部分网页信息, 它的大小在300M左右

这些语料已经被准备好, 我们可以通过Matt Mahoney的网站下载.

首先创建一个存储数据的文件夹data

$ mkdir data

使用wget下载数据的zip压缩包, 它将存储在data目录中

$ wget -c http://mattmahoney.net/dc/enwik9.zip -P data

使用unzip解压, 如果你的服务器中还没有unzip命令, 请使用: yum install unzip -y

解压后在data目录下会出现enwik9的文件夹

$ unzip data/enwik9.zip -d data
查看原始数据:

$ head -10 data/enwik9

原始数据将输出很多包含XML/HTML格式的内容, 这些内容并不是我们需要的

Wikipedia http://en.wikipedia.org/wiki/Main_Page MediaWiki 1.6alpha first-letter Media Special 原始数据处理:

使用wikifil.pl文件处理脚本来清除XML/HTML格式的内容

注: wikifil.pl文件已为大家提供

$ perl wikifil.pl data/enwik9 > data/fil9
查看预处理后的数据:

查看前80个字符

head -c 80 data/fil9

输出结果为由空格分割的单词

anarchism originated as a term of abuse first used against early working class
第二步: 训练词向量

代码运行在python解释器中

导入fasttext

import fasttext

使用fasttext的train_unsupervised(无监督训练方法)进行词向量的训练

它的参数是数据集的持久化文件路径’data/fil9’

model = fasttext.train_unsupervised(‘data/fil9’)

有效训练词汇量为124M, 共218316个单词

Read 124M words
Number of words: 218316
Number of labels: 0
Progress: 100.0% words/sec/thread: 53996 lr: 0.000000 loss: 0.734999 ETA: 0h 0m
查看单词对应的词向量:

通过get_word_vector方法来获得指定词汇的词向量

model.get_word_vector(“the”)

array([-0.03087516, 0.09221972, 0.17660329, 0.17308897, 0.12863874,
0.13912526, -0.09851588, 0.00739991, 0.37038437, -0.00845221,
…
-0.21184735, -0.05048715, -0.34571868, 0.23765688, 0.23726143],
dtype=float32)
第三步: 模型超参数设定

在训练词向量过程中, 我们可以设定很多常用超参数来调节我们的模型效果, 如:

无监督训练模式: ‘skipgram’ 或者 ‘cbow’, 默认为’skipgram’, 在实践中，skipgram模式在利用子词方面比cbow更好.

词嵌入维度dim: 默认为100, 但随着语料库的增大, 词嵌入的维度往往也要更大.

数据循环次数epoch: 默认为5, 但当你的数据集足够大, 可能不需要那么多次.

学习率lr: 默认为0.05, 根据经验, 建议选择[0.01，1]范围内.

使用的线程数thread: 默认为12个线程, 一般建议和你的cpu核数相同.

model = fasttext.train_unsupervised(‘data/fil9’, “cbow”, dim=300, epoch=1, lr=0.1, thread=8)

Read 124M words
Number of words: 218316
Number of labels: 0
Progress: 100.0% words/sec/thread: 49523 lr: 0.000000 avg.loss: 1.777205 ETA: 0h 0m 0s
第四步: 模型效果检验

检查单词向量质量的一种简单方法就是查看其邻近单词, 通过我们主观来判断这些邻近单词是否与目标单词相关来粗略评定模型效果好坏.

查找"运动"的邻近单词, 我们可以发现"体育网", “运动汽车”, "运动服"等.

model.get_nearest_neighbors(‘sports’)

[(0.8414610624313354, ‘sportsnet’), (0.8134572505950928, ‘sport’), (0.8100415468215942, ‘sportscars’), (0.8021156787872314, ‘sportsground’), (0.7889881134033203, ‘sportswomen’), (0.7863013744354248, ‘sportsplex’), (0.7786710262298584, ‘sporty’), (0.7696356177330017, ‘sportscar’), (0.7619683146476746, ‘sportswear’), (0.7600985765457153, ‘sportin’)]

查找"音乐"的邻近单词, 我们可以发现与音乐有关的词汇.

model.get_nearest_neighbors(‘music’)

[(0.8908010125160217, ‘emusic’), (0.8464668393135071, ‘musicmoz’), (0.8444250822067261, ‘musics’), (0.8113634586334229, ‘allmusic’), (0.8106718063354492, ‘musices’), (0.8049437999725342, ‘musicam’), (0.8004694581031799, ‘musicom’), (0.7952923774719238, ‘muchmusic’), (0.7852965593338013, ‘musicweb’), (0.7767147421836853, ‘musico’)]

查找"小狗"的邻近单词, 我们可以发现与小狗有关的词汇.

model.get_nearest_neighbors(‘dog’)

[(0.8456876873970032, ‘catdog’), (0.7480780482292175, ‘dogcow’), (0.7289096117019653, ‘sleddog’), (0.7269964218139648, ‘hotdog’), (0.7114801406860352, ‘sheepdog’), (0.6947550773620605, ‘dogo’), (0.6897546648979187, ‘bodog’), (0.6621081829071045, ‘maddog’), (0.6605004072189331, ‘dogs’), (0.6398137211799622, ‘dogpile’)]
第五步: 模型的保存与重加载

使用save_model保存模型

model.save_model(“fil9.bin”)

使用fasttext.load_model加载模型

model = fasttext.load_model(“fil9.bin”)
model.get_word_vector(“the”)

用向量表示文本中的词汇(或字符)是现代机器学习中最流行的做法, 这些向量能够很好的捕捉语言之间的关系, 从而提升基于词向量的各种NLP任务的效果.
使用fasttext工具训练词向量的过程:

第一步: 获取数据
第二步: 训练词向量
第三步: 模型超参数设定
第四步: 模型效果检验
第五步: 模型的保存与重加载
1.4 词向量迁移
学习目标
了解什么是词向量迁移.
了解fasttext工具中有哪些可迁移的词向量模型.
掌握如何使用fasttext进行词向量模型迁移.
什么是词向量迁移:
使用在大型语料库上已经进行训练完成的词向量模型.
fasttext工具中可以提供的可迁移的词向量:
fasttext提供了157种语言的在CommonCrawl和Wikipedia语料上进行训练的可迁移词向量模型, 它们采用CBOW模式进行训练, 词向量维度为300维. 可通过该地址查看具体语言词向量模型: https://fasttext.cc/docs/en/crawl-vectors.html
fasttext提供了294种语言的在Wikipedia语料上进行训练的可迁移词向量模型, 它们采用skipgram模式进行训练, 词向量维度同样是300维. 可通过该地址查看具体语言词向量模型: https://fasttext.cc/docs/en/pretrained-vectors.html
如何使用fasttext进行词向量模型迁移
第一步: 下载词向量模型压缩的bin.gz文件
第二步: 解压bin.gz文件到bin文件
第三步: 加载bin文件获取词向量
第四步: 利用邻近词进行效果检验
第一步: 下载词向量模型压缩的bin.gz文件

这里我们以迁移在CommonCrawl和Wikipedia语料上进行训练的中文词向量模型为例:

下载中文词向量模型(bin.gz文件)

wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.zh.300.bin.gz
第二步: 解压bin.gz文件到bin文件

使用gunzip进行解压, 获取cc.zh.300.bin文件

gunzip cc.zh.300.bin.gz
第三步: 加载bin文件获取词向量

加载模型

model = fasttext.load_model(“cc.zh.300.bin”)

查看前100个词汇(这里的词汇是广义的, 可以是中文符号或汉字))

model.words[:100]
[’，’, ‘的’, ‘。’, ‘’, ‘、’, ‘是’, ‘一’, ‘在’, ‘：’, ‘了’, ‘（’, ‘）’, “’”, ‘和’, ‘不’, ‘有’, ‘我’, ‘,’, ‘)’, ‘(’, ‘“’, ‘”’, ‘也’, ‘人’, ‘个’, ‘:’, ‘中’, ‘.’, ‘就’, ‘他’, ‘》’, ‘《’, ‘-’, ‘你’, ‘都’, ‘上’, ‘大’, ‘！’, ‘这’, ‘为’, ‘多’, ‘与’, ‘章’, ‘「’, ‘到’, ‘」’, ‘要’, ‘？’, ‘被’, ‘而’, ‘能’, ‘等’, ‘可以’, ‘年’, ‘；’, ‘|’, ‘以’, ‘及’, ‘之’, ‘公司’, ‘对’, ‘中国’, ‘很’, ‘会’, ‘小’, ‘但’, ‘我们’, ‘最’, ‘更’, ‘/’, ‘1’, ‘三’, ‘新’, ‘自己’, ‘可’, ‘2’, ‘或’, ‘次’, ‘好’, ‘将’, ‘第’, ‘种’, ‘她’, ‘…’, ‘3’, ‘地’, ‘對’, ‘用’, ‘工作’, ‘下’, ‘后’, ‘由’, ‘两’, ‘使用’, ‘还’, ‘又’, ‘您’, ‘?’, ‘其’, ‘已’]

使用模型获得’音乐’这个名词的词向量

model.get_word_vector(“音乐”)
array([-6.81843981e-02, 3.84048335e-02, 4.63239700e-01, 6.11658543e-02,
9.38086119e-03, -9.63955745e-02, 1.28141120e-01, -6.51574507e-02,
…
3.13430429e-02, -6.43611327e-02, 1.68979481e-01, -1.95011273e-01],
dtype=float32)
第四步: 利用邻近词进行效果检验

以’音乐’为例, 返回的邻近词基本上与音乐都有关系, 如乐曲, 音乐会, 声乐等.

model.get_nearest_neighbors(“音乐”)
[(0.6703276634216309, ‘乐曲’), (0.6569967269897461, ‘音乐人’), (0.6565821170806885, ‘声乐’), (0.6557438373565674, ‘轻音乐’), (0.6536258459091187, ‘音乐家’), (0.6502416133880615, ‘配乐’), (0.6501686573028564, ‘艺术’), (0.6437276005744934, ‘音乐会’), (0.639589250087738, ‘原声’), (0.6368917226791382, ‘音响’)]

以’美术’为例, 返回的邻近词基本上与美术都有关系, 如艺术, 绘画, 霍廷霄(满城尽带黄金甲的美术师)等.

model.get_nearest_neighbors(“美术”)
[(0.724744975566864, ‘艺术’), (0.7165924310684204, ‘绘画’), (0.6741853356361389, ‘霍廷霄’), (0.6470299363136292, ‘纯艺’), (0.6335071921348572, ‘美术家’), (0.6304370164871216, ‘美院’), (0.624431312084198, ‘艺术类’), (0.6244068741798401, ‘陈浩忠’), (0.62302166223526, ‘美术史’), (0.621710479259491, ‘环艺系’)]

以’周杰伦’为例, 返回的邻近词基本上与明星有关系, 如杰伦, 周董, 陈奕迅等.

model.get_nearest_neighbors(“周杰伦”)
[(0.6995140910148621, ‘杰伦’), (0.6967097520828247, ‘周杰倫’), (0.6859776377677917, ‘周董’), (0.6381043195724487, ‘陈奕迅’), (0.6367626190185547, ‘张靓颖’), (0.6313326358795166, ‘张韶涵’), (0.6271176338195801, ‘谢霆锋’), (0.6188404560089111, ‘周华健’), (0.6184280514717102, ‘林俊杰’), (0.6143589019775391, ‘王力宏’)]
小节总结
学习了什么是词向量迁移:

使用在大型语料库上已经进行训练完成的词向量模型.
学习了fasttext工具中可以提供的可迁移的词向量:

fasttext提供了157种语言的在CommonCrawl和Wikipedia语料上进行训练的可迁移词向量模型, 它们采用CBOW模式进行训练, 词向量维度为300维. 可通过该地址查看具体语言词向量模型: https://fasttext.cc/docs/en/crawl-vectors.html
fasttext提供了294种语言的在Wikipedia语料上进行训练的可迁移词向量模型, 它们采用skipgram模式进行训练, 词向量维度同样是300维. 可通过该地址查看具体语言词向量模型: https://fasttext.cc/docs/en/pretrained-vectors.html
如何使用fasttext进行词向量模型迁移:

第一步: 下载词向量模型压缩的bin.gz文件
第二步: 解压bin.gz文件到bin文件
第三步: 加载bin文件获取词向量
第四步: 利用邻近词进行效果检验
下一页第二章:迁移学习
©Copyright 2019, itcast.cn.
Made with Material for MkDocs

跳转至
logo
迁移学习
第二章:迁移学习
logo
迁移学习
第一章:fasttext工具的使用
第二章:迁移学习
目录
2.1 迁移学习理论
学习目标
2.2 NLP中的标准数据集
学习目标
GLUE数据集合包含以下数据集
GLUE数据集合中子数据集的样式及其任务类型
CoLA数据集文件样式
SST-2数据集文件样式
MRPC数据集文件样式
STS-B数据集文件样式
QQP数据集文件样式
(MNLI/SNLI)数据集文件样式
(QNLI/RTE/WNLI)数据集文件样式
小节总结
2.3 NLP中的常用预训练模型
学习目标
当下NLP中流行的预训练模型
小节总结
2.4 加载和使用预训练模型
学习目标
加载和使用预训练模型的工具
加载和使用预训练模型的步骤
第一步: 确定需要加载的预训练模型并安装依赖包
第二步: 加载预训练模型的映射器tokenizer
第三步: 加载带/不带头的预训练模型
第四步: 使用模型获得输出结果
小节总结
2.5 迁移学习实践
学习目标
指定任务类型的微调脚本使用步骤
第一步: 下载微调脚本文件
第二步: 配置微调脚本参数
第三步: 运行并检验效果
通过微调脚本微调后模型的使用步骤
第一步: 在https://huggingface.co/join上创建一个帐户
第二步: 在服务器终端使用transformers-cli登陆
第三步: 使用transformers-cli上传模型并查看
第四步: 使用pytorch.hub加载模型进行使用, 更多信息请参考2.4 加载和使用预训练模型
通过微调方式进行迁移学习的两种类型
类型一实战演示
类型二实战演示
小节总结
第二章:迁移学习
2.1 迁移学习理论
学习目标
了解迁移学习中的有关概念.
掌握迁移学习的两种迁移方式.
迁移学习中的有关概念:
预训练模型
微调
微调脚本
预训练模型(Pretrained model):
一般情况下预训练模型都是大型模型，具备复杂的网络结构，众多的参数量，以及在足够大的数据集下进行训练而产生的模型. 在NLP领域，预训练模型往往是语言模型，因为语言模型的训练是无监督的，可以获得大规模语料，同时语言模型又是许多典型NLP任务的基础，如机器翻译，文本生成，阅读理解等，常见的预训练模型有BERT, GPT, roBERTa, transformer-XL等.
微调(Fine-tuning):
根据给定的预训练模型，改变它的部分参数或者为其新增部分输出结构后，通过在小部分数据集上训练，来使整个模型更好的适应特定任务.
微调脚本(Fine-tuning script):
实现微调过程的代码文件。这些脚本文件中，应包括对预训练模型的调用，对微调参数的选定以及对微调结构的更改等，同时，因为微调是一个训练过程，它同样需要一些超参数的设定，以及损失函数和优化器的选取等, 因此微调脚本往往也包含了整个迁移学习的过程.
关于微调脚本的说明:
一般情况下，微调脚本应该由不同的任务类型开发者自己编写，但是由于目前研究的NLP任务类型（分类，提取，生成）以及对应的微调输出结构都是有限的，有些微调方式已经在很多数据集上被验证是有效的，因此微调脚本也可以使用已经完成的规范脚本.
两种迁移方式:
直接使用预训练模型，进行相同任务的处理，不需要调整参数或模型结构，这些模型开箱即用。但是这种情况一般只适用于普适任务, 如：fasttest工具包中预训练的词向量模型。另外，很多预训练模型开发者为了达到开箱即用的效果，将模型结构分各个部分保存为不同的预训练模型，提供对应的加载方法来完成特定目标.
更加主流的迁移学习方式是发挥预训练模型特征抽象的能力，然后再通过微调的方式，通过训练更新小部分参数以此来适应不同的任务。这种迁移方式需要提供小部分的标注数据来进行监督学习.
关于迁移方式的说明:
直接使用预训练模型的方式, 已经在fasttext的词向量迁移中学习. 接下来的迁移学习实践将主要讲解通过微调的方式进行迁移学习.
2.2 NLP中的标准数据集
学习目标
了解NLP中GLUE标准数据集合的相关知识.
掌握GLUE标准数据集合的下载方式, 数据样式及其对应的任务类型.
GLUE数据集合的介绍:
GLUE由纽约大学, 华盛顿大学, Google联合推出, 涵盖不同NLP任务类型, 截止至2020年1月其中包括11个子任务数据集, 成为衡量NLP研究发展的衡量标准.
GLUE数据集合包含以下数据集
CoLA 数据集
SST-2 数据集
MRPC 数据集
STS-B 数据集
QQP 数据集
MNLI 数据集
SNLI 数据集
QNLI 数据集
RTE 数据集
WNLI 数据集
diagnostics数据集(官方未完善)
GLUE数据集合的下载方式:
下载脚本代码:

‘’’ Script for downloading all GLUE data.’’’
import os
import sys
import shutil
import argparse
import tempfile
import urllib.request
import zipfile

TASKS = [“CoLA”, “SST”, “MRPC”, “QQP”, “STS”, “MNLI”, “SNLI”, “QNLI”, “RTE”, “WNLI”, “diagnostic”]
TASK2PATH = {“CoLA”:‘https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FCoLA.zip?alt=media&token=46d5e637-3411-4188-bc44-5809b5bfb5f4’,
“SST”:‘https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FSST-2.zip?alt=media&token=aabc5f6b-e466-44a2-b9b4-cf6337f84ac8’,
“MRPC”:‘https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2Fmrpc_dev_ids.tsv?alt=media&token=ec5c0836-31d5-48f4-b431-7480817f1adc’,
“QQP”:‘https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FQQP.zip?alt=media&token=700c6acf-160d-4d89-81d1-de4191d02cb5’,
“STS”:‘https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FSTS-B.zip?alt=media&token=bddb94a7-8706-4e0d-a694-1109e12273b5’,
“MNLI”:‘https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FMNLI.zip?alt=media&token=50329ea1-e339-40e2-809c-10c40afff3ce’,
“SNLI”:‘https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FSNLI.zip?alt=media&token=4afcfbb2-ff0c-4b2d-a09a-dbf07926f4df’,
“QNLI”: ‘https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FQNLIv2.zip?alt=media&token=6fdcf570-0fc5-4631-8456-9505272d1601’,
“RTE”:‘https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FRTE.zip?alt=media&token=5efa7e85-a0bb-4f19-8ea2-9e1840f077fb’,
“WNLI”:‘https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FWNLI.zip?alt=media&token=068ad0a0-ded7-4bd7-99a5-5e00222e0faf’,
“diagnostic”:‘https://storage.googleapis.com/mtl-sentence-representations.appspot.com/tsvsWithoutLabels%2FAX.tsv?GoogleAccessId=firebase-adminsdk-0khhl@mtl-sentence-representations.iam.gserviceaccount.com&Expires=2498860800&Signature=DuQ2CSPt2Yfre0C%2BiISrVYrIFaZH1Lc7hBVZDD4ZyR7fZYOMNOUGpi8QxBmTNOrNPjR3z1cggo7WXFfrgECP6FBJSsURv8Ybrue8Ypt%2FTPxbuJ0Xc2FhDi%2BarnecCBFO77RSbfuz%2Bs95hRrYhTnByqu3U%2FYZPaj3tZt5QdfpH2IUROY8LiBXoXS46LE%2FgOQc%2FKN%2BA9SoscRDYsnxHfG0IjXGwHN%2Bf88q6hOmAxeNPx6moDulUF6XMUAaXCSFU%2BnRO2RDL9CapWxj%2BDl7syNyHhB7987hZ80B%2FwFkQ3MEs8auvt5XW1%2Bd4aCU7ytgM69r8JDCwibfhZxpaa4gd50QXQ%3D%3D’}

MRPC_TRAIN = ‘https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_train.txt’
MRPC_TEST = ‘https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_test.txt’

def download_and_extract(task, data_dir):
print(“Downloading and extracting %s…” % task)
data_file = “%s.zip” % task
urllib.request.urlretrieve(TASK2PATH[task], data_file)
with zipfile.ZipFile(data_file) as zip_ref:
zip_ref.extractall(data_dir)
os.remove(data_file)
print("\tCompleted!")

def format_mrpc(data_dir, path_to_data):
print(“Processing MRPC…”)
mrpc_dir = os.path.join(data_dir, “MRPC”)
if not os.path.isdir(mrpc_dir):
os.mkdir(mrpc_dir)
if path_to_data:
mrpc_train_file = os.path.join(path_to_data, “msr_paraphrase_train.txt”)
mrpc_test_file = os.path.join(path_to_data, “msr_paraphrase_test.txt”)
else:
print(“Local MRPC data not specified, downloading data from %s” % MRPC_TRAIN)
mrpc_train_file = os.path.join(mrpc_dir, “msr_paraphrase_train.txt”)
mrpc_test_file = os.path.join(mrpc_dir, “msr_paraphrase_test.txt”)
urllib.request.urlretrieve(MRPC_TRAIN, mrpc_train_file)
urllib.request.urlretrieve(MRPC_TEST, mrpc_test_file)
assert os.path.isfile(mrpc_train_file), “Train data not found at %s” % mrpc_train_file
assert os.path.isfile(mrpc_test_file), “Test data not found at %s” % mrpc_test_file
urllib.request.urlretrieve(TASK2PATH[“MRPC”], os.path.join(mrpc_dir, “dev_ids.tsv”))

dev_ids = []
with open(os.path.join(mrpc_dir, "dev_ids.tsv"), encoding="utf8") as ids_fh:
    for row in ids_fh:
        dev_ids.append(row.strip().split('\t'))

with open(mrpc_train_file, encoding="utf8") as data_fh, \
     open(os.path.join(mrpc_dir, "train.tsv"), 'w', encoding="utf8") as train_fh, \
     open(os.path.join(mrpc_dir, "dev.tsv"), 'w', encoding="utf8") as dev_fh:
    header = data_fh.readline()
    train_fh.write(header)
    dev_fh.write(header)
    for row in data_fh:
        label, id1, id2, s1, s2 = row.strip().split('\t')
        if [id1, id2] in dev_ids:
            dev_fh.write("%s\t%s\t%s\t%s\t%s\n" % (label, id1, id2, s1, s2))
        else:
            train_fh.write("%s\t%s\t%s\t%s\t%s\n" % (label, id1, id2, s1, s2))

with open(mrpc_test_file, encoding="utf8") as data_fh, \
        open(os.path.join(mrpc_dir, "test.tsv"), 'w', encoding="utf8") as test_fh:
    header = data_fh.readline()
    test_fh.write("index\t#1 ID\t#2 ID\t#1 String\t#2 String\n")
    for idx, row in enumerate(data_fh):
        label, id1, id2, s1, s2 = row.strip().split('\t')
        test_fh.write("%d\t%s\t%s\t%s\t%s\n" % (idx, id1, id2, s1, s2))
print("\tCompleted!")

def download_diagnostic(data_dir):
print(“Downloading and extracting diagnostic…”)
if not os.path.isdir(os.path.join(data_dir, “diagnostic”)):
os.mkdir(os.path.join(data_dir, “diagnostic”))
data_file = os.path.join(data_dir, “diagnostic”, “diagnostic.tsv”)
urllib.request.urlretrieve(TASK2PATH[“diagnostic”], data_file)
print("\tCompleted!")
return

def get_tasks(task_names):
task_names = task_names.split(’,’)
if “all” in task_names:
tasks = TASKS
else:
tasks = []
for task_name in task_names:
assert task_name in TASKS, “Task %s not found!” % task_name
tasks.append(task_name)
return tasks

def main(arguments):
parser = argparse.ArgumentParser()
parser.add_argument(’–data_dir’, help=‘directory to save data to’, type=str, default=‘glue_data’)
parser.add_argument(’–tasks’, help=‘tasks to download data for as a comma separated string’,
type=str, default=‘all’)
parser.add_argument(’–path_to_mrpc’, help=‘path to directory containing extracted MRPC data, msr_paraphrase_train.txt and msr_paraphrase_text.txt’,
type=str, default=’’)
args = parser.parse_args(arguments)

if not os.path.isdir(args.data_dir):
    os.mkdir(args.data_dir)
tasks = get_tasks(args.tasks)

for task in tasks:
    if task == 'MRPC':
        format_mrpc(args.data_dir, args.path_to_mrpc)
    elif task == 'diagnostic':
        download_diagnostic(args.data_dir)
    else:
        download_and_extract(task, args.data_dir)

if name == ‘main’:
sys.exit(main(sys.argv[1:]))
运行脚本下载所有数据集:

假设你已经将以上代码copy到download_glue_data.py文件中

运行这个python脚本, 你将同目录下得到一个glue文件夹

python download_glue_data.py
输出效果:

Downloading and extracting CoLA…
Completed!
Downloading and extracting SST…
Completed!
Processing MRPC…
Local MRPC data not specified, downloading data from https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_train.txt
Completed!
Downloading and extracting QQP…
Completed!
Downloading and extracting STS…
Completed!
Downloading and extracting MNLI…
Completed!
Downloading and extracting SNLI…
Completed!
Downloading and extracting QNLI…
Completed!
Downloading and extracting RTE…
Completed!
Downloading and extracting WNLI…
Completed!
Downloading and extracting diagnostic…
Completed!
GLUE数据集合中子数据集的样式及其任务类型
CoLA数据集文件样式

CoLA/
- dev.tsv
- original/
- test.tsv
- train.tsv
  文件样式说明:
  在使用中常用到的文件是train.tsv, dev.tsv, test.tsv, 分别代表训练集, 验证集和测试集. 其中train.tsv与dev.tsv数据样式相同, 都是带有标签的数据, 其中test.tsv是不带有标签的数据.
  train.tsv数据样式:

…
gj04 1 She coughed herself awake as the leaf landed on her nose.
gj04 1 The worm wriggled onto the carpet.
gj04 1 The chocolate melted onto the carpet.
gj04 0 * The ball wriggled itself loose.
gj04 1 Bill wriggled himself loose.
bc01 1 The sinking of the ship to collect the insurance was very devious.
bc01 1 The ship’s sinking was very devious.
bc01 0 * The ship’s sinking to collect the insurance was very devious.
bc01 1 The testing of such drugs on oneself is too risky.
bc01 0 * This drug’s testing on oneself is too risky.
…
train.tsv数据样式说明:
train.tsv中的数据内容共分为4列, 第一列数据, 如gj04, bc01等代表每条文本数据的来源即出版物代号; 第二列数据, 0或1, 代表每条文本数据的语法是否正确, 0代表不正确, 1代表正确; 第三列数据, ‘’, 是作者最初的正负样本标记, 与第二列意义相同, ''表示不正确; 第四列即是被标注的语法使用是否正确的文本句子.
test.tsv数据样式:

index sentence
0 Bill whistled past the house.
1 The car honked its way down the road.
2 Bill pushed Harry off the sofa.
3 the kittens yawned awake and played.
4 I demand that the more John eats, the more he pay.
5 If John eats more, keep your mouth shut tighter, OK?
6 His expectations are always lower than mine are.
7 The sooner you call, the more carefully I will word the letter.
8 The more timid he feels, the more people he interviews without asking questions of.
9 Once Janet left, Fred became a lot crazier.
…
test.tsv数据样式说明:
test.tsv中的数据内容共分为2列, 第一列数据代表每条文本数据的索引; 第二列数据代表用于测试的句子.
CoLA数据集的任务类型:
二分类任务
评估指标为: MCC(马修斯相关系数, 在正负样本分布十分不均衡的情况下使用的二分类评估指标)
SST-2数据集文件样式

SST-2/
- dev.tsv
- original/
- test.tsv
- train.tsv
文件样式说明:
在使用中常用到的文件是train.tsv, dev.tsv, test.tsv, 分别代表训练集, 验证集和测试集. 其中train.tsv与dev.tsv数据样式相同, 都是带有标签的数据, 其中test.tsv是不带有标签的数据.
train.tsv数据样式:

sentence label
hide new secretions from the parental units 0
contains no wit , only labored gags 0
that loves its characters and communicates something rather beautiful about human nature 1
remains utterly satisfied to remain the same throughout 0
on the worst revenge-of-the-nerds clichés the filmmakers could dredge up 0
that 's far too tragic to merit such superficial treatment 0
demonstrates that the director of such hollywood blockbusters as patriot games can still turn out a small , personal film with an emotional wallop . 1
of saucy 1
a depressed fifteen-year-old 's suicidal poetry 0
…
train.tsv数据样式说明:
train.tsv中的数据内容共分为2列, 第一列数据代表具有感情色彩的评论文本; 第二列数据, 0或1, 代表每条文本数据是积极或者消极的评论, 0代表消极, 1代表积极.
test.tsv数据样式:

index sentence
0 uneasy mishmash of styles and genres .
1 this film 's relationship to actual tension is the same as what christmas-tree flocking in a spray can is to actual snow : a poor – if durable – imitation .
2 by the end of no such thing the audience , like beatrice , has a watchful affection for the monster .
3 director rob marshall went out gunning to make a great one .
4 lathan and diggs have considerable personal charm , and their screen rapport makes the old story seem new .
5 a well-made and often lovely depiction of the mysteries of friendship .
6 none of this violates the letter of behan 's book , but missing is its spirit , its ribald , full-throated humor .
7 although it bangs a very cliched drum at times , this crowd-pleaser 's fresh dialogue , energetic music , and good-natured spunk are often infectious .
8 it is not a mass-market entertainment but an uncompromising attempt by one artist to think about another .
9 this is junk food cinema at its greasiest .
…
test.tsv数据样式说明: * test.tsv中的数据内容共分为2列, 第一列数据代表每条文本数据的索引; 第二列数据代表用于测试的句子.
SST-2数据集的任务类型:
二分类任务
评估指标为: ACC
MRPC数据集文件样式

MRPC/
- dev.tsv
- test.tsv
- train.tsv
- dev_ids.tsv
- msr_paraphrase_test.txt
- msr_paraphrase_train.txt
  文件样式说明:
  在使用中常用到的文件是train.tsv, dev.tsv, test.tsv, 分别代表训练集, 验证集和测试集. 其中train.tsv与dev.tsv数据样式相同, 都是带有标签的数据, 其中test.tsv是不带有标签的数据.
  train.tsv数据样式:

Quality #1 ID #2 ID #1 String #2 String
1 702876 702977 Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence . Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .
0 2108705 2108831 Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion . Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .
1 1330381 1330521 They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added . On June 10 , the ship 's owners had published an advertisement on the Internet , offering the explosives for sale .
0 3344667 3344648 Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 . Tab shares jumped 20 cents , or 4.6 % , to set a record closing high at A $ 4.57 .
1 1236820 1236712 The stock rose $ 2.11 , or about 11 percent , to close Friday at $ 21.51 on the New York Stock Exchange . PG & E Corp. shares jumped $ 1.63 or 8 percent to $ 21.03 on the New York Stock Exchange on Friday .
1 738533 737951 Revenue in the first quarter of the year dropped 15 percent from the same period a year earlier . With the scandal hanging over Stewart 's company , revenue the first quarter of the year dropped 15 percent from the same period a year earlier .
0 264589 264502 The Nasdaq had a weekly gain of 17.27 , or 1.2 percent , closing at 1,520.15 on Friday . The tech-laced Nasdaq Composite .IXIC rallied 30.46 points , or 2.04 percent , to 1,520.15 .
1 579975 579810 The DVD-CCA then appealed to the state Supreme Court . The DVD CCA appealed that decision to the U.S. Supreme Court .
…
train.tsv数据样式说明:
train.tsv中的数据内容共分为5列, 第一列数据, 0或1, 代表每对句子是否具有相同的含义, 0代表含义不相同, 1代表含义相同. 第二列和第三列分别代表每对句子的id, 第四列和第五列分别具有相同/不同含义的句子对.
test.tsv数据样式:

index #1 ID #2 ID #1 String #2 String
0 1089874 1089925 PCCW 's chief operating officer , Mike Butcher , and Alex Arena , the chief financial officer , will report directly to Mr So . Current Chief Operating Officer Mike Butcher and Group Chief Financial Officer Alex Arena will report to So .
1 3019446 3019327 The world 's two largest automakers said their U.S. sales declined more than predicted last month as a late summer sales frenzy caused more of an industry backlash than expected . Domestic sales at both GM and No. 2 Ford Motor Co. declined more than predicted as a late summer sales frenzy prompted a larger-than-expected industry backlash .
2 1945605 1945824 According to the federal Centers for Disease Control and Prevention ( news - web sites ) , there were 19 reported cases of measles in the United States in 2002 . The Centers for Disease Control and Prevention said there were 19 reported cases of measles in the United States in 2002 .
3 1430402 1430329 A tropical storm rapidly developed in the Gulf of Mexico Sunday and was expected to hit somewhere along the Texas or Louisiana coasts by Monday night . A tropical storm rapidly developed in the Gulf of Mexico on Sunday and could have hurricane-force winds when it hits land somewhere along the Louisiana coast Monday night .
4 3354381 3354396 The company didn 't detail the costs of the replacement and repairs . But company officials expect the costs of the replacement work to run into the millions of dollars .
5 1390995 1391183 The settling companies would also assign their possible claims against the underwriters to the investor plaintiffs , he added . Under the agreement , the settling companies will also assign their potential claims against the underwriters to the investors , he added .
6 2201401 2201285 Air Commodore Quaife said the Hornets remained on three-minute alert throughout the operation . Air Commodore John Quaife said the security operation was unprecedented .
7 2453843 2453998 A Washington County man may have the countys first human case of West Nile virus , the health department said Friday . The countys first and only human case of West Nile this year was confirmed by health officials on Sept . 8 .
…
test.tsv数据样式说明: * test.tsv中的数据内容共分为5列, 第一列数据代表每条文本数据的索引; 其余列的含义与train.tsv中相同.
MRPC数据集的任务类型:
句子对二分类任务
评估指标为: ACC和F1
STS-B数据集文件样式

STS-B/
- dev.tsv
- test.tsv
- train.tsv
- LICENSE.txt
- readme.txt
- original/
  文件样式说明:
  在使用中常用到的文件是train.tsv, dev.tsv, test.tsv, 分别代表训练集, 验证集和测试集. 其中train.tsv与dev.tsv数据样式相同, 都是带有标签的数据, 其中test.tsv是不带有标签的数据.
  train.tsv数据样式:

index genre filename year old_index source1 source2 sentence1 sentence2 score
0 main-captions MSRvid 2012test 0001 none none A plane is taking off. An air plane is taking off. 5.000
1 main-captions MSRvid 2012test 0004 none none A man is playing a large flute. A man is playing a flute. 3.800
2 main-captions MSRvid 2012test 0005 none none A man is spreading shreded cheese on a pizza. A man is spreading shredded cheese on an uncooked pizza. 3.800
3 main-captions MSRvid 2012test 0006 none none Three men are playing chess.Two men are playing chess. 2.600
4 main-captions MSRvid 2012test 0009 none none A man is playing the cello.A man seated is playing the cello. 4.250
5 main-captions MSRvid 2012test 0011 none none Some men are fighting. Two men are fighting. 4.250
6 main-captions MSRvid 2012test 0012 none none A man is smoking. A man is skating. 0.500
7 main-captions MSRvid 2012test 0013 none none The man is playing the piano. The man is playing the guitar. 1.600
8 main-captions MSRvid 2012test 0014 none none A man is playing on a guitar and singing. A woman is playing an acoustic guitar and singing. 2.200
9 main-captions MSRvid 2012test 0016 none none A person is throwing a cat on to the ceiling. A person throws a cat on the ceiling. 5.000
…
train.tsv数据样式说明:
train.tsv中的数据内容共分为10列, 第一列数据是数据索引; 第二列代表每对句子的来源, 如main-captions表示来自字幕; 第三列代表来源的具体保存文件名, 第四列代表出现时间(年); 第五列代表原始数据的索引; 第六列和第七列分别代表句子对原始来源; 第八列和第九列代表相似程度不同的句子对; 第十列代表句子对的相似程度由低到高, 值域范围是[0, 5].
test.tsv数据样式:

index genre filename year old_index source1 source2 sentence1 sentence2
0 main-captions MSRvid 2012test 0024 none none A girl is styling her hair. A girl is brushing her hair.
1 main-captions MSRvid 2012test 0033 none none A group of men play soccer on the beach. A group of boys are playing soccer on the beach.
2 main-captions MSRvid 2012test 0045 none none One woman is measuring another woman’s ankle. A woman measures another woman’s ankle.
3 main-captions MSRvid 2012test 0063 none none A man is cutting up a cucumber. A man is slicing a cucumber.
4 main-captions MSRvid 2012test 0066 none none A man is playing a harp. A man is playing a keyboard.
5 main-captions MSRvid 2012test 0074 none none A woman is cutting onions. A woman is cutting tofu.
6 main-captions MSRvid 2012test 0076 none none A man is riding an electric bicycle. A man is riding a bicycle.
7 main-captions MSRvid 2012test 0082 none none A man is playing the drums. A man is playing the guitar.
8 main-captions MSRvid 2012test 0092 none none A man is playing guitar. A lady is playing the guitar.
9 main-captions MSRvid 2012test 0095 none none A man is playing a guitar. A man is playing a trumpet.
10 main-captions MSRvid 2012test 0096 none none A man is playing a guitar. A man is playing a trumpet.
…
test.tsv数据样式说明:
test.tsv中的数据内容共分为9列, 含义与train.tsv前9列相同.
STS-B数据集的任务类型:
句子对多分类任务/句子对回归任务
评估指标为: Pearson-Spearman Corr
QQP数据集文件样式

QQP/
- dev.tsv
- original/
- test.tsv
- train.tsv
文件样式说明:
在使用中常用到的文件是train.tsv, dev.tsv, test.tsv, 分别代表训练集, 验证集和测试集. 其中train.tsv与dev.tsv数据样式相同, 都是带有标签的数据, 其中test.tsv是不带有标签的数据.
train.tsv数据样式:

id qid1 qid2 question1 question2 is_duplicate
133273 213221 213222 How is the life of a math student? Could you describe your own experiences?Which level of prepration is enough for the exam jlpt5? 0
402555 536040 536041 How do I control my horny emotions? How do you control your horniness? 1
360472 364011 490273 What causes stool color to change to yellow? What can cause stool to come out as little balls? 0
150662 155721 7256 What can one do after MBBS? What do i do after my MBBS ? 1
183004 279958 279959 Where can I find a power outlet for my laptop at Melbourne Airport? Would a second airport in Sydney, Australia be needed if a high-speed rail link was created between Melbourne and Sydney? 0
119056 193387 193388 How not to feel guilty since I am Muslim and I’m conscious we won’t have sex together? I don’t beleive I am bulimic, but I force throw up atleast once a day after I eat something and feel guilty. Should I tell somebody, and if so who? 0
356863 422862 96457 How is air traffic controlled? How do you become an air traffic controller?0
106969 147570 787 What is the best self help book you have read? Why? How did it change your life? What are the top self help books I should read? 1
…
train.tsv数据样式说明:
train.tsv中的数据内容共分为6列, 第一列代表文本数据索引; 第二列和第三列数据分别代表问题1和问题2的id; 第四列和第五列代表需要进行’是否重复’判定的句子对; 第六列代表上述问题是/不是重复性问题的标签, 0代表不重复, 1代表重复.
test.tsv数据样式:

id question1 question2
0 Would the idea of Trump and Putin in bed together scare you, given the geopolitical implications? Do you think that if Donald Trump were elected President, he would be able to restore relations with Putin and Russia as he said he could, based on the rocky relationship Putin had with Obama and Bush?
1 What are the top ten Consumer-to-Consumer E-commerce online? What are the top ten Consumer-to-Business E-commerce online?
2 Why don’t people simply ‘Google’ instead of asking questions on Quora? Why do people ask Quora questions instead of just searching google?
3 Is it safe to invest in social trade biz? Is social trade geniune?
4 If the universe is expanding then does matter also expand? If universe and space is expanding? Does that mean anything that occupies space is also expanding?
5 What is the plural of hypothesis? What is the plural of thesis?
6 What is the application form you need for launching a company? What is the application form you need for launching a company in Austria?
7 What is Big Theta? When should I use Big Theta as opposed to big O? Is O(Log n) close to O(n) or O(1)?
8 What are the health implications of accidentally eating a small quantity of aluminium foil?What are the implications of not eating vegetables?
…
test.tsv数据样式说明:
test.tsv中的数据内容共分为3列, 第一列数据代表每条文本数据的索引; 第二列和第三列数据代表用于测试的问题句子对.
QQP数据集的任务类型:
句子对二分类任务
评估指标为: ACC/F1
(MNLI/SNLI)数据集文件样式

(MNLI/SNLI)/
- dev_matched.tsv
- dev_mismatched.tsv
- original/
- test_matched.tsv
- test_mismatched.tsv
- train.tsv
  文件样式说明:
  在使用中常用到的文件是train.tsv, dev_matched.tsv, dev_mismatched.tsv, test_matched.tsv, test_mismatched.tsv分别代表训练集, 与训练集一同采集的验证集, 与训练集不是一同采集验证集, 与训练集一同采集的测试集, 与训练集不是一同采集测试集. 其中train.tsv与dev_matched.tsv和dev_mismatched.tsv数据样式相同, 都是带有标签的数据, 其中test_matched.tsv与test_mismatched.tsv数据样式相同, 都是不带有标签的数据.
  train.tsv数据样式:

index promptID pairID genre sentence1_binary_parse sentence2_binary_parse sentence1_parse sentence2_parse sentence1 sentence2 label1 gold_label
0 31193 31193n government ( ( Conceptually ( cream skimming ) ) ( ( has ( ( ( two ( basic dimensions ) ) - ) ( ( product and ) geography ) ) ) . ) ) ( ( ( Product and ) geography ) ( ( are ( what ( make ( cream ( skimming work ) ) ) ) ) . ) ) (ROOT (S (NP (JJ Conceptually) (NN cream) (NN skimming)) (VP (VBZ has) (NP (NP (CD two) (JJ basic) (NNS dimensions)) (: -) (NP (NN product) (CC and) (NN geography)))) (. .))) (ROOT (S (NP (NN Product) (CC and) (NN geography)) (VP (VBP are) (SBAR (WHNP (WP what)) (S (VP (VBP make) (NP (NP (NN cream)) (VP (VBG skimming) (NP (NN work)))))))) (. .))) Conceptually cream skimming has two basic dimensions - product and geography. Product and geography are what make cream skimming work. neutral neutral
1 101457 101457e telephone ( you ( ( know ( during ( ( ( the season ) and ) ( i guess ) ) ) ) ( at ( at ( ( your level ) ( uh ( you ( ( ( lose them ) ( to ( the ( next level ) ) ) ) ( if ( ( if ( they ( decide ( to ( recall ( the ( the ( parent team ) ) ) ) ) ) ) ) ( ( the Braves ) ( decide ( to ( call ( to ( ( recall ( a guy ) ) ( from ( ( triple A ) ( ( ( then ( ( a ( double ( A guy ) ) ) ( ( goes up ) ( to ( replace him ) ) ) ) ) and ) ( ( a ( single ( A guy ) ) ) ( ( goes up ) ( to ( replace him ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ( You ( ( ( ( lose ( the things ) ) ( to ( the ( following level ) ) ) ) ( if ( ( the people ) recall ) ) ) . ) ) (ROOT (S (NP (PRP you)) (VP (VBP know) (PP (IN during) (NP (NP (DT the) (NN season)) (CC and) (NP (FW i) (FW guess)))) (PP (IN at) (IN at) (NP (NP (PRP$ your) (NN level)) (SBAR (S (INTJ (UH uh)) (NP (PRP you)) (VP (VBP lose) (NP (PRP them)) (PP (TO to) (NP (DT the) (JJ next) (NN level))) (SBAR (IN if) (S (SBAR (IN if) (S (NP (PRP they)) (VP (VBP decide) (S (VP (TO to) (VP (VB recall) (NP (DT the) (DT the) (NN parent) (NN team)))))))) (NP (DT the) (NNPS Braves)) (VP (VBP decide) (S (VP (TO to) (VP (VB call) (S (VP (TO to) (VP (VB recall) (NP (DT a) (NN guy)) (PP (IN from) (NP (NP (RB triple) (DT A)) (SBAR (S (S (ADVP (RB then)) (NP (DT a) (JJ double) (NNP A) (NN guy)) (VP (VBZ goes) (PRT (RP up)) (S (VP (TO to) (VP (VB replace) (NP (PRP him))))))) (CC and) (S (NP (DT a) (JJ single) (NNP A) (NN guy)) (VP (VBZ goes) (PRT (RP up)) (S (VP (TO to) (VP (VB replace) (NP (PRP him)))))))))))))))))))))))))))) (ROOT (S (NP (PRP You)) (VP (VBP lose) (NP (DT the) (NNS things)) (PP (TO to) (NP (DT the) (JJ following) (NN level))) (SBAR (IN if) (S (NP (DT the) (NNS people)) (VP (VBP recall))))) (. .))) you know during the season and i guess at at your level uh you lose them to the next level if if they decide to recall the the parent team the Braves decide to call to recall a guy from triple A then a double A guy goes up to replace him and a single A guy goes up to replace him You lose the things to the following level if the people recall. entailment entailment
2 134793 134793e fiction ( ( One ( of ( our number ) ) ) ( ( will ( ( ( carry out ) ( your instructions ) ) minutely ) ) . ) ) ( ( ( A member ) ( of ( my team ) ) ) ( ( will ( ( execute ( your orders ) ) ( with ( immense precision ) ) ) ) . ) ) (ROOT (S (NP (NP (CD One)) (PP (IN of) (NP (PRP$ our) (NN number)))) (VP (MD will) (VP (VB carry) (PRT (RP out)) (NP (PRP$ your) (NNS instructions)) (ADVP (RB minutely)))) (. .))) (ROOT (S (NP (NP (DT A) (NN member)) (PP (IN of) (NP (PRP$ my) (NN team)))) (VP (MD will) (VP (VB execute) (NP (PRP$ your) (NNS orders)) (PP (IN with) (NP (JJ immense) (NN precision))))) (. .))) One of our number will carry out your instructions minutely. A member of my team will execute your orders with immense precision. entailment entailment
3 37397 37397e fiction ( ( How ( ( ( do you ) know ) ? ) ) ( ( All this ) ( ( ( is ( their information ) ) again ) . ) ) ) ( ( This information ) ( ( belongs ( to them ) ) . ) ) (ROOT (S (SBARQ (WHADVP (WRB How)) (SQ (VBP do) (NP (PRP you)) (VP (VB know))) (. ?)) (NP (PDT All) (DT this)) (VP (VBZ is) (NP (PRP$ their) (NN information)) (ADVP (RB again))) (. .))) (ROOT (S (NP (DT This) (NN information)) (VP (VBZ belongs) (PP (TO to) (NP (PRP them)))) (. .))) How do you know? All this is their information again. This information belongs to them. entailment entailment
…
train.tsv数据样式说明:
train.tsv中的数据内容共分为12列, 第一列代表文本数据索引; 第二列和第三列数据分别代表句子对的不同类型id; 第四列代表句子对的来源; 第五列和第六列代表具有句法结构分析的句子对表示; 第七列和第八列代表具有句法结构和词性标注的句子对表示, 第九列和第十列代表原始的句子对, 第十一和第十二列代表不同标准的标注方法产生的标签, 在这里，他们始终相同, 一共有三种类型的标签, neutral代表两个句子既不矛盾也不蕴含, entailment代表两个句子具有蕴含关系, contradiction代表两个句子观点矛盾.
test_matched.tsv数据样式:

index promptID pairID genre sentence1_binary_parse sentence2_binary_parse sentence1_parse sentence2_parse sentence1 sentence2
0 31493 31493 travel ( ( ( ( ( ( ( ( Hierbas , ) ( ans seco ) ) , ) ( ans dulce ) ) , ) and ) frigola ) ( ( ( are just ) ( ( a ( few names ) ) ( worth ( ( keeping ( a look-out ) ) for ) ) ) ) . ) ) ( Hierbas ( ( is ( ( a name ) ( worth ( ( looking out ) for ) ) ) ) . ) ) (ROOT (S (NP (NP (NNS Hierbas)) (, ,) (NP (NN ans) (NN seco)) (, ,) (NP (NN ans) (NN dulce)) (, ,) (CC and) (NP (NN frigola))) (VP (VBP are) (ADVP (RB just)) (NP (NP (DT a) (JJ few) (NNS names)) (PP (JJ worth) (S (VP (VBG keeping) (NP (DT a) (NN look-out)) (PP (IN for))))))) (. .))) (ROOT (S (NP (NNS Hierbas)) (VP (VBZ is) (NP (NP (DT a) (NN name)) (PP (JJ worth) (S (VP (VBG looking) (PRT (RP out)) (PP (IN for))))))) (. .))) Hierbas, ans seco, ans dulce, and frigola are just a few names worth keeping a look-out for. Hierbas is a name worth looking out for.
1 92164 92164 government ( ( ( The extent ) ( of ( the ( behavioral effects ) ) ) ) ( ( would ( ( depend ( in ( part ( on ( ( the structure ) ( of ( ( ( the ( individual ( account program ) ) ) and ) ( any limits ) ) ) ) ) ) ) ) ( on ( accessing ( the funds ) ) ) ) ) . ) ) ( ( Many people ) ( ( would ( be ( very ( unhappy ( to ( ( loose control ) ( over ( their ( own money ) ) ) ) ) ) ) ) ) . ) ) (ROOT (S (NP (NP (DT The) (NN extent)) (PP (IN of) (NP (DT the) (JJ behavioral) (NNS effects)))) (VP (MD would) (VP (VB depend) (PP (IN in) (NP (NP (NN part)) (PP (IN on) (NP (NP (DT the) (NN structure)) (PP (IN of) (NP (NP (DT the) (JJ individual) (NN account) (NN program)) (CC and) (NP (DT any) (NNS limits)))))))) (PP (IN on) (S (VP (VBG accessing) (NP (DT the) (NNS funds))))))) (. .))) (ROOT (S (NP (JJ Many) (NNS people)) (VP (MD would) (VP (VB be) (ADJP (RB very) (JJ unhappy) (PP (TO to) (NP (NP (JJ loose) (NN control)) (PP (IN over) (NP (PRP$ their) (JJ own) (NN money)))))))) (. .))) The extent of the behavioral effects would depend in part on the structure of the individual account program and any limits on accessing the funds. Many people would be very unhappy to loose control over their own money.
2 9662 9662 government ( ( ( Timely access ) ( to information ) ) ( ( is ( in ( ( the ( best interests ) ) ( of ( ( ( both GAO ) and ) ( the agencies ) ) ) ) ) ) . ) ) ( It ( ( ( is ( in ( ( everyone 's ) ( best interest ) ) ) ) ( to ( ( have access ) ( to ( information ( in ( a ( timely manner ) ) ) ) ) ) ) ) . ) ) (ROOT (S (NP (NP (JJ Timely) (NN access)) (PP (TO to) (NP (NN information)))) (VP (VBZ is) (PP (IN in) (NP (NP (DT the) (JJS best) (NNS interests)) (PP (IN of) (NP (NP (DT both) (NNP GAO)) (CC and) (NP (DT the) (NNS agencies))))))) (. .))) (ROOT (S (NP (PRP It)) (VP (VBZ is) (PP (IN in) (NP (NP (NN everyone) (POS 's)) (JJS best) (NN interest))) (S (VP (TO to) (VP (VB have) (NP (NN access)) (PP (TO to) (NP (NP (NN information)) (PP (IN in) (NP (DT a) (JJ timely) (NN manner))))))))) (. .))) Timely access to information is in the best interests of both GAO and the agencies. It is in everyone’s best interest to have access to information in a timely manner.
3 5991 5991 travel ( ( Based ( in ( ( the ( Auvergnat ( spa town ) ) ) ( of Vichy ) ) ) ) ( , ( ( the ( French government ) ) ( often ( ( ( ( proved ( more zealous ) ) ( than ( its masters ) ) ) ( in ( ( ( suppressing ( civil liberties ) ) and ) ( ( drawing up ) ( anti-Jewish legislation ) ) ) ) ) . ) ) ) ) ) ( ( The ( French government ) ) ( ( passed ( ( anti-Jewish laws ) ( aimed ( at ( helping ( the Nazi ) ) ) ) ) ) . ) ) (ROOT (S (PP (VBN Based) (PP (IN in) (NP (NP (DT the) (NNP Auvergnat) (NN spa) (NN town)) (PP (IN of) (NP (NNP Vichy)))))) (, ,) (NP (DT the) (JJ French) (NN government)) (ADVP (RB often)) (VP (VBD proved) (NP (JJR more) (NNS zealous)) (PP (IN than) (NP (PRP$ its) (NNS masters))) (PP (IN in) (S (VP (VP (VBG suppressing) (NP (JJ civil) (NNS liberties))) (CC and) (VP (VBG drawing) (PRT (RP up)) (NP (JJ anti-Jewish) (NN legislation))))))) (. .))) (ROOT (S (NP (DT The) (JJ French) (NN government)) (VP (VBD passed) (NP (NP (JJ anti-Jewish) (NNS laws)) (VP (VBN aimed) (PP (IN at) (S (VP (VBG helping) (NP (DT the) (JJ Nazi)))))))) (. .))) Based in the Auvergnat spa town of Vichy, the French government often proved more zealous than its masters in suppressing civil liberties and drawing up anti-Jewish legislation. The French government passed anti-Jewish laws aimed at helping the Nazi.
…
test_matched.tsv数据样式说明:
test_matched.tsv中的数据内容共分为10列, 与train.tsv的前10列含义相同.
(MNLI/SNLI)数据集的任务类型:
句子对多分类任务
评估指标为: ACC
(QNLI/RTE/WNLI)数据集文件样式

QNLI, RTE, WNLI三个数据集的样式基本相同.

(QNLI/RTE/WNLI)/
- dev.tsv
- test.tsv
- train.tsv
文件样式说明:
在使用中常用到的文件是train.tsv, dev.tsv, test.tsv, 分别代表训练集, 验证集和测试集. 其中train.tsv与dev.tsv数据样式相同, 都是带有标签的数据, 其中test.tsv是不带有标签的数据.
QNLI中的train.tsv数据样式:

index question sentence label
0 When did the third Digimon series begin? Unlike the two seasons before it and most of the seasons that followed, Digimon Tamers takes a darker and more realistic approach to its story featuring Digimon who do not reincarnate after their deaths and more complex character development in the original Japanese. not_entailment
1 Which missile batteries often have individual launchers several kilometres from one another? When MANPADS is operated by specialists, batteries may have several dozen teams deploying separately in small sections; self-propelled air defence guns may deploy in pairs. not_entailment
2 What two things does Popper argue Tarski’s theory involves in an evaluation of truth? He bases this interpretation on the fact that examples such as the one described above refer to two things: assertions and the facts to which they refer. entailment
3 What is the name of the village 9 miles north of Calafat where the Ottoman forces attacked the Russians? On 31 December 1853, the Ottoman forces at Calafat moved against the Russian force at Chetatea or Cetate, a small village nine miles north of Calafat, and engaged them on 6 January 1854. entailment
4 What famous palace is located in London? London contains four World Heritage Sites: the Tower of London; Kew Gardens; the site comprising the Palace of Westminster, Westminster Abbey, and St Margaret’s Church; and the historic settlement of Greenwich (in which the Royal Observatory, Greenwich marks the Prime Meridian, 0° longitude, and GMT). not_entailment
5 When is the term ‘German dialects’ used in regard to the German language? When talking about the German language, the term German dialects is only used for the traditional regional varieties. entailment
6 What was the name of the island the English traded to the Dutch in return for New Amsterdam? At the end of the Second Anglo-Dutch War, the English gained New Amsterdam (New York) in North America in exchange for Dutch control of Run, an Indonesian island. entailment
7 How were the Portuguese expelled from Myanmar? From the 1720s onward, the kingdom was beset with repeated Meithei raids into Upper Myanmar and a nagging rebellion in Lan Na. not_entailment
8 What does the word ‘customer’ properly apply to? The bill also required rotation of principal maintenance inspectors and stipulated that the word “customer” properly applies to the flying public, not those entities regulated by the FAA. entailment
…
RTE中的train.tsv数据样式:

index sentence1 sentence2 label
0 No Weapons of Mass Destruction Found in Iraq Yet. Weapons of Mass Destruction Found in Iraq. not_entailment
1 A place of sorrow, after Pope John Paul II died, became a place of celebration, as Roman Catholic faithful gathered in downtown Chicago to mark the installation of new Pope Benedict XVI.Pope Benedict XVI is the new leader of the Roman Catholic Church. entailment
2 Herceptin was already approved to treat the sickest breast cancer patients, and the company said, Monday, it will discuss with federal regulators the possibility of prescribing the drug for more breast cancer patients. Herceptin can be used to treat breast cancer. entailment
3 Judie Vivian, chief executive at ProMedica, a medical service company that helps sustain the 2-year-old Vietnam Heart Institute in Ho Chi Minh City (formerly Saigon), said that so far about 1,500 children have received treatment. The previous name of Ho Chi Minh City was Saigon.entailment
4 A man is due in court later charged with the murder 26 years ago of a teenager whose case was the first to be featured on BBC One’s Crimewatch. Colette Aram, 16, was walking to her boyfriend’s house in Keyworth, Nottinghamshire, on 30 October 1983 when she disappeared. Her body was later found in a field close to her home. Paul Stewart Hutchinson, 50, has been charged with murder and is due before Nottingham magistrates later. Paul Stewart Hutchinson is accused of having stabbed a girl. not_entailment
5 Britain said, Friday, that it has barred cleric, Omar Bakri, from returning to the country from Lebanon, where he was released by police after being detained for 24 hours. Bakri was briefly detained, but was released. entailment
6 Nearly 4 million children who have at least one parent who entered the U.S. illegally were born in the United States and are U.S. citizens as a result, according to the study conducted by the Pew Hispanic Center. That’s about three quarters of the estimated 5.5 million children of illegal immigrants inside the United States, according to the study. About 1.8 million children of undocumented immigrants live in poverty, the study found. Three quarters of U.S. illegal immigrants have children. not_entailment
7 Like the United States, U.N. officials are also dismayed that Aristide killed a conference called by Prime Minister Robert Malval in Port-au-Prince in hopes of bringing all the feuding parties together. Aristide had Prime Minister Robert Malval murdered in Port-au-Prince. not_entailment
8 WASHINGTON – A newly declassified narrative of the Bush administration’s advice to the CIA on harsh interrogations shows that the small group of Justice Department lawyers who wrote memos authorizing controversial interrogation techniques were operating not on their own but with direction from top administration officials, including then-Vice President Dick Cheney and national security adviser Condoleezza Rice. At the same time, the narrative suggests that then-Defense Secretary Donald H. Rumsfeld and then-Secretary of State Colin Powell were largely left out of the decision-making process. Dick Cheney was the Vice President of Bush. entailment
WNLI中的train.tsv数据样式:

index sentence1 sentence2 label
0 I stuck a pin through a carrot. When I pulled the pin out, it had a hole. The carrot had a hole. 1
1 John couldn’t see the stage with Billy in front of him because he is so short. John is so short. 1
2 The police arrested all of the gang members. They were trying to stop the drug trade in the neighborhood. The police were trying to stop the drug trade in the neighborhood. 1
3 Steve follows Fred’s example in everything. He influences him hugely. Steve influences him hugely. 0
4 When Tatyana reached the cabin, her mother was sleeping. She was careful not to disturb her, undressing and climbing back into her berth. mother was careful not to disturb her, undressing and climbing back into her berth. 0
5 George got free tickets to the play, but he gave them to Eric, because he was particularly eager to see it. George was particularly eager to see it. 0
6 John was jogging through the park when he saw a man juggling watermelons. He was very impressive. John was very impressive. 0
7 I couldn’t put the pot on the shelf because it was too tall. The pot was too tall. 1
8 We had hoped to place copies of our newsletter on all the chairs in the auditorium, but there were simply not enough of them. There were simply not enough copies of the newsletter. 1
(QNLI/RTE/WNLI)中的train.tsv数据样式说明:
train.tsv中的数据内容共分为4列, 第一列代表文本数据索引; 第二列和第三列数据代表需要进行’是否蕴含’判定的句子对; 第四列数据代表两个句子是否具有蕴含关系, 0/not_entailment代表不是蕴含关系, 1/entailment代表蕴含关系.
QNLI中的test.tsv数据样式:

index question sentence
0 What organization is devoted to Jihad against Israel? For some decades prior to the First Palestine Intifada in 1987, the Muslim Brotherhood in Palestine took a “quiescent” stance towards Israel, focusing on preaching, education and social services, and benefiting from Israel’s “indulgence” to build up a network of mosques and charitable organizations.
1 In what century was the Yarrow-Schlick-Tweedy balancing system used? In the late 19th century, the Yarrow-Schlick-Tweedy balancing ‘system’ was used on some marine triple expansion engines.
2 The largest brand of what store in the UK is located in Kingston Park? Close to Newcastle, the largest indoor shopping centre in Europe, the MetroCentre, is located in Gateshead.
3 What does the IPCC rely on for research? In principle, this means that any significant new evidence or events that change our understanding of climate science between this deadline and publication of an IPCC report cannot be included.
4 What is the principle about relating spin and space variables? Thus in the case of two fermions there is a strictly negative correlation between spatial and spin variables, whereas for two bosons (e.g. quanta of electromagnetic waves, photons) the correlation is strictly positive.
5 Which network broadcasted Super Bowl 50 in the U.S.? CBS broadcast Super Bowl 50 in the U.S., and charged an average of $5 million for a 30-second commercial during the game.
6 What did the museum acquire from the Royal College of Science? To link this to the rest of the museum, a new entrance building was constructed on the site of the former boiler house, the intended site of the Spiral, between 1978 and 1982.
7 What is the name of the old north branch of the Rhine? From Wijk bij Duurstede, the old north branch of the Rhine is called Kromme Rijn (“Bent Rhine”) past Utrecht, first Leidse Rijn (“Rhine of Leiden”) and then, Oude Rijn (“Old Rhine”).
8 What was one of Luther’s most personal writings? It remains in use today, along with Luther’s hymns and his translation of the Bible.
…
(RTE/WNLI)中的test.tsv数据样式:

index sentence1 sentence2
0 Maude and Dora had seen the trains rushing across the prairie, with long, rolling puffs of black smoke streaming back from the engine. Their roars and their wild, clear whistles could be heard from far away. Horses ran away when they came in sight. Horses ran away when Maude and Dora came in sight.
1 Maude and Dora had seen the trains rushing across the prairie, with long, rolling puffs of black smoke streaming back from the engine. Their roars and their wild, clear whistles could be heard from far away. Horses ran away when they came in sight. Horses ran away when the trains came in sight.
2 Maude and Dora had seen the trains rushing across the prairie, with long, rolling puffs of black smoke streaming back from the engine. Their roars and their wild, clear whistles could be heard from far away. Horses ran away when they came in sight. Horses ran away when the puffs came in sight.
3 Maude and Dora had seen the trains rushing across the prairie, with long, rolling puffs of black smoke streaming back from the engine. Their roars and their wild, clear whistles could be heard from far away. Horses ran away when they came in sight. Horses ran away when the roars came in sight.
4 Maude and Dora had seen the trains rushing across the prairie, with long, rolling puffs of black smoke streaming back from the engine. Their roars and their wild, clear whistles could be heard from far away. Horses ran away when they came in sight. Horses ran away when the whistles came in sight.
5 Maude and Dora had seen the trains rushing across the prairie, with long, rolling puffs of black smoke streaming back from the engine. Their roars and their wild, clear whistles could be heard from far away. Horses ran away when they came in sight. Horses ran away when the horses came in sight.
6 Maude and Dora had seen the trains rushing across the prairie, with long, rolling puffs of black smoke streaming back from the engine. Their roars and their wild, clear whistles could be heard from far away. Horses ran away when they saw a train coming. Maude and Dora saw a train coming.
7 Maude and Dora had seen the trains rushing across the prairie, with long, rolling puffs of black smoke streaming back from the engine. Their roars and their wild, clear whistles could be heard from far away. Horses ran away when they saw a train coming. The trains saw a train coming.
8 Maude and Dora had seen the trains rushing across the prairie, with long, rolling puffs of black smoke streaming back from the engine. Their roars and their wild, clear whistles could be heard from far away. Horses ran away when they saw a train coming. The puffs saw a train coming.
…
(QNLI/RTE/WNLI)中的test.tsv数据样式说明:
test.tsv中的数据内容共分为3列, 第一列数据代表每条文本数据的索引; 第二列和第三列数据代表需要进行’是否蕴含’判定的句子对.
(QNLI/RTE/WNLI)数据集的任务类型:
句子对二分类任务
评估指标为: ACC
小节总结
学习了GLUE数据集合的介绍:

GLUE由纽约大学, 华盛顿大学, Google联合推出, 涵盖不同NLP任务类型, 截止至2020年1月其中包括11个子任务数据集, 成为衡量NLP研究发展的衡量标准.
GLUE数据集合包含以下数据集:

CoLA 数据集
SST-2 数据集
MRPC 数据集
STS-B 数据集
QQP 数据集
MNLI 数据集
SNLI 数据集
QNLI 数据集
RTE 数据集
WNLI 数据集
2.3 NLP中的常用预训练模型
学习目标
了解当下NLP中流行的预训练模型.
掌握如何加载和使用预训练模型.
当下NLP中流行的预训练模型
BERT
GPT
GPT-2
Transformer-XL
XLNet
XLM
RoBERTa
DistilBERT
ALBERT
T5
XLM-RoBERTa
BERT及其变体:
bert-base-uncased: 编码器具有12个隐层, 输出768维张量, 12个自注意力头, 共110M参数量, 在小写的英文文本上进行训练而得到.
bert-large-uncased: 编码器具有24个隐层, 输出1024维张量, 16个自注意力头, 共340M参数量, 在小写的英文文本上进行训练而得到.
bert-base-cased: 编码器具有12个隐层, 输出768维张量, 12个自注意力头, 共110M参数量, 在不区分大小写的英文文本上进行训练而得到.
bert-large-cased: 编码器具有24个隐层, 输出1024维张量, 16个自注意力头, 共340M参数量, 在不区分大小写的英文文本上进行训练而得到.
bert-base-multilingual-uncased: 编码器具有12个隐层, 输出768维张量, 12个自注意力头, 共110M参数量, 在小写的102种语言文本上进行训练而得到.
bert-large-multilingual-uncased: 编码器具有24个隐层, 输出1024维张量, 16个自注意力头, 共340M参数量, 在小写的102种语言文本上进行训练而得到.
bert-base-chinese: 编码器具有12个隐层, 输出768维张量, 12个自注意力头, 共110M参数量, 在简体和繁体中文文本上进行训练而得到.
GPT:
openai-gpt: 编码器具有12个隐层, 输出768维张量, 12个自注意力头, 共110M参数量, 由OpenAI在英文语料上进行训练而得到.
GPT-2及其变体:
gpt2: 编码器具有12个隐层, 输出768维张量, 12个自注意力头, 共117M参数量, 在OpenAI GPT-2英文语料上进行训练而得到.
gpt2-xl: 编码器具有48个隐层, 输出1600维张量, 25个自注意力头, 共1558M参数量, 在大型的OpenAI GPT-2英文语料上进行训练而得到.
Transformer-XL:
transfo-xl-wt103: 编码器具有18个隐层, 输出1024维张量, 16个自注意力头, 共257M参数量, 在wikitext-103英文语料进行训练而得到.
XLNet及其变体:
xlnet-base-cased: 编码器具有12个隐层, 输出768维张量, 12个自注意力头, 共110M参数量, 在英文语料上进行训练而得到.
xlnet-large-cased: 编码器具有24个隐层, 输出1024维张量, 16个自注意力头, 共240参数量, 在英文语料上进行训练而得到.
XLM:
xlm-mlm-en-2048: 编码器具有12个隐层, 输出2048维张量, 16个自注意力头, 在英文文本上进行训练而得到.
RoBERTa及其变体:
roberta-base: 编码器具有12个隐层, 输出768维张量, 12个自注意力头, 共125M参数量, 在英文文本上进行训练而得到.
roberta-large: 编码器具有24个隐层, 输出1024维张量, 16个自注意力头, 共355M参数量, 在英文文本上进行训练而得到.
DistilBERT及其变体:
distilbert-base-uncased: 基于bert-base-uncased的蒸馏(压缩)模型, 编码器具有6个隐层, 输出768维张量, 12个自注意力头, 共66M参数量.
distilbert-base-multilingual-cased: 基于bert-base-multilingual-uncased的蒸馏(压缩)模型, 编码器具有6个隐层, 输出768维张量, 12个自注意力头, 共66M参数量.
ALBERT:
albert-base-v1: 编码器具有12个隐层, 输出768维张量, 12个自注意力头, 共125M参数量, 在英文文本上进行训练而得到.
albert-base-v2: 编码器具有12个隐层, 输出768维张量, 12个自注意力头, 共125M参数量, 在英文文本上进行训练而得到, 相比v1使用了更多的数据量, 花费更长的训练时间.
T5及其变体:
t5-small: 编码器具有6个隐层, 输出512维张量, 8个自注意力头, 共60M参数量, 在C4语料上进行训练而得到.
t5-base: 编码器具有12个隐层, 输出768维张量, 12个自注意力头, 共220M参数量, 在C4语料上进行训练而得到.
t5-large: 编码器具有24个隐层, 输出1024维张量, 16个自注意力头, 共770M参数量, 在C4语料上进行训练而得到.
XLM-RoBERTa及其变体:
xlm-roberta-base: 编码器具有12个隐层, 输出768维张量, 8个自注意力头, 共125M参数量, 在2.5TB的100种语言文本上进行训练而得到.
xlm-roberta-large: 编码器具有24个隐层, 输出1027维张量, 16个自注意力头, 共355M参数量, 在2.5TB的100种语言文本上进行训练而得到.
预训练模型说明:
所有上述预训练模型及其变体都是以transformer为基础，只是在模型结构如神经元连接方式，编码器隐层数，多头注意力的头数等发生改变，这些改变方式的大部分依据都是由在标准数据集上的表现而定，因此，对于我们使用者而言，不需要从理论上深度探究这些预训练模型的结构设计的优劣，只需要在自己处理的目标数据上，尽量遍历所有可用的模型对比得到最优效果即可.
小节总结
当下NLP中流行的预训练模型:
BERT
GPT
GPT-2
Transformer-XL
XLNet
XLM
RoBERTa
DistilBERT
ALBERT
T5
XLM-RoBERTa
2.4 加载和使用预训练模型
学习目标
了解加载和使用预训练模型的工具.
掌握加载和使用预训练模型的过程.
加载和使用预训练模型的工具
在这里我们使用torch.hub工具进行模型的加载和使用.
这些预训练模型由世界先进的NLP研发团队huggingface提供.
注意: 下面使用的代码需要国外服务器的资源, 在国内使用的时候, 国内的网站下载可能会出现在原地卡死不动, 或是网络连接超时等一些网络报错, 均是网络问题, 不是代码问题, 这个可以先行跳过, 把主要逻辑梳理完成即可
加载和使用预训练模型的步骤
第一步: 确定需要加载的预训练模型并安装依赖包.
第二步: 加载预训练模型的映射器tokenizer.
第三步: 加载带/不带头的预训练模型.
第四步: 使用模型获得输出结果.
第一步: 确定需要加载的预训练模型并安装依赖包
能够加载哪些模型可以参考2.3 NLP中的常用预训练模型
这里假设我们处理的是中文文本任务, 需要加载的模型是BERT的中文模型: bert-base-chinese
在使用工具加载模型前需要安装必备的依赖包:

pip install tqdm boto3 requests regex sentencepiece sacremoses
第二步: 加载预训练模型的映射器tokenizer

import torch

预训练模型来源

source = ‘huggingface/pytorch-transformers’

选定加载模型的哪一部分, 这里是模型的映射器

part = ‘tokenizer’

加载的预训练模型的名字

model_name = ‘bert-base-chinese’
tokenizer = torch.hub.load(source, part, model_name)
第三步: 加载带/不带头的预训练模型
加载预训练模型时我们可以选择带头或者不带头的模型
这里的’头’是指模型的任务输出层, 选择加载不带头的模型, 相当于使用模型对输入文本进行特征表示.
选择加载带头的模型时, 有三种类型的’头’可供选择, modelWithLMHead(语言模型头), modelForSequenceClassification(分类模型头), modelForQuestionAnswering(问答模型头)
不同类型的’头’, 可以使预训练模型输出指定的张量维度. 如使用’分类模型头’, 则输出尺寸为(1,2)的张量, 用于进行分类任务判定结果.

加载不带头的预训练模型

part = ‘model’
model = torch.hub.load(source, part, model_name)

加载带有语言模型头的预训练模型

part = ‘modelWithLMHead’
lm_model = torch.hub.load(source, part, model_name)

加载带有类模型头的预训练模型

part = ‘modelForSequenceClassification’
classification_model = torch.hub.load(source, part, model_name)

加载带有问答模型头的预训练模型

part = ‘modelForQuestionAnswering’
qa_model = torch.hub.load(source, part, model_name)
第四步: 使用模型获得输出结果
使用不带头的模型进行输出:

输入的中文文本

input_text = “人生该如何起头”

使用tokenizer进行数值映射

indexed_tokens = tokenizer.encode(input_text)

打印映射后的结构

print(“indexed_tokens:”, indexed_tokens)

将映射结构转化为张量输送给不带头的预训练模型

tokens_tensor = torch.tensor([indexed_tokens])

使用不带头的预训练模型获得结果

with torch.no_grad():
encoded_layers, _ = model(tokens_tensor)

print(“不带头的模型输出结果:”, encoded_layers)

print(“不带头的模型输出结果的尺寸:”, encoded_layers.shape)
输出效果:

tokenizer映射后的结果, 101和102是起止符,

中间的每个数字对应"人生该如何起头"的每个字.

indexed_tokens: [101, 782, 4495, 6421, 1963, 862, 6629, 1928, 102]

不带头的模型输出结果: tensor([[[ 0.5421, 0.4526, -0.0179, …, 1.0447, -0.1140, 0.0068],
[-0.1343, 0.2785, 0.1602, …, -0.0345, -0.1646, -0.2186],
[ 0.9960, -0.5121, -0.6229, …, 1.4173, 0.5533, -0.2681],
…,
[ 0.0115, 0.2150, -0.0163, …, 0.6445, 0.2452, -0.3749],
[ 0.8649, 0.4337, -0.1867, …, 0.7397, -0.2636, 0.2144],
[-0.6207, 0.1668, 0.1561, …, 1.1218, -0.0985, -0.0937]]])

输出尺寸为1x9x768, 即每个字已经使用768维的向量进行了表示,

我们可以基于此编码结果进行接下来的自定义操作, 如: 编写自己的微调网络进行最终输出.

不带头的模型输出结果的尺寸: torch.Size([1, 9, 768])
使用带有语言模型头的模型进行输出:

使用带有语言模型头的预训练模型获得结果

with torch.no_grad():
lm_output = lm_model(tokens_tensor)

print(“带语言模型头的模型输出结果:”, lm_output)

print(“带语言模型头的模型输出结果的尺寸:”, lm_output[0].shape)
输出效果:

带语言模型头的模型输出结果: (tensor([[[ -7.9706, -7.9119, -7.9317, …, -7.2174, -7.0263, -7.3746],
[ -8.2097, -8.1810, -8.0645, …, -7.2349, -6.9283, -6.9856],
[-13.7458, -13.5978, -12.6076, …, -7.6817, -9.5642, -11.9928],
…,
[ -9.0928, -8.6857, -8.4648, …, -8.2368, -7.5684, -10.2419],
[ -8.9458, -8.5784, -8.6325, …, -7.0547, -5.3288, -7.8077],
[ -8.4154, -8.5217, -8.5379, …, -6.7102, -5.9782, -7.6909]]]),)

输出尺寸为1x9x21128, 即每个字已经使用21128维的向量进行了表示,

同不带头的模型一样, 我们可以基于此编码结果进行接下来的自定义操作, 如: 编写自己的微调网络进行最终输出.

带语言模型头的模型输出结果的尺寸: torch.Size([1, 9, 21128])
使用带有分类模型头的模型进行输出:

使用带有分类模型头的预训练模型获得结果

with torch.no_grad():
classification_output = classification_model(tokens_tensor)

print(“带分类模型头的模型输出结果:”, classification_output)

print(“带分类模型头的模型输出结果的尺寸:”, classification_output[0].shape)
输出效果:

带分类模型头的模型输出结果: (tensor([[-0.0649, -0.1593]]),)

输出尺寸为1x2, 可直接用于文本二分问题的输出

带分类模型头的模型输出结果的尺寸: torch.Size([1, 2])
使用带有问答模型头的模型进行输出:

使用带有问答模型头的模型进行输出时, 需要使输入的形式为句子对

第一条句子是对客观事物的陈述

第二条句子是针对第一条句子提出的问题

问答模型最终将得到两个张量,

每个张量中最大值对应索引的分别代表答案的在文本中的起始位置和终止位置.

input_text1 = “我家的小狗是黑色的”
input_text2 = “我家的小狗是什么颜色的呢?”

映射两个句子

indexed_tokens = tokenizer.encode(input_text1, input_text2)
print(“句子对的indexed_tokens:”, indexed_tokens)

输出结果: [101, 2769, 2157, 4638, 2207, 4318, 3221, 7946, 5682, 4638, 102, 2769, 2157, 4638, 2207, 4318, 3221, 784, 720, 7582, 5682, 4638, 1450, 136, 102]

用0，1来区分第一条和第二条句子

segments_ids = [0]*11 + [1]*14

转化张量形式

segments_tensors = torch.tensor([segments_ids])
tokens_tensor = torch.tensor([indexed_tokens])

使用带有问答模型头的预训练模型获得结果

with torch.no_grad():
start_logits, end_logits = qa_model(tokens_tensor, token_type_ids=segments_tensors)

print(“带问答模型头的模型输出结果:”, (start_logits, end_logits))
print(“带问答模型头的模型输出结果的尺寸:”, (start_logits.shape, end_logits.shape))
输出效果:

句子对的indexed_tokens: [101, 2769, 2157, 4638, 2207, 4318, 3221, 7946, 5682, 4638, 102, 2769, 2157, 4638, 2207, 4318, 3221, 784, 720, 7582, 5682, 4638, 1450, 136, 102]

带问答模型头的模型输出结果: (tensor([[ 0.2574, -0.0293, -0.8337, -0.5135, -0.3645, -0.2216, -0.1625, -0.2768,
-0.8368, -0.2581, 0.0131, -0.1736, -0.5908, -0.4104, -0.2155, -0.0307,
-0.1639, -0.2691, -0.4640, -0.1696, -0.4943, -0.0976, -0.6693, 0.2426,
0.0131]]), tensor([[-0.3788, -0.2393, -0.5264, -0.4911, -0.7277, -0.5425, -0.6280, -0.9800,
-0.6109, -0.2379, -0.0042, -0.2309, -0.4894, -0.5438, -0.6717, -0.5371,
-0.1701, 0.0826, 0.1411, -0.1180, -0.4732, -0.1541, 0.2543, 0.2163,
-0.0042]]))

输出为两个形状1x25的张量, 他们是两条句子合并长度的概率分布,

第一个张量中最大值所在的索引代表答案出现的起始索引,

第二个张量中最大值所在的索引代表答案出现的终止索引.

带问答模型头的模型输出结果的尺寸: (torch.Size([1, 25]), torch.Size([1, 25]))
小节总结
加载和使用预训练模型的工具:

在这里我们使用torch.hub工具进行模型的加载和使用. 这些预训练模型由世界先进的NLP研发团队huggingface提供.
加载和使用预训练模型的步骤:

第一步: 确定需要加载的预训练模型并安装依赖包.
第二步: 加载预训练模型的映射器tokenizer.
第三步: 加载带/不带头的预训练模型.
第四步: 使用模型获得输出结果.
2.5 迁移学习实践
学习目标
了解并掌握指定任务类型的微调脚本使用方法.
了解并掌握通过微调脚本微调后模型的使用方法.
掌握通过微调方式进行迁移学习的两种类型实现过程.
指定任务类型的微调脚本:
huggingface研究机构向我们提供了针对GLUE数据集合任务类型的微调脚本, 这些微调脚本的核心都是微调模型的最后一个全连接层.
通过简单的参数配置来指定GLUE中存在任务类型(如: CoLA对应文本二分类, MRPC对应句子对文本二分类, STS-B对应句子对文本多分类), 以及指定需要微调的预训练模型.
指定任务类型的微调脚本使用步骤
第一步: 下载微调脚本文件
第二步: 配置微调脚本参数
第三步: 运行并检验效果
第一步: 下载微调脚本文件

克隆huggingface的transfomers文件

git clone https://github.com/huggingface/transformers.git

进行transformers文件夹

cd transformers

安装python的transformer工具包, 因为微调脚本是py文件.

pip install .

当前的版本可能跟我们教学的版本并不相同，你还需要执行：

pip install transformers==2.3.0

进入微调脚本所在路径并查看

cd examples
ls

其中run_glue.py就是针对GLUE数据集合任务类型的微调脚本

注意：
对于run_glue.py，由于版本变更导致，请通过该地址http://git.itcast.cn/Stephen/AI-key-file/blob/master/run_glue.py复制里面的代码，覆盖原有内容。
第二步: 配置微调脚本参数
在run_glue.py同级目录下创建run_glue.sh文件, 写入内容如下:

定义DATA_DIR: 微调数据所在路径, 这里我们使用glue_data中的数据作为微调数据

export DATA_DIR="…/…/glue_data"

定义SAVE_DIR: 模型的保存路径, 我们将模型保存在当前目录的bert_finetuning_test文件中

export SAVE_DIR="./bert_finetuning_test/"

使用python运行微调脚本

–model_type: 选择需要微调的模型类型, 这里可以选择BERT, XLNET, XLM, roBERTa, distilBERT, ALBERT

–model_name_or_path: 选择具体的模型或者变体, 这里是在英文语料上微调, 因此选择bert-base-uncased

–task_name: 它将代表对应的任务类型, 如MRPC代表句子对二分类任务

–do_train: 使用微调脚本进行训练

–do_eval: 使用微调脚本进行验证

–data_dir: 训练集及其验证集所在路径, 将自动寻找该路径下的train.tsv和dev.tsv作为训练集和验证集

–max_seq_length: 输入句子的最大长度, 超过则截断, 不足则补齐

–learning_rate: 学习率

–num_train_epochs: 训练轮数

–output_dir $SAVE_DIR: 训练后的模型保存路径

–overwrite_output_dir: 再次训练时将清空之前的保存路径内容重新写入

python run_glue.py
–model_type BERT
–model_name_or_path bert-base-uncased
–task_name MRPC
–do_train
–do_eval
–data_dir $DATA_DIR/MRPC/
–max_seq_length 128
–learning_rate 2e-5
–num_train_epochs 1.0
–output_dir $SAVE_DIR
–overwrite_output_dir
第三步: 运行并检验效果

使用sh命令运行

sh run_glue.sh
输出效果:

最终打印模型的验证结果:

01/05/2020 23:59:53 - INFO - main - Saving features into cached file …/…/glue_data/MRPC/cached_dev_bert-base-uncased_128_mrpc
01/05/2020 23:59:53 - INFO - main - ***** Running evaluation *****
01/05/2020 23:59:53 - INFO - main - Num examples = 408
01/05/2020 23:59:53 - INFO - main - Batch size = 8
Evaluating: 100%|█| 51/51 [00:23<00:00, 2.20it/s]
01/06/2020 00:00:16 - INFO - main - ***** Eval results *****
01/06/2020 00:00:16 - INFO - main - acc = 0.7671568627450981
01/06/2020 00:00:16 - INFO - main - acc_and_f1 = 0.8073344506341863
01/06/2020 00:00:16 - INFO - main - f1 = 0.8475120385232745
查看$SAVE_DIR的文件内容:

added_tokens.json
checkpoint-450
checkpoint-400
checkpoint-350
checkpoint-200
checkpoint-300
checkpoint-250
checkpoint-200
checkpoint-150
checkpoint-100
checkpoint-50
pytorch_model.bin
training_args.bin
config.json
special_tokens_map.json
vocab.txt
eval_results.txt
tokenizer_config.json
文件解释:
pytorch_model.bin代表模型参数，可以使用torch.load加载查看；
traning_args.bin代表模型训练时的超参，如batch_size，epoch等，仍可使用torch.load查看；
config.json是模型配置文件，如多头注意力的头数，编码器的层数等，代表典型的模型结构，如bert，xlnet，一般不更改；
added_token.json记录在训练时通过代码添加的自定义token对应的数值，即在代码中使用add_token方法添加的自定义词汇；
special_token_map.json当添加的token具有特殊含义时，如分隔符，该文件存储特殊字符的及其对应的含义，使文本中出现的特殊字符先映射成其含义，之后特殊字符的含义仍然使用add_token方法映射。
checkpoint: 若干步骤保存的模型参数文件(也叫检测点文件)。
通过微调脚本微调后模型的使用步骤
第一步: 在https://huggingface.co/join上创建一个帐户
第二步: 在服务器终端使用transformers-cli登陆
第三步: 使用transformers-cli上传模型并查看
第四步: 使用pytorch.hub加载模型进行使用
第一步: 在https://huggingface.co/join上创建一个帐户

如果由于网络原因无法访问, 我们已经为你提供了默认账户

username: ItcastAI
password: ItcastAI
avatar

第二步: 在服务器终端使用transformers-cli登陆

在微调模型的服务器上登陆

使用刚刚注册的用户名和密码

默认username: ItcastAI

默认password: ItcastAI

$ transformers-cli login
第三步: 使用transformers-cli上传模型并查看

使用transformers-cli upload命令上传模型

选择正确的微调模型路径

$ transformers-cli upload ./bert_finetuning_test/

查看上传结果

$ transformers-cli ls

Filename LastModified ETag Size

bert_finetuning_test/added_tokens.json 2020-01-05T17:39:57.000Z “99914b932bd37a50b983c5e7c90ae93b” 2
bert_finetuning_test/checkpoint-400/config.json 2020-01-05T17:26:49.000Z “74d53ea41e5acb6d60496bc195d82a42” 684
bert_finetuning_test/checkpoint-400/training_args.bin 2020-01-05T17:26:47.000Z “b3273519c2b2b1cb2349937279880f50” 1207
bert_finetuning_test/checkpoint-450/config.json 2020-01-05T17:15:42.000Z “74d53ea41e5acb6d60496bc195d82a42” 684
bert_finetuning_test/checkpoint-450/pytorch_model.bin 2020-01-05T17:15:58.000Z “077cc0289c90b90d6b662cce104fe4ef” 437982584
bert_finetuning_test/checkpoint-450/training_args.bin 2020-01-05T17:15:40.000Z “b3273519c2b2b1cb2349937279880f50” 1207
bert_finetuning_test/config.json 2020-01-05T17:28:50.000Z “74d53ea41e5acb6d60496bc195d82a42” 684
bert_finetuning_test/eval_results.txt 2020-01-05T17:28:56.000Z “67d2d49a96afc4308d33bfcddda8a7c5” 81
bert_finetuning_test/pytorch_model.bin 2020-01-05T17:28:59.000Z “d46a8ccfb8f5ba9ecee70cef8306679e” 437982584
bert_finetuning_test/special_tokens_map.json 2020-01-05T17:28:54.000Z “8b3fb1023167bb4ab9d70708eb05f6ec” 112
bert_finetuning_test/tokenizer_config.json 2020-01-05T17:28:52.000Z “0d7f03e00ecb582be52818743b50e6af” 59
bert_finetuning_test/training_args.bin 2020-01-05T17:28:48.000Z “b3273519c2b2b1cb2349937279880f50” 1207
bert_finetuning_test/vocab.txt 2020-01-05T17:39:55.000Z “64800d5d8528ce344256daf115d4965e” 231508
第四步: 使用pytorch.hub加载模型进行使用, 更多信息请参考2.4 加载和使用预训练模型

若之前使用过huggingface的transformers, 请清除~/.cache

import torch

如： ItcastAI/bert_finetuning_test

source = ‘huggingface/pytorch-transformers’

选定加载模型的哪一部分, 这里是模型的映射器

part = ‘tokenizer’

#############################################

加载的预训练模型的名字

使用自己的模型名字"username/model_name"

如：‘ItcastAI/bert_finetuning_test’

model_name = ‘ItcastAI/bert_finetuning_test’
#############################################

tokenizer = torch.hub.load(‘huggingface/pytorch-transformers’, ‘tokenizer’, model_name)
model = torch.hub.load(‘huggingface/pytorch-transformers’, ‘modelForSequenceClassification’, model_name)
index = tokenizer.encode(“Talk is cheap”, “Please show me your code!”)

102是bert模型中的间隔(结束)符号的数值映射

mark = 102

找到第一个102的索引, 即句子对的间隔符号

k = index.index(mark)

句子对分割id列表, 由0，1组成, 0的位置代表第一个句子, 1的位置代表第二个句子

segments_ids = [0](k + 1) + [1](len(index) - k - 1)

转化为tensor

tokens_tensor = torch.tensor([index])
segments_tensors = torch.tensor([segments_ids])

使用评估模式

with torch.no_grad():
# 使用模型预测获得结果
result = model(tokens_tensor, token_type_ids=segments_tensors)
# 打印预测结果以及张量尺寸
print(result)
print(result[0].shape)
输出效果:

(tensor([[-0.0181, 0.0263]]),)
torch.Size([1, 2])
通过微调方式进行迁移学习的两种类型
类型一: 使用指定任务类型的微调脚本微调预训练模型, 后接带有输出头的预定义网络输出结果.
类型二: 直接加载预训练模型进行输入文本的特征表示, 后接自定义网络进行微调输出结果.
说明: 所有类型的实战演示, 都将针对中文文本进行.
类型一实战演示
使用文本二分类的任务类型SST-2的微调脚本微调中文预训练模型, 后接带有分类输出头的预定义网络输出结果. 目标是判断句子的情感倾向.
准备中文酒店评论的情感分析语料, 语料样式与SST-2数据集相同, 标签0代表差评, 标签1好评.
语料存放在与glue_data/同级目录cn_data/下, 其中的SST-2目录包含train.tsv和dev.tsv
train.tsv

sentence label
早餐不好,服务不到位,晚餐无西餐,早餐晚餐相同,房间条件不好,餐厅不分吸烟区.房间不分有无烟房. 0
去的时候 ,酒店大厅和餐厅在装修,感觉大厅有点挤.由于餐厅装修本来该享受的早饭,也没有享受(他们是8点开始每个房间送,但是我时间来不及了)不过前台服务员态度好! 1
有很长时间没有在西藏大厦住了，以前去北京在这里住的较多。这次住进来发现换了液晶电视，但网络不是很好，他们自己说是收费的原因造成的。其它还好。 1
非常好的地理位置，住的是豪华海景房，打开窗户就可以看见栈桥和海景。记得很早以前也住过，现在重新装修了。总的来说比较满意，以后还会住 1
交通很方便，房间小了一点，但是干净整洁，很有香港的特色，性价比较高，推荐一下哦 1
酒店的装修比较陈旧，房间的隔音，主要是卫生间的隔音非常差，只能算是一般的 0
酒店有点旧，房间比较小，但酒店的位子不错，就在海边，可以直接去游泳。8楼的海景打开窗户就是海。如果想住在热闹的地带，这里不是一个很好的选择，不过威海城市真的比较小，打车还是相当便宜的。晚上酒店门口出租车比较少。 1
位置很好，走路到文庙、清凉寺5分钟都用不了，周边公交车很多很方便，就是出租车不太爱去（老城区路窄爱堵车），因为是老宾馆所以设施要陈旧些， 1
酒店设备一般，套房里卧室的不能上网，要到客厅去。 0
dev.tsv

sentence label
房间里有电脑，虽然房间的条件略显简陋，但环境、服务还有饭菜都还是很不错的。如果下次去无锡，我还是会选择这里的。 1
我们是5月1日通过携程网入住的，条件是太差了，根本达不到四星级的标准，所有的东西都很陈旧，卫生间水龙头用完竟关不上，浴缸的漆面都掉了，估计是十年前的四星级吧，总之下次是不会入住了。 0
离火车站很近很方便。住在东楼标间，相比较在九江住的另一家酒店，房间比较大。卫生间设施略旧。服务还好。10元中式早餐也不错，很丰富，居然还有青菜肉片汤。 1
坐落在香港的老城区，可以体验香港居民生活，门口交通很方便，如果时间不紧，坐叮当车很好呀！周围有很多小餐馆，早餐就在中远后面的南北嚼吃的，东西很不错。我们定的大床房，挺安静的，总体来说不错。前台结账没有银联！ 1
酒店前台服务差，对待客人不热情。号称携程没有预定。感觉是客人在求他们，我们一定得住。这样的宾馆下次不会入住！ 0
价格确实比较高，而且还没有早餐提供。 1
是一家很实惠的酒店，交通方便，房间也宽敞，晚上没有电话骚扰，住了两次，有一次住５０１房间，洗澡间排水不畅通，也许是个别问题．服务质量很好，刚入住时没有调好宽带，服务员很快就帮忙解决了． 1
位置非常好，就在西街的街口，但是却闹中取静，环境很清新优雅。 1
房间应该超出30平米,是HK同级酒店中少有的大;重装之后,设备也不错. 1
在run_glue.py同级目录下创建run_cn.sh文件, 写入内容如下:

定义DATA_DIR: 微调数据所在路径

export DATA_DIR="…/…/cn_data"

定义SAVE_DIR: 模型的保存路径, 我们将模型保存在当前目录的bert_finetuning文件中

export SAVE_DIR="./bert_cn_finetuning/"

使用python运行微调脚本

–model_type: 选择BERT

–model_name_or_path: 选择bert-base-chinese

–task_name: 句子二分类任务SST-2

–do_train: 使用微调脚本进行训练

–do_eval: 使用微调脚本进行验证

–data_dir: “./cn_data/SST-2/”, 将自动寻找该路径下的train.tsv和dev.tsv作为训练集和验证集

–max_seq_length: 128，输入句子的最大长度

–output_dir $SAVE_DIR: “./bert_finetuning/”, 训练后的模型保存路径

python run_glue.py
–model_type BERT
–model_name_or_path bert-base-chinese
–task_name SST-2
–do_train
–do_eval
–data_dir $DATA_DIR/SST-2/
–max_seq_length 128
–learning_rate 2e-5
–num_train_epochs 1.0
–output_dir $SAVE_DIR
运行并检验效果

使用sh命令运行

sh run_cn.sh
输出效果:

最终打印模型的验证结果, 准确率高达0.88.

01/06/2020 14:22:36 - INFO - main - Saving features into cached file …/…/cn_data/SST-2/cached_dev_bert-base-chinese_128_sst-2
01/06/2020 14:22:36 - INFO - main - ***** Running evaluation *****
01/06/2020 14:22:36 - INFO - main - Num examples = 1000
01/06/2020 14:22:36 - INFO - main - Batch size = 8
Evaluating: 100%|████████████| 125/125 [00:56<00:00, 2.20it/s]
01/06/2020 14:23:33 - INFO - main - ***** Eval results *****
01/06/2020 14:23:33 - INFO - main - acc = 0.88
查看$SAVE_DIR的文件内容:

added_tokens.json
checkpoint-350
checkpoint-200
checkpoint-300
checkpoint-250
checkpoint-200
checkpoint-150
checkpoint-100
checkpoint-50
pytorch_model.bin
training_args.bin
config.json
special_tokens_map.json
vocab.txt
eval_results.txt
tokenizer_config.json
使用transformers-cli上传模型:

默认username: ItcastAI

默认password: ItcastAI

$ transformers-cli login

使用transformers-cli upload命令上传模型

选择正确的微调模型路径

$ transformers-cli upload ./bert_cn_finetuning/
通过pytorch.hub加载模型进行使用:

import torch

source = ‘huggingface/pytorch-transformers’

模型名字为’ItcastAI/bert_cn_finetuning’

model_name = ‘ItcastAI/bert_cn_finetuning’

tokenizer = torch.hub.load(source, ‘tokenizer’, model_name)
model = torch.hub.load(source, ‘modelForSequenceClassification’, model_name)

def get_label(text):
index = tokenizer.encode(text)
tokens_tensor = torch.tensor([index])
# 使用评估模式
with torch.no_grad():
# 使用模型预测获得结果
result = model(tokens_tensor)
predicted_label = torch.argmax(result[0]).item()
return predicted_label

if name == “main”:
# text = “早餐不好,服务不到位,晚餐无西餐,早餐晚餐相同,房间条件不好”
text = “房间应该超出30平米,是HK同级酒店中少有的大;重装之后,设备也不错.”
print(“输入文本为:”, text)
print(“预测标签为:”, get_label(text))
输出效果:

输入文本为: 早餐不好,服务不到位,晚餐无西餐,早餐晚餐相同,房间条件不好
预测标签为: 0

输入文本为: 房间应该超出30平米,是HK同级酒店中少有的大;重装之后,设备也不错.
预测标签为: 1
类型二实战演示
直接加载预训练模型进行输入文本的特征表示, 后接自定义网络进行微调输出结果.
使用语料和完成的目标与类型一实战相同.
直接加载预训练模型进行输入文本的特征表示:

import torch

进行句子的截断补齐(规范长度)

from keras.preprocessing import sequence

source = ‘huggingface/pytorch-transformers’

直接使用预训练的bert中文模型

model_name = ‘bert-base-chinese’

通过torch.hub获得已经训练好的bert-base-chinese模型

model = torch.hub.load(source, ‘model’, model_name)

获得对应的字符映射器, 它将把中文的每个字映射成一个数字

tokenizer = torch.hub.load(source, ‘tokenizer’, model_name)

句子规范长度

cutlen = 32

def get_bert_encode(text):
“”"
description: 使用bert-chinese编码中文文本
:param text: 要进行编码的文本
:return: 使用bert编码后的文本张量表示
“”"
# 首先使用字符映射器对每个汉字进行映射
# 这里需要注意, bert的tokenizer映射后会为结果前后添加开始和结束标记即101和102
# 这对于多段文本的编码是有意义的, 但在我们这里没有意义, 因此使用[1:-1]对头和尾进行切片
indexed_tokens = tokenizer.encode(text[:cutlen])[1:-1]
# 对映射后的句子进行截断补齐
indexed_tokens = sequence.pad_sequences([indexed_tokens], cutlen)
# 之后将列表结构转化为tensor
tokens_tensor = torch.LongTensor(indexed_tokens)
# 使模型不自动计算梯度
with torch.no_grad():
# 调用模型获得隐层输出
encoded_layers, _ = model(tokens_tensor)
# 输出的隐层是一个三维张量, 最外层一维是1, 我们使用[0]降去它.
encoded_layers = encoded_layers[0]
return encoded_layers
调用:

if name == “main”:
text = “早餐不好,服务不到位,晚餐无西餐,早餐晚餐相同,房间条件不好”
encoded_layers = get_bert_encode(text)
print(encoded_layers)
print(encoded_layers.shape)
输出效果:

tensor([[-1.2282, 1.0551, -0.7953, …, 2.3363, -0.6413, 0.4174],
[-0.9769, 0.8361, -0.4328, …, 2.1668, -0.5845, 0.4836],
[-0.7990, 0.6181, -0.1424, …, 2.2845, -0.6079, 0.5288],
…,
[ 0.9514, 0.5972, 0.3120, …, 1.8408, -0.1362, -0.1206],
[ 0.1250, 0.1984, 0.0484, …, 1.2302, -0.1905, 0.3205],
[ 0.2651, 0.0228, 0.1534, …, 1.0159, -0.3544, 0.1479]])

torch.Size([32, 768])
自定义单层的全连接网络作为微调网络:
根据实际经验, 自定义的微调网络参数总数应大于0.5倍的训练数据量, 小于10倍的训练数据量, 这样有助于模型在合理的时间范围内收敛.

import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
“”“定义微调网络的类”""
def init(self, char_size=32, embedding_size=768):
“”"
:param char_size: 输入句子中的字符数量, 即输入句子规范后的长度128.
:param embedding_size: 字嵌入的维度, 因为使用的bert中文模型嵌入维度是768, 因此embedding_size为768
“”"
super(Net, self).init()
# 将char_size和embedding_size传入其中
self.char_size = char_size
self.embedding_size = embedding_size
# 实例化一个全连接层
self.fc1 = nn.Linear(char_size*embedding_size, 2)

def forward(self, x):
    # 对输入的张量形状进行变换, 以满足接下来层的输入要求
    x = x.view(-1, self.char_size*self.embedding_size)
    # 使用一个全连接层
    x = self.fc1(x)
    return x

调用:

if name == “main”:
# 随机初始化一个输入参数
x = torch.randn(1, 32, 768)
# 实例化网络结构, 所有参数使用默认值
net = Net()
nr = net(x)
print(nr)
输出效果:

tensor([[0.3279, 0.2519]], grad_fn=)
构建训练与验证数据批次生成器:

import pandas as pd
from collections import Counter
from functools import reduce
from sklearn.utils import shuffle

def data_loader(train_data_path, valid_data_path, batch_size):
“”"
description: 从持久化文件中加载数据
:param train_data_path: 训练数据路径
:param valid_data_path: 验证数据路径
:param batch_size: 训练和验证数据集的批次大小
:return: 训练数据生成器, 验证数据生成器, 训练数据数量, 验证数据数量
“”"
# 使用pd进行csv数据的读取, 并去除第一行的列名
train_data = pd.read_csv(train_data_path, header=None, sep="\t").drop([0])
valid_data = pd.read_csv(valid_data_path, header=None, sep="\t").drop([0])

# 打印训练集和验证集上的正负样本数量
print("训练数据集的正负样本数量:")
print(dict(Counter(train_data[1].values)))
print("验证数据集的正负样本数量:")
print(dict(Counter(valid_data[1].values)))

# 验证数据集中的数据总数至少能够满足一个批次
if len(valid_data) < batch_size:
    raise("Batch size or split not match!")

def _loader_generator(data):
    """
    description: 获得训练集/验证集的每个批次数据的生成器
    :param data: 训练数据或验证数据
    :return: 一个批次的训练数据或验证数据的生成器
    """
    # 以每个批次的间隔遍历数据集
    for batch in range(0, len(data), batch_size):
        # 定义batch数据的张量列表
        batch_encoded = []
        batch_labels = []
        # 将一个bitch_size大小的数据转换成列表形式, 并进行逐条遍历
        for item in shuffle(data.values.tolist())[batch: batch+batch_size]:
            # 使用bert中文模型进行编码
            encoded = get_bert_encode(item[0])
            # 将编码后的每条数据装进预先定义好的列表中
            batch_encoded.append(encoded)
            # 同样将对应的该batch的标签装进labels列表中
            batch_labels.append([int(item[1])])
        # 使用reduce高阶函数将列表中的数据转换成模型需要的张量形式
        # encoded的形状是(batch_size*max_len, embedding_size)
        encoded = reduce(lambda x, y: torch.cat((x, y), dim=0), batch_encoded)
        labels = torch.tensor(reduce(lambda x, y: x + y, batch_labels))
        # 以生成器的方式返回数据和标签
        yield (encoded, labels)

# 对训练集和验证集分别使用_loader_generator函数, 返回对应的生成器
# 最后还要返回训练集和验证集的样本数量
return _loader_generator(train_data), _loader_generator(valid_data), len(train_data), len(valid_data)

调用:

if name == “main”:
train_data_path = “./cn_data/SST-2/train.tsv”
valid_data_path = “./cn_data/SST-2/dev.tsv”
batch_size = 16
train_data_labels, valid_data_labels,
train_data_len, valid_data_len = data_loader(train_data_path, valid_data_path, batch_size)
print(next(train_data_labels))
print(next(valid_data_labels))
print(“train_data_len:”, train_data_len)
print(“valid_data_len:”, valid_data_len)
输出效果:

训练数据集的正负样本数量:
{‘0’: 1518, ‘1’: 1442}
验证数据集的正负样本数量:
{‘1’: 518, ‘0’: 482}
(tensor([[[-0.8328, 0.9376, -1.2489, …, 1.8594, -0.4636, -0.1682],
[-0.9798, 0.5113, -0.9868, …, 1.5500, -0.1934, 0.2521],
[-0.7574, 0.3086, -0.6031, …, 1.8467, -0.2507, 0.3916],
…,
[ 0.0064, 0.2321, 0.3785, …, 0.3376, 0.4748, -0.1272],
[-0.3175, 0.4018, -0.0377, …, 0.6030, 0.2916, -0.4172],
[-0.6154, 1.0439, 0.2921, …, 0.5048, -0.0983, 0.0061]]]), tensor([0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
1, 0, 1, 1, 1, 1, 0, 0]))
(tensor([[[-0.1611, 0.9182, -0.3419, …, 0.6323, -0.2013, 0.0184],
[-0.1224, 0.7706, -0.2386, …, 0.7925, 0.0444, 0.2160],
[-0.0301, 0.6867, -0.1510, …, 0.9140, 0.0308, 0.2611],
…,
[ 0.3662, -0.4925, 1.2332, …, 0.7741, -0.1007, -0.3099],
[-0.0932, -0.8494, 0.6586, …, 0.1235, -0.3152, -0.1635],
[ 0.5306, -0.5510, 0.3105, …, 1.2631, -0.5882, -0.1133]]]), tensor([1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
1, 0, 0, 1, 1, 1, 0, 0]))
train_data_len: 2960
valid_data_len: 1000
编写训练和验证函数:

import torch.optim as optim

def train(train_data_labels):
“”"
description: 训练函数, 在这个过程中将更新模型参数, 并收集准确率和损失
:param train_data_labels: 训练数据和标签的生成器对象
:return: 整个训练过程的平均损失之和以及正确标签的累加数
“”"
# 定义训练过程的初始损失和准确率累加数
train_running_loss = 0.0
train_running_acc = 0.0
# 循环遍历训练数据和标签生成器, 每个批次更新一次模型参数
for train_tensor, train_labels in train_data_labels:
# 初始化该批次的优化器
optimizer.zero_grad()
# 使用微调网络获得输出
train_outputs = net(train_tensor)
# 得到该批次下的平均损失
train_loss = criterion(train_outputs, train_labels)
# 将该批次的平均损失加到train_running_loss中
train_running_loss += train_loss.item()
# 损失反向传播
train_loss.backward()
# 优化器更新模型参数
optimizer.step()
# 将该批次中正确的标签数量进行累加, 以便之后计算准确率
train_running_acc += (train_outputs.argmax(1) == train_labels).sum().item()
return train_running_loss, train_running_acc

def valid(valid_data_labels):
“”"
description: 验证函数, 在这个过程中将验证模型的在新数据集上的标签, 收集损失和准确率
:param valid_data_labels: 验证数据和标签的生成器对象
:return: 整个验证过程的平均损失之和以及正确标签的累加数
“”"
# 定义训练过程的初始损失和准确率累加数
valid_running_loss = 0.0
valid_running_acc = 0.0
# 循环遍历验证数据和标签生成器
for valid_tensor, valid_labels in valid_data_labels:
# 不自动更新梯度
with torch.no_grad():
# 使用微调网络获得输出
valid_outputs = net(valid_tensor)
# 得到该批次下的平均损失
valid_loss = criterion(valid_outputs, valid_labels)
# 将该批次的平均损失加到valid_running_loss中
valid_running_loss += valid_loss.item()
# 将该批次中正确的标签数量进行累加, 以便之后计算准确率
valid_running_acc += (valid_outputs.argmax(1) == valid_labels).sum().item()
return valid_running_loss, valid_running_acc
调用并保存模型:

if name == “main”:
# 设定数据路径
train_data_path = “./cn_data/SST-2/train.tsv”
valid_data_path = “./cn_data/SST-2/dev.tsv”
# 定义交叉熵损失函数
criterion = nn.CrossEntropyLoss()
# 定义SGD优化方法
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
# 定义训练轮数
epochs = 4
# 定义批次样本数量
batch_size = 16
# 进行指定轮次的训练
for epoch in range(epochs):
# 打印轮次
print(“Epoch:”, epoch + 1)
# 通过数据加载器获得训练数据和验证数据生成器, 以及对应的样本数量
train_data_labels, valid_data_labels, train_data_len,
valid_data_len = data_loader(train_data_path, valid_data_path, batch_size)
# 调用训练函数进行训练
train_running_loss, train_running_acc = train(train_data_labels)
# 调用验证函数进行验证
valid_running_loss, valid_running_acc = valid(valid_data_labels)
# 计算每一轮的平均损失, train_running_loss和valid_running_loss是每个批次的平均损失之和
# 因此将它们乘以batch_size就得到了该轮的总损失, 除以样本数即该轮次的平均损失
train_average_loss = train_running_loss * batch_size / train_data_len
valid_average_loss = valid_running_loss * batch_size / valid_data_len

    # train_running_acc和valid_running_acc是每个批次的正确标签累加和,
    # 因此只需除以对应样本总数即是该轮次的准确率
    train_average_acc = train_running_acc /  train_data_len
    valid_average_acc = valid_running_acc / valid_data_len
    # 打印该轮次下的训练损失和准确率以及验证损失和准确率
    print("Train Loss:", train_average_loss, "|", "Train Acc:", train_average_acc)
    print("Valid Loss:", valid_average_loss, "|", "Valid Acc:", valid_average_acc)

print('Finished Training')

# 保存路径
MODEL_PATH = './BERT_net.pth'
# 保存模型参数
torch.save(net.state_dict(), MODEL_PATH) 
print('Finished Saving')

输出效果:

Epoch: 1
Train Loss: 2.144986984236597 | Train Acc: 0.7347972972972973
Valid Loss: 2.1898122818128902 | Valid Acc: 0.704
Epoch: 2
Train Loss: 1.3592962406135032 | Train Acc: 0.8435810810810811
Valid Loss: 1.8816152956699324 | Valid Acc: 0.784
Epoch: 3
Train Loss: 1.5507876996199943 | Train Acc: 0.8439189189189189
Valid Loss: 1.8626576719331536 | Valid Acc: 0.795
Epoch: 4
Train Loss: 0.7825378059198299 | Train Acc: 0.9081081081081082
Valid Loss: 2.121698483480899 | Valid Acc: 0.803
Finished Training
Finished Saving
加载模型进行使用:

if name == “main”:
MODEL_PATH = ‘./BERT_net.pth’
# 加载模型参数
net.load_state_dict(torch.load(MODEL_PATH))

# text = "酒店设备一般，套房里卧室的不能上网，要到客厅去。"
text = "房间应该超出30平米,是HK同级酒店中少有的大;重装之后,设备也不错."
print("输入文本为:", text)
with torch.no_grad():
    output = net(get_bert_encode(text))
    # 从output中取出最大值对应的索引
    print("预测标签为:", torch.argmax(output).item())

输出效果:

输入文本为: 房间应该超出30平米,是HK同级酒店中少有的大;重装之后,设备也不错.
预测标签为: 1
输入文本为: 酒店设备一般，套房里卧室的不能上网，要到客厅去。
预测标签为: 0
小节总结
学习了指定任务类型的微调脚本:

huggingface研究机构向我们提供了针对GLUE数据集合任务类型的微调脚本, 这些微调脚本的核心都是微调模型的最后一个全连接层.
通过简单的参数配置来指定GLUE中存在任务类型(如: CoLA对应文本二分类, MRPC对应句子对文本二分类, STS-B对应句子对文本多分类), 以及指定需要微调的预训练模型.
学习了指定任务类型的微调脚本使用步骤:

第一步: 下载微调脚本文件
第二步: 配置微调脚本参数
第三步: 运行并检验效果
学习了通过微调脚本微调后模型的使用步骤:

第一步: 在https://huggingface.co/join上创建一个帐户
第二步: 在服务器终端使用transformers-cli登陆
第三步: 使用transformers-cli上传模型并查看
第四步: 使用pytorch.hub加载模型进行使用
学习了通过微调方式进行迁移学习的两种类型:

类型一: 使用指定任务类型的微调脚本微调预训练模型, 后接带有输出头的预定义网络输出结果.
类型二: 直接加载预训练模型进行输入文本的特征表示, 后接自定义网络进行微调输出结果.
学习了类型一实战演示:

使用文本二分类的任务类型SST-2的微调脚本微调中文预训练模型, 后接带有分类输出头的预定义网络输出结果. 目标是判断句子的情感倾向.
准备中文酒店评论的情感分析语料, 语料样式与SST-2数据集相同, 标签0代表差评, 标签1好评.
语料存放在与glue_data/同级目录cn_data/下, 其中的SST-2目录包含train.tsv和dev.tsv
学习了类型二实战演示: