就是有个vocab.txt文件,里面每行是个token,比如:
abc
bcd
吴家行
然后Tokenizer这个类中有几个属性:
vocab
是个字典,也就是将上面的vocab.txt文件变成如下的形式:
{
"abc": 0,
"bcd": 1,
"吴家行": 2,
}
ids_to_tokens
调换vocab中token和id的位置, 变成如下的形式:
{
0: "abc",
1: "bcd",
2: "吴家行",
}
basic_tokenizer
wordpiece_tokenizer
max_len
把结构(各种size)和预训练好的参数(也就是各个模型中的weight和bias)都加载进来
我的理解是模型中规定的超参数, 是json格式,形式如下:
{
"attention_probs_dropout_prob": 0.1,
"directionality": "bidi",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pooler_fc_size": 768,
"pooler_num_attention_heads": 12,
"pooler_num_fc_layers": 3,
"pooler_size_per_head": 128,
"pooler_type": "first_token_transform",
"type_vocab_size": 2,
"vocab_size": 21128
}
我的理解是模型训练好的参数, 是二进制的文件,解析出是dict,我大致把解析的keys输出一下:
bert.embeddings.word_embeddings.weight
bert.embeddings.position_embeddings.weight
bert.embeddings.token_type_embeddings.weight
bert.embeddings.LayerNorm.weight
bert.embeddings.LayerNorm.bias
bert.encoder.layer.0.attention.self.query.weight
bert.encoder.layer.0.attention.self.query.bias
bert.encoder.layer.0.attention.self.key.weight
bert.encoder.layer.0.attention.self.key.bias
bert.encoder.layer.0.attention.self.value.weight
bert.encoder.layer.0.attention.self.value.bias
bert.encoder.layer.0.attention.output.dense.weight
bert.encoder.layer.0.attention.output.dense.bias
bert.encoder.layer.0.attention.output.LayerNorm.weight
bert.encoder.layer.0.attention.output.LayerNorm.bias
bert.encoder.layer.0.intermediate.dense.weight
bert.encoder.layer.0.intermediate.dense.bias
bert.encoder.layer.0.output.dense.weight
bert.encoder.layer.0.output.dense.bias
bert.encoder.layer.0.output.LayerNorm.weight
bert.encoder.layer.0.output.LayerNorm.bias
bert.encoder.layer.1.attention.self.query.weight
bert.encoder.layer.1.attention.self.query.bias
bert.encoder.layer.1.attention.self.key.weight
bert.encoder.layer.1.attention.self.key.bias
bert.encoder.layer.1.attention.self.value.weight
bert.encoder.layer.1.attention.self.value.bias
bert.encoder.layer.1.attention.output.dense.weight
bert.encoder.layer.1.attention.output.dense.bias
bert.encoder.layer.1.attention.output.LayerNorm.weight
bert.encoder.layer.1.attention.output.LayerNorm.bias
bert.encoder.layer.1.intermediate.dense.weight
bert.encoder.layer.1.intermediate.dense.bias
bert.encoder.layer.1.output.dense.weight
bert.encoder.layer.1.output.dense.bias
bert.encoder.layer.1.output.LayerNorm.weight
bert.encoder.layer.1.output.LayerNorm.bias
bert.encoder.layer.2.attention.self.query.weight
bert.encoder.layer.2.attention.self.query.bias
bert.encoder.layer.2.attention.self.key.weight
bert.encoder.layer.2.attention.self.key.bias
bert.encoder.layer.2.attention.self.value.weight
bert.encoder.layer.2.attention.self.value.bias
bert.encoder.layer.2.attention.output.dense.weight
bert.encoder.layer.2.attention.output.dense.bias
bert.encoder.layer.2.attention.output.LayerNorm.weight
bert.encoder.layer.2.attention.output.LayerNorm.bias
bert.encoder.layer.2.intermediate.dense.weight
bert.encoder.layer.2.intermediate.dense.bias
bert.encoder.layer.2.output.dense.weight
bert.encoder.layer.2.output.dense.bias
bert.encoder.layer.2.output.LayerNorm.weight
bert.encoder.layer.2.output.LayerNorm.bias
bert.encoder.layer.3.attention.self.query.weight
bert.encoder.layer.3.attention.self.query.bias
bert.encoder.layer.3.attention.self.key.weight
bert.encoder.layer.3.attention.self.key.bias
bert.encoder.layer.3.attention.self.value.weight
bert.encoder.layer.3.attention.self.value.bias
bert.encoder.layer.3.attention.output.dense.weight
bert.encoder.layer.3.attention.output.dense.bias
bert.encoder.layer.3.attention.output.LayerNorm.weight
bert.encoder.layer.3.attention.output.LayerNorm.bias
bert.encoder.layer.3.intermediate.dense.weight
bert.encoder.layer.3.intermediate.dense.bias
bert.encoder.layer.3.output.dense.weight
bert.encoder.layer.3.output.dense.bias
bert.encoder.layer.3.output.LayerNorm.weight
bert.encoder.layer.3.output.LayerNorm.bias
bert.encoder.layer.4.attention.self.query.weight
bert.encoder.layer.4.attention.self.query.bias
bert.encoder.layer.4.attention.self.key.weight
bert.encoder.layer.4.attention.self.key.bias
bert.encoder.layer.4.attention.self.value.weight
bert.encoder.layer.4.attention.self.value.bias
bert.encoder.layer.4.attention.output.dense.weight
bert.encoder.layer.4.attention.output.dense.bias
bert.encoder.layer.4.attention.output.LayerNorm.weight
bert.encoder.layer.4.attention.output.LayerNorm.bias
bert.encoder.layer.4.intermediate.dense.weight
bert.encoder.layer.4.intermediate.dense.bias
bert.encoder.layer.4.output.dense.weight
bert.encoder.layer.4.output.dense.bias
bert.encoder.layer.4.output.LayerNorm.weight
bert.encoder.layer.4.output.LayerNorm.bias
bert.encoder.layer.5.attention.self.query.weight
bert.encoder.layer.5.attention.self.query.bias
bert.encoder.layer.5.attention.self.key.weight
bert.encoder.layer.5.attention.self.key.bias
bert.encoder.layer.5.attention.self.value.weight
bert.encoder.layer.5.attention.self.value.bias
bert.encoder.layer.5.attention.output.dense.weight
bert.encoder.layer.5.attention.output.dense.bias
bert.encoder.layer.5.attention.output.LayerNorm.weight
bert.encoder.layer.5.attention.output.LayerNorm.bias
bert.encoder.layer.5.intermediate.dense.weight
bert.encoder.layer.5.intermediate.dense.bias
bert.encoder.layer.5.output.dense.weight
bert.encoder.layer.5.output.dense.bias
bert.encoder.layer.5.output.LayerNorm.weight
bert.encoder.layer.5.output.LayerNorm.bias
bert.encoder.layer.6.attention.self.query.weight
bert.encoder.layer.6.attention.self.query.bias
bert.encoder.layer.6.attention.self.key.weight
bert.encoder.layer.6.attention.self.key.bias
bert.encoder.layer.6.attention.self.value.weight
bert.encoder.layer.6.attention.self.value.bias
bert.encoder.layer.6.attention.output.dense.weight
bert.encoder.layer.6.attention.output.dense.bias
bert.encoder.layer.6.attention.output.LayerNorm.weight
bert.encoder.layer.6.attention.output.LayerNorm.bias
bert.encoder.layer.6.intermediate.dense.weight
bert.encoder.layer.6.intermediate.dense.bias
bert.encoder.layer.6.output.dense.weight
bert.encoder.layer.6.output.dense.bias
bert.encoder.layer.6.output.LayerNorm.weight
bert.encoder.layer.6.output.LayerNorm.bias
bert.encoder.layer.7.attention.self.query.weight
bert.encoder.layer.7.attention.self.query.bias
bert.encoder.layer.7.attention.self.key.weight
bert.encoder.layer.7.attention.self.key.bias
bert.encoder.layer.7.attention.self.value.weight
bert.encoder.layer.7.attention.self.value.bias
bert.encoder.layer.7.attention.output.dense.weight
bert.encoder.layer.7.attention.output.dense.bias
bert.encoder.layer.7.attention.output.LayerNorm.weight
bert.encoder.layer.7.attention.output.LayerNorm.bias
bert.encoder.layer.7.intermediate.dense.weight
bert.encoder.layer.7.intermediate.dense.bias
bert.encoder.layer.7.output.dense.weight
bert.encoder.layer.7.output.dense.bias
bert.encoder.layer.7.output.LayerNorm.weight
bert.encoder.layer.7.output.LayerNorm.bias
bert.encoder.layer.8.attention.self.query.weight
bert.encoder.layer.8.attention.self.query.bias
bert.encoder.layer.8.attention.self.key.weight
bert.encoder.layer.8.attention.self.key.bias
bert.encoder.layer.8.attention.self.value.weight
bert.encoder.layer.8.attention.self.value.bias
bert.encoder.layer.8.attention.output.dense.weight
bert.encoder.layer.8.attention.output.dense.bias
bert.encoder.layer.8.attention.output.LayerNorm.weight
bert.encoder.layer.8.attention.output.LayerNorm.bias
bert.encoder.layer.8.intermediate.dense.weight
bert.encoder.layer.8.intermediate.dense.bias
bert.encoder.layer.8.output.dense.weight
bert.encoder.layer.8.output.dense.bias
bert.encoder.layer.8.output.LayerNorm.weight
bert.encoder.layer.8.output.LayerNorm.bias
bert.encoder.layer.9.attention.self.query.weight
bert.encoder.layer.9.attention.self.query.bias
bert.encoder.layer.9.attention.self.key.weight
bert.encoder.layer.9.attention.self.key.bias
bert.encoder.layer.9.attention.self.value.weight
bert.encoder.layer.9.attention.self.value.bias
bert.encoder.layer.9.attention.output.dense.weight
bert.encoder.layer.9.attention.output.dense.bias
bert.encoder.layer.9.attention.output.LayerNorm.weight
bert.encoder.layer.9.attention.output.LayerNorm.bias
bert.encoder.layer.9.intermediate.dense.weight
bert.encoder.layer.9.intermediate.dense.bias
bert.encoder.layer.9.output.dense.weight
bert.encoder.layer.9.output.dense.bias
bert.encoder.layer.9.output.LayerNorm.weight
bert.encoder.layer.9.output.LayerNorm.bias
bert.encoder.layer.10.attention.self.query.weight
bert.encoder.layer.10.attention.self.query.bias
bert.encoder.layer.10.attention.self.key.weight
bert.encoder.layer.10.attention.self.key.bias
bert.encoder.layer.10.attention.self.value.weight
bert.encoder.layer.10.attention.self.value.bias
bert.encoder.layer.10.attention.output.dense.weight
bert.encoder.layer.10.attention.output.dense.bias
bert.encoder.layer.10.attention.output.LayerNorm.weight
bert.encoder.layer.10.attention.output.LayerNorm.bias
bert.encoder.layer.10.intermediate.dense.weight
bert.encoder.layer.10.intermediate.dense.bias
bert.encoder.layer.10.output.dense.weight
bert.encoder.layer.10.output.dense.bias
bert.encoder.layer.10.output.LayerNorm.weight
bert.encoder.layer.10.output.LayerNorm.bias
bert.encoder.layer.11.attention.self.query.weight
bert.encoder.layer.11.attention.self.query.bias
bert.encoder.layer.11.attention.self.key.weight
bert.encoder.layer.11.attention.self.key.bias
bert.encoder.layer.11.attention.self.value.weight
bert.encoder.layer.11.attention.self.value.bias
bert.encoder.layer.11.attention.output.dense.weight
bert.encoder.layer.11.attention.output.dense.bias
bert.encoder.layer.11.attention.output.LayerNorm.weight
bert.encoder.layer.11.attention.output.LayerNorm.bias
bert.encoder.layer.11.intermediate.dense.weight
bert.encoder.layer.11.intermediate.dense.bias
bert.encoder.layer.11.output.dense.weight
bert.encoder.layer.11.output.dense.bias
bert.encoder.layer.11.output.LayerNorm.weight
bert.encoder.layer.11.output.LayerNorm.bias
bert.pooler.dense.weight
bert.pooler.dense.bias
classifier.weight
classifier.bias
看样子预训练模型大致是由:
这几个层构成的。
用的是一种比Adam更新的一种优化算法,看样子是加入了正则化项,weight_decay,(待研究)…
这里是论文链接:DECOUPLED WEIGHT DECAY REGULARIZATIO
每个example是一个对象,由四个属性构成:guid,text_a,text_b,label
train_examples由好多example构成的列表,我们要将train_examples转化为特征表示。
先用basic_tokenizer分词产生token,然后在对token用wordpiece_tokenizer产生sub_token,其实大部分情况token和sub_token是一样的,然而也有如下的情况:
token: あすみ
sun_token: ['あ', '##す', '##み']
token: clucl
sun_token: ['cl', '##uc', '##l']
token: 5000
sun_token: ['50', '##00']
...
处理两个句子,两个句子最多有max_seq_length-3(因为还有[CLS], [SEP], [SEP])个token,多的会被裁掉。
经过这样的处理后会将example中的两个句子变成如下的样子:(tokens_a具体怎么来的还需要再看看)
tokens_a: ['喜', '欢', '打', '篮', '球', '的', '男', '生', '喜', '欢', '什', '么', '样', '的', '女', '生']
tokens_b: ['爱', '打', '篮', '球', '的', '男', '生', '喜', '欢', '什', '么', '样', '的', '女', '生']
然后再将两个句子合起来,并附上segment id,segment id是为了区分这两句话的,前者都是0, 后者都是1:
tokens: ['[CLS]', '喜', '欢', '打', '篮', '球', '的', '男', '生', '喜', '欢', '什', '么', '样', '的', '女', '生', '[SEP]', '爱', '打', '篮', '球', '的', '男', '生', '喜', '欢', '什', '么', '样', '的', '女', '生', '[SEP]']
segment_ids: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
然后利用前面的vocab字典将这些token转化成id:
input_ids: [101, 1599, 3614, 2802, 5074, 4413, 4638, 4511, 4495, 1599, 3614, 784, 720, 3416, 4638, 1957, 4495, 102, 4263, 2802, 5074, 4413, 4638, 4511, 4495, 1599, 3614, 784, 720, 3416, 4638, 1957, 4495, 102]
然后对input_ids,input_mask,segment_ids补齐,用0补到max_seq_length。
还有label_id,因为是2分类问题,也就是0或1,有个字典里面存的就是label和id的对应关系,如下:
{
"0": 0,
"1": 1,
}
以上,就转成了特征表示:
每个特征有以下几个元素:
已经定义损失函数和优化器,损失函数也就是目标函数,loss.backward()
用于计算梯度,优化器会利用optimizer.step()
来进行梯度下降等优化参数的过程,optimizer.zero_grad()
用于将梯度清空,用于下一次计算。
在模型中一般讨论先在__init__()
函数中定义model,比如:
self.bert = BertModel(config)
然后通过直接调用结构方法来进行forward()
方法的调用,比如:
_, pooled_output = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
上面这句话其实在执行BertModel
中的forward()
方法。
然后将input_ids, segment_ids, input_mask喂到模型中,下面就要扣模型的结构了
input_ids, token_type_ids, attention_mask都会批处理成Size为[20, 128]张量的形式,20是一批的数据量,128是每个的最大长度,之前已经说过。
BertForSequenceClassification
bert(BertModel)
在这里面attention_mask会被拓展成extended_attention_mask,size变成[20, 1, 1, 128],用来表示[batch_size, num_heads, from_seq_length, to_seq_length]
attention_mask.unsqueeze(0)
的作用就是在第0个位置添加一个维度,attention_mask.unsqueeze(2)
的作用就是在第2个位置添加一个维度
然后extended_attention_mask中原来是0的地方变成-10000,是1的地方变成0
embeddings(BertEmbeddings)
input_ids, token_type_ids被送入到embedding层中
这个层还有position_ids,position_ids就是将每个位置从0到max_length编号,Size和input_ids的Size相同,形式如下:
tensor([[ 0, 1, 2, ..., 125, 126, 127],
[ 0, 1, 2, ..., 125, 126, 127],
[ 0, 1, 2, ..., 125, 126, 127],
...,
[ 0, 1, 2, ..., 125, 126, 127],
[ 0, 1, 2, ..., 125, 126, 127],
[ 0, 1, 2, ..., 125, 126, 127]], device='cuda:0')
然后分别将input_ids,position_ids, token_type_ids分别喂到相应的word_embeddings,position_embeddings,token_type_embeddings层中,生成相应的词向量,然后对这三个词向量加和得到embeddings
输入到LayerNorm层中。
LayerNorm层输出为:
w × e m b e d d i n g s − e m b e d d i n g s ‾ ( e m b e d d i n g s − e m b e d d i n g s ‾ ) 2 ‾ + ϵ + b w\times \frac{embeddings-\overline{embeddings}}{\sqrt{\overline{{(embeddings- \overline{embeddings})}^2}+\epsilon}} +b w×(embeddings−embeddings)2+ϵembeddings−embeddings+b
encoder(BertEncoder)
"num_hidden_layers"是12,也就是说有12个这个样的层:
BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()(求和归一)
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
BertAttention
BertIntermediate
size通过全连接层由[20,128,768]变成了[20,128,3072]
gelu激活
BertOutput
size通过BertOutput层由[20,128,3072]变成了[20,128,768]
pooler(BertPooler)
size为[20,128,768]的数据,只取第一个token的embedding,所以变成了[20,768]
tanh激活
dropout(Dropout)
在机器学习的模型中,如果模型的参数太多,而训练样本又太少,训练出来的模型很容易产生过拟合的现象。在训练神经网络的时候经常会遇到过拟合的问题,过拟合具体表现在:模型在训练数据上损失函数较小,预测准确率较高;但是在测试数据上损失函数比较大,预测准确率较低。
Dropout可以比较有效的缓解过拟合的发生,在一定程度上达到正则化的效果。
Dropout说的简单一点就是:我们在前向传播的时候,让某个神经元的激活值以一定的概率p停止工作,也就是让原来的某个输出以p的概率变成0,此外其他的数字会变成原来的 1 1 − p \frac{1}{1-p} 1−p1 ,这样可以使模型泛化性更强,因为它不会太依赖某些局部的特征。
>>> import torch
>>> from torch inport nn
>>> input = torch.range(1,10,1)
>>> input
tensor([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.])
>>> output=m(input)
>>> output
tensor([ 1.2500, 0.0000, 3.7500, 5.0000, 6.2500, 7.5000, 8.7500, 10.0000,
11.2500, 12.5000])
classifier(Linear)
size又通过Pooler层由[20,768]变成了[20,2],得到20个样本的分类结果