使用HuggingFace工具中的transformers模型搭建训练流程,传说核心训练代码5行搞定。今天我们简单搭建一个二分类的文本分类任务。
搭建训练流程demo.py,如下:
整理中
关键:针对分类任务来说,训练loss会变小。简单实用几个样本,100epoch,看看loss是否收敛,如果收敛,那么训练流程正确,否则,你需要仔细的check你的代码逻辑(一定要忘记之前的逻辑,重新开始梳理。。因为自己总觉得自己是对的,需要摒弃这种想法,重新来过,以一个新的视角去排查,才可能发现错误)。
运行demo.py,打印信息的结果如下:
Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
/data/deploy/wang/bertt/lib/python3.6/site-packages/transformers/tokenization_utils_base.py:1944: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).
FutureWarning,
Some weights of the model checkpoint at bert-base-chinese were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-chinese and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Let's use 4 GPUs!
357
Epoch [1/10]
/data/deploy/wang/bertt/lib/python3.6/site-packages/torch/nn/parallel/_functions.py:64: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
Iter: 0, Train Loss: 1.8, Train Acc: 14.84%, Val Loss: 1.9, Val Acc: 16.43%, Time: 0:00:43 *
Iter: 100, Train Loss: 0.64, Train Acc: 82.03%, Val Loss: 0.63, Val Acc: 79.52%, Time: 0:02:14 *
Iter: 200, Train Loss: 0.4, Train Acc: 85.94%, Val Loss: 0.31, Val Acc: 87.70%, Time: 0:03:45 *
Iter: 300, Train Loss: 0.34, Train Acc: 85.16%, Val Loss: 0.27, Val Acc: 88.87%, Time: 0:05:17 *
Epoch [2/10]
Iter: 400, Train Loss: 0.17, Train Acc: 92.19%, Val Loss: 0.26, Val Acc: 89.30%, Time: 0:06:48 *
Iter: 500, Train Loss: 0.24, Train Acc: 92.19%, Val Loss: 0.24, Val Acc: 90.36%, Time: 0:08:19 *
Iter: 600, Train Loss: 0.19, Train Acc: 92.19%, Val Loss: 0.24, Val Acc: 90.58%, Time: 0:09:51 *
Iter: 700, Train Loss: 0.18, Train Acc: 94.53%, Val Loss: 0.23, Val Acc: 90.55%, Time: 0:11:22 *
Epoch [3/10]
Iter: 800, Train Loss: 0.17, Train Acc: 93.75%, Val Loss: 0.24, Val Acc: 90.58%, Time: 0:12:52
Iter: 900, Train Loss: 0.15, Train Acc: 96.09%, Val Loss: 0.24, Val Acc: 90.62%, Time: 0:14:22
Iter: 1000, Train Loss: 0.27, Train Acc: 89.06%, Val Loss: 0.23, Val Acc: 90.76%, Time: 0:15:53 *
Epoch [4/10]
Iter: 1100, Train Loss: 0.15, Train Acc: 95.31%, Val Loss: 0.24, Val Acc: 91.02%, Time: 0:17:23
Iter: 1200, Train Loss: 0.12, Train Acc: 94.53%, Val Loss: 0.25, Val Acc: 90.82%, Time: 0:18:53
Iter: 1300, Train Loss: 0.059, Train Acc: 98.44%, Val Loss: 0.24, Val Acc: 90.87%, Time: 0:20:24
Iter: 1400, Train Loss: 0.16, Train Acc: 92.97%, Val Loss: 0.25, Val Acc: 90.72%, Time: 0:21:54
Epoch [5/10]
Iter: 1500, Train Loss: 0.094, Train Acc: 96.88%, Val Loss: 0.27, Val Acc: 91.02%, Time: 0:23:24
Iter: 1600, Train Loss: 0.12, Train Acc: 96.09%, Val Loss: 0.28, Val Acc: 90.70%, Time: 0:24:55
Iter: 1700, Train Loss: 0.13, Train Acc: 96.09%, Val Loss: 0.28, Val Acc: 90.84%, Time: 0:26:25
Epoch [6/10]
Iter: 1800, Train Loss: 0.053, Train Acc: 97.66%, Val Loss: 0.28, Val Acc: 90.85%, Time: 0:27:55
Iter: 1900, Train Loss: 0.075, Train Acc: 97.66%, Val Loss: 0.3, Val Acc: 91.24%, Time: 0:29:25
Iter: 2000, Train Loss: 0.097, Train Acc: 96.09%, Val Loss: 0.31, Val Acc: 90.45%, Time: 0:30:56
Iter: 2100, Train Loss: 0.041, Train Acc: 98.44%, Val Loss: 0.31, Val Acc: 90.58%, Time: 0:32:26
No optimization for a long time, auto-stopping...
train total time is : 1970.5445399284363
可见train loss是收敛的,训练流程没有问题。
看看训练参数是否改变:
模型训练前:
----name : module.bert.encoder.layer.11.output.LayerNorm.bias
parameter : Parameter containing:
tensor([-7.5394e-02, -5.3339e-02, -8.4475e-02, 1.3116e-01, -7.9671e-03,
2.0331e-01, -1.0460e-01, 9.0587e-02, -1.6743e-01, 1.6022e-02,
-4.2025e-02, -2.6452e-02, 5.9012e-02, -1.6719e-02, -2.2786e-01,
-3.5230e-02, 3.8843e-02, -6.1143e-02, -1.8396e-03, -5.0667e-02,
-9.7109e-02, 9.6873e-02, 4.0096e-02, -9.9374e-02, -3.5263e-02,
-6.1067e-02, -1.9375e-01, -1.2152e-01, -2.8524e-02, -1.5492e-01,
1.6237e-02, -8.2809e-02, -1.4949e-01, 1.4481e-01, -8.8371e-02,
训练一轮后:
tensor([-7.5270e-02, -5.3138e-02, -8.5101e-02, 1.3082e-01, -8.3281e-03,
2.0305e-01, -1.0474e-01, 9.1127e-02, -1.6659e-01, 1.5437e-02,
-4.1256e-02, -2.7110e-02, 5.8693e-02, -1.5990e-02, -2.2822e-01,
-3.4954e-02, 3.8195e-02, -6.1271e-02, -1.4476e-03, -5.0210e-02,
-9.7067e-02, 9.6599e-02, 4.0670e-02, -9.8674e-02, -3.5899e-02,
-6.1480e-02, -1.9452e-01, -1.2167e-01, -2.8671e-02, -1.5562e-01,
module.classifier.weight
module.classifier.bias
print(model) 即可打印模型结构,如下:
---models : BertForSequenceClassification(
(bert): BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(21128, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): BertEncoder(
(layer): ModuleList(
(0): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(1): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(2): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(3): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(4): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(5): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(6): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(7): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(8): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(9): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(10): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(11): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
(pooler): BertPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
)
(dropout): Dropout(p=0.1, inplace=False)
(classifier): Linear(in_features=768, out_features=2, bias=True)
)
参考:
1.