HuggingFace模型训练流程搭建准确性验证

1.训练流程搭建

使用HuggingFace工具中的transformers模型搭建训练流程,传说核心训练代码5行搞定。今天我们简单搭建一个二分类的文本分类任务。

搭建训练流程demo.py,如下:

整理中

2.验证Loss是否收敛、模型参数是否改变

关键:针对分类任务来说,训练loss会变小。简单实用几个样本,100epoch,看看loss是否收敛,如果收敛,那么训练流程正确,否则,你需要仔细的check你的代码逻辑(一定要忘记之前的逻辑,重新开始梳理。。因为自己总觉得自己是对的,需要摒弃这种想法,重新来过,以一个新的视角去排查,才可能发现错误)。

运行demo.py,打印信息的结果如下:

Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
/data/deploy/wang/bertt/lib/python3.6/site-packages/transformers/tokenization_utils_base.py:1944: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).
  FutureWarning,
Some weights of the model checkpoint at bert-base-chinese were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-chinese and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Let's use 4 GPUs!
357

Epoch [1/10]
/data/deploy/wang/bertt/lib/python3.6/site-packages/torch/nn/parallel/_functions.py:64: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
Iter:      0,  Train Loss:   1.8,  Train Acc: 14.84%,  Val Loss:   1.9,  Val Acc: 16.43%,  Time: 0:00:43 *
Iter:    100,  Train Loss:  0.64,  Train Acc: 82.03%,  Val Loss:  0.63,  Val Acc: 79.52%,  Time: 0:02:14 *
Iter:    200,  Train Loss:   0.4,  Train Acc: 85.94%,  Val Loss:  0.31,  Val Acc: 87.70%,  Time: 0:03:45 *
Iter:    300,  Train Loss:  0.34,  Train Acc: 85.16%,  Val Loss:  0.27,  Val Acc: 88.87%,  Time: 0:05:17 *
Epoch [2/10]
Iter:    400,  Train Loss:  0.17,  Train Acc: 92.19%,  Val Loss:  0.26,  Val Acc: 89.30%,  Time: 0:06:48 *
Iter:    500,  Train Loss:  0.24,  Train Acc: 92.19%,  Val Loss:  0.24,  Val Acc: 90.36%,  Time: 0:08:19 *
Iter:    600,  Train Loss:  0.19,  Train Acc: 92.19%,  Val Loss:  0.24,  Val Acc: 90.58%,  Time: 0:09:51 *
Iter:    700,  Train Loss:  0.18,  Train Acc: 94.53%,  Val Loss:  0.23,  Val Acc: 90.55%,  Time: 0:11:22 *
Epoch [3/10]
Iter:    800,  Train Loss:  0.17,  Train Acc: 93.75%,  Val Loss:  0.24,  Val Acc: 90.58%,  Time: 0:12:52
Iter:    900,  Train Loss:  0.15,  Train Acc: 96.09%,  Val Loss:  0.24,  Val Acc: 90.62%,  Time: 0:14:22
Iter:   1000,  Train Loss:  0.27,  Train Acc: 89.06%,  Val Loss:  0.23,  Val Acc: 90.76%,  Time: 0:15:53 *
Epoch [4/10]
Iter:   1100,  Train Loss:  0.15,  Train Acc: 95.31%,  Val Loss:  0.24,  Val Acc: 91.02%,  Time: 0:17:23
Iter:   1200,  Train Loss:  0.12,  Train Acc: 94.53%,  Val Loss:  0.25,  Val Acc: 90.82%,  Time: 0:18:53
Iter:   1300,  Train Loss: 0.059,  Train Acc: 98.44%,  Val Loss:  0.24,  Val Acc: 90.87%,  Time: 0:20:24
Iter:   1400,  Train Loss:  0.16,  Train Acc: 92.97%,  Val Loss:  0.25,  Val Acc: 90.72%,  Time: 0:21:54
Epoch [5/10]
Iter:   1500,  Train Loss: 0.094,  Train Acc: 96.88%,  Val Loss:  0.27,  Val Acc: 91.02%,  Time: 0:23:24
Iter:   1600,  Train Loss:  0.12,  Train Acc: 96.09%,  Val Loss:  0.28,  Val Acc: 90.70%,  Time: 0:24:55
Iter:   1700,  Train Loss:  0.13,  Train Acc: 96.09%,  Val Loss:  0.28,  Val Acc: 90.84%,  Time: 0:26:25
Epoch [6/10]
Iter:   1800,  Train Loss: 0.053,  Train Acc: 97.66%,  Val Loss:  0.28,  Val Acc: 90.85%,  Time: 0:27:55
Iter:   1900,  Train Loss: 0.075,  Train Acc: 97.66%,  Val Loss:   0.3,  Val Acc: 91.24%,  Time: 0:29:25
Iter:   2000,  Train Loss: 0.097,  Train Acc: 96.09%,  Val Loss:  0.31,  Val Acc: 90.45%,  Time: 0:30:56
Iter:   2100,  Train Loss: 0.041,  Train Acc: 98.44%,  Val Loss:  0.31,  Val Acc: 90.58%,  Time: 0:32:26
No optimization for a long time, auto-stopping...
train total time is : 1970.5445399284363

可见train loss是收敛的,训练流程没有问题。

看看训练参数是否改变:

模型训练前:

----name : module.bert.encoder.layer.11.output.LayerNorm.bias
 parameter : Parameter containing:
tensor([-7.5394e-02, -5.3339e-02, -8.4475e-02,  1.3116e-01, -7.9671e-03,
         2.0331e-01, -1.0460e-01,  9.0587e-02, -1.6743e-01,  1.6022e-02,
        -4.2025e-02, -2.6452e-02,  5.9012e-02, -1.6719e-02, -2.2786e-01,
        -3.5230e-02,  3.8843e-02, -6.1143e-02, -1.8396e-03, -5.0667e-02,
        -9.7109e-02,  9.6873e-02,  4.0096e-02, -9.9374e-02, -3.5263e-02,
        -6.1067e-02, -1.9375e-01, -1.2152e-01, -2.8524e-02, -1.5492e-01,
         1.6237e-02, -8.2809e-02, -1.4949e-01,  1.4481e-01, -8.8371e-02,

训练一轮后:
tensor([-7.5270e-02, -5.3138e-02, -8.5101e-02,  1.3082e-01, -8.3281e-03,
         2.0305e-01, -1.0474e-01,  9.1127e-02, -1.6659e-01,  1.5437e-02,
        -4.1256e-02, -2.7110e-02,  5.8693e-02, -1.5990e-02, -2.2822e-01,
        -3.4954e-02,  3.8195e-02, -6.1271e-02, -1.4476e-03, -5.0210e-02,
        -9.7067e-02,  9.6599e-02,  4.0670e-02, -9.8674e-02, -3.5899e-02,
        -6.1480e-02, -1.9452e-01, -1.2167e-01, -2.8671e-02, -1.5562e-01,

module.classifier.weight
module.classifier.bias

3.打印模型结构

print(model) 即可打印模型结构,如下:


---models : BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(21128, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (1): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (2): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (3): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (4): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (5): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (6): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (7): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (8): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (9): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (10): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (11): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
    (pooler): BertPooler(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (activation): Tanh()
    )
  )
  (dropout): Dropout(p=0.1, inplace=False)
  (classifier): Linear(in_features=768, out_features=2, bias=True)
)

参考:

1.

你可能感兴趣的:(nlp,huggingface)