BERT测试实践

运行环境NF5288M5

运行环境NF5488M5

谷歌BERT预训练任务

https://www.jianshu.com/p/22e462f01d8c

样本生成

Bert工程中发布了一个“生成预训练数据”的脚本。该脚本的输入是待训练的数据（纯文本文件）和字典，输出是处理得到的tfrecord文件。待训练的数据一句话一行，一个段落或文档中间用空格隔开。

# Input fileformat:

(1) One sentenceper line. These should ideally be actual sentences, not entire paragraphs orarbitrary spans of text. (Because we use the sentence boundaries for the"next sentence prediction" task).

(2) Blank linesbetween documents. Document boundaries are needed so that the "nextsentence prediction" task doesn't span between documents.

该生成训练数据的脚本，会一次性将输入文件的所有内容填充到内存在做处理，如果文件过大需要多次调用脚本生成不同的TFRECORD文件。

python create_pretraining_data.py \

--input_file=./sample_text.txt \

--output_file=/tmp/tf_examples.tfrecord \

--vocab_file=$BERT_BASE_DIR/vocab.txt \

--do_lower_case=True \

--max_seq_length=128 \

--max_predictions_per_seq=20 \

--masked_lm_prob=0.15 \

--random_seed=12345 \

--dupe_factor=5

****在生成数据的过程中老是提醒我生成的数据是空数据，没办法我就逐行的debug发现输入文件读不到，确认路径不存在问题。后来，竟然是因为文件名的最后多加了一个空格。

训练步骤

如果想从头开始训练的话就不要添加init_checkpoint这个超参。解释下下面的参数，input_file是指预训练用的数据集，在上面流程中产生的tfrecord文件；output_dir是存放日志和模型的文件夹；do_train & do_eval是否去做这两个操作，必须有大于等于一个是True；bert_config_file构建bert模型时需要的参数，下载的模型文件中有这个json文件；init_checkpoint模型训练的起点；后面的几个参数分别是批次大小、最大预测的词数、训练的步数、预热学习率的步数、初始学习率。

python run_pretraining.py \

--input_file=/tmp/tf_examples.tfrecord \

--output_dir=/tmp/pretraining_output \

--do_train=True \

--do_eval=True \

--bert_config_file=$BERT_BASE_DIR/bert_config.json \

--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \

--train_batch_size=32 \

--max_seq_length=128 \

--max_predictions_per_seq=20 \

--num_train_steps=20 \

--num_warmup_steps=10 \

--learning_rate=2e-5

1）输入数据为TFRECORD格式数据，该数据可以使用样本生成步骤中的脚本来生成。

TFRECORD中包含的数据包括：

Input_ids：: [101, 1131, 3090, 1106, 9416, 1103, 18127, 103, 117, 1115, 103, 1821, 170, 14798, 103, 4267, 20394, 1785, 2111, 103, 102, 170, 4984, 2851, 117, 178, 1821, 117, 6442, 106, 112, 1598, 1119, 103, 8228, 8788, 103, 117, 15992, 103, 8290, 3472, 118, 118, 112, 103, 4984, 2851, 106, 103, 118, 4984, 117, 1191, 1103, 103, 1104, 103, 103, 2621, 1104, 1103, 27466, 17893, 117, 1621, 1103, 16358, 5700, 1104, 1103, 2211, 1362, 118, 118, 5750, 117, 1256, 1154, 103, 16358, 1403, 118, 15398, 2111, 119, 1218, 117, 1170, 1155, 117, 178, 6111, 1437, 1128, 1103, 1236, 1106, 1103, 19026, 112, 188, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Input_mask:: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Segement id:: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Mask_lm_position:: [4, 7, 10, 14, 16, 19, 33, 36, 39, 45, 49, 55, 57, 58, 79, 92, 0, 0, 0, 0]

Mask_lm_ids:: [1143, 1864, 178, 1104, 6871, 119, 117, 1193, 1117, 170, 118, 14931, 5027, 1209, 1103, 1209, 0, 0, 0, 0]

Mask_lm_weights:: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0]

Next_sentence_lable 0

2) 给数据打batch，用tf.data，函数是input_fn_builder

3) 构建模型，输入input_id input_mask segement_id

4) embedding及embedding的后处理，得到的大小batch*seq_len*emd_size

5) 经过transformer_model,得到的sequence_out大小batch*seq_len*hidden_size,如果是句子级别的分类等任务输出可以选择pool_out,大小batch*hidden_size

6) mask_lm_loss 的求取，输入sequence_out,得到per example_loss的大小 batchsize* mask_lm_ids_length,总loss 标量的loss

7) next_seq_loss的求取，输入pool_out，得到batch_size *2的分类，得到per example_loss的大小 batchsize*1，总loss 标量的loss

8) 优化器然后loss反传求解梯度，学习率反向更新权重

预训练速度1

NF5288M5：96-100examples/s

预训练速度2

NF5488M5：107~112examples/s

BERT模型构建过程

https://www.jianshu.com/p/d7ce41b58801

（1）模型配置

模型配置，比较简单，依次是：词典大小、隐层神经元个数、transformer的层数、attention的头数、激活函数、中间层神经元个数、隐层dropout比例、attention里面dropout比例、sequence最大长度、token_type_ids的词典大小、truncated_normal_initializer的stdev。

（2）word embedding

（3）词向量的后处理（添加位置信息、词性信息）

（4）构造attention mask

（5）attention layer（多头attention）

（6）transformer

（7）BERT模型构造

Bert模型返回的结果

***bert主要流程是先embedding（包括位置和token_type的embedding），然后调用transformer得到输出结果，其中embedding、embedding_table、所有transformer层输出、最后transformer层输出以及pooled_output都可以获得，用于迁移学习的fine-tune和预测任务；

***bert对于transformer的使用仅限于encoder，没有decoder的过程。这是因为模型存粹是为了预训练服务，而预训练是通过语言模型，不同于NLP其他特定任务。在做迁移学习时可以自行添加；

***正因为没有decoder的操作，所以在attention函数里面也相应地减少了很多不必要的功能。

BERT预训练tips了解

1）Masked LM 和nextsentence prediction loss

```***** Evalresults *****

global_step = 20

loss = 0.0979674

masked_lm_accuracy = 0.985479

masked_lm_loss = 0.0979328

next_sentence_accuracy = 1.0

next_sentence_loss = 3.45724e-05

```

2）更换自己词典时，注意vocab_size的大小

3）Check point开始训练，专有行业的语料影评分析

4）The learning rate we used inthe paper was 1e-4. However, if you are doing additional steps of pre-trainingstarting from an existing BERT checkpoint, you should use a smaller learningrate (e.g., 2e-5).

5）Longer sequences are disproportionately expensive because attention is quadratic to the sequence length.In otherwords, a batch of 64 sequences of length 512 is much more expensive than abatch of 256 sequences of length 128. The fully-connected/convolutional cost isthe same, but the attention cost is far greater for the 512-length sequences.Therefore, one good recipe is to pre-train for, say, 90,000 steps with asequence length of 128 and then for 10,000 additional steps with a sequencelength of 512. The very long sequences are mostly needed to learn positionalembeddings, which can be learned fairly quickly. Note that this does requiregenerating the data twice with different values of`max_seq_length`.

6）Isthis code compatible with Cloud TPUs? What about GPUs?

Yes, all of the code in this repository worksout-of-the-box with CPU, GPU, and Cloud TPU. However, GPU training issingle-GPU only.

7）选择BERT-Base, Uncased这个模型呢？原因有三：1、训练语料为英文，所以不选择中文或者多语种；2、设备条件有限，如果您的显卡内存小于16个G，那就请乖乖选择base,不要折腾large了；3、cased表示区分大小写，uncased表示不区分大小写。除非你明确知道你的任务对大小写敏感（比如命名实体识别、词性标注等）那么通常情况下uncased效果更好。

BERT和Transformer理解及测试（二）