如何用自己的数据集进行bert预训练

本实验在colab环境下进行

这里我们用bert在文本匹配任务上做训练,训练数据集为蚂蚁金服文本匹配的数据

下载代码,配置环境

!git clone https://github.com/BonnieHuangxin/Bert_sentence_similarity.git
!mv Bert_sentence_similarity/* ./
!pip install -r sentence_similarity_Bert/requirements.txt
!mv sentence_similarity_Bert/examples/* sentence_similarity_Bert/

将tensorflow的bert模型转为pytorch模型

!git clone https://github.com/xieyufei1993/Bert-Pytorch-Chinese-TextClassification.git
!mv Bert-Pytorch-Chinese-TextClassification/* ./

下载chinese的bert的中文模型集 tensorlfow格式

!wget https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip
!unzip chinese_L-12_H-768_A-12.zip
# tensorflow模型转pytorch模型
!python convert_tf_to_pytorch/convert_tf_checkpoint_to_pytorch.py \
  --tf_checkpoint_path chinese_L-12_H-768_A-12/bert_model.ckpt \
  --bert_config_file chinese_L-12_H-768_A-12/bert_config.json \
  --pytorch_dump_path chinese_L-12_H-768_A-12/pytorch_model.bin

训练

!python sentence_similarity_Bert/run_classifier_modify2.py 
--data_dir=sentence_similarity_Bert/chinese_data 
--bert_model=chinese_L-12_H-768_A-12  
--task_name=mrpc 
--output_dir=/home/tmp/sim_model    #模型保存在这里
--do_train  
--train_batch_size=32  

预测

!python sentence_similarity_Bert/run_classifier_class.py

你可能感兴趣的:(自然语言处理)