hugging face transformer文本分类运行

hugging face 团队的transformer又更新了,现在里面有distilroberta和distilbert和albert模型,这三个模型值得我们对比其他模型的差异。那么如何运行呢?

首先进入GitHub,搜索transformer

https://github.com/huggingface/transformers

进入这个repo

 

git clone 或者下载下来

 

接着用pycharm或其他编辑器打开这个repo

https://github.com/huggingface/transformers/tree/master/examples

选择examples里的run_gule.py

 

找到最下面的__main__,把所有代码剪切出来单独封装一个函数为main(),参数有两个model和dataset。

dataset是数据集的名字也是数据所在文件夹名称,model是model type。在这里,最重要的是命令行的argument,由于我们不想用命令行输入参数,这里可以在parser.add_argument中加入参数default,并设置required为False,这样就有了一个默认值。

接着我们设置data dir和训练batch大小和epoch次数。

 

def main(model,task):

    parser = argparse.ArgumentParser()
    model_dir = model_to_dir[model]
    ## Required parameters
    data_dir = '/home/socialbird/Downloads/transformers-master/examples/glue_data/{}'.format(task)
    #task = 'RTE'
    train_bs = 8
    eps = 3.0
    parser.add_argument("--data_dir", default=data_dir, type=str, required=False,
                        help="The input data dir. Should contain the .tsv files (or other data files) for the task.")
    parser.add_argument("--model_type", default=model, type=str, required=False,
                        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()))
    parser.add_argument("--model_name_or_path", default=model_dir, type=str, required=False,
                        help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS))
    parser.add_argument("--task_name", default=task, type=str, required=False,
                        help="The name of the task to train selected in the list: " + ", ".join(processors.keys()))
    parser.add_argument("--output_dir", default='output', type=str, required=False,
                        help="The output directory where the model predictions and checkpoints will be written.")

    ## Other parameters
    parser.add_argument("--config_name", default="", type=str,
                        help="Pretrained config name or path if not the same as model_name")
    parser.add_argument("--tokenizer_name", default="", type=str,
                        help="Pretrained tokenizer name or path if not the same as model_name")
    parser.add_argument("--cache_dir", default="", type=str,
                        help="Where do you want to store the pre-trained models downloaded from s3")
    parser.add_argument("--max_seq_length", default=128, type=int,
                        help="The maximum total input sequence length after tokenization. Sequences longer "
                             "than this will be truncated, sequences shorter will be padded.")
    parser.add_argument("--do_train", action='store_true', default=True,
                        help="Whether to run training.")
    parser.add_argument("--do_eval", action='store_true',default=True,
                        help="Whether to run eval on the dev set.")
    parser.add_argument("--evaluate_during_training", action='store_true',default=True,
                        help="Rul evaluation during training at each logging step.")
    parser.add_argument("--do_lower_case", action='store_true',
                        help="Set this flag if you are using an uncased model.")

    parser.add_argument("--per_gpu_train_batch_size", default=train_bs, type=int,
                        help="Batch size per GPU/CPU for training.")
    parser.add_argument("--per_gpu_eval_batch_size", default=8, type=int,
                        help="Batch size per GPU/CPU for evaluation.")
    parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
                        help="Number of updates steps to accumulate before performing a backward/update pass.")     
    parser.add_argument("--learning_rate", default=5e-5, type=float,
                        help="The initial learning rate for Adam.")
    parser.add_argument("--weight_decay", default=0.0, type=float,
                        help="Weight deay if we apply some.")
    parser.add_argument("--adam_epsilon", default=1e-8, type=float,
                        help="Epsilon for Adam optimizer.")
    parser.add_argument("--max_grad_norm", default=1.0, type=float,
                        help="Max gradient norm.")
    parser.add_argument("--num_train_epochs", default=eps, type=float,
                        help="Total number of training epochs to perform.")
    parser.add_argument("--max_steps", default=-1, type=int,
                        help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
    parser.add_argument("--warmup_steps", default=0, type=int,
                        help="Linear warmup over warmup_steps.")

    parser.add_argument('--logging_steps', type=int, default=200,
                        help="Log every X updates steps.")
    parser.add_argument('--save_steps', type=int, default=500,
                        help="Save checkpoint every X updates steps.")
    parser.add_argument("--eval_all_checkpoints", action='store_true',
                        help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number")
    parser.add_argument("--no_cuda", default=False,required=False,
                        help="Avoid using CUDA when available")
    parser.add_argument('--overwrite_output_dir', action='store_true',default=True,
                        help="Overwrite the content of the output directory")
    parser.add_argument('--overwrite_cache', action='store_true',
                        help="Overwrite the cached training and evaluation sets")
    parser.add_argument('--seed', type=int, default=42,
                        help="random seed for initialization")

    parser.add_argument('--fp16', action='store_true',
                        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
    parser.add_argument('--fp16_opt_level', type=str, default='O1',
                        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
                             "See details at https://nvidia.github.io/apex/amp.html")
    parser.add_argument("--local_rank", type=int, default=-1,
                        help="For distributed training: local_rank")
    parser.add_argument('--server_ip', type=str, default='', help="For distant debugging.")
    parser.add_argument('--server_port', type=str, default='', help="For distant debugging.")
    args = parser.parse_args()

 

然后定义model to dirs,这个是我用来设置model path的字典,你完全可以不添加这个。运行这个脚本会载入一个bert或其他模型,所以要指定model的名字类型,和model的路径位置(model_name_or_path),如果没有预先下载模型可以填模型的具体名字如roberta-base。

    model_to_dir = {
        'distilbert':'distilbert-base-uncased',
        'distilroberta': MODEL_DIRS['distilroberta'],
        'albert': 'albert-base-v2',
        'bert': MODEL_DIRS['bert-base'],
        'roberta': 'roberta-base',
        'camembert': 'camembert-base',
        'xlm':'xlm-mlm-ende-1024',
        'xlnet':'xlnet-base-cased'
    }

最后我们还需要设置processor。在data/processors/glue.py脚本中已经有了一些processor,我们可以直接使用。如RTE processors。

使用rte需要知道rte的标注task name是什么,也就是dataset的标准名字是什么。在脚本的最下面能看到。


glue_processors = {
    "cola": ColaProcessor,
    "mnli": MnliProcessor,
    "mnli-mm": MnliMismatchedProcessor,
    "mrpc": MrpcProcessor,
    "sst-2": Sst2Processor,
    "sts-b": StsbProcessor,
    "qqp": QqpProcessor,
    "qnli": QnliProcessor,
    "rte": RteProcessor,
    "wnli": WnliProcessor,


}

 

最后我们还需要这些数据,需要一个download glue data这个脚本,在github 的W4ngatang这个repo可以找到。

最后我们运行run glue.py就可以了

 

你可能感兴趣的:(NLP,文本分类,Bert,Transformer,NLP)