使用huggingface‘s transformers预训练自己模型时报:Assertion ‘srcIndex < srcSelectDimSize‘ failed. 的解决办法

说来这个坑真让人无语,都是因为自己不细心,浪费了将近一个小时时间来排查,但我肯定不是最后一个,所以总结一下,希望能帮助到一些朋友。

huggingface's transformers是目前功能比较强大的包含各种预训练Transformer类模型的framework:https://github.com/huggingface/transformers

在这里,他们介绍了怎么用自己的corpus训练自己的预训练模型:https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb

应该说这个教程已经写得很认真了,但我还是碰到了因为自己粗心造成的坑:

我按照上面的流程做下来,结果报显存不够,我就调整了这里:

from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

的vocab_size和num_hidden_layers等值,看到果然模型的参数变少了,但是总是报类似:indexSelectLargeIndex: block: [43,0,0], thread: [96,0,0] Assertion 'srcIndex < srcSelectDimSize' failed. 以及RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling 'cublasCreate(handle)'的错。我竟然还排查了半天,最后才想起来:

tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "",
    "",
    "",
    "",
    "",
])

这里定义tokenizer的时候没有改vocab_size,所以才报错。呵呵呵,真是太不细心了。最后贴上整个放在一起的代码供大家参考:

import torch
print(torch.cuda.is_available())

############################################################
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer

paths = [str(x) for x in Path("NLPCorpus").glob("**/*.txt")]
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(files=paths, vocab_size=12_000, min_frequency=2, special_tokens=[
    "",
    "",
    "",
    "",
    "",
])

tokenizer.save_model("EsperBERTo")
############################################################
from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=12_000,
    max_position_embeddings=514,
    num_attention_heads=8,
    num_hidden_layers=6,
    type_vocab_size=1,
)
############################################################
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained("./EsperBERTo", max_len=512)
############################################################
from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config=config)
print(model.num_parameters())
############################################################
from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./NLPCorpus/eo.txt",
    block_size=128,
)
############################################################
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)
############################################################
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./EsperBERTo",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_gpu_train_batch_size=64,
    save_steps=10_000,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
    prediction_loss_only=True,
)

trainer.train()

trainer.save_model("./EsperBERTo")

这个过程总体上还是很清晰简单的。

你可能感兴趣的:(深度学习,Python技巧,自然语言处理,深度学习)