说来这个坑真让人无语,都是因为自己不细心,浪费了将近一个小时时间来排查,但我肯定不是最后一个,所以总结一下,希望能帮助到一些朋友。
huggingface's transformers是目前功能比较强大的包含各种预训练Transformer类模型的framework:https://github.com/huggingface/transformers
在这里,他们介绍了怎么用自己的corpus训练自己的预训练模型:https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb
应该说这个教程已经写得很认真了,但我还是碰到了因为自己粗心造成的坑:
我按照上面的流程做下来,结果报显存不够,我就调整了这里:
from transformers import RobertaConfig
config = RobertaConfig(
vocab_size=52_000,
max_position_embeddings=514,
num_attention_heads=12,
num_hidden_layers=6,
type_vocab_size=1,
)
的vocab_size和num_hidden_layers等值,看到果然模型的参数变少了,但是总是报类似:indexSelectLargeIndex: block: [43,0,0], thread: [96,0,0] Assertion 'srcIndex < srcSelectDimSize' failed. 以及RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling 'cublasCreate(handle)'的错。我竟然还排查了半天,最后才想起来:
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
"",
"",
" ",
"",
"",
])
这里定义tokenizer的时候没有改vocab_size,所以才报错。呵呵呵,真是太不细心了。最后贴上整个放在一起的代码供大家参考:
import torch
print(torch.cuda.is_available())
############################################################
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer
paths = [str(x) for x in Path("NLPCorpus").glob("**/*.txt")]
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(files=paths, vocab_size=12_000, min_frequency=2, special_tokens=[
"",
"",
" ",
"",
"",
])
tokenizer.save_model("EsperBERTo")
############################################################
from transformers import RobertaConfig
config = RobertaConfig(
vocab_size=12_000,
max_position_embeddings=514,
num_attention_heads=8,
num_hidden_layers=6,
type_vocab_size=1,
)
############################################################
from transformers import RobertaTokenizerFast
tokenizer = RobertaTokenizerFast.from_pretrained("./EsperBERTo", max_len=512)
############################################################
from transformers import RobertaForMaskedLM
model = RobertaForMaskedLM(config=config)
print(model.num_parameters())
############################################################
from transformers import LineByLineTextDataset
dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path="./NLPCorpus/eo.txt",
block_size=128,
)
############################################################
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)
############################################################
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./EsperBERTo",
overwrite_output_dir=True,
num_train_epochs=1,
per_gpu_train_batch_size=64,
save_steps=10_000,
save_total_limit=2,
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset,
prediction_loss_only=True,
)
trainer.train()
trainer.save_model("./EsperBERTo")
这个过程总体上还是很清晰简单的。