有任何问题欢迎在下面留言
本篇文章的代码运行界面均在Jupyter Notebook中进行
本篇文章配套的代码资源已经上传
Hugging Face实战-系列教程10:文本预训练模型构建1
Hugging Face实战-系列教程11:文本预训练模型构建2
接下来我们需要随机mask掉一些位置,然后来进行预测,方法huggingface已经提供好了!!!
随机mask,正常情况下我们需要写一个函数,对某一个索引位置的Token做值的替换,但是AI领域发展的太快了,python又是一个如此魔性的语言,直接有现成的方法。
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)
#0.15是BERT人家说的,咱们别改了
这里和torch中的dataloader非常像,实际上的功能也是一样的
samples = [lm_datasets["train"][i] for i in range(2)]
for sample in samples:
print(sample)
for chunk in data_collator(samples)["input_ids"]:
print(f"\n'>>> {tokenizer.decode(chunk)}'")
print(len(chunk))
打印结果:
没有mask的文本就是标签,有mask的文本就是训练数据。lm_datasets是我们下载的原始数据,data_collator是我们使用DataCollatorForLanguageModeling工具随机mask后的数据,分别是数据和标签。
train_size = 10000
test_size = int(0.1 * train_size)
downsampled_dataset = lm_datasets["train"].train_test_split(
train_size=train_size, test_size=test_size, seed=42
)
downsampled_dataset
打印结果:
DatasetDict({
train: Dataset({
features: [‘input_ids’, ‘attention_mask’, ‘word_ids’, ‘labels’],
num_rows: 10000
})
test: Dataset({
features: [‘input_ids’, ‘attention_mask’, ‘word_ids’, ‘labels’],
num_rows: 1000
})
})
from transformers import TrainingArguments
batch_size = 64
logging_steps = len(downsampled_dataset["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]
training_args = TrainingArguments(
output_dir=f"{model_name}-finetuned-imdb",#自己定名字
overwrite_output_dir=True,
evaluation_strategy="epoch",
learning_rate=2e-5,
weight_decay=0.01,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
logging_steps=logging_steps,
num_train_epochs=1,
save_strategy='epoch',
)
from transformers import Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=downsampled_dataset["train"],
eval_dataset=downsampled_dataset["test"],
data_collator=data_collator,
)
import math
eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
这个评估指标叫做困惑度,困惑度就是交叉熵的指数形式,这东西有点难理解,用我的话就是你不得在mask那挑啥词合适吗,平均挑了多少个才能答对。
》》》Perplexity: 21.94
trainer.train()
看下损失:
看下训练后的模型,困惑度是不是更好一点:
eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
'>>> Perplexity: 12.85
训练完模型后看看模型效果,加载我们训练的模型:
from transformers import AutoModelForMaskedLM
model_checkpoint = "distilbert-base-uncased"
model = AutoModelForMaskedLM.from_pretrained("./distilbert-base-uncased-finetuned-imdb/checkpoint-157")
训练好的模型保存在本地了,157表示迭代的次数不一样,所以每个人的名字也不一样
继续用tokenizer去加载:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
看看新模型的结果:
import torch
inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
for token in top_5_tokens:
print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")
打印结果:
‘>>> This is a great deal.’
‘>>> This is a great idea.’
‘>>> This is a great adventure.’
‘>>> This is a great film.’
‘>>> This is a great movie.’
训练结果多出来了film和movie,之前没有这些词
NLP都是微调的,你自己是玩不出来的
Hugging Face实战-系列教程10:文本预训练模型构建1
Hugging Face实战-系列教程11:文本预训练模型构建2