创建虚拟环境:
conda create -n myenv#创建
conda activate myenv#激活
安装transformers包
pip install git+https://github.com/huggingface/transformers
或者
conda install -c huggingface transformers
transformer库中最基本的对象是pipeline(管道),将模型与其他必要预处理和后处理步骤组合起来,使我们可以直接输入任何文本并获得可理解的答案,它支持如下的任务:
from transformers import pipeline
classifier = pipeline('sentiment-analysis')
results = classifier(["We are very happy to show you the Transformers library.",
"We hope you don't hate it."])
for result in results:
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
当第一次运行的时候,它会下载预训练模型和分词器(tokenizer)并且缓存下来
pipeline将三个步骤组合在一起:预处理、传递输入到模型和后处理:
transformers不是一个基础的神经网络库来一步一步构造Transformer,而是把常见的Transformer模型封装成一个building block,我们可以方便的在PyTorch或者TensorFlow里使用它。
只有configuration,models和tokenizer三个主要类。
Configuration
类用于配置模型的各种参数,如模型类型、层数、隐藏单元数等。它允许用户通过创建一个配置对象来自定义模型的架构。例如,BertConfig
是 BERT 模型的配置类,用户可以通过设置不同的参数来配置不同的 BERT 模型变体。Models
类包含了各种预训练的深度学习模型,如BERT、GPT等。通过从相应的模型类中实例化对象,用户可以加载预训练的权重并进行推理或微调。例如,BertModel
是 Hugging Face Transformers 库中用于加载和使用 BERT 模型的类。Tokenizer
将输入文本转换为模型可以理解的标记。它也负责将模型的输出标记转换回人类可读的形式。例如,BertTokenizer
是用于对文本进行标记化的类,与 BERT 模型一起使用。例如wikitext数据集,地址:https://huggingface.co/datasets/wikitext/tree/main
sudo apt-get install git-lfs
git lfs clone https://huggingface.co/datasets/wikitext
- 不能使用git clone https://huggingface.co/t5-base,从huggingface中git clone下来的模型看似下载下来了,但是其实下载下来的并不是实质的模型文件(如果你检查文件的大小,只有几B) ,后续通过from_pretrained()函数来加载模型时会报错:safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge
- 使用git clone 数据集似乎没有问题
如果实验室的服务器不能上外网,下载数据集的时候遇到问题:
Cloning into 'wikitext'...
fatal: unable to access 'https://huggingface.co/datasets/wikitext/': Failed to connect to huggingface.co port 443: Connection timed out
解决办法:手动下载文件到本地,然后上传到服务器上
如果实验室服务器配置了代理,能上外网,那么通过以下命令就能成功下载:
git lfs clone https://huggingface.co/datasets/wikitext
从服务器(本地)加载:
import datasets
wikitext_datasets=datasets.load_dataset("/mnt/workspace/wzf/transformer/datasets/wikitext", 'wikitext-103-v1')
print(wikitext_datasets)
mrpc_datasets=datasets.load_dataset("/mnt/workspace/wzf/transformer/datasets/mrpc")
print(mrpc_datasets)
rte_datasets=datasets.load_dataset("/mnt/workspace/wzf/transformer/datasets/rte")
print(rte_datasets)
注:
load_dataset函数
会执行wikitext/mrpc/rte文件夹下的.py文件,通过.py代码,加载原始数据,如果报错应该查看.py文件的内容
ps:save_to_disk
保存至服务器;load_from_disk
读取服务器数据集
wikitext:
DatasetDict({
test: Dataset({
features: ['text'],
num_rows: 4358
})
train: Dataset({
features: ['text'],
num_rows: 1801350
})
validation: Dataset({
features: ['text'],
num_rows: 3760
})
})
mrpc:
DatasetDict({
train: Dataset({
features: ['text1', 'text2', 'label', 'idx', 'label_text'],
num_rows: 3668
})
validation: Dataset({
features: ['text1', 'text2', 'label', 'idx', 'label_text'],
num_rows: 408
})
test: Dataset({
features: ['text1', 'text2', 'label', 'idx', 'label_text'],
num_rows: 1725
})
})
rte:
DatasetDict({
train: Dataset({
features: ['text1', 'text2', 'label', 'idx', 'label_text'],
num_rows: 2490
})
validation: Dataset({
features: ['text1', 'text2', 'label', 'idx', 'label_text'],
num_rows: 277
})
test: Dataset({
features: ['text1', 'text2', 'label', 'idx', 'label_text'],
num_rows: 3000
})
})
wikitext:
raw_train_dataset = wikitext_datasets["train"]
print(raw_train_dataset[0])
print(raw_train_dataset[1])
print(raw_train_dataset[2])
print(raw_train_dataset[3])
输出:
{'text': ''}
{'text': ' = Valkyria Chronicles III = \n'}
{'text': ''}
{'text': ' Senjō no Valkyria 3 : Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Raven " . \n'}
mrpc:
raw_train_dataset = mrpc_datasets["train"]
print(raw_train_dataset[0])
输出:
{'text1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
'text2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
'label': 1,
'idx': 0,
'label_text': 'equivalent'}
mrpc:
raw_train_dataset.features
输出:
{'text1': Value(dtype='string', id=None),
'text2': Value(dtype='string', id=None),
'label': Value(dtype='int64', id=None),
'idx': Value(dtype='int64', id=None),
'label_text': Value(dtype='string', id=None)}
使用AutoTokenizer类来处理数据,通过AutoTokenizer.from_pretrained()函数根据预训练模型,定义分词器tokenizer。transformers会处理下载、缓存和其它所有加载模型相关的细节(下载了模型需要的词表库tokens vocabulary,它会被缓存起来,从而再次使用的时候不会重新下载),而所有这些模型都统一在Hugging Face Models管理。
Tokenizer
的主要作用是将文本转换为模型可以处理的输入形式(通常是标记的索引序列)
from transformers import AutoTokenizer
checkpoint = "/mnt/workspace/wzf/transformer/model/bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)#实例化tokenizer
inputs = tokenizer("This is the first sentence.", "This is the second one.")
print(inputs)
输出:
{'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 2028, 1012, 102],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
input_ids一般来说随着预训练模型名字的不同而有所不同。原因是不同的预训练模型在预训练的时候设定了不同的规则。token_type_ids:0就表示的第一句话,1表示第二句话
解码input_ids:
tokenizer.convert_ids_to_tokens(inputs["input_ids"])
输出:
['[CLS]', 'this', 'is', 'the', 'first', 'sentence', '.', '[SEP]', 'this', 'is', 'the', 'second', 'one', '.', '[SEP]']
tokenizer()这种处理方法的缺点是处理之后返回的不是dataset
格式,而是返回字典(带有我们的键:input_ids
、attention_mask
和 token_type_ids
,对应键值对的值)。为了使我们的数据保持dataset
的格式,我们将使用更灵活的Dataset.map
方法,map
方法是对数据集中的每个元素应用同一个函数,所以让我们定义一个函数来对输入进行tokenize
预处理
mrpc:
def mrpc_tokenize_function(example):
return tokenizer(example["text1"], example["text2"], truncation=True)
mrpc_tokenized_datasets = mrpc_datasets.map(mrpc_tokenize_function, batched=True)
print(mrpc_tokenized_datasets)
map函数参数:num_proc
参数用于指定并行处理数据的进程数量,例如num_proc=4;batched=True参数用于
map函数一次应用于数据集的整个batch元素,而不是分别应用于每个元素;tokenizer参数:
use_fast=True
参数表示选择使用快速分词器;padding=True填充输入序列,使得批次内序列长度一致;truncation=True 截断过长的序列;return_tensors="pt" 返回PyTorch 张量;
输出:
DatasetDict({
train: Dataset({
features: ['text1', 'text2', 'label', 'idx', 'label_text', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 3668
})
validation: Dataset({
features: ['text1', 'text2', 'label', 'idx', 'label_text', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 408
})
test: Dataset({
features: ['text1', 'text2', 'label', 'idx', 'label_text', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 1725
})
})
wikitext:
def wikitext_tokenize_function(example):
return tokenizer(example["text"], truncation=True, max_length=512)
wikitext_tokenized_datasets = wikitext_datasets.map(wikitext_tokenize_function, batched=True)
print(wikitext_tokenized_datasets)
输出:
DatasetDict({
test: Dataset({
features: ['text', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 4358
})
train: Dataset({
features: ['text', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 1801350
})
validation: Dataset({
features: ['text', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 3760
})
})
映射后多出了tokenizer生成的input_ids、token_type_ids、attention_mask
DataCollatorForLanguageModeling主要用于处理模型的输入数据,包括 tokenization、masking 等。
mlm
:设置为 True
,DataCollatorForLanguageModeling
将为 MLM 任务准备批次数据。
mlm_probability
: 用于掩盖标记的概率。
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)
TrainingArguments类定义模型参数。
from transformers import TrainingArguments
training_args = TrainingArguments(output_dir="test-trainer", evaluation_strategy="epoch")
print(training_args)
evaluation_strategy="epoch" 表示每个epoch评估一次,evaluation_strategy="steps"每训练 eval_steps步时进行一次评估
输出:
TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=test-trainer/runs/Dec14_19-35-03_dsw-30998-59d59d765d-gp7nz,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=500,#训练500次打印一次损失
logging_strategy=steps,
lr_scheduler_kwargs={},
lr_scheduler_type=linear,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=3.0,
optim=adamw_torch,
optim_args=None,
output_dir=test-trainer,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=,
ray_scope=last,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=test-trainer,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=500,
save_strategy=steps,
save_total_limit=None,
seed=42,
skip_memory_metrics=True,
split_batches=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
模型有很多的类,其中AutoModel
类可以从checkpoint实例化任何模型,这是一种比较好的实例化模型方法。但是AutoModel
类只包含基本的Transformer模块,给定一些输入,它会输出隐藏状态hidden states(logits向量),将隐藏状态输入到Model heads(通常由一个或几个线性层组成)中,并将它们投影到不同的维度上,得到model output输出(logits向量),然后可以使用softmax激活函数得到概率
以下为举例,与任务无关:
from transformers import AutoTokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint) # 自动加载该模型训练时所用的分词器
raw_inputs = [
"I've been waiting for a HuggingFace course my whole life.",
"I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)
from transformers import AutoModelForSequenceClassification
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.logits.shape)
import torch
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)
输出:
{'input_ids': tensor([[ 101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172,
2607, 2026, 2878, 2166, 1012, 102],
[ 101, 1045, 5223, 2023, 2061, 2172, 999, 102, 0, 0,
0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}
torch.Size([2, 2])
tensor([[4.0195e-02, 9.5980e-01],
[9.9946e-01, 5.4418e-04]], grad_fn=)
因为我们要使用wikitext数据集pre_tain,所以选择AutoModelForMaskedLM类,因为它包含了我们想要的模型Head。
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained(checkpoint)
如果是情感分析,使用AutoModelForSequenceClassification类
使用训练器trainer,主要参数:
from transformers import Trainer
pretraining_trainer = Trainer(
model=model,
args=training_args,
train_dataset=wikitext_tokenized_datasets["train"],
eval_dataset=wikitext_tokenized_datasets["validation"],
data_collator=data_collator,
tokenizer=tokenizer,
)
fine_tuning_trainer = Trainer(
model=model,
args=fine_tuning_args,
train_dataset=mrpc_tokenized_datasets["train"],
eval_dataset=mrpc_tokenized_datasets["validation"],
data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15),
tokenizer=tokenizer,
compute_metrics=mrpc_compute_metrics,
)
要在我们的数据集上微调模型,我们只需要调用 Trainer 的 train方法:
pretraining_trainer.train()
fine_tuning_trainer.train()
使用 Trainer.predict 命令获得模型的预测结果:
predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)
输出:
(408, 2) (408,)
predict 方法输出一个具有三个字段的元组。
import evaluate
metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)
正常来说如果直接使用指标名称“accuracy”或者"glue", "mrpc"等,程序将会从 huggingface 上下载相应模块到缓存中使用,实际上我的问题就是无法顺利下载
报错:
FileNotFoundError: Couldn't find a module script at /mnt/workspace/wzf/transformer/datasets/fine_tuningdata/glue/glue.py.
Module 'glue' doesn't exist on the Hugging Face Hub either.
解决方法:
将相关文件下载到本地,然后上传到服务器上(采用 local metric script 方法):打开官方Github,GitHub - huggingface/evaluate: Evaluate: A library for easily evaluating machine learning models and datasets.
下载 metrics 文件夹,放在测试脚本的目录下,将'glue' 改为 '/mnt/workspace/wzf/transformer/metrics/glue',再次运行文件即可得到正确结果
Trainer的compute_metrics 可以计算训练时具体的评估指标的值(比如acc、F1分数等等)。如果trainer不设置compute_metrics 就只显示training loss,这不是一个直观的数字。
import evaluate
import numpy as np
def compute_metrics(eval_preds):
metric = evaluate.load("/mnt/workspace/wzf/transformer/metrics/glue", "mrpc")
logits, labels = eval_preds
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
import datasets
from transformers import AutoTokenizer, AutoModelForMaskedLM, DataCollatorForLanguageModeling, TrainingArguments, Trainer
# 加载Wikitext数据集
wikitext_datasets = datasets.load_dataset("/mnt/workspace/wzf/transformer/datasets/wikitext", 'wikitext-103-v1')
mrpc_datasets = datasets.load_dataset("/mnt/workspace/wzf/transformer/datasets/mrpc")
rte_datasets = datasets.load_dataset("/mnt/workspace/wzf/transformer/datasets/rte")
import evaluate
import numpy as np
def mrpc_compute_metrics(eval_preds):
metric = evaluate.load("/mnt/workspace/wzf/transformer/metrics/glue", "mrpc")
logits, labels = eval_preds
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
def rte_compute_metrics(eval_preds):
metric = evaluate.load("/mnt/workspace/wzf/transformer/metrics/glue", "rte")
logits, labels = eval_preds
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
modelname=["bert-base-cased","bert-base-chinese","bert-base-uncased","gpt2","roberta-base","roberta-large","t5-base","t5-small","vit-gpt2-image-captioning"]
for i in range(len(modelname)):
# 分词器
checkpoint = "/mnt/workspace/wzf/transformer/model/"+modelname[i]
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# 分词函数和添加标签
def wikitext_tokenize_function(example):
return tokenizer(example["text"], truncation=True, max_length=512)
def mrpc_tokenize_function(example):
return tokenizer(example["text1"], example["text2"], truncation=True)
# 分词和添加标签
wikitext_tokenized_datasets = wikitext_datasets.map(wikitext_tokenize_function, batched=True)
mrpc_tokenized_datasets = mrpc_datasets.map(mrpc_tokenize_function, batched=True)
rte_tokenized_datasets = rte_datasets.map(mrpc_tokenize_function, batched=True)
# 数据填充
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)
# 加载预训练模型
model = AutoModelForMaskedLM.from_pretrained(checkpoint)
# 模型参数
training_args = TrainingArguments(
output_dir="./pretraining_result/"+modelname[i],
overwrite_output_dir=True,
num_train_epochs=1,
)
# 创建预训练 Trainer
pretraining_trainer = Trainer(
model=model,
args=training_args,
train_dataset=wikitext_tokenized_datasets["train"],
eval_dataset=wikitext_tokenized_datasets["validation"],
data_collator=data_collator,
tokenizer=tokenizer,
)
pretraining_trainer.train()
# 微调参数
fine_tuning_args = TrainingArguments(
output_dir="./fine_tuning_result/"+modelname[i],
overwrite_output_dir=True,
num_train_epochs=1,
)
# 创建微调 Trainer
fine_tuning_trainer = Trainer(
model=model,
args=fine_tuning_args,
train_dataset=mrpc_tokenized_datasets["train"],
eval_dataset=mrpc_tokenized_datasets["validation"],
data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15),
tokenizer=tokenizer,
compute_metrics=mrpc_compute_metrics,
)
fine_tuning_trainer.train()
fine_tuning_trainer = Trainer(
model=model,
args=fine_tuning_args,
train_dataset=rte_tokenized_datasets["train"],
eval_dataset=rte_tokenized_datasets["validation"],
data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15),
tokenizer=tokenizer,
compute_metrics=rte_compute_metrics,
)
fine_tuning_trainer.train()
输出:暂时还没跑完
# Load the dataset
data_files = {}
data_path = DATA_PATH
train_file = data_path + "train.json"
data_files["train"] = train_file
extension = train_file.split(".")[-1]
valid_file = data_path + "dev.json"
data_files["validation"] = valid_file
test_file = data_path + "test.json"
data_files["test"] = test_file
raw_datasets = load_dataset(extension, data_files=data_files)
model.resize_token_embeddings(len(tokenizer))
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
参考:
HuggingFace Transformers框架使用教程_haggingface transformers怎么搭建-CSDN博客
Hugging Face实战-系列教程8:GLUE数据集/文本分类上(NLP实战/Transformer实战/预训练模型/分词器/模型微调/模型自动选择/PyTorch版本/代码逐行解析)-CSDN博客
Huggingface Evaluate包使用小坑-CSDN博wan
深入浅出对话系统——拥抱笑脸Transformer库的使用_trainingarguments-CSDN博客