① 知乎上的transformers 教程
② 博客园上的Colab 使用教程
③ huggingface 官网
The library was designed with two strong goals in mind:
①我们严格限制了要学习的面向用户的抽象的数量,事实上,几乎没有抽象,使用每个模型只需要三个标准类:configuration
, models
and tokenizer
②所有这些类都可以通过使用一个公共的**from_pretrained()**实例化方法从预训练的实例以一种简单而统一的方式初始化,该方法将负责下载(如果需要),从Hugging Face Hub上提供的预训练检查点或您自己保存的检查点缓存和加载相关类实例和相关数据(配置的超参数、标记化器的词汇表和模型的权重)。
③除了这三个基类之外,该库还提供了两个API:pipeline()
,用于在给定任务和Trainer/keras
上快速使用模型(及其关联的标记器和配置)。适合快速训练或微调给定模型。
④因此,该库不是神经网络构建块的模块化工具箱。如果想扩展/构建库,只需使用常规的Python/PyTorch/TensorFlow/Keras模块,并从库的基类继承来重用模型加载/保存等功能。
①我们为每种架构提供了至少一个示例,该示例再现了上述架构的官方作者提供的结果。
②该代码通常尽可能接近原始代码库,这意味着某些PyTorch代码可能不像转换为TensorFlow代码时的pytorch代码那样,反之亦然。
The library is built around three types of classes for each model:
1 Model classes
2 Configuration classes
3 Tokenizer classes
All these classes can be instantiated from pretrained instances
and saved locally using
two methods:
1 from_pretrained()
允许您从库本身提供的预训练版本(支持的模型可以在模型中心找到)或用户本地(或服务器上)存储的预训练版本实例化模型/配置/标记器
2 save_pretrained()
允许您在本地保存model/config/tokenizer
,以便可以使用from_pretrained()重新加载它。
刚才已经介绍了Transforemers提供了三个基类和两个API(pipeline和trainer),下面就介绍最简单的transformers的实现方法。
首先让我们看看pipline最常见的一个应用
pip install torch
这样一看是不是很熟悉了呢?我们经常使用pipeline来安装各种包。话不多说,下面就来介绍利用pipeline()
实现情感分类这个例子。
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
这样就加载了一个默认的用于"sentiment-analysis"
的 pretrained model
和 tokenizer
classifier("We are very happy to show you the Transformers library.")
结果
[{'label': 'POSITIVE', 'score': 0.9998}]
results = classifier(["We are very happy to show you the Transformers library.", "We hope you don't hate it."])
for result in results:
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
很明显这里是将结果放在了一个字典中,然后利用一个循环将所有的结果打印出来
结果:
label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.5309
当我们使用一个完整数据集进行特定的任务时,需要用pipeline加载数据集。那么首先应该通过pipeline来安装一个名为datasets
的library
pip install datasets
使用要解决的任务和要使用的模型创建管道()。将设备参数设置为0,将张量放置在CUDA设备上:
from transformers import pipeline
speech_recognizer = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h", device=0)
加载一个用到的名字为super
的数据集
import datasets
dataset = datasets.load_dataset("superb", name="asr", split="test")
我们简单看一看这个数据的样子吧(这里提取了前四行)
files = dataset["file"]
speech_recognizer(files[:4])
[{'text': 'HE HOPED THERE WOULD BE STEW FOR DINNER TURNIPS AND CARROTS AND BRUISED POTATOES AND FAT MUTTON PIECES TO BE LADLED OUT IN THICK PEPPERED FLOWER FAT AND SAUCE'},
{'text': 'STUFFERED INTO YOU HIS BELLY COUNSELLED HIM'},
{'text': 'AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS'},
{'text': 'HO BERTIE ANY GOOD IN YOUR MIND'}]
对于输入较大的较大数据集(如在语音或视觉中),将希望传递一个generator
,而不是在内存中加载所有输入的列表。有关更多信息,请参阅文档
在许多情况下,可以从提供给from_pretrained()方法的预训练模型的名称或路径猜出您想要使用的体系结构。自动类在这里为您完成这项工作,这样您就可以根据预先训练的权重/配置/词汇表的名称/路径自动检索相关模型。
使用AutoClasses类不需要我们自己选择AutoConfig, AutoModel, and AutoTokenizer
。这个类可以根据我们给出的预训练模型的名称
或者预训练模型的路径
,自动识别出我们想要使用的模型。从而自动配置相关文件,就不用再另外进行文件的配置了,只需要调用Model和Tokenizer。(注意:
Model和Tokenizer中使用的预训练的模型的名称或路径一定要一致)
同样,我们用AutoClasses实现上面的例子
由于我们要解决的问题是一个分类问题,所以我们需要用到AutoModelForSequenceClassification
来加载Model,并且用AutoTokenizer
来加载Tokenizer。下面看一下具体的代码:
from transformers import AutoTokenizer
model_name = "bert-base-cased-finetuned-mrpc"
tokenizer = AutoTokenizer.from_pretrained(model_name)
注意:
如何找到合适模型的名字请见[Model hub]
encoding = tokenizer("We are very happy to show you the Transformers library.")
print(encoding)
结果:
{'input_ids': [101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 100, 58263, 13299, 119, 102],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
标记器将返回一个字典,其中包含:
input_ID:您的代币的数字表示。
ATTENTION_mask:指示应该关注哪些令牌。
注意:
可以对Tokenizer
设置一些参数,如长度。下面为举例:
pt_batch = tokenizer(
["We are very happy to show you the Transformers library.", "We hope you don't hate it."],
padding=True,
truncation=True,
max_length=512,
return_tensors="pt",
)
更多关于tokenizer
的细节请见here
model
模型名字的选择here
from transformers import AutoModelForSequenceClassification
model_name = "bert-base-cased-finetuned-mrpc"
pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
pt_batch是经过tokenizer
后的字典,将这个字典放入模型里。
pt_outputs = pt_model(**pt_batch) # 由于输入部分是字典,所以需要加**
产生的结果需要经过一个softmax
from torch import nn
pt_predictions = nn.functional.softmax(pt_outputs.logits, dim=-1)
print(pt_predictions)
结果:
tensor([[0.0021, 0.0018, 0.0115, 0.2121, 0.7725],
[0.2084, 0.1826, 0.1969, 0.1755, 0.2365]], grad_fn=<SoftmaxBackward0>)
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")
classes = ["not paraphrase", "is paraphrase"]
sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
# The tokenizer will automatically add any model specific separators (i.e. and ) and tokens to
# the sequence, as well as compute the attention masks.
paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt")
paraphrase_classification_logits = model(**paraphrase).logits
not_paraphrase_classification_logits = model(**not_paraphrase).logits
paraphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0]
not_paraphrase_results = torch.softmax(not_paraphrase_classification_logits, dim=1).tolist()[0]
# Should be paraphrase
for i in range(len(classes)):
print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")
# 输出结果
not paraphrase: 10%
is paraphrase: 90%
# Should not be paraphrase
for i in range(len(classes)):
print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")
# 输出结果
not paraphrase: 94%
is paraphrase: 6%
a.问答
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
text = r"""
Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
TensorFlow 2.0 and PyTorch.
"""
questions = [
"How many pretrained models are available in Transformers?",
"What does Transformers provide?",
" Transformers provides interoperability between which frameworks?",
]
for question in questions:
inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]
outputs = model(**inputs)
answer_start_scores = outputs.start_logits
answer_end_scores = outputs.end_logits
# Get the most likely beginning of answer with the argmax of the score
answer_start = torch.argmax(answer_start_scores)
# Get the most likely end of answer with the argmax of the score
answer_end = torch.argmax(answer_end_scores) + 1
answer = tokenizer.convert_tokens_to_string(
tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
)
print(f"Question: {question}")
print(f"Answer: {answer}")
b-1.Language Modeling之MLM
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = AutoModelForMaskedLM.from_pretrained("distilbert-base-cased")
sequence = (
"Distilled models are smaller than the models they mimic. Using them instead of the large "
f"versions would help {tokenizer.mask_token} our carbon footprint."
)
inputs = tokenizer(sequence, return_tensors="pt")
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
token_logits = model(**inputs).logits
mask_token_logits = token_logits[0, mask_token_index, :]
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
for token in top_5_tokens:
print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))
关于AutoClasses类更多信息[请点击](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForPreTraining)
b-2 .Language Modeling之CLM
from transformers import AutoModelForCausalLM, AutoTokenizer, top_k_top_p_filtering
import torch
from torch import nn
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
sequence = f"Hugging Face is based in DUMBO, New York City, and"
inputs = tokenizer(sequence, return_tensors="pt")
input_ids = inputs["input_ids"]
# get logits of last hidden state
next_token_logits = model(**inputs).logits[:, -1, :]
# filter
filtered_next_token_logits = top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)
# sample
probs = nn.functional.softmax(filtered_next_token_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
generated = torch.cat([input_ids, next_token], dim=-1)
resulting_string = tokenizer.decode(generated.tolist()[0])
print(resulting_string)
其他任务请见here
① 定义模型的名字
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
② 用AutoClasses
加载Model
和tokenizer
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
③ 用pipeline将Model
和tokenizer
加载进来
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
classifier("Nous sommes très heureux de vous présenter la bibliothèque Transformers.")
结果:
[{'label': '5 stars', 'score': 0.7273}]
以上内容都是直接使用预训练好的模型的几种方法,但是实际应用中可能需要fine-tuning
。这部分内容将在下一篇中介绍。
对模型进行微调后,可以使用PreTrainedModel使用其标记器保存模型。保存预先训练好的:
pt_save_directory = "./pt_save_pretrained"
tokenizer.save_pretrained(pt_save_directory)
pt_model.save_pretrained(pt_save_directory)
使用的时候用.from_pretrained
载入
pt_model = AutoModelForSequenceClassification.from_pretrained("./pt_save_pretrained")
这是tensorflow转换为pytorch
from transformers import AutoModel
tokenizer = AutoTokenizer.from_pretrained(tf_save_directory)
pt_model = AutoModelForSequenceClassification.from_pretrained(tf_save_directory, from_tf=True)
反向转换请见here