训练过程比预测过程多的东西:数据增广、梯度反传。虽然之多了这两个东西,但是训练的代码要比预测的代码复杂很多,所以先看简单一点的预测过程。
hugging face transformers 的预测过程由 Pipeline
类全权代理。
实例化: pipeline()
返回 Pipeline 对象
Pipeline 对象包括:
默认传入 pipeline()
的参数是 task 参数
>>> # 获取 Pipeline 对象,通过 str 参数控制返回的 pipeline 对象类型;默认是 task 参数;
>>> pipe = pipeline("text-classification")
>>> 将输入数据传入 pipeline 对象,会返回预测结果
>>> pipe("This restaurant is awesome")
[{'label': 'POSITIVE', 'score': 0.9998743534088135}]
如果不传 task,可以传具体需要哪个模型(传模型的名字):
# 可以传模型名字
>>> pipe = pipeline(model="roberta-large-mnli")
>>> pipe("This restaurant is awesome")
[{'label': 'NEUTRAL', 'score': 0.7313136458396912}]
传 model 对象
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
# Sentiment analysis pipeline
analyzer = pipeline("sentiment-analysis")
# Question answering pipeline, specifying the checkpoint identifier
oracle = pipeline(
"question-answering", model="distilbert-base-cased-distilled-squad", tokenizer="bert-base-cased"
)
# Named entity recognition pipeline, passing in a specific model and tokenizer
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
recognizer = pipeline("ner", model=model, tokenizer=tokenizer)
用 list 处理多个输入
>>> pipe = pipeline("text-classification")
>>> pipe(["This restaurant is awesome", "This restaurant is awful"])
[{'label': 'POSITIVE', 'score': 0.9998743534088135},
{'label': 'NEGATIVE', 'score': 0.9996669292449951}]
直接用 datasets
import datasets
from transformers import pipeline
from transformers.pipelines.pt_utils import KeyDataset
from tqdm.auto import tqdm
pipe = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h", device=0)
dataset = datasets.load_dataset("superb", name="asr", split="test")
# 把 dataset 传入 pipeline 实例对象即可
# KeyDataset (only *pt*) will simply return the item in the dict returned by the dataset item
# as we're not interested in the *target* part of the dataset. For sentence pair use KeyPairDataset
for out in tqdm(pipe(KeyDataset(dataset, "file"))):
print(out)
# {"text": "NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND"}
# {"text": ....}
# ....
其实传一个 generator 就可以工作:
from transformers import pipeline
pipe = pipeline("text-classification")
def data():
while True:
# This could come from a dataset, a database, a queue or HTTP request
# in a server
# Caveat: because this is iterative, you cannot use `num_workers > 1` variable
# to use multiple threads to preprocess data. You can still have 1 thread that
# does the preprocessing while the main runs the big inference
yield "This is a test"
for out in pipe(data()):
print(out)
# {"text": "NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND"}
# {"text": ....}
# ....
transformers.pipeline()
参数说明参数非常多,这里只说最重要的
from_pretrained()
的其它参数zero-shot-classification and question-answering 用的是 ChunkPipeline
因为 a single input might yield multiple forward pass of a model(?)Under normal circumstances, this would yield issues with batch_size argument.
之前是直接把数据送到 pipeline 就好了,但是现在要分别调用 pipeline 的方法:
pipe.preprocess()
pipe.forward()
pipe.postprocess()
基础用例:
all_model_outputs = []
for preprocessed in pipe.preprocess(inputs):
model_outputs = pipe.forward(preprocessed)
all_model_outputs.append(model_outputs)
outputs = pipe.postprocess(all_model_outputs)
首先,弄清输入和输入分别是什么?
from transformers import Pipeline
class MyPipeline(Pipeline):
def _sanitize_parameters(self, **kwargs):
preprocess_kwargs = {}
if "maybe_arg" in kwargs:
preprocess_kwargs["maybe_arg"] = kwargs["maybe_arg"]
return preprocess_kwargs, {}, {}
def preprocess(self, inputs, maybe_arg=2):
model_input = Tensor(inputs["input_ids"])
return {"model_input": model_input}
def _forward(self, model_inputs):
# model_inputs == {"model_input": model_input}
outputs = self.model(**model_inputs)
# Maybe {"logits": Tensor(...)}
return outputs
def postprocess(self, model_outputs):
best_class = model_outputs["logits"].softmax(-1)
return best_class
preprocess()
输入是你确定的最开始的输入,然后在这个方法里面会做一些处理,变成模型的输入(即 preprocess 的输出)。(注意区分 pipeline 的输入和 model 的输入)
一般 preprocess() 的输出是一个字典,然后送入模型的时候就用 **kwargs
传到模型里面。
_forward()
forward()
里面加了一些保护性的代码,让大家在希望的 device 上正常工作,而其它与模型相关的代码,都放到 _forward()
里面,然后让 forward()
调用 _forward()
注意,只有与模型相关的代码才放到 _forward()
,前处理后处理都放到对应的方法里面去。
postprocess()
_forward()
的输出就是 postprocess()
的输入,然后把它变成用户想要的输出
_sanitize_parameters()
This function exists to allow users to pass any parameters whenever they wish, be it at initialization time pipeline(...., maybe_arg=4)
or at call time pipe = pipeline(...)
; output = pipe(...., maybe_arg=4)
该方法返回值为 3 个 dicts,这 3 个 dicts 会分别送入 preprocess()
, _forward()
和 postprocess()
目标效果:
>>> pipe = pipeline("my-new-task")
>>> pipe("This is a test")
[{"label": "1-star", "score": 0.8}, {"label": "2-star", "score": 0.1}, {"label": "3-star", "score": 0.05}
{"label": "4-star", "score": 0.025}, {"label": "5-star", "score": 0.025}]
>>> pipe("This is a test", top_k=2)
[{"label": "1-star", "score": 0.8}, {"label": "2-star", "score": 0.1}]
第一次预测的时候没有传除了输入数据以外的别的参数,自动出来 top-k 是 5 个,也就是默认参数为 5 (这个参数应该是 postprocess()
的参数)。为了实现这个,编辑 _sanitize_parameters()
方法,让这个参数加进去:
def postprocess(self, model_outputs, top_k=5):
best_class = model_output["logits"].softmax(-1)
return best_class
def _sanitize_parameters(self, **kwargs):
preprocess_kwargs = {}
if "maybe_arg" in kwargs:
preprocess_kwargs["maybe_arg"] = kwargs["maybe_arg"]
postprocess_kwargs = {}
if "top_k" in kwargs:
postprocess_kwargs["top_k"] = kwargs["top_k"]
return preprocess_kwargs, {}, postprocess_kwargs
调用 PIPELINE_REGISTRY.register_pipeline()
方法
from transformers.pipelines import PIPELINE_REGISTRY
PIPELINE_REGISTRY.register_pipeline(
"new-task",
pipeline_class=MyPipeline,
pt_model=AutoModelForSequenceClassification,
)
>>> from transformers import pipeline
>>> classifier = pipeline(model="microsoft/beit-base-patch16-224-pt22k-ft22k")
>>> classifier("https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.png")
[{'score': 0.442, 'label': 'macaw'}, {'score': 0.088, 'label': 'popinjay'}, {'score': 0.075, 'label': 'parrot'}, {'score': 0.073, 'label': 'parodist, lampooner'}, {'score': 0.046, 'label': 'poll, poll_parrot'}]
__call__()
的参数:
images (str, List[str], PIL.Image or List[PIL.Image])
— The pipeline handles three types of images:
top_k (int, optional, defaults to 5)
— The number of top labels that will be returned by the pipeline.>>> from transformers import pipeline
>>> segmenter = pipeline(model="facebook/detr-resnet-50-panoptic")
>>> segments = segmenter("https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.png")
>>> len(segments)
2
>>> segments[0]["label"]
'bird'
>>> segments[1]["label"]
'bird'
>>> type(segments[0]["mask"]) # This is a black and white mask showing where is the bird on the original image.
<class 'PIL.Image.Image'>
>>> segments[0]["mask"].size
(768, 512)
__call__()
的参数:
images (str, List[str], PIL.Image or List[PIL.Image])
— The pipeline handles three types of images:
subtask (str, optional)
— Segmentation task to be performed, choose [semantic, instance and panoptic]
depending on model capabilities. If not set, the pipeline will attempt tp resolve in the following order: panoptic, instance, semantic.threshold (float, optional, defaults to 0.9)
— Probability threshold to filter out predicted masks.mask_threshold (float, optional, defaults to 0.5)
— Threshold to use when turning the predicted masks into binary values.overlap_mask_area_threshold (float, optional, defaults to 0.5)
— Mask overlap threshold to eliminate small, disconnected segments.>>> from transformers import pipeline
>>> detector = pipeline(model="facebook/detr-resnet-50")
>>> detector("https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.png")
[{'score': 0.997, 'label': 'bird', 'box': {'xmin': 69, 'ymin': 171, 'xmax': 396, 'ymax': 507}}, {'score': 0.999, 'label': 'bird', 'box': {'xmin': 398, 'ymin': 105, 'xmax': 767, 'ymax': 507}}]
>>> # x, y are expressed relative to the top left hand corner.
__call__()
的参数:
images (str, List[str], PIL.Image or List[PIL.Image])
— The pipeline handles three types of images:
threshold (float, optional, defaults to 0.9)
— The probability necessary to make a prediction.>>> from transformers import pipeline
>>> captioner = pipeline(model="ydshieh/vit-gpt2-coco-en")
>>> captioner("https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.png")
[{'generated_text': 'two birds are standing next to each other '}]
__call__()
的参数:
images (str, List[str], PIL.Image or List[PIL.Image])
— The pipeline handles three types of images:
max_new_tokens (int, optional)
— The amount of maximum tokens to generate. By default it will use generate default.generate_kwargs (Dict, optional)
— Pass it to send all of these arguments directly to generate allowing full control of this function.This visual question answering pipeline can currently be loaded from pipeline() using the following task identifiers: “visual-question-answering”, “vqa”.
>>> from transformers import pipeline
>>> oracle = pipeline(model="dandelin/vilt-b32-finetuned-vqa")
>>> image_url = "https://huggingface.co/datasets/Narsil/image_dummy/raw/main/lena.png"
>>> oracle(question="What is she wearing ?", image=image_url)
[{'score': 0.948, 'answer': 'hat'}, {'score': 0.009, 'answer': 'fedora'}, {'score': 0.003, 'answer': 'clothes'}, {'score': 0.003, 'answer': 'sun hat'}, {'score': 0.002, 'answer': 'nothing'}]
>>> oracle(question="What is she wearing ?", image=image_url, top_k=1)
[{'score': 0.948, 'answer': 'hat'}]
>>> oracle(question="Is this a person ?", image=image_url, top_k=1)
[{'score': 0.993, 'answer': 'yes'}]
>>> oracle(question="Is this a man ?", image=image_url, top_k=1)
[{'score': 0.996, 'answer': 'no'}]
__call__()
的参数:
images (str, List[str], PIL.Image or List[PIL.Image])
— The pipeline handles three types of images:
question (str, List[str])
— The question(s) asked. If given a single question, it can be broadcasted to multiple images.top_k (int, optional, defaults to 5)