诸神缄默不语-个人CSDN博文目录
本文属于huggingface.transformers全部文档学习笔记博文的一部分。
全文链接:huggingface transformers包 文档学习笔记(持续更新ing…)
本部分网址:https://huggingface.co/docs/transformers/master/en/task_summary
本部分介绍了一些常见NLP任务使用transformers包的解决方案。本文使用的AutoModel具体内容可以参阅其文档,也可以参考我此前的撰写的transformers包文档笔记,我介绍了一些相关的用法和示例代码。
模型需要从针对对应任务上预训练过的checkpoint加载,才能更好地应用于对应任务。(如果加载的是未经过特定任务微调的checkpoint会仅加载基础transformers层,没有特定任务所需的additional head,就会随机初始化additional head权重,产生随机输出)
这些checkpoints往往是在大量语料上预训练(pre-train),然后再针对具体任务进行微调(fine-tune)过。这意味着:
如果想在指定任务上直接做推理,可以使用这些机制:
以下两种方式都会展示:
Sequence Classification任务是将sequence在给定的类数中进行分类。如GLUE数据集。在GLUE数据集上进行微调可参考run_glue.py或run_xnli.py。
用pipeline进行情感分类的示例,使用在sst2(GLUE task)上微调过的模型,返回标签("POSITIVE"
或"NEGATIVE"
)和得分:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("I hate you")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
result = classifier("I love you")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
输出:
label: NEGATIVE, with score: 0.9991
label: POSITIVE, with score: 0.9999
用AutoClass判断两句话是否同义(互为改写)的示例:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")
classes = ["not paraphrase", "is paraphrase"]
sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
# The tokenizer will automatically add any model specific separators (i.e. and ) and tokens to
# the sequence, as well as compute the attention masks.
paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt")
paraphrase_classification_logits = model(**paraphrase).logits
not_paraphrase_classification_logits = model(**not_paraphrase).logits
paraphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0]
not_paraphrase_results = torch.softmax(not_paraphrase_classification_logits, dim=1).tolist()[0]
# Should be paraphrase
for i in range(len(classes)):
print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")
# Should not be paraphrase
for i in range(len(classes)):
print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")
输出:not paraphrase: 10%
is paraphrase: 90%
not paraphrase: 94%
is paraphrase: 6%
Extractive Question Answering是从context(一段文本)中抽取句子,作为特定问题答句。如SQuAD1数据集。在SQuAD数据集上微调可参考run_qa.py。
用pipeline的示例,使用在SQuAD数据集上微调过的模型,返回从context中抽取的答案、confidence score、指示答案在context中位置的start
和end
值:
from transformers import pipeline
question_answerer = pipeline("question-answering")
context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
"""
result = question_answerer(question="What is extractive question answering?", context=context)
print(
f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"
)
result = question_answerer(question="What is a good example of a question answering dataset?", context=context)
print(
f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"
)
输出:
Answer: 'the task of extracting an answer from a text given a question', score: 0.6177, start: 34, end: 95
Answer: 'SQuAD dataset', score: 0.5152, start: 147, end: 160
用AutoClass的示例:
start
index或end
index的可能性得分)。start
和end
之间的值的token,将其转化为字符串from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
text = r"""
Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
TensorFlow 2.0 and PyTorch.
"""
questions = [
"How many pretrained models are available in Transformers?",
"What does Transformers provide?",
" Transformers provides interoperability between which frameworks?",
]
for question in questions:
inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]
outputs = model(**inputs)
answer_start_scores = outputs.start_logits
answer_end_scores = outputs.end_logits
# Get the most likely beginning of answer with the argmax of the score
answer_start = torch.argmax(answer_start_scores)
# Get the most likely end of answer with the argmax of the score
answer_end = torch.argmax(answer_end_scores) + 1
answer = tokenizer.convert_tokens_to_string(
tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
)
print(f"Question: {question}")
print(f"Answer: {answer}")
输出:
Question: How many pretrained models are available in Transformers?
Answer: over 32 +
Question: What does Transformers provide?
Answer: general - purpose architectures
Question: Transformers provides interoperability between which frameworks?
Answer: tensorflow 2. 0 and pytorch
Language modeling是使模型适应某一语料(一般是特定领域的)的任务,这样说可能比较抽象,所以建议直接看本文后续的示例来直观了解其含义。
所有流行的transformer-based模型都是用language modeling的一种变体来训练的,如BERT用masked language modeling,GPT-2用causal language modeling。
Language modeling也可以用于预训练之外的情况,如将模型分布转移到domain-specific:用一个在大语料上预训练过的模型,在新数据集上微调,如在论文上微调:lysandre/arxiv-nlp · Hugging Face
MLM是用masking token来mask sequence中的一些tokens,然后调整模型使之用合适的token来填充这些mask。这让模型能够attend right context(mask右边的token)和left context(mask左边的token)。这样的训练设置为需要bi-directional context的下游任务(如SQuAD1)提供了强基础。
在MLM任务上微调的代码可参考run_mlm.py。
用pipeline的示例,输出填充mask后的sequence、confidence score、被用以填充mask的token及其在tokenizer vocabulary中的token ID:
from transformers import pipeline
unmasker = pipeline("fill-mask")
from pprint import pprint
pprint(
unmasker(
f"HuggingFace is creating a {unmasker.tokenizer.mask_token} that the community uses to solve NLP tasks."
)
)
输出:
[{'score': 0.1793,
'sequence': 'HuggingFace is creating a tool that the community uses to solve '
'NLP tasks.',
'token': 3944,
'token_str': ' tool'},
{'score': 0.1135,
'sequence': 'HuggingFace is creating a framework that the community uses to '
'solve NLP tasks.',
'token': 7208,
'token_str': ' framework'},
{'score': 0.0524,
'sequence': 'HuggingFace is creating a library that the community uses to '
'solve NLP tasks.',
'token': 5560,
'token_str': ' library'},
{'score': 0.0349,
'sequence': 'HuggingFace is creating a database that the community uses to '
'solve NLP tasks.',
'token': 8503,
'token_str': ' database'},
{'score': 0.0286,
'sequence': 'HuggingFace is creating a prototype that the community uses to '
'solve NLP tasks.',
'token': 17715,
'token_str': ' prototype'}]
用AutoClass的示例:
tokenizer.mask_token
(这是个字符串格式的变量,在字符串中用花括号括起来以实现替换2)替换一个单词(我感觉这里的单词应该指的是一个token)topk
方法提取得分最高的5个token。from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = AutoModelForMaskedLM.from_pretrained("distilbert-base-cased")
sequence = (
"Distilled models are smaller than the models they mimic. Using them instead of the large "
f"versions would help {tokenizer.mask_token} our carbon footprint."
)
inputs = tokenizer(sequence, return_tensors="pt")
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
token_logits = model(**inputs).logits
mask_token_logits = token_logits[0, mask_token_index, :]
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
#得到分数最高的5个token的索引
#值得注意的是,topk函数默认是根据value经过sort的。参考其函数文档:https://pytorch.org/docs/stable/generated/torch.topk.html
for token in top_5_tokens:
print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))
#将该token解码为文本形式,替代原文中的tokenizer.mask_token
输出:
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help reduce our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help increase our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help decrease our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help offset our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help improve our carbon footprint.
CLM是预测一个sequence之后的token的任务。在这种情境下,模型只会attend left context(mask左边的token)。这样的训练设置特别关注于生成任务。
在CLM任务上微调的代码可参考run_clm.py。
一般来说,预测下一个token是通过抽样输入sequence得到的最后一层hidden state的logits得到的。
用AutoClass的示例:用AutoModelForCausalLM、AutoTokenizer和top_k_top_p_filtering()方法,在输入sequence后抽样得到下一个token:
from transformers import AutoModelForCausalLM, AutoTokenizer, top_k_top_p_filtering
import torch
from torch import nn
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
sequence = f"Hugging Face is based in DUMBO, New York City, and"
inputs = tokenizer(sequence, return_tensors="pt")
input_ids = inputs["input_ids"]
# get logits of last hidden state
next_token_logits = model(**inputs).logits[:, -1, :]
# filter
filtered_next_token_logits = top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)
# sample
probs = nn.functional.softmax(filtered_next_token_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
generated = torch.cat([input_ids, next_token], dim=-1)
resulting_string = tokenizer.decode(generated.tolist()[0])
print(resulting_string)
输出:Hugging Face is based in DUMBO, New York City, and is
我自己没跑,我看到文档里说会是is
或features
,我在gpt2模型首页直接调用推理API得到的第一个token也是is
,那就:)
实话说我没太看懂这个例子。总之大约就是这么回事吧。以后看了更多资料再来写详细解释吧。
在下一节使用的generation_utils.GenerationMixin.generate()方法可以用来生成多个长达指定长度的tokens,而不是一次只生成一个token。
文本生成(text generation,又名open-ended text generation)的目标是生成给定context(文本)后的一段连续的文本。
用pipeline的示例,使用的是GPT-2模型,Top-K抽样,参考GPT-2模型的configuration文件:config.json · gpt2 at main
from transformers import pipeline
text_generator = pipeline("text-generation")
print(text_generator("As far as I am concerned, I will", max_length=50, do_sample=False))
输出:
[{'generated_text': 'As far as I am concerned, I will be the first to admit that I am not a fan of the idea of a
"free market." I think that the idea of a free market is a bit of a stretch. I think that the idea'}]
pipeline对象其实调用了PreTrainedModel.generate()方法,对这方面的介绍可参考我之前撰写的博文huggingface.transformers速成笔记:Pipeline推理和AutoClass_诸神缄默不语的博客-CSDN博客第一节序号③部分相关内容。
用AutoClass的示例:用XLNet及其对应的tokenizer。这个模型可以直接调用generate()
函数:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("xlnet-base-cased")
tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")
# Padding text helps XLNet with short prompts - proposed by Aman Rusia in https://github.com/rusiaaman/XLNet-gen#methodology
PADDING_TEXT = """In 1991, the remains of Russian Tsar Nicholas II and his family
(except for Alexei and Maria) are discovered.
The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
remainder of the story. 1883 Western Siberia,
a young Grigori Rasputin is asked by his father and a group of men to perform magic.
Rasputin has a vision and denounces one of the men as a horse thief. Although his
father initially slaps him for making such an accusation, Rasputin watches as the
man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
with people, even a bishop, begging for his blessing. """
prompt = "Today the weather is really nice and I am planning on "
inputs = tokenizer(PADDING_TEXT + prompt, add_special_tokens=False, return_tensors="pt")["input_ids"]
prompt_length = len(tokenizer.decode(inputs[0]))
outputs = model.generate(inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)
generated = prompt + tokenizer.decode(outputs[0])[prompt_length + 1 :]
print(generated)
代码中的padding text见注释中给出的网站解释,具体的我也没看懂,总之是说XLNet的运算方式造成了一些问题,如context太短会导致生成内容效果不好,所以需要加一段硬编码的随机文本(在这段随机文本后要加
,然后再加上真实context)。
这个模型我自己跑出来的结果是:
Today the weather is really nice and I am planning on going for a walk in the park with my mom (she can walk and play golf) to explore and see something I didn't know. The park is actually a giant green, with lots of shade in the first two thirds of the way (it is en route to a major golf course, which is really neat). I decided to walk
感觉效果还行?
如果不加padding text的话,运行出来的结果就是:
Today the weather is really nice and I am planning on going on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on going on on on on on on on on on on on on going on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on and on on on on on on on and on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on and on on on on on on on on on on on on on on and on on on on on on and on on on on on on on on on on on on on on on on and on on on and on on on on on on on on on on on on on on on on on on and on and on on on
嗯,完全就是人工智障,可见padding text是很有必要的……
此外我还试了一下在xlnet-base-cased · Hugging Face的推理pipeline上运行,结果是:
也完全是人工智障的样子。
文本生成任务现在在PyTorch上支持GPT-2, OpenAi-GPT, CTRL, XLNet, Transfo-XL and Reformer模型。
和上述例子一样,XLNet和Transfo-XL的输入数据需要经pad才能正常工作。
GPT-2是一个open-ended text generation任务的好选择,因为它在上百万网页上以causal language modeling目标函数训练过。
对于如何使用不同的解码策略来进行文本生成,文档中给出了官方博客作为参考资料:How to generate text: using different decoding methods for language generation with Transformers,我对此篇博文也有撰写学习笔记博文的计划。
(我不是做NER的,所以以下内容都是照着文档内容半理解半猜的,没有去仔细查证过,如有疏漏请直接跟我说)
命名实体识别Named Entity Recognition (NER)是token分类任务中的一种,识别出文本中的命名实体。如将某一token识别为人物person、组织organization或地点location的实体的组成部分、或不属于任何实体。如CoNLL-2003数据集。在NER任务上进行微调可参考run_ner.py。
用pipeline进行命名实体识别的示例,将token分为如下9类:
使用在CoNLL-2003上微调过的模型(微调者@stefan-it,项目dbmdz):
from transformers import pipeline
ner_pipe = pipeline("ner")
sequence = """Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO,
therefore very close to the Manhattan Bridge which is visible from the window."""
显示所需的返回值:
for entity in ner_pipe(sequence):
print(entity)
输出效果:
{'entity': 'I-ORG', 'score': 0.9996, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2}
{'entity': 'I-ORG', 'score': 0.9910, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7}
{'entity': 'I-ORG', 'score': 0.9982, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12}
{'entity': 'I-ORG', 'score': 0.9995, 'index': 4, 'word': 'Inc', 'start': 13, 'end': 16}
{'entity': 'I-LOC', 'score': 0.9994, 'index': 11, 'word': 'New', 'start': 40, 'end': 43}
{'entity': 'I-LOC', 'score': 0.9993, 'index': 12, 'word': 'York', 'start': 44, 'end': 48}
{'entity': 'I-LOC', 'score': 0.9994, 'index': 13, 'word': 'City', 'start': 49, 'end': 53}
{'entity': 'I-LOC', 'score': 0.9863, 'index': 19, 'word': 'D', 'start': 79, 'end': 80}
{'entity': 'I-LOC', 'score': 0.9514, 'index': 20, 'word': '##UM', 'start': 80, 'end': 82}
{'entity': 'I-LOC', 'score': 0.9337, 'index': 21, 'word': '##BO', 'start': 82, 'end': 84}
{'entity': 'I-LOC', 'score': 0.9762, 'index': 28, 'word': 'Manhattan', 'start': 114, 'end': 123}
{'entity': 'I-LOC', 'score': 0.9915, 'index': 29, 'word': 'Bridge', 'start': 124, 'end': 130}
sequence“Hugging Face”
被识别为organization,“New York City”
、“DUMBO”
和“Manhattan Bridge”
被识别为location。
用AutoClass进行命名实体识别的示例:
“Hugging Face”
,location“New York City”
)from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
sequence = (
"Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, "
"therefore very close to the Manhattan Bridge."
)
inputs = tokenizer(sequence, return_tensors="pt")
tokens = inputs.tokens()
outputs = model(**inputs).logits
predictions = torch.argmax(outputs, dim=2)
和pipeline不同,在这里没有去掉0
类,即该token并不是任何一种实体的情况。
predictions
中每一类都对应一个整数,该整数与类名的对应可以通过model.config.id2label
解码:
for token, prediction in zip(tokens, predictions[0].numpy()):
print((token, model.config.id2label[prediction]))
输出:
('[CLS]', 'O')
('Hu', 'I-ORG')
('##gging', 'I-ORG')
('Face', 'I-ORG')
('Inc', 'I-ORG')
('.', 'O')
('is', 'O')
('a', 'O')
('company', 'O')
('based', 'O')
('in', 'O')
('New', 'I-LOC')
('York', 'I-LOC')
('City', 'I-LOC')
('.', 'O')
('Its', 'O')
('headquarters', 'O')
('are', 'O')
('in', 'O')
('D', 'I-LOC')
('##UM', 'I-LOC')
('##BO', 'I-LOC')
(',', 'O')
('therefore', 'O')
('very', 'O')
('close', 'O')
('to', 'O')
('the', 'O')
('Manhattan', 'I-LOC')
('Bridge', 'I-LOC')
('.', 'O')
('[SEP]', 'O')
文本摘要(summarization)的目标是将长文本缩写为简短的摘要。如CNN / Daily Mail新闻数据集。在文本摘要上微调的任务可参考transformers/examples/pytorch/summarization at main · huggingface/transformers。
用pipeline的示例,使用在CNN / Daily Mail数据集上微调过的BART模型:
from transformers import pipeline
summarizer = pipeline("summarization")
ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison. Her next court appearance is scheduled for May 18.
"""
print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))
输出:
[{'summary_text': ' Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in
the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and
2002 . At one time, she was married to eight men at once, prosecutors say .'}]
summarization pipeline也是基于PreTrainedModel.generate()
写的,见前文文本生成部分的介绍。
用AutoClass的示例:
“summarize: “
。PreTrainedModel.generate()
方法生成摘要。以下示例使用谷歌的T5模型,它是在多任务混合模型(包含CNN / Daily Mail数据集)上预训练的:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")
# T5 uses a max_length of 512 so we cut the article to 512 tokens.
inputs = tokenizer("summarize: " + ARTICLE, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(
inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True
)
print(tokenizer.decode(outputs[0]))
输出:
prosecutors say the marriages were part of an immigration scam. if convicted, barrientos faces two criminal
counts of "offering a false instrument for filing in the first degree" she has been married 10 times, nine of them
between 1999 and 2002.
翻译(translation)的目标是将一种语言的文本翻译到另一种语言。如WMT数据集,输入为英语,输出为德语。在翻译任务上微调的代码可参考transformers/examples/pytorch/translation at main · huggingface/transformers。
用pipeline的示例,使用上述文本摘要部分AutoClass部分用过的T5模型(其训练用的数据集包括WMT数据集):
from transformers import pipeline
translator = pipeline("translation_en_to_de")
print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40))
输出:
[{'translation_text': 'Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.'}]
translation pipeline也是基于PreTrainedModel.generate()
写的,见前文文本生成部分的介绍。
用AutoClass的示例:
“translate English to German: ”
。PreTrainedModel.generate()
方法生成翻译。from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")
inputs = tokenizer(
"translate English to German: Hugging Face is a technology company based in New York and Paris",
return_tensors="pt",
)
outputs = model.generate(inputs["input_ids"], max_length=40, num_beams=4, early_stopping=True)
print(tokenizer.decode(outputs[0]))
输出:
和pipeline示例的结果相同。
文档中给出SQuAD的参考资料:BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension 4.2部分有介绍。 ↩︎ ↩︎
这个用法可参考:Python字符串f-string使用大括号{}_彭世瑜的博客-CSDN博客_python 字符串大括号 ↩︎
miscellaneous混杂的;五花八门的;各式各样的
在这里应该是人物、组织、地点之外的实体类型的意思。 ↩︎