相关资源对应网址如下:
网址 | |
---|---|
库的 GitHub 地址 | https://github.com/huggingface/transformers |
官方开发文档 | https://huggingface.co/docs/transformers/index |
预训练模型下载地址 | https://huggingface.co/models |
pytorch 和 tensorflow 都可用,但建议用 pytorch,以下代码全都基于 pytorch.
需要安装的库:
现有的模型和checkpoint,可以直接拿来用,处理某个任务,如情感分类、文本生成、命名实体识别、问答等。
支持的任务
Task | Description | Modality | Pipeline identifier |
---|---|---|---|
Text classification | assign a label to a given sequence of text | NLP | pipeline(task=“sentiment-analysis”) |
Text generation | generate text that follows a given prompt | NLP | pipeline(task=“text-generation”) |
Name entity recognition | assign a label to each token in a sequence (people, organization, location, etc.) | NLP | pipeline(task=“ner”) |
Question answering | extract an answer from the text given some context and a question | NLP | pipeline(task=“question-answering”) |
Fill-mask | predict the correct masked token in a sequence | NLP | pipeline(task=“fill-mask”) |
Summarization | generate a summary of a sequence of text or document | NLP | pipeline(task=“summarization”) |
Translation | translate text from one language into another | NLP | pipeline(task=“translation”) |
Image classification | assign a label to an image | Computer vision | pipeline(task=“image-classification”) |
Image segmentation | assign a label to each individual pixel of an image (supports semantic, panoptic, and instance segmentation) | Computer vision | pipeline(task=“image-segmentation”) |
Object detection | predict the bounding boxes and classes of objects in an image | Computer vision | pipeline(task=“object-detection”) |
Audio classification | assign a label to an audio file | Audio | pipeline(task=“audio-classification”) |
Automatic speech recognition | extract speech from an audio file into text | Audio | pipeline(task=“automatic-speech-recognition”) |
Visual question answering | given an image and a question, correctly answer a question about the image | Multimodal | pipeline(task=“vqa”) |
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
测试 “We are very happy to show you the Transformers library.” 这句句子的情感倾向
classifier("We are very happy to show you the Transformers library.")
输出:
[{'label': 'POSITIVE', 'score': 0.9998}]
测试一个 batch 的句子的情感倾向
results = classifier(["We are very happy to show you the Transformers library.", "We hope you don't hate it."])
for result in results:
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
输出:
label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.5309
详细介绍见【载入预训练的模型】部分
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# 会下载缓存这个模型
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
测试 “We are very happy to show you the Transformers library.” 这句句子的情感倾向
classifier("We are very happy to show you the Transformers library.")
输出:
[{'label': '5 stars', 'score': 0.772534966468811}]
测试一个 batch 的句子的情感倾向
results = classifier(["We are very happy to show you the Transformers library.", "We hope you don't hate it."])
for result in results:
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
输出:
label: 5 stars, with score: 0.7725
label: 5 stars, with score: 0.2365
给定一段上下文 context
,提问
from transformers import pipeline
question_answerer = pipeline("question-answering")
context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
"""
模型回答
问题 question=“What is extractive question answering?”
result = question_answerer(question="What is extractive question answering?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
输出:
Answer: 'the task of extracting an answer from a text given a question', score: 0.6177, start: 34, end: 95
问题 question=“What is a good example of a question answering dataset?”
result = question_answerer(question="What is a good example of a question answering dataset?", context=context)
print(
f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"
)
输出:
Answer: 'SQuAD dataset', score: 0.5152, start: 147, end: 160
BERT 指的是模型结构(architecture),model_name
指的是一个加载进模型的权重(checkpoint)
from transformers import BertTokenizer
from transformers import BertModel
from transformers import BertConfig
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
config = BertConfig.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)
一般来说,这个 checkpoint 会从网上下载并缓存在本地,如果想要指定模型缓存的位置,就
import os
os.environ['TRANSFORMERS_CACHE'] = './cache' # 地址
.from_pretrained()
方法可以加载一个预训练的模型,不需要从头开始训练
目前 transformer 支持的模型结构(architecture)比较多,比较复杂,以下截取一小部分:
albert — AlbertConfig (ALBERT model)
bart — BartConfig (BART model)
beit — BeitConfig (BEiT model)
bert — BertConfig (BERT model)
bert-generation — BertGenerationConfig (Bert Generation model)
big_bird — BigBirdConfig (BigBird model)
bigbird_pegasus — BigBirdPegasusConfig (BigBird-Pegasus model)
blenderbot — BlenderbotConfig (Blenderbot model)
blenderbot-small — BlenderbotSmallConfig (BlenderbotSmall model)
bloom — BloomConfig (BLOOM model)
camembert — CamembertConfig (CamemBERT model)
canine — CanineConfig (CANINE model)
clip — CLIPConfig (CLIP model)
codegen — CodeGenConfig (CodeGen model)
conditional_detr — ConditionalDetrConfig (Conditional DETR model)
convbert — ConvBertConfig (ConvBERT model)
convnext — ConvNextConfig (ConvNeXT model)
...
Auto 类型可以根据你提供的 checkpoint 自动推断模型的结构(architecture)
最常用的有:AutoTokenizer, AutoModel, AutoConfig
from transformers import AutoTokenizer, AutoModel, AutoConfig
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
config = AutoConfig.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-cased")
以上是根据库里的对应名字的模型直接载入,也可以把模型下载到本地后,用本地路径载入
model_path = "../pretrained_model/distilbert-base-uncased"
model = AutoModel.from_pretrained(model_path)
AutoModelForXXXX
类型可以根据给定的任务,载入对应的预训练模型
# sequence classification 任务
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
# token classification 任务
from transformers import AutoModelForTokenClassification
model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased")
训练模型前,需要把你的数据预处理成模型的输入格式
定义一个 tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
sentence = "Do not meddle in the affairs of wizards, for they are subtle and quick to anger."
encoded_input = tokenizer(sentence)
# 或者也可以写成
encoded_input = tokenizer.encode_plus(sentence)
print(encoded_input)
输出:
{'input_ids': [101, 2091, 1136, 1143, 13002, 1107, 1103, 5707, 1104, 16678, 1116, 117, 1111, 1152, 1132, 11515, 1105, 3613, 1106, 4470, 119, 102],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
encode和encode_plus的区别
input_ids
:是单词在词典中的编码token_type_ids
:区分token 是否来源于同一句句子
attention_mask
:指定对哪些词进行self-Attention操作sentence = "Hello, my son is laughing."
print(tokenizer.encode(sentence))
print(tokenizer.encode_plus(sentence))
tokenizer.decode(encoded_input["input_ids"])
输出
'[CLS] Do not meddle in the affairs of wizards, for they are subtle and quick to anger. [SEP]'
自动在句子开始加上了 [CLS] 符号,在句子末尾加上了 [SEP] 符号
可以一次性编码多个句子
batch_sentences = [
"But what about second breakfast?",
"Don't think he knows about second breakfast, Pip.",
"What about elevensies?",
]
encoded_inputs = tokenizer(batch_sentences)
print(encoded_inputs)
输出:
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102],
[101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
[101, 1327, 1164, 5450, 23434, 136, 102]],
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0]],
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1]]}
同一个批次(batch)中的句子将根据最长的句子,被补全(padding)到同一个长度,用 [PAD] 补充,token 的索引为0
batch_sentences = [
"But what about second breakfast?",
"Don't think he knows about second breakfast, Pip.",
"What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, padding=True) # 设置 padding
print(encoded_input)
输出
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
[101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
[101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]],
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}
[PAD] 不需要被 attention,所以attention_mask
为 0
如果句子太长,超过了模型的最大长度 max_length
的设置,将会被截断(比如超过 512 个token)
batch_sentences = [
"But what about second breakfast?",
"Don't think he knows about second breakfast, Pip.",
"What about elevensies?",
]
encoded_input = tokenizer(batch_sentences,
padding=True,
truncation=True) # 设置 truncation
print(encoded_input)
输出
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
[101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
[101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]],
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}
因为这里的句子都比较短,所以没有被截断
可以自己设置 max_length
batch_sentences = [
"But what about second breakfast?",
"Don't think he knows about second breakfast, Pip.",
"What about elevensies?",
]
encoded_input = tokenizer(batch_sentences,
padding=True,
truncation=True, # 设置 truncation
max_length=8) # max_length
print(encoded_input)
输出
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102],
[101, 1790, 112, 189, 1341, 1119, 3520, 102],
[101, 1327, 1164, 5450, 23434, 136, 102, 0]],
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0]],
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 0]]}
超过长度为 max_length=8
会被截断,不足长度为 max_length=8
会补全
设置 return_tensors="pt"
,就会返回 pytorch 对应的 tensor
,不再是前面的 list
batch_sentences = [
"But what about second breakfast?",
"Don't think he knows about second breakfast, Pip.",
"What about elevensies?",
]
encoded_input = tokenizer(batch_sentences,
padding=True,
truncation=True, # 设置 truncation
max_length=8, # max_length
return_tensors="pt")
print(encoded_input)
输出
{'input_ids':
tensor([[ 101, 1252, 1184, 1164, 1248, 6462, 136, 102],
[ 101, 1790, 112, 189, 1341, 1119, 3520, 102],
[ 101, 1327, 1164, 5450, 23434, 136, 102, 0]]), 'token_type_ids':
tensor([[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0]]),
'attention_mask':
tensor([[1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 0]])}
在预训练模型的基础上,继续训练模型,用于特定任务,叫做微调。
可以有两种方法:
这一部分分享一下博主的 jupyter notebook,以下两个相同,任选:
载入内置的数据集 yelp_review
,API
from datasets import load_dataset
dataset = load_dataset("yelp_review_full")
dataset
输出
DatasetDict({
train: Dataset({
features: ['label', 'text'],
num_rows: 650000
})
test: Dataset({
features: ['label', 'text'],
num_rows: 50000
})
})
有训练集(train)和测试集(test)两个数据集,数据量分别为 650,000 和 50,000
特征为:label 和 text
dataset["train"][100]
输出
{'label': 0,
'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. I\'ve worked at more than one location. I expect bad days, bad moods, and the occasional mistake. But I have yet to have a decent experience at this store. It will remain a place I avoid unless someone in my party needs to avoid illness from low blood sugar. Perhaps I should go back to the racially biased service of Steak n Shake instead!'}
用 tokenizer 预处理
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets
输出
DatasetDict({
train: Dataset({
features: ['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 650000
})
test: Dataset({
features: ['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 50000
})
})
多了 input_ids
, token_type_ids
, attention_mask
tokenized_datasets["train"][100]
输出
{'label': 0,
'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. I\'ve worked at more than one location. I expect bad days, bad moods, and the occasional mistake. But I have yet to have a decent experience at this store. It will remain a place I avoid unless someone in my party needs to avoid illness from low blood sugar. Perhaps I should go back to the racially biased service of Steak n Shake instead!',
'input_ids': [101, 1422, 11471, 1111, 9092, 1116, 1132, 189, 6034, 1344, 119, 1252, 1111, 1141, 1106, 1253, 8693, 1177, 14449, 1193, 119, 119, 119, 1115, 2274, 1380, 1957, 106, 165, 183, 1942, 4638, 5948, 2852, 1261, 1139, 2053, 112, 188, 1546, 117, 1173, 13796, 5794, 1143, 119, 146, 1125, 1106, 2049, 1991, 1107, 1524, 1104, 170, 5948, 2852, 1150, 1533, 1117, 8077, 1106, 3074, 1113, 1103, 1825, 139, 2036, 3048, 11607, 2137, 1143, 119, 146, 3932, 1166, 1421, 1904, 1111, 170, 23275, 1546, 1115, 1529, 11228, 1141, 5102, 112, 188, 7696, 119, 1258, 2903, 1160, 1234, 1150, 2802, 1170, 1143, 1129, 3541, 1147, 2094, 117, 146, 1455, 1187, 2317, 1108, 119, 1109, 2618, 1408, 13732, 1120, 1103, 5948, 11528, 1111, 165, 107, 2688, 1228, 1147, 3791, 165, 107, 1165, 1152, 1238, 112, 189, 1138, 1147, 2094, 119, 1252, 4534, 5948, 2852, 1108, 5456, 1485, 1343, 7451, 117, 1105, 1103, 2618, 1108, 1103, 1141, 2688, 2094, 1106, 5793, 1105, 8650, 1103, 8190, 119, 165, 183, 1942, 4638, 2618, 1108, 14708, 1165, 2368, 1143, 1139, 1546, 119, 1153, 1238, 112, 189, 1294, 1612, 1115, 146, 1125, 1917, 21748, 150, 3663, 155, 8231, 27514, 2101, 1942, 117, 1105, 1309, 1256, 1125, 1103, 1260, 2093, 7232, 1106, 12529, 1115, 146, 1464, 146, 1108, 2033, 2869, 1555, 119, 165, 183, 2240, 112, 1396, 8527, 1120, 1672, 9092, 1116, 7724, 1111, 1166, 1476, 1201, 119, 146, 112, 1396, 1589, 1120, 1167, 1190, 1141, 2450, 119, 146, 5363, 2213, 1552, 117, 2213, 6601, 1116, 117, 1105, 1103, 7957, 6223, 119, 1252, 146, 1138, 1870, 1106, 1138, 170, 11858, 2541, 1120, 1142, 2984, 119, 1135, 1209, 3118, 170, 1282, 146, 3644, 4895, 1800, 1107, 1139, 1710, 2993, 1106, 3644, 6946, 1121, 1822, 1892, 6656, 119, 5203, 146, 1431, 1301, 1171, 1106, 1103, 5209, 1193, 15069, 1174, 1555, 1104, 1457, 23783, 183, 25775, 1939, 106, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
可以只选取一小部分数据集
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
transformer 封装好了一个 Trainer 类(API)
yelp_review_full
数据集内的 label 有五个取值(1~5),定义 num_labels=5
model 是个五分类的模型
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)
用 TrainingArguments 类(API)来封装超参数,这里使用默认的超参数
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(output_dir="test_trainer")
在训练时的 metric 需要自己定,Evaluate库(API)提供了简单的 accuracy 函数实现
import numpy as np
import evaluate
metric = evaluate.load("accuracy")
需要自定义 compute 函数,需要将模型的输出 logits
通过 softmax 函数转化为概率值
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
如果想要在训练阶段观测 metrics 变化情况,设置 evaluation_strategy="epoch"
,这样,在每个epoch结束的时候,会输出 metric 的分数
training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")
创建一个Trainer类的实例,传入模型、训练参数、训练集、验证集、评价指标
trainer = Trainer(
model=model,
args=training_args,
train_dataset=small_train_dataset,
eval_dataset=small_eval_dataset,
compute_metrics=compute_metrics,
)
训练模型
trainer.train()
输出:
TrainOutput(global_step=39,
training_loss=1.516761681972406,
metrics={
'train_runtime': 109.446,
'train_samples_per_second': 27.411,
'train_steps_per_second': 0.356,
'total_flos': 789354427392000.0,
'train_loss': 1.516761681972406,
'epoch': 3.0
})
先对 tokenized_datasets 进一步处理,不需要 text 列,只要 input_ids
, token_type_ids
, attention_mask
和 labels
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
在本次任务中,只选择一小部分数据集
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
用 DataLoader 封装数据集
from torch.utils.data import DataLoader
train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)
加载模型
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)
定义优化器(optimizer)和 learning rate scheduler
使用 AdamW 这个优化器
from torch.optim import AdamW
optimizer = AdamW(model.parameters(), lr=5e-5)
from transformers import get_scheduler
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
name="linear",
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=num_training_steps
)
如果有 GPU,把模型放到 GPU 上
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
训练
from tqdm.auto import tqdm
progress_bar = tqdm(range(num_training_steps))
model.train()
for epoch in range(num_epochs):
for batch in train_dataloader:
batch = {k: v.to(device) for k, v in batch.items()}
outputs = model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
progress_bar.update(1)
评估
import evaluate
metric = evaluate.load("accuracy")
model.eval()
for batch in eval_dataloader:
batch = {k: v.to(device) for k, v in batch.items()}
with torch.no_grad():
outputs = model(**batch)
logits = outputs.logits
predictions = torch.argmax(logits, dim=-1)
metric.add_batch(predictions=predictions, references=batch["labels"])
metric.compute()
输出
{'accuracy': 0.587}
在多GPU的机器上,或者在多台机器的多GPU上,利用 accelerate 库(API)进行加速
安装库
pip install accelerate
创建一个对象
from accelerate import Accelerator
accelerator = Accelerator()
将之前的 dataloader 、模型、优化器封装处理
train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare(
train_dataloader, eval_dataloader, model, optimizer
)
将之前训练部分的 loss.backward()
替换为 accelerator.backward(loss)
model.train()
for epoch in range(num_epochs):
for batch in train_dataloader:
# batch = {k: v.to(device) for k, v in batch.items()} # delete
outputs = model(**batch)
loss = outputs.loss
accelerator.backward(loss) # !!! here
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
progress_bar.update(1)
先创建一个配置文件
accelerate config
然后加速训练
accelerate launch train.py
如果直接在notebook里,先把训练部分的所有有关代码放在一个 training_function
函数中,再调用 notebook_launcher
from accelerate import notebook_launcher
notebook_launcher(training_function, num_processes=8) #指定gpu块数
说明:使用的要求比较高
torch.cuda.is_available()
用 notebook的完整加速代码
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, get_scheduler
from torch.utils.data import DataLoader
from accelerate import Accelerator
from torch.optim import AdamW
from tqdm.auto import tqdm
dataset = load_dataset("yelp_review_full")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
def training_function():
accelerator = Accelerator()
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)
model.to(device)
optimizer = AdamW(model.parameters(), lr=5e-5)
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)
train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare(
train_dataloader, eval_dataloader, model, optimizer
)
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
name="linear",
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=num_training_steps
)
progress_bar = tqdm(range(num_training_steps))
model.train()
for epoch in range(num_epochs):
for batch in train_dataloader:
outputs = model(**batch)
loss = outputs.loss
accelerator.backward(loss)
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
progress_bar.update(1)
from accelerate import notebook_launcher
notebook_launcher(training_function, num_processes=8)