项目需要,要加载一个具有两千多万条样本的两万多分类问题的数据集在BERT模型上进行Fine tune,我选取了其中2%的数据(约50万条)作为测试集,然后剩下的两千多万条作为训练集。
我按照 Transformers库官方文档里的 Fine-tuning with custom datasets一文中对BERT模型在IMDb数据集上Fine tune的过程进行改写。原代码如下:
train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')
from sklearn.model_selection import train_test_split
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)
import torch
class IMDbDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
item['labels'] = torch.tensor(self.labels[idx])
return item
def __len__(self):
return len(self.labels)
train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)
# Fine tune with Trainer
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir='./results', # output directory
num_train_epochs=3, # total number of training epochs
per_device_train_batch_size=16, # batch size per device during training
per_device_eval_batch_size=64, # batch size for evaluation
warmup_steps=500, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir='./logs', # directory for storing logs
logging_steps=10,
)
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
trainer = Trainer(
model=model, # the instantiated Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_dataset, # training dataset
eval_dataset=val_dataset # evaluation dataset
)
trainer.train()
这里,文档中是先将train_texts、test_texts、val_texts先进行了分词操作,我在改写时起初也是这样,但是这样有几个缺点:
在进行数据集处理时,我们完全没有必要一开始就把所有的数据全部处理好,特别是非常大的数据集。完全可以采用一种“lazy dataset”的方式,在训练用到一批时我再加载一批。相关实现如下:
class LazyTextMAG_Dataset(torch.utils.data.Dataset):
"""
Works with datasets of simple lines of text. Lines are loaded and tokenized
lazily rather than being pulled into memory up-front. This reduces the memory
footprint when using large datasets, and also remedies a problem seen when using
the other Datasets (above) whereby they take too long to load all
of the data and tokenize it before doing any training.
The file i/o work is handled within self.examples. This class just indexes
into that object and applies the tokenization.
"""
def __init__(self, tokenizer, filepath, label2mid_dict, block_size=32):
"""
:args:
tokenizer: tokenizer.implementations.BaseTokenizer object (instantiated)
: This tokenizer will be directly applied to the text data
to prepare the data for passing through the model.
file_path: str
: Path to the data file to be used.
label2mid_dict: dict
: key is label string, value is label id
block_size: int
: The maximum length of a sequence (truancated beyond this length).
:returns: None.
"""
self.texts, self.labels = self.read_mag_file(filepath, label2mid_dict)
self.label2mid_dict = label2mid_dict
self.tokenizer = tokenizer
self.max_len = block_size
def __len__(self):
return len(self.labels)
def read_mag_file(self, filepath, label2mid_dict):
texts = []
labels = []
with open(filepath, "r", encoding="utf-8") as f:
for line in f:
ori, nor = line.replace("\n", "").split("\t\t")
mid = label2mid_dict[nor]
texts.append(ori)
labels.append(mid)
f.close()
return texts, labels
def _text_to_encoding(self, item):
"""
Defines the logic for transforming a single raw text item to a tokenized
tensor ready to be passed into a model.
:args:
item: str
: The text item as a string to be passed to the tokenizer.
"""
return self.tokenizer(item, padding='max_length', truncation=True, max_length=self.max_len)
def _text_to_item(self, text):
"""
Convenience functino to encapsulate re-used logic for converting raw
text to the output of __getitem__ of __next__.
:returns:
torch.Tensor of tokenized text if no errors.
None if any errors encountered.
"""
try:
if (text is not None):
return self._text_to_encoding(text)
else:
return None
except:
return None
def __getitem__(self, _id):
"""
:returns:
torch.Tensor of tokenized text if no errors.
None if any errors encountered.
"""
text = self.texts[_id]
label = self.labels[_id]
encodings = self._text_to_item(text)
item = {key: torch.tensor(value) for key, value in encodings.items()}
item['label'] = torch.tensor(label)
return item
然后用这种方式生成数据集
train_dataset = LazyTextMAG_Dataset(tokenizer, train_filepath, train_label2mid_dict)
test_dataset = LazyTextMAG_Dataset(tokenizer, test_filepath, train_label2mid_dict)
在这里,所实现的类的做法是一开始加载了所有的文本text,然后再需要时再分批返回经过tokenizer后的结果(当然,一种更理想的方式是需要时再加载需要的文本text,但相比之下应该先加载所有的文本text方式避免了io请求,可能会更快一些(只是猜想)。)。