Pytorch中数据集太大加载爆内存问题解决记录

问题说明

项目需要,要加载一个具有两千多万条样本的两万多分类问题的数据集在BERT模型上进行Fine tune,我选取了其中2%的数据(约50万条)作为测试集,然后剩下的两千多万条作为训练集。

我按照 Transformers库官方文档里的 Fine-tuning with custom datasets一文中对BERT模型在IMDb数据集上Fine tune的过程进行改写。原代码如下:

train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')

from sklearn.model_selection import train_test_split
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

import torch

class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)

# Fine tune with Trainer

from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

trainer = Trainer(
    model=model,                         # the instantiated  Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
)

trainer.train()

这里,文档中是先将train_texts、test_texts、val_texts先进行了分词操作,我在改写时起初也是这样,但是这样有几个缺点:

  1. 分词处理时间慢,使用BertTokenizer时巨慢,推荐使用BertTokenizerFast会好一些
  2. 内存要求高,在对train_texts进行分词时直接内存溢出,jupyter lab被直接kill

问题解决

在进行数据集处理时,我们完全没有必要一开始就把所有的数据全部处理好,特别是非常大的数据集。完全可以采用一种“lazy dataset”的方式,在训练用到一批时我再加载一批。相关实现如下:

class LazyTextMAG_Dataset(torch.utils.data.Dataset):
    """
    Works with datasets of simple lines of text. Lines are loaded and tokenized
    lazily rather than being pulled into memory up-front. This reduces the memory
    footprint when using large datasets, and also remedies a problem seen when using
    the other Datasets (above) whereby they take too long to load all
    of the data and tokenize it before doing any training.

    The file i/o work is handled within self.examples. This class just indexes
    into that object and applies the tokenization.
    """
    def __init__(self, tokenizer, filepath, label2mid_dict, block_size=32):
        """
        :args:
            tokenizer: tokenizer.implementations.BaseTokenizer object (instantiated)
                     : This tokenizer will be directly applied to the text data
                       to prepare the data for passing through the model.
            file_path: str
                     : Path to the data file to be used.
            label2mid_dict: dict
                            : key is label string, value is label id
            block_size: int
                      : The maximum length of a sequence (truancated beyond this length).
        :returns: None.
        """
        self.texts, self.labels = self.read_mag_file(filepath, label2mid_dict)
        self.label2mid_dict = label2mid_dict
        self.tokenizer = tokenizer
        self.max_len = block_size

        
    def __len__(self):
        return len(self.labels)
    
            
    def read_mag_file(self, filepath, label2mid_dict):
        texts = []
        labels = []
        with open(filepath, "r", encoding="utf-8") as f:
            for line in f:
                ori, nor = line.replace("\n", "").split("\t\t")
                mid = label2mid_dict[nor]
                texts.append(ori)
                labels.append(mid)
        f.close()

        return texts, labels

    
    def _text_to_encoding(self, item):
        """
        Defines the logic for transforming a single raw text item to a tokenized
        tensor ready to be passed into a model.

        :args:
            item: str
                : The text item as a string to be passed to the tokenizer.
        """
        return self.tokenizer(item, padding='max_length', truncation=True, max_length=self.max_len)

    
    def _text_to_item(self, text):
        """
        Convenience functino to encapsulate re-used logic for converting raw
        text to the output of __getitem__ of __next__.

        :returns:
            torch.Tensor of tokenized text if no errors.
            None if any errors encountered.
        """
        try:
            if (text is not None):
                return self._text_to_encoding(text)
            else:
                return None
        except:
            return None

        
    def __getitem__(self, _id):
        """
        :returns:
            torch.Tensor of tokenized text if no errors.
            None if any errors encountered.
        """
        text = self.texts[_id]
        label = self.labels[_id]
        encodings = self._text_to_item(text)
        
        item = {key: torch.tensor(value) for key, value in encodings.items()}
        item['label'] = torch.tensor(label)
        return item

然后用这种方式生成数据集

train_dataset = LazyTextMAG_Dataset(tokenizer, train_filepath, train_label2mid_dict)
test_dataset = LazyTextMAG_Dataset(tokenizer, test_filepath, train_label2mid_dict)

在这里,所实现的类的做法是一开始加载了所有的文本text,然后再需要时再分批返回经过tokenizer后的结果(当然,一种更理想的方式是需要时再加载需要的文本text,但相比之下应该先加载所有的文本text方式避免了io请求,可能会更快一些(只是猜想)。)。

参考

  1. Memory error : load 200GB file in run_language_model.py,https://github.com/huggingface/transformers/issues/3083

你可能感兴趣的:(自然语言处理,Pytorch,NLP)