RNN/LSTM (四) 实践案例改编

文章目录

  • 背景
  • 代码编写
    • 训练/测试集分割
    • 构建词库
    • 构建迭代器
    • 实现LSTM模型
    • 创建模型
    • 编写训练逻辑
  • 踩坑
  • 总结

背景

在学习RNN/LSTM (二) 实践案例后,由于其使用了较旧版本的torchtext0.9,很多API在新版已经淘汰,本文尝试用torchtext0.14来改编代码。

数据集的下载方式,在旧文中已经阐述。

由于torchtext自从0.12版本后,API有了较大变化,而本文代码逻辑都用该版本编写。所以,对于习惯于torchtext 0.9版本API的读者,我推荐阅读下torchtext0.14 实践手册(0.12版本同理)帮助改变思维。

完整代码存放在码云,文件夹rnn_text_classification2里的代码是本次教程的全部,其它文件夹与本次教学无关,请读者忽视。

代码编写

训练/测试集分割

建立代码文件train_val_split.py,它用来切割训练集train.csv

import os.path

import pandas as pd

# 分割train/val数据集

# Import Data
from sklearn.model_selection import train_test_split

train = pd.read_csv("data/train.csv")

# Shape of dataset
print(train.shape)
print(train.head())

# drop 'id' , 'keyword' and 'location' columns.
train.drop(columns=['id','keyword','location'], inplace=True)


def normalise_text(text):
    text = text.str.lower()  # lowercase
    text = text.str.replace(r"\#", "", regex=True)  # replaces hashtags
    text = text.str.replace(r"http\S+", "URL", regex=True)  # remove URL addresses
    text = text.str.replace(r"@", "")
    text = text.str.replace(r"[^A-Za-z0-9()!?\'\`\"]", " ", regex=True)
    text = text.str.replace("\s{2,}", " ", regex=True)
    return text

# to clean data
train["text"] = normalise_text(train["text"])
print(train['text'].head())

# split data into train and validation
train_df, valid_df = train_test_split(train)
print(train_df.head())
print(valid_df.head())

if not os.path.exists("processed_data"):
    os.mkdir("processed_data")

train_df.to_csv("processed_data/train.csv")
valid_df.to_csv("processed_data/valid.csv")
  1. pandas加载csv文件会得到Dataframe对象
  2. normalise_text是为了替换掉不想分析的字符串,比如URL链接、@符号、#符号等等。这可能是为了防止它们加入模型词库,带来歧义,令模型过拟合。
  3. train_test_split是来自scikit-learn的API,用于切分train/valid数据集。

构建词库

分割train/val数据集后,编写训练文件代码text_classification_demo.py

首先,读取数据集,并以此构建词库。会用到torchtext的词库接口build_vocab_from_iterator

def main():
    # split data into train and validation
    train_df = pd.read_csv("processed_data/train.csv")
    valid_df = pd.read_csv("processed_data/valid.csv")

    SEED = 42
    torch.manual_seed(SEED)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

    # build vocab
    vocab = build_vocab_from_iterator(yield_tokens(train_df), min_freq=5, specials=['', ''])
    vocab.set_default_index(vocab[""])

    print(vocab.get_itos()[:10])
    print(vocab.get_stoi()[""], vocab.get_stoi()[""], vocab.get_stoi()["the"])

下面是collate_batch的一种实现。该函数看上去行数很多,其实主体是一个for循环,对batch里的每句话作分词、编码、截断、tensor化、填充操作。同时记录句子长度(后面要用)。

构建迭代器

继续写下去,需要为数据集构建迭代器,这里使用pytorch原生的Dataset(DataFrameDataset继承了前者),DataLoader就行。

	# build data loader
    train_iter = DataFrameDataset(list(train_df['text']), list(train_df['target']))
    train_loader = DataLoader(train_iter, batch_size=8, shuffle=True,
                              collate_fn=partial(collate_batch, vocab=vocab, device=device))

    valid_iter = DataFrameDataset(list(valid_df['text']), list(valid_df['target']))
    valid_loader = DataLoader(valid_iter, batch_size=8, shuffle=True,
                              collate_fn=partial(collate_batch, vocab=vocab, device=device))

    print(len(train_loader))

原生的数据text和label的类型分别是str list和list,需要对批数据作预处理,这里使用预处理函数collate_batch

def collate_batch(batch, vocab, device):
    # batch预处理函数。将batch的text截断、填充后,与label一同送入gpu
    label_list, text_list = [], []

    # 写成函数而不是lambda,便于调试
    def tokenize_and_encode(x):
        tokens = spacy_tokenizer(x)
        return [vocab[token.text] for token in tokens]

    def label_pipeline(x):
        return int(x)

    truncate = Truncate(max_seq_len=20)
    pad = PadTransform(max_length=20, pad_value=vocab[''])
    text_lengths = []
    for (_text, _label) in batch:
        label_list.append(label_pipeline(_label))
        text = tokenize_and_encode(_text)  # 字符串分词、编码
        text = truncate(text)  # 截断
        text_lengths.append(len(text))  # 记录长度
        text = torch.tensor(text, dtype=torch.int64)  # tensor化
        text = pad(text)  # 填充
        text_list.append(text)
    text_list = torch.vstack(text_list)
    label_list = torch.tensor(label_list, dtype=torch.float)
    return text_list.to(device), text_lengths, label_list.to(device)

实现LSTM模型

创建代码文件LSTM_net.py。LSTM模型的大部分实现与旧教程一样,关于LSTM的细节、原理,本文无需阐述。这里用到了·nn.utils.rnn.pack_padded_sequence·,其作用和参数讲解,可阅读pytorch nn.utils.rnn.pack_padded_sequence 分析

import torch
from torch import nn


class LSTM_net(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers,
                 bidirectional, dropout, pad_idx):
        super().__init__()

        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)

        self.rnn = nn.LSTM(embedding_dim,
                           hidden_dim,
                           num_layers=n_layers,
                           bidirectional=bidirectional,
                           dropout=dropout)

        self.fc1 = nn.Linear(hidden_dim * 2, hidden_dim)

        self.fc2 = nn.Linear(hidden_dim, 1)

        self.dropout = nn.Dropout(dropout)

    def forward(self, text, text_lengths):
        # text = [sent len, batch size]

        embedded = self.embedding(text)

        # embedded = [sent len, batch size, emb dim]

        # pack sequence
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths, batch_first=True, enforce_sorted=False)

        packed_output, (hidden, cell) = self.rnn(packed_embedded)

        # unpack sequence
        # output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)

        # output = [sent len, batch size, hid dim * num directions]
        # output over padding tokens are zero tensors

        # hidden = [num layers * num directions, batch size, hid dim]
        # cell = [num layers * num directions, batch size, hid dim]

        # concat the final forward (hidden[-2,:,:]) and backward (hidden[-1,:,:]) hidden layers
        # and apply dropout

        hidden = self.dropout(torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1))
        output = self.fc1(hidden)
        output = self.dropout(self.fc2(output))

        # hidden = [batch size, hid dim * num directions]

        return output

创建模型

回到刚才的main函数,继续编写逻辑。接下来我们要创建模型LSTM_net,使用上一节写好的代码。先定义一些参数,再构建模型对象。

# build model
    MODEL_PATH = "model.pth"
    INPUT_DIM = len(vocab)
    EMBEDDING_DIM = 200
    HIDDEN_DIM = 256
    OUTPUT_DIM = 1
    N_LAYERS = 2
    BIDIRECTIONAL = True
    DROPOUT = 0.2
    PAD_IDX = vocab.get_stoi()[""]  # padding

    model = LSTM_net(INPUT_DIM,
                     EMBEDDING_DIM,
                     HIDDEN_DIM,
                     OUTPUT_DIM,
                     N_LAYERS,
                     BIDIRECTIONAL,
                     DROPOUT,
                     PAD_IDX)
  • 我们要令模型实现可恢复训练进度的能力,所以用到model.load_state_dict
  • 在第一次训练时,我们希望模型能使用迁移学习,所以用到了torchtext.vocab.GloVe
    if os.path.exists(MODEL_PATH):
        model.load_state_dict(torch.load(MODEL_PATH))
    else:
        # 迁移学习glove预训练词向量
        pretrained = torchtext.vocab.GloVe(name="6B", dim=200)
        print(f"pretrained.vectors device: {pretrained.vectors.device}, shape: {pretrained.vectors.shape}")
        for i, token in enumerate(vocab.get_itos()):
            model.embedding.weight.data[i] = pretrained.get_vecs_by_tokens(token)

将PAD字符的向量初始化为0,再把模型送入GPU

# 填充位初始化为0
    model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)
    print("model.embedding.weight.data", model.embedding.weight.data)
    model.to(device)  # CNN to GPU

编写训练逻辑

定义一些超参数

# train Hyperparameters
    num_epochs = 25
    learning_rate = 0.001

    # Loss and optimizer
    criterion = nn.BCEWithLogitsLoss()

    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

编写三个函数:

  • binary_accuracy用于判断准确率
  • train用于训练阶段。
  • evaluate用于验证阶段。

留意到,trainevalue的for循环迭代中,batch会被解元为3个元素,这是由collate_batch的行为决定的。

    def binary_accuracy(preds, y):
        """
        Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
        """

        # round predictions to the closest integer
        rounded_preds = torch.round(torch.sigmoid(preds))
        correct = (rounded_preds == y).float()  # convert into float for division
        acc = correct.sum() / len(correct)
        return acc

    # training function
    def train(model, iterator):
        epoch_loss = 0
        epoch_acc = 0

        model.train()

        for batch in tqdm(iterator):
            text, text_lengths, labels = batch

            optimizer.zero_grad()
            predictions = model(text, text_lengths).squeeze(1)
            loss = criterion(predictions, labels)
            acc = binary_accuracy(predictions, labels)

            loss.backward()
            optimizer.step()

            epoch_loss += loss.item()
            epoch_acc += acc.item()

        return epoch_loss / len(iterator), epoch_acc / len(iterator)

    def evaluate(model, iterator):
        epoch_acc = 0
        model.eval()

        with torch.no_grad():
            for batch in tqdm(iterator):
                text, text_lengths, labels = batch
                predictions = model(text, text_lengths).squeeze(1)
                acc = binary_accuracy(predictions, labels)

                epoch_acc += acc.item()

        return epoch_acc / len(iterator)

最后,在主循环里编写训练逻辑,每个epoch里,都先对训练集作一遍训练train,再对验证集作一遍验证evaluate,最后torch.save保存模型。

	t = time.time()
    loss = []
    acc = []
    val_acc = []

    for epoch in range(num_epochs):
        train_loss, train_acc = train(model, train_loader)
        valid_acc = evaluate(model, valid_loader)

        print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc * 100:.2f}%')
        print(f'\t Val. Acc: {valid_acc * 100:.2f}%')

        loss.append(train_loss)
        acc.append(train_acc)
        val_acc.append(valid_acc)

        torch.save(model.state_dict(), MODEL_PATH)

    print(f'time:{time.time() - t:.3f}')

踩坑

Q:在对数据集预处理时,是覆写Dataset::__getitem__,还是实现Dataloadercollate_fn函数更好?
A: 个人觉得,改写collate_fn更清晰一些。Dataset只需要返回原生的数据,也就是字符串、整形数之类的。

Q:如何从词库Glove中获取给定单词的向量
A:调用get_vecs_by_tokens, 参考torchtext与glove

Q: RuntimeError: lengths array must be sorted in decreasing order when enforce_sorted is True
A:Pytorch-RNN关于pack_padded_sequence之enforce_sorted详解 https://blog.csdn.net/BierOne/article/details/116133857

Q:在主函数中用到了criterion = nn.BCEWithLogitsLoss()。使用BCEWithLogitsLoss(output,target),output 为float类型,target为int64,报错"RuntimeError: result type Float can’t be cast to the desired output type Long"
A:参考RuntimeError: result type Float can‘t be cast to the desired output type Long,文章建议,将target变量转为float。对应代码中的问题,是在比较label和pred时,pred的类型是float,而labe的类型是Long。最后解决方案是修改collate_fn,编写代码如下:

label_list = torch.tensor(label_list, dtype=torch.float)

将输出的label转为float型的Tensor

总结

当前网上缺乏新版torchtext API的教程,而本文通过对网上案例重新编写,为网友们提供了从0到1的完整nlp项目实现过程,帮大家梳理了新版API的使用方法,并开源了代码。

想到了一个问题。每个句子的长度都是限制的(比如20),那么能无限对话、无限写作的对话机器人(比如chatGPT)是如何实现的呢?需要慢慢发掘。

你可能感兴趣的:(AI与ML,rnn,lstm)