在学习RNN/LSTM (二) 实践案例后,由于其使用了较旧版本的torchtext0.9,很多API在新版已经淘汰,本文尝试用torchtext0.14来改编代码。
数据集的下载方式,在旧文中已经阐述。
由于torchtext自从0.12版本后,API有了较大变化,而本文代码逻辑都用该版本编写。所以,对于习惯于torchtext 0.9版本API的读者,我推荐阅读下torchtext0.14 实践手册(0.12版本同理)帮助改变思维。
完整代码存放在码云,文件夹rnn_text_classification2
里的代码是本次教程的全部,其它文件夹与本次教学无关,请读者忽视。
建立代码文件train_val_split.py
,它用来切割训练集train.csv
import os.path
import pandas as pd
# 分割train/val数据集
# Import Data
from sklearn.model_selection import train_test_split
train = pd.read_csv("data/train.csv")
# Shape of dataset
print(train.shape)
print(train.head())
# drop 'id' , 'keyword' and 'location' columns.
train.drop(columns=['id','keyword','location'], inplace=True)
def normalise_text(text):
text = text.str.lower() # lowercase
text = text.str.replace(r"\#", "", regex=True) # replaces hashtags
text = text.str.replace(r"http\S+", "URL", regex=True) # remove URL addresses
text = text.str.replace(r"@", "")
text = text.str.replace(r"[^A-Za-z0-9()!?\'\`\"]", " ", regex=True)
text = text.str.replace("\s{2,}", " ", regex=True)
return text
# to clean data
train["text"] = normalise_text(train["text"])
print(train['text'].head())
# split data into train and validation
train_df, valid_df = train_test_split(train)
print(train_df.head())
print(valid_df.head())
if not os.path.exists("processed_data"):
os.mkdir("processed_data")
train_df.to_csv("processed_data/train.csv")
valid_df.to_csv("processed_data/valid.csv")
normalise_text
是为了替换掉不想分析的字符串,比如URL链接、@符号、#符号等等。这可能是为了防止它们加入模型词库,带来歧义,令模型过拟合。train_test_split
是来自scikit-learn的API,用于切分train/valid数据集。分割train/val数据集后,编写训练文件代码text_classification_demo.py
。
首先,读取数据集,并以此构建词库。会用到torchtext的词库接口build_vocab_from_iterator
。
def main():
# split data into train and validation
train_df = pd.read_csv("processed_data/train.csv")
valid_df = pd.read_csv("processed_data/valid.csv")
SEED = 42
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
# build vocab
vocab = build_vocab_from_iterator(yield_tokens(train_df), min_freq=5, specials=['' , '' ])
vocab.set_default_index(vocab["" ])
print(vocab.get_itos()[:10])
print(vocab.get_stoi()["" ], vocab.get_stoi()["" ], vocab.get_stoi()["the"])
下面是collate_batch
的一种实现。该函数看上去行数很多,其实主体是一个for循环,对batch里的每句话作分词、编码、截断、tensor化、填充操作。同时记录句子长度(后面要用)。
继续写下去,需要为数据集构建迭代器,这里使用pytorch原生的Dataset(DataFrameDataset继承了前者),DataLoader就行。
# build data loader
train_iter = DataFrameDataset(list(train_df['text']), list(train_df['target']))
train_loader = DataLoader(train_iter, batch_size=8, shuffle=True,
collate_fn=partial(collate_batch, vocab=vocab, device=device))
valid_iter = DataFrameDataset(list(valid_df['text']), list(valid_df['target']))
valid_loader = DataLoader(valid_iter, batch_size=8, shuffle=True,
collate_fn=partial(collate_batch, vocab=vocab, device=device))
print(len(train_loader))
原生的数据text和label的类型分别是str list和list,需要对批数据作预处理,这里使用预处理函数collate_batch
。
def collate_batch(batch, vocab, device):
# batch预处理函数。将batch的text截断、填充后,与label一同送入gpu
label_list, text_list = [], []
# 写成函数而不是lambda,便于调试
def tokenize_and_encode(x):
tokens = spacy_tokenizer(x)
return [vocab[token.text] for token in tokens]
def label_pipeline(x):
return int(x)
truncate = Truncate(max_seq_len=20)
pad = PadTransform(max_length=20, pad_value=vocab['' ])
text_lengths = []
for (_text, _label) in batch:
label_list.append(label_pipeline(_label))
text = tokenize_and_encode(_text) # 字符串分词、编码
text = truncate(text) # 截断
text_lengths.append(len(text)) # 记录长度
text = torch.tensor(text, dtype=torch.int64) # tensor化
text = pad(text) # 填充
text_list.append(text)
text_list = torch.vstack(text_list)
label_list = torch.tensor(label_list, dtype=torch.float)
return text_list.to(device), text_lengths, label_list.to(device)
创建代码文件LSTM_net.py
。LSTM模型的大部分实现与旧教程一样,关于LSTM的细节、原理,本文无需阐述。这里用到了·nn.utils.rnn.pack_padded_sequence·,其作用和参数讲解,可阅读pytorch nn.utils.rnn.pack_padded_sequence 分析
import torch
from torch import nn
class LSTM_net(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers,
bidirectional, dropout, pad_idx):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
self.rnn = nn.LSTM(embedding_dim,
hidden_dim,
num_layers=n_layers,
bidirectional=bidirectional,
dropout=dropout)
self.fc1 = nn.Linear(hidden_dim * 2, hidden_dim)
self.fc2 = nn.Linear(hidden_dim, 1)
self.dropout = nn.Dropout(dropout)
def forward(self, text, text_lengths):
# text = [sent len, batch size]
embedded = self.embedding(text)
# embedded = [sent len, batch size, emb dim]
# pack sequence
packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths, batch_first=True, enforce_sorted=False)
packed_output, (hidden, cell) = self.rnn(packed_embedded)
# unpack sequence
# output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)
# output = [sent len, batch size, hid dim * num directions]
# output over padding tokens are zero tensors
# hidden = [num layers * num directions, batch size, hid dim]
# cell = [num layers * num directions, batch size, hid dim]
# concat the final forward (hidden[-2,:,:]) and backward (hidden[-1,:,:]) hidden layers
# and apply dropout
hidden = self.dropout(torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1))
output = self.fc1(hidden)
output = self.dropout(self.fc2(output))
# hidden = [batch size, hid dim * num directions]
return output
回到刚才的main函数,继续编写逻辑。接下来我们要创建模型LSTM_net,使用上一节写好的代码。先定义一些参数,再构建模型对象。
# build model
MODEL_PATH = "model.pth"
INPUT_DIM = len(vocab)
EMBEDDING_DIM = 200
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.2
PAD_IDX = vocab.get_stoi()["" ] # padding
model = LSTM_net(INPUT_DIM,
EMBEDDING_DIM,
HIDDEN_DIM,
OUTPUT_DIM,
N_LAYERS,
BIDIRECTIONAL,
DROPOUT,
PAD_IDX)
model.load_state_dict
。torchtext.vocab.GloVe
if os.path.exists(MODEL_PATH):
model.load_state_dict(torch.load(MODEL_PATH))
else:
# 迁移学习glove预训练词向量
pretrained = torchtext.vocab.GloVe(name="6B", dim=200)
print(f"pretrained.vectors device: {pretrained.vectors.device}, shape: {pretrained.vectors.shape}")
for i, token in enumerate(vocab.get_itos()):
model.embedding.weight.data[i] = pretrained.get_vecs_by_tokens(token)
将PAD字符的向量初始化为0,再把模型送入GPU
# 填充位初始化为0
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)
print("model.embedding.weight.data", model.embedding.weight.data)
model.to(device) # CNN to GPU
定义一些超参数
# train Hyperparameters
num_epochs = 25
learning_rate = 0.001
# Loss and optimizer
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
编写三个函数:
binary_accuracy
用于判断准确率train
用于训练阶段。evaluate
用于验证阶段。留意到,train
和evalue
的for循环迭代中,batch会被解元为3个元素,这是由collate_batch
的行为决定的。
def binary_accuracy(preds, y):
"""
Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
"""
# round predictions to the closest integer
rounded_preds = torch.round(torch.sigmoid(preds))
correct = (rounded_preds == y).float() # convert into float for division
acc = correct.sum() / len(correct)
return acc
# training function
def train(model, iterator):
epoch_loss = 0
epoch_acc = 0
model.train()
for batch in tqdm(iterator):
text, text_lengths, labels = batch
optimizer.zero_grad()
predictions = model(text, text_lengths).squeeze(1)
loss = criterion(predictions, labels)
acc = binary_accuracy(predictions, labels)
loss.backward()
optimizer.step()
epoch_loss += loss.item()
epoch_acc += acc.item()
return epoch_loss / len(iterator), epoch_acc / len(iterator)
def evaluate(model, iterator):
epoch_acc = 0
model.eval()
with torch.no_grad():
for batch in tqdm(iterator):
text, text_lengths, labels = batch
predictions = model(text, text_lengths).squeeze(1)
acc = binary_accuracy(predictions, labels)
epoch_acc += acc.item()
return epoch_acc / len(iterator)
最后,在主循环里编写训练逻辑,每个epoch里,都先对训练集作一遍训练train
,再对验证集作一遍验证evaluate
,最后torch.save
保存模型。
t = time.time()
loss = []
acc = []
val_acc = []
for epoch in range(num_epochs):
train_loss, train_acc = train(model, train_loader)
valid_acc = evaluate(model, valid_loader)
print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc * 100:.2f}%')
print(f'\t Val. Acc: {valid_acc * 100:.2f}%')
loss.append(train_loss)
acc.append(train_acc)
val_acc.append(valid_acc)
torch.save(model.state_dict(), MODEL_PATH)
print(f'time:{time.time() - t:.3f}')
Q:在对数据集预处理时,是覆写Dataset::__getitem__
,还是实现Dataloader
的collate_fn
函数更好?
A: 个人觉得,改写collate_fn更清晰一些。Dataset只需要返回原生的数据,也就是字符串、整形数之类的。
Q:如何从词库Glove中获取给定单词的向量
A:调用get_vecs_by_tokens
, 参考torchtext与glove
Q: RuntimeError: lengths
array must be sorted in decreasing order when enforce_sorted
is True
A:Pytorch-RNN关于pack_padded_sequence之enforce_sorted详解 https://blog.csdn.net/BierOne/article/details/116133857
Q:在主函数中用到了criterion = nn.BCEWithLogitsLoss()
。使用BCEWithLogitsLoss(output,target),output 为float类型,target为int64,报错"RuntimeError: result type Float can’t be cast to the desired output type Long"
A:参考RuntimeError: result type Float can‘t be cast to the desired output type Long,文章建议,将target变量转为float。对应代码中的问题,是在比较label和pred时,pred的类型是float,而labe的类型是Long。最后解决方案是修改collate_fn
,编写代码如下:
label_list = torch.tensor(label_list, dtype=torch.float)
将输出的label转为float型的Tensor
当前网上缺乏新版torchtext API的教程,而本文通过对网上案例重新编写,为网友们提供了从0到1的完整nlp项目实现过程,帮大家梳理了新版API的使用方法,并开源了代码。
想到了一个问题。每个句子的长度都是限制的(比如20),那么能无限对话、无限写作的对话机器人(比如chatGPT)是如何实现的呢?需要慢慢发掘。