【学习笔记】Pytorch-两层BiLSTM情感计算Demo代码解读

BiLSTM for Sentiment Computing Demo

  • 模型:两层、双向LSTM

  • 数据集:IMDB

  • 环境:

    • Python3.7
    • torch==1.10.0
    • torchtext==0.11.0
    • spacy==2.2.4
  • 相关代码参考自:https://www.bilibili.com/video/BV1Rv411y7oE?p=75

代码

这里在导入data datasets的时候,注意torchtext的版本不能超过0.11.0,,否则会报错。

详情可见官方文档

参考文档
Pytorch/torchtext/Python版本对应

import torch
import torch.nn as nn
import torch.optim as optim
from torchtext.legacy import data, datasets

在神经网络中,参数默认是进行随机初始化的。如果不设置的话每次训练时的初始化都是随机的,导致结果不确定。如果设置初始化,则每次初始化都是固定的。

print('GPU:', torch.cuda.is_available())
# 为GPU设置随机种子
torch.cuda.manual_seed(123)

输出:GPU: True

1.加载数据

1.1 定义文本数据处理方式(Field)

# 定义单个样本数据的处理方式,如此处设置分词方法为spacy分词。
TEXT = data.Field(tokenize='spacy')
# 要指定为torch.float类型,在后面计算BCEloss的时候,target和input的类型要相同。
LABEL = data.LabelField(dtype=torch.float)
# 下载数据集,为数据集的拆分创建数据集对象
train_data,test_data = datasets.IMDB.splits(TEXT, LABEL)

查看数据集样本个数和某个文本数据的内容

print('len of train data:', len(train_data))
print('len of test data:', len(test_data))
print(train_data.examples[15].text)
print(train_data.examples[15].label)

output:

len of train data: 25000

len of test data: 25000

[‘A’, ‘bus’, ‘full’, ‘of’, ‘passengers’, ‘is’, ‘stuck’, ‘during’, ‘a’, ‘snow’, ‘storm’, ‘.’, ‘The’, ‘police’, ‘have’, ‘closed’, ‘the’, ‘bridge’, ‘–’, ‘saying’, ‘it’, “'s”, ‘unsafe’, ‘and’, ‘they’, ‘are’, ‘stuck’, ‘in’, ‘a’, ‘little’, ‘café’, ‘until’, ‘the’, ‘road’, ‘has’, ‘been’, ‘cleared’, ‘.’, ‘However’, ‘,’, ‘after’, ‘a’, ‘while’, ‘,’, ‘their’, ‘boredom’, ‘is’, ‘turned’, ‘to’, ‘concern’, ‘,’, ‘as’, ‘it’, ‘seems’, ‘that’, ‘one’, ‘of’, ‘the’, ‘passengers’, ‘was’, ‘NOT’, ‘originally’, ‘on’, ‘the’, ‘bus’, ‘and’, ‘may’, ‘just’, ‘be’, ‘an’, ‘alien’, ‘!’, ‘!’, ‘This’, ‘leads’, ‘to’, ‘a’, ‘conclusion’, ‘that’, ‘is’, ‘ironic’, ‘but’, ‘also’, ‘rather’, ‘funny’, ‘in’, ‘a’, ‘low’, ‘-’, ‘brow’, ‘way.This’, ‘is’, ‘another’, ‘of’, ‘the’, ‘fun’, ‘episodes’, ‘of’, ‘The’, ‘Twilight’, ‘Zone’, ‘.’, ‘Instead’, ‘of’, ‘the’, ‘typical’, ‘twists’, ‘or’, ‘social’, ‘commentary’, ‘,’, ‘this’, ‘one’, ‘features’, ‘no’, ‘lasting’, ‘message’, ‘.’, ‘However’, ‘,’, ‘it’, “'s”, ‘also’, ‘very’, ‘and’, ‘watchable’, ‘,’, ‘so’, ‘who’, ‘cares’, ‘?’, ‘!’, ‘Exactly’, ‘WHAT’, ‘occurs’, ‘you’, “'ll”, ‘just’, ‘have’, ‘to’, ‘see’, ‘for’, ‘yourself.By’, ‘the’, ‘way’, ‘,’, ‘this’, ‘one’, ‘stars’, ‘John’, ‘Hoyt’, ‘–’, ‘a’, ‘face’, ‘most’, ‘of’, ‘you’, ‘will’, ‘recognize’, ‘from’, ‘countless’, ‘old’, ‘TV’, ‘shows’, ‘and’, ‘movies’, ‘.’, ‘In’, ‘almost’, ‘every’, ‘case’, ‘,’, ‘he’, ‘played’, ‘a’, ‘real’, ‘grouch’, ‘(’, ‘like’, ‘Charles’, ‘Lane’, ‘during’, ‘the’, ‘same’, ‘era’, ‘)’, ‘,’, ‘but’, ‘boy’, ‘did’, ‘I’, ‘love’, ‘seeing’, ‘him’, ‘–’, ‘as’, ‘he’, ‘perfected’, ‘the’, ‘grouchy’, ‘persona’, ‘and’, ‘was’, ‘kind’, ‘of’, ‘funny’, ‘at’, ‘the’, ‘same’, ‘time’, ‘.’]

pos

1.2 构建词表(Vocabulary)

利用训练集构建词表,vocabulary把每个单词一一映射到一个数字。使用10k个单词来构建单词表(用max_size这个参数可以设定),所有其他的单词都用来表示。映射到数字后使用vectors创建词向量。

TEXT.build_vocab(train_data, max_size=10000, vectors='glove.6B.100d')
# LABEL也要构建词表,因为标签是pos、neg,是字符串,词表中仅这两个词。
LABEL.build_vocab(train_data)
# 查看string to int或者int to string
print(TEXT.vocab.stoi)
print(LABEL.vocab.stoi)
# 输出:defaultdict(None, {'neg': 0, 'pos': 1})
batchsz = 32
# 使用GPU加速
device = torch.device('cuda')

1.3 创建迭代器

创建迭代器,相当于torch的DataLoader。每个iterator中各有两部分:词(.text)和标签(.label),其中text全部转换成数字了。

BucketIterator会把长度差不多的句子放到同一个batch中,确保每个batch中不出现太多的padding。这里因为pad比较少,所以把也当做了模型的输入进行训练。如果有GPU,还可以指定每个iteration返回的tensor都在GPU上。

train_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, test_data),
    batch_size = batchsz,
    device=device
)

2.定义模型

这里要注意各层之间的输入输出的shape,模型将正向和反向的最终输出拼接起来。关于多层双向LSTM的输出格式可查看链接。

举个例子,我们定义一个num_layers=3的双向LSTM。
h_n第一个维度的大小就等于 6 (2*3),
h_n[0]表示第一层前向传播最后一个time step的输出,h_n[1]表示第一层后向传播最后一个time step的输出;
h_n[2]表示第二层前向传播最后一个time step的输出,h_n[3]表示第二层后向传播最后一个time step的输出;
h_n[4]和h_n[5]分别表示第三层前向和后向传播时最后一个time step的输出。

class RNN(nn.Module):
    
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
 
        super(RNN, self).__init__()
        
        # [0-10001] => [100]
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        # [100] => [256]
        # 定义两层双向的LSTM
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, num_layers=2, 
                           bidirectional=True, dropout=0.5)
        # [256*2] => [1]
        # 全连接层,因为最后要拼接两个方向的隐藏状态
        self.fc = nn.Linear(hidden_dim*2, 1)
        self.dropout = nn.Dropout(0.5)
        
        
    def forward(self, x):
        """
        x: [seq_len, b] vs [b, 3, 28, 28]
        """
        # [seq, b, 1] => [seq, b, 100]
        embedding = self.dropout(self.embedding(x))
        
        # output: [seq, b, hid_dim*2]
        # hidden/h: [num_layers*2, b, hid_dim]
        # cell/c: [num_layers*2, b, hid_di]
        output, (hidden, cell) = self.rnn(embedding)
        
        # [num_layers*2, b, hid_dim] => 2 of [b, hid_dim] => [b, hid_dim*2]
        # 拼接正向和反向的最后一个时间戳的隐藏状态
        # 为什么是[-2]和[-1]见LSTM的输出格式
        hidden = torch.cat([hidden[-2], hidden[-1]], dim=1)
        
        # [b, hid_dim*2] => [b, 1]
        hidden = self.dropout(hidden)
        out = self.fc(hidden)
        
        return out

3.初始化模型

# 初始化模型
rnn = RNN(len(TEXT.vocab), 100, 256)

# 使用预训练过的embedding来替换随机初始化(Tip:.copy_()这种带着下划线的函数均代表替换inplace)
pretrained_embedding = TEXT.vocab.vectors
print('pretrained_embedding:', pretrained_embedding.shape)
rnn.embedding.weight.data.copy_(pretrained_embedding)
print('embedding layer inited.')
# 定义优化器
optimizer = optim.Adam(rnn.parameters(), lr=1e-3)
# 定义损失函数
criterion = nn.BCEWithLogitsLoss().to(device)
rnn.to(device)

输出

pretrained_embedding: torch.Size([10002, 100])
embedding layer inited.
RNN(
  (embedding): Embedding(10002, 100)
  (rnn): LSTM(100, 256, num_layers=2, dropout=0.5, bidirectional=True)
  (fc): Linear(in_features=512, out_features=1, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)

4.定义评估指标函数

import numpy as np

# 计算准确率的函数
def binary_acc(preds, y):
    """
    get accuracy
    """
    # 过一个sigmoid后将预测值固定到0-1之间,然后四舍五入
    # 因为是二分类问题,所以可以直接四舍五入。
    preds = torch.round(torch.sigmoid(preds))
    correct = torch.eq(preds, y).float()
    acc = correct.sum() / len(correct)
    return acc

5.定义训练和测试函数


def train(rnn, iterator, optimizer, criterion):
    
    avg_acc = []
    rnn.train()
    
    # 针对每个batch计算准确率
    for i, batch in enumerate(iterator):
        
        # [seq, b] => [b, 1] => [b]
        # 压缩维度,使得batch.text和batch.label的维度一致。
        pred = rnn(batch.text).squeeze(1)
        # print(pred)
        # pred是浮点型,这里为什么这么算损失?
        # 因为是二分类,一个概率值就可以决定是正样本还是负样本
        loss = criterion(pred, batch.label)
        acc = binary_acc(pred, batch.label).item()
        avg_acc.append(acc)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        if i%10 == 0:
            print(i, acc)
        
    avg_acc = np.array(avg_acc).mean()
    print('avg acc:', avg_acc)
    
    
def eval(rnn, iterator, criterion):
    
    avg_acc = []
    # 转换为测试状态
    rnn.eval()
    
    # with是上下文管理器,torch.no_grad()意思是不建计算图
    with torch.no_grad():
        for batch in iterator:

            # [b, 1] => [b]
            pred = rnn(batch.text).squeeze(1)
            #
            loss = criterion(pred, batch.label)

            acc = binary_acc(pred, batch.label).item()
            avg_acc.append(acc)
        
    avg_acc = np.array(avg_acc).mean()
    
    print('>>test:', avg_acc)

6.训练

for epoch in range(10):    
    train(rnn, train_iterator, optimizer, criterion)
    eval(rnn, test_iterator, criterion)

你可能感兴趣的:(学习笔记,NLP,Pytorch)