pytorch - Text Classification

本文将使用pytorch和pytorchtext实现文本分类,使用的数据集为IMDB。


pytorch - Text Classification_第1张图片

关于pytorchtext的简介和使用可阅读下面的两篇博文即可清楚其基本的常见用法
A Comprehensive Introduction to Torchtext (Practical Torchtext part 1)
Language modeling tutorial in torchtext (Practical Torchtext part 2)


首先导入所需的包和做一些常规设置:

import torch 
import torch.nn as nn
from torchtext import data, datasets
import os
import random
import torch.nn.functional as F

import spacy
nlp = spacy.load('en')

SEED = 1234

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

加载IMDB数据集:torchtext.datasets中有很好类型任务的常用数据集可供使用,例如:

  • Language Modeling:WikiText-2、WikiText103、PennTreebank
  • Sentiment Analysis:SST、IMDB

Text Classification、Question Classification、Entailment、Machine Translation具体的数据集可见:TORCHTEXT.DATASETS

torchtext的Dataset是继承自pytorch的Dataset,提供了一个可以下载压缩数据并解压的方法(支持.zip, .gz, .tgz,splits方法可以同时读取训练集,验证集,测试集。其中TabularDataset可以很方便的读取CSV, TSV, or JSON格式的文件

# 创建Field
tokenize = lambda x: x.split()
TEXT = data.Field(tokenize=tokenize)
LABEL = data.LabelField(dtype=torch.float)

# 加载datasets
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

# 分割验证集
train_data, valid_data = train_data.split(random_state = random.seed(SEED))

Field:数据预处理的配置讯息,例如分词、是否转小写、起始符号和终止符号的设置、补全字符及词典等,它包含一写文本处理的通用参数的设置,同时还包含一个词典对象,可以把文本数据表示成数字类型,进而可以把文本表示成需要的tensor类型。
主要参数
– sequential: 是否把数据表示成序列,如果是False, 不能使用分词 默认值: True.
– use_vocab: 是否使用词典对象. 如果是False 数据的类型必须已经是数值类型. 默认值: True.
– init_token: 每一条数据的起始字符 默认值: None.
– eos_token: 每条数据的结尾字符 默认值: None.
– fix_length: 修改每条数据的长度为该值,不够的用pad_token补全. 默认值: None.
– tensor_type: 把数据转换成的tensor类型 默认值: torch.LongTensor.
– preprocessing:在分词之后和数值化之前使用的管道 默认值: None.
– postprocessing: 数值化之后和转化成tensor之前使用的管道默认值: None.
– lower: 是否把数据转化为小写 默认值: False.
– tokenize: 分词函数. 默认值: str.split.
– include_lengths: 是否返回一个已经补全的最小batch的元组和和一个包含每条数据长度的列表 . 默认值: False.
– batch_first: Whether to produce tensors with the batch dimension first. 默认值: False.
– pad_token: 用于补全的字符. 默认值: “”.
– unk_token: 不存在词典里的字符. 默认值: “”.
– pad_first: 是否补全第一个字符. 默认值: False.


主要方法
– pad(minibatch): 在一个batch对齐每条数据
– build_vocab(): 建立词典
– numericalize(): 把文本数据数值化,返回tensor

在得到数据集之后,我们需要创建词典

# 建立字典
TEXT.build_vocab(train_data, max_size=25000, vectors="glove.6B.100d", unk_init=torch.Tensor.normal_)
LABEL.build_vocab(train_data)

为了后续模型的训练,接着需要利用datasets创建Iterator

# 创建Iterator
BATCH_SIZE = 64
train_iterator, val_iterator, test_iterator = data.BucketIterator.splits(
                                                                        (train_data, valid_data, test_data), 
                                                                         batch_size=BATCH_SIZE, 
                                                                         device=device)

Iterator是torchtext到模型的输出,它提供了我们对数据的一般处理方式,比如打乱,排序,等等,可以动态修改batch大小,这里也有splits方法 可以同时输出训练集,验证集,测试集。
主要参数

  • dataset: 加载的数据集
  • batch_size: Batch 大小.
  • batch_size_fn: 产生动态的batch大小 的函数
  • sort_key: 排序的key
  • train: 是否是一个训练集
  • repeat: 是否在不同epoch中重复迭代
  • shuffle: 是否打乱数据
  • sort: 是否对数据进行排序
  • sort_within_batch: batch内部是否排序
  • device: 建立batch的设备 -1:CPU ;0,1 …:对应的GPU

通过上述的三个步骤,我们就得到了模型训练时所需的数据集格式。接着所需的就是建立分类模型。

下面时一个简单的基于RNN的模型,模型只有Embedding、2层的Bi-LSTM和最后的全连接层。

# 建立模型
class RNN(nn.Module):
    def __init__(self, vocab_size, 
                 embedding_dim, 
                 hidden_dim, 
                 output_dim, 
                 n_layers = 2, 
                 bidirectional = True, 
                 dropout = 0.5):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.LSTM(embedding_dim, 
                           hidden_dim, 
                           num_layers=n_layers, 
                           bidirectional=bidirectional, 
                           dropout=dropout)
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text):
        embedded = self.dropout(self.embedding(text)) #[sent len, batch size, emb dim]
        output, (hidden, cell) = self.rnn(embedded)
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)) # [batch size, hid dim * num directions]
        return self.fc(hidden.squeeze(0))

下面是利用所有token的表示向量的平均值来进行最后的分类。

class WordAVGModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, output_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.fc = nn.Linear(embedding_dim, output_dim)
        
    def forward(self, text):
        embedded = self.embedding(text) # [sent len, batch size, emb dim]
        embedded = embedded.permute(1, 0, 2) # [batch size, sent len, emb dim]
        pooled = F.avg_pool2d(embedded, (embedded.shape[1], 1)).squeeze(1) # [batch size, embedding_dim]
        return self.fc(pooled)

TextCNN的实现:

class CNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, n_filters, filter_sizes, output_dim, dropout):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.convs = nn.ModuleList([nn.Conv2d(in_channels = 1, 
                                              out_channels = n_filters, 
                                              kernel_size = (fs, embedding_dim)) for fs in filter_sizes])
        self.fc = nn.Linear(len(filter_sizes) * n_filters, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text):
        text = text.permute(1, 0) # [batch size, sent len]
        embedded = self.embedding(text) # [batch size, sent len, emb dim]
        embedded = embedded.unsqueeze(1) # [batch size, 1, sent len, emb dim]
        conved = [F.relu(conv(embedded)).squeeze(3) for conv in self.convs]  #conv_n = [batch size, n_filters, sent len - filter_sizes[n]]
        pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved] #pooled_n = [batch size, n_filters]
        cat = self.dropout(torch.cat(pooled, dim=1))  #cat = [batch size, n_filters * len(filter_sizes)]
            
        return self.fc(cat)

模型训练

VOCAB_SIZE = len(TEXT.vocab)
EMBED_DIM = 100
HIDDEN_DIM = 256
NUM_LABELS = 1
N_FILTERS = 100
FILTER_SIZES = [3,4,5]
DROPOUT = 0.5

# 获取模型
model = RNN(VOCAB_SIZE, EMBED_DIM, HIDDEN_DIM, NUM_LABELS)
model = model.to(device)

# 加载Glove预训练权重
pretrained_embeddings = TEXT.vocab.vectors
model.embedding.weight.data.copy_(pretrained_embeddings)
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]

model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBED_DIM)
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBED_DIM)

# 选择优化器和损失函数
optimizer = torch.optim.Adam(model.parameters())
criterion = nn.BCEWithLogitsLoss()
criterion = criterion.to(device)

# 训练
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    model.train()
    
    for index, batch in enumerate(iterator):
        optimizer.zero_grad()
        predictions = model(batch.text).squeeze(1)
        loss = criterion(predictions, batch.label)
        acc = binary_accuracy(predictions, batch.label)
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
         
        if index % 1000 == 0:
            print ("loss: {} , acc: {}".format(loss, acc))
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)
    
# 评估
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    model.eval()
    
    with torch.no_grad():
        for batch in iterator:
            predictions = model(batch.text).squeeze(1)
            loss = criterion(predictions, batch.label)
            acc = binary_accuracy(predictions, batch.label)
            epoch_loss += loss.item()
            epoch_acc += acc.item()
    
    model.train()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

# 预测
def predict_sentiment(model, sentence):
    model.eval()
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    length = [len(indexed)]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(1)
    length_tensor = torch.LongTensor(length)
    prediction = torch.sigmoid(model(tensor, length_tensor))
    return prediction.item()

if __name__ == '__main__':
	N_EPOCHS = 1
	best_valid_loss = float('inf')
	
	for epoch in range(N_EPOCHS):
	    start_time = time.time()
	    
	    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
	    valid_loss, valid_acc = evaluate(model, val_iterator, criterion)
	    
	    end_time = time.time()
	
	    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
	    
	    if valid_loss < best_valid_loss:
	        best_valid_loss = valid_loss
	        torch.save(model.state_dict(), 'wordavg-model.pt')
	    
	    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
	    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
	    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

source:pytorch-sentiment-analysis

你可能感兴趣的:(NLP)