Pytorch笔记-6

Text Classification with the torchtext library

本教程,我们将展示如何使用torchtext库,构建文本分类分析用的数据集。用户将能灵活做以下几项:

  • Access to raw data as iterator
  • Build data processing pipeline to convert the raw text strings into torch.Tensor that can be used to train the model
  • Shuffle and iterate the data with torch.utils.data.DataLoader

1. Access to the raw dataset iterators

torchtext库提供了一些raw dataset iterators,能够yield the raw strings.例如,AG_NEWS dataset iterators yield the raw data as a tuple of label and text.

# 运行本课程中介绍的,会出现connectError错误
import torch
from torchtext.datasets import AG_NEWS
train_iter = AG_NEWS(split='train')

以上代码返回错误 ConnectionError: ('Connection aborted.', OSError(22, 'Invalid argument'))

因此,我在网上下载了AG_NEWS数据集,下载地址:

https://download.csdn.net/download/hanfeixue2001/16261579?spm=1001.2014.3001.5501

from torchtext.utils import unicode_csv_reader
import io
def read_iter(path):
    with io.open(path, encoding='utf-8') as f:
        reader = unicode_csv_reader(f)
        for row in reader:
            yield int(row[0]), ' '.join(row[1:])

            
train_path = './AG_NEWS/train.csv'
test_path = './AG_NEWS/test.csv'
train_iter = read_iter(train_path)
next(train_iter)
(3,
 "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.")
next(train_iter)
(3,
 'Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market.')

2. Prepare data processing pipelines

我们将会使用torchtext库最基本的组件,包括vocab、word vectors、tokenizer。对 raw text strings 做基本的数据预处理操作。

这里给出用tokenizer 和 vocabulary做NLP数据预处理的例子。第一步,用raw training dataset构建vocabulary。用户可以通过在Vocab类的构造函数中设置参数,来自定义vocab。例如:参数min_freq,表示包含tokens的最小频数要求。

from torchtext.data.utils import get_tokenizer
from collections import Counter
from torchtext.vocab import Vocab
# get_tokenizer函数的作用是创建一个分词器,根据不同分词函数的规则完成分词
# 分词器支持’basic_english’,‘spacy’,‘moses’,‘toktok’,‘revtok’,'subword’等规则
tokenizer = get_tokenizer('basic_english')
train_iter = read_iter(train_path)
counter = Counter()
for label, line in train_iter:
    # 将语料喂给相应的分词器
    counter.update(tokenizer(line))
# Create a Vocab object from a collections.Counter
vocab = Vocab(counter, min_freq=1)
# vocab有三个属性,分别是freqs、stoi、itos,下面展示一个属性
vocab.itos[:5]
['', '', '.', 'the', ',']
# convert token into integer(token进行数值化处理,即每个token都有唯一索引去替代)
[vocab[token] for token in ['here', 'is', 'an', 'example']]

[476, 22, 31, 5298]

准备带有tokenizer和vocabulary的the text processing pipeline。

The text 和 label pipelines 将被用于处理raw data strings

text_pipeline = lambda x: [vocab[token] for token in tokenizer(x)]
label_pipeline = lambda x: int(x) - 1
text_pipeline('here is an example')
[476, 22, 31, 5298]
label_pipeline('3')
2

3. Generate data batch and iterator

torch.utils.data.DataLoader用于构建 data batch and iterator。
在数据输入model之前,collate_fn函数能够操作由DataLoader产生的 a batch of samples。collate_fn函数的输入是DataLoader中指定的batch size的a batch of data。

from torch.utils.data import DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for _label, _text in batch:
        # 利用label_pipeline函数将 标签数值化
        label_list.append(label_pipeline(_label))
        # 利用text_pipeline函数将 文本数值化
        processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
        text_list.append(processed_text)
        # 将每个文本的长度加入offsets中
        offsets.append(processed_text.size(0))
        
    label_list = torch.tensor(label_list, dtype=torch.int64)
    # 将offsets中的值累加,比如[0, 25, 20, 18] 进行累加后 [0, 25, 45]
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    # 将text_list中的tensor进行摊平,比如[tensor([1,2]), tensor([3, 4])] 改变为tensor([1,2,3,4])
    text_list = torch.cat(text_list)
    # 每个batch的所有标签数值化、文本数值化后都 合在一起,然后可以用offsets的值不同文本的区分。
    # 下面就返回了 所有标签、糅合在一起的所有文本、每个文本的起始和结束索引
    return label_list.to(device), text_list.to(device), offsets.to(device)

train_iter = read_iter(train_path)
dataloader = DataLoader(train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch)

4. Define the model

这个模型是由nn.EmbeddingBag layer和 linear layer构成。

nn.EmbeddingBag默认模式是’mean’,用于计算a bag of embeddings的平均值。尽管the text有不同的长度,但nn.EmbeddingBag模块不需要填充,因为the text lengths被保存在offsets中。

Pytorch笔记-6_第1张图片

from torch import nn

class TextClassificationModel(nn.Module):
    
    def __init__(self, vocab_size, embed_dim, num_class):
        super(TextClassificationModel, self).__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()
    
    def init_weights(self):
        # 权重初始化
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()
    
    def forward(self, text, offsets):
        # text是一个batch糅合起来文本子集, offsets是每个文本的起始和结束索引
        # 返回embedded的大小:batch_size*embed_dim
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)

5. Initiate an instance

AG_NEWS 数据集有4个标签,所以有4个类别,分别为:

1 : World

2 : Sports

3 : Business

4 : Sci/Tec

我们构建embedding dimension 的维度是64。vacab size等于vocabulary instance的长度。类别的数量等于标签的数量。

train_iter = read_iter(train_path)
# 标签数量
num_class = len(set([label for (label, text) in train_iter]))
# vocabulary 的大小
vocab_size = len(vocab)
emsize = 64
model = TextClassificationModel(vocab_size, emsize, num_class).to(device)

6. Define functions to train the model and evaluate results

import time
def train(dataloader):
    # 训练
    model.train()
    total_acc, total_count = 0, 0
    log_interval = 500
    start_time = time.time()
    
    for idx, (label, text, offsets) in enumerate(dataloader):
        optimizer.zero_grad()
        predited_label = model(text, offsets)
        loss = criterion(predited_label, label)
        # print(label)
        # print(predited_label)
        loss.backward()
        # 梯度剪枝,防止梯度爆炸
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
        # 更新参数
        optimizer.step()
        # 预测正确的样本累加
        total_acc += (predited_label.argmax(1) == label).sum().item()
        # 所有样本累加
        total_count += label.size(0)
        # 每log_interval间隙打印模型训练结果
        if idx % log_interval ==0 and idx > 0:
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches '
                  '| accuracy {:8.3f}'.format(epoch, idx, len(dataloader),
                                             total_acc/total_count))
            total_acc, total_count = 0, 0
            start_time = time.time()

def evaluate(dataloader):
    model.eval()
    total_acc, total_count = 0, 0
    with torch.no_grad():
        for idx, (label, text, offsets) in enumerate(dataloader):
            predited_label = model(text, offsets)
            loss = criterion(predited_label, label)
            total_acc += (predited_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
    return total_acc/total_count

7. Split the dataset and run the model

因为原始AG_NEWS没有valid dataset,我们将会把training dataset 划分为 train/valid sets,其中train(0.95)、vaild(0.05)。我们使用torch.utils.data.dataset.random_split函数来做这件事。

from torch.utils.data.dataset import random_split
# Hyperparameters
EPOCHS = 10     # epoch
LR = 5          # learning rate
BATCH_SIZE = 64 # batch size for training
# 定义损失函数对象
criterion = torch.nn.CrossEntropyLoss()
# 定义优化器对象
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
# 定义学习率的动态调整对象
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)


total_accu = None
train_iter = read_iter(train_path)
test_iter = read_iter(test_path)

train_dataset = list(train_iter)
test_dataset = list(test_iter)
num_train = int(len(train_dataset) * 0.95)
split_train_, split_vaild_ = random_split(train_dataset, 
                                        [num_train, len(train_dataset)-num_train])


train_dataloader = DataLoader(split_train_, batch_size=BATCH_SIZE,
                             shuffle=True, collate_fn=collate_batch)
valid_dataloader = DataLoader(split_vaild_, batch_size=BATCH_SIZE,
                             shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE,
                             shuffle=True, collate_fn=collate_batch)

for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    train(train_dataloader)
    accu_val = evaluate(valid_dataloader)
    
    if total_accu is not None and total_accu > accu_val:
        scheduler.step()
    else:
        total_accu = accu_val
    print('-' * 59)
    print('| end of epoch {:3d} | time: {:5.2f}s | '
         'valid accuracy {:8.3f} '.format(epoch,
                                         time.time() - epoch_start_time,
                                         accu_val))
    print('-' * 59)
| epoch   1 |   500/ 1782 batches | accuracy    0.683
| epoch   1 |  1000/ 1782 batches | accuracy    0.853
| epoch   1 |  1500/ 1782 batches | accuracy    0.875
-----------------------------------------------------------
| end of epoch   1 | time: 17.74s | valid accuracy    0.890 
-----------------------------------------------------------
| epoch   2 |   500/ 1782 batches | accuracy    0.897
| epoch   2 |  1000/ 1782 batches | accuracy    0.901
| epoch   2 |  1500/ 1782 batches | accuracy    0.901
-----------------------------------------------------------
| end of epoch   2 | time: 16.61s | valid accuracy    0.901 
-----------------------------------------------------------
| epoch   3 |   500/ 1782 batches | accuracy    0.913
| epoch   3 |  1000/ 1782 batches | accuracy    0.913
| epoch   3 |  1500/ 1782 batches | accuracy    0.914
-----------------------------------------------------------
| end of epoch   3 | time: 15.95s | valid accuracy    0.902 
-----------------------------------------------------------
| epoch   4 |   500/ 1782 batches | accuracy    0.922
| epoch   4 |  1000/ 1782 batches | accuracy    0.923
| epoch   4 |  1500/ 1782 batches | accuracy    0.922
-----------------------------------------------------------
| end of epoch   4 | time: 16.52s | valid accuracy    0.907 
-----------------------------------------------------------
| epoch   5 |   500/ 1782 batches | accuracy    0.930
| epoch   5 |  1000/ 1782 batches | accuracy    0.931
| epoch   5 |  1500/ 1782 batches | accuracy    0.927
-----------------------------------------------------------
| end of epoch   5 | time: 17.86s | valid accuracy    0.909 
-----------------------------------------------------------
| epoch   6 |   500/ 1782 batches | accuracy    0.939
| epoch   6 |  1000/ 1782 batches | accuracy    0.935
| epoch   6 |  1500/ 1782 batches | accuracy    0.933
-----------------------------------------------------------
| end of epoch   6 | time: 18.22s | valid accuracy    0.906 
-----------------------------------------------------------
| epoch   7 |   500/ 1782 batches | accuracy    0.945
| epoch   7 |  1000/ 1782 batches | accuracy    0.948
| epoch   7 |  1500/ 1782 batches | accuracy    0.947
-----------------------------------------------------------
| end of epoch   7 | time: 17.89s | valid accuracy    0.916 
-----------------------------------------------------------
| epoch   8 |   500/ 1782 batches | accuracy    0.951
| epoch   8 |  1000/ 1782 batches | accuracy    0.949
| epoch   8 |  1500/ 1782 batches | accuracy    0.947
-----------------------------------------------------------
| end of epoch   8 | time: 17.18s | valid accuracy    0.916 
-----------------------------------------------------------
| epoch   9 |   500/ 1782 batches | accuracy    0.950
| epoch   9 |  1000/ 1782 batches | accuracy    0.950
| epoch   9 |  1500/ 1782 batches | accuracy    0.950
-----------------------------------------------------------
| end of epoch   9 | time: 17.25s | valid accuracy    0.917 
-----------------------------------------------------------
| epoch  10 |   500/ 1782 batches | accuracy    0.950
| epoch  10 |  1000/ 1782 batches | accuracy    0.951
| epoch  10 |  1500/ 1782 batches | accuracy    0.949
-----------------------------------------------------------
| end of epoch  10 | time: 17.41s | valid accuracy    0.917 
-----------------------------------------------------------

8. Evaluate the model with test dataset

用test dataset验证

print('Checking the results of test dataset.')
accu_test = evaluate(test_dataloader)
print('test accuracy {:8.3f}'.format(accu_test))
Checking the results of test dataset.
test accuracy    0.909

9. Test on a random news

ag_news_label = {1: 'World',
                 2: 'Sports',
                 3: 'Business',
                 4: 'Sci/Tec'}

def predict(text, text_pipeline):
    with torch.no_grad():
        text = torch.tensor(text_pipeline(text))
        # offsets 参数很巧妙的用了 [0] 表示一个文本
        output = model(text, torch.tensor([0]))
        return output.argmax(1).item() + 1

ex_text_str = "MEMPHIS, Tenn. – Four days ago, Jon Rahm was \
    enduring the season’s worst weather conditions on Sunday at The \
    Open on his way to a closing 75 at Royal Portrush, which \
    considering the wind and the rain was a respectable showing. \
    Thursday’s first round at the WGC-FedEx St. Jude Invitational \
    was another story. With temperatures in the mid-80s and hardly any \
    wind, the Spaniard was 13 strokes better in a flawless round. \
    Thanks to his best putting performance on the PGA Tour, Rahm \
    finished with an 8-under 62 for a three-stroke lead, which \
    was even more impressive considering he’d never played the \
    front nine at TPC Southwind."

model = model.to("cpu")
print('This is a %s news' % ag_news_label[predict(ex_text_str,
                                                 text_pipeline)])
This is a Sports news

你可能感兴趣的:(Pytorch,python,深度学习,机器学习,pytorch)