本教程,我们将展示如何使用torchtext库,构建文本分类分析用的数据集。用户将能灵活做以下几项:
torch.Tensor
that can be used to train the modeltorch.utils.data.DataLoader
torchtext
库提供了一些raw dataset iterators,能够yield the raw strings.例如,AG_NEWS
dataset iterators yield the raw data as a tuple of label and text.
# 运行本课程中介绍的,会出现connectError错误
import torch
from torchtext.datasets import AG_NEWS
train_iter = AG_NEWS(split='train')
以上代码返回错误 ConnectionError: ('Connection aborted.', OSError(22, 'Invalid argument'))
因此,我在网上下载了AG_NEWS数据集,下载地址:
https://download.csdn.net/download/hanfeixue2001/16261579?spm=1001.2014.3001.5501
from torchtext.utils import unicode_csv_reader
import io
def read_iter(path):
with io.open(path, encoding='utf-8') as f:
reader = unicode_csv_reader(f)
for row in reader:
yield int(row[0]), ' '.join(row[1:])
train_path = './AG_NEWS/train.csv'
test_path = './AG_NEWS/test.csv'
train_iter = read_iter(train_path)
next(train_iter)
(3,
"Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.")
next(train_iter)
(3,
'Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market.')
我们将会使用torchtext库最基本的组件,包括vocab、word vectors、tokenizer。对 raw text strings 做基本的数据预处理操作。
这里给出用tokenizer 和 vocabulary做NLP数据预处理的例子。第一步,用raw training dataset构建vocabulary。用户可以通过在Vocab类的构造函数中设置参数,来自定义vocab。例如:参数min_freq,表示包含tokens的最小频数要求。
from torchtext.data.utils import get_tokenizer
from collections import Counter
from torchtext.vocab import Vocab
# get_tokenizer函数的作用是创建一个分词器,根据不同分词函数的规则完成分词
# 分词器支持’basic_english’,‘spacy’,‘moses’,‘toktok’,‘revtok’,'subword’等规则
tokenizer = get_tokenizer('basic_english')
train_iter = read_iter(train_path)
counter = Counter()
for label, line in train_iter:
# 将语料喂给相应的分词器
counter.update(tokenizer(line))
# Create a Vocab object from a collections.Counter
vocab = Vocab(counter, min_freq=1)
# vocab有三个属性,分别是freqs、stoi、itos,下面展示一个属性
vocab.itos[:5]
['', '', '.', 'the', ',']
# convert token into integer(token进行数值化处理,即每个token都有唯一索引去替代)
[vocab[token] for token in ['here', 'is', 'an', 'example']]
[476, 22, 31, 5298]
准备带有tokenizer和vocabulary的the text processing pipeline。
The text 和 label pipelines 将被用于处理raw data strings
text_pipeline = lambda x: [vocab[token] for token in tokenizer(x)]
label_pipeline = lambda x: int(x) - 1
text_pipeline('here is an example')
[476, 22, 31, 5298]
label_pipeline('3')
2
torch.utils.data.DataLoader
用于构建 data batch and iterator。
在数据输入model之前,collate_fn
函数能够操作由DataLoader
产生的 a batch of samples。collate_fn
函数的输入是DataLoader
中指定的batch size的a batch of data。
from torch.utils.data import DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def collate_batch(batch):
label_list, text_list, offsets = [], [], [0]
for _label, _text in batch:
# 利用label_pipeline函数将 标签数值化
label_list.append(label_pipeline(_label))
# 利用text_pipeline函数将 文本数值化
processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
text_list.append(processed_text)
# 将每个文本的长度加入offsets中
offsets.append(processed_text.size(0))
label_list = torch.tensor(label_list, dtype=torch.int64)
# 将offsets中的值累加,比如[0, 25, 20, 18] 进行累加后 [0, 25, 45]
offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
# 将text_list中的tensor进行摊平,比如[tensor([1,2]), tensor([3, 4])] 改变为tensor([1,2,3,4])
text_list = torch.cat(text_list)
# 每个batch的所有标签数值化、文本数值化后都 合在一起,然后可以用offsets的值不同文本的区分。
# 下面就返回了 所有标签、糅合在一起的所有文本、每个文本的起始和结束索引
return label_list.to(device), text_list.to(device), offsets.to(device)
train_iter = read_iter(train_path)
dataloader = DataLoader(train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch)
这个模型是由nn.EmbeddingBag
layer和 linear layer构成。
nn.EmbeddingBag
默认模式是’mean’,用于计算a bag of embeddings的平均值。尽管the text有不同的长度,但nn.EmbeddingBag
模块不需要填充,因为the text lengths被保存在offsets中。
from torch import nn
class TextClassificationModel(nn.Module):
def __init__(self, vocab_size, embed_dim, num_class):
super(TextClassificationModel, self).__init__()
self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
self.fc = nn.Linear(embed_dim, num_class)
self.init_weights()
def init_weights(self):
# 权重初始化
initrange = 0.5
self.embedding.weight.data.uniform_(-initrange, initrange)
self.fc.weight.data.uniform_(-initrange, initrange)
self.fc.bias.data.zero_()
def forward(self, text, offsets):
# text是一个batch糅合起来文本子集, offsets是每个文本的起始和结束索引
# 返回embedded的大小:batch_size*embed_dim
embedded = self.embedding(text, offsets)
return self.fc(embedded)
AG_NEWS 数据集有4个标签,所以有4个类别,分别为:
1 : World
2 : Sports
3 : Business
4 : Sci/Tec
我们构建embedding dimension 的维度是64。vacab size等于vocabulary instance的长度。类别的数量等于标签的数量。
train_iter = read_iter(train_path)
# 标签数量
num_class = len(set([label for (label, text) in train_iter]))
# vocabulary 的大小
vocab_size = len(vocab)
emsize = 64
model = TextClassificationModel(vocab_size, emsize, num_class).to(device)
import time
def train(dataloader):
# 训练
model.train()
total_acc, total_count = 0, 0
log_interval = 500
start_time = time.time()
for idx, (label, text, offsets) in enumerate(dataloader):
optimizer.zero_grad()
predited_label = model(text, offsets)
loss = criterion(predited_label, label)
# print(label)
# print(predited_label)
loss.backward()
# 梯度剪枝,防止梯度爆炸
torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
# 更新参数
optimizer.step()
# 预测正确的样本累加
total_acc += (predited_label.argmax(1) == label).sum().item()
# 所有样本累加
total_count += label.size(0)
# 每log_interval间隙打印模型训练结果
if idx % log_interval ==0 and idx > 0:
elapsed = time.time() - start_time
print('| epoch {:3d} | {:5d}/{:5d} batches '
'| accuracy {:8.3f}'.format(epoch, idx, len(dataloader),
total_acc/total_count))
total_acc, total_count = 0, 0
start_time = time.time()
def evaluate(dataloader):
model.eval()
total_acc, total_count = 0, 0
with torch.no_grad():
for idx, (label, text, offsets) in enumerate(dataloader):
predited_label = model(text, offsets)
loss = criterion(predited_label, label)
total_acc += (predited_label.argmax(1) == label).sum().item()
total_count += label.size(0)
return total_acc/total_count
因为原始AG_NEWS没有valid dataset,我们将会把training dataset 划分为 train/valid sets,其中train(0.95)、vaild(0.05)。我们使用torch.utils.data.dataset.random_split
函数来做这件事。
from torch.utils.data.dataset import random_split
# Hyperparameters
EPOCHS = 10 # epoch
LR = 5 # learning rate
BATCH_SIZE = 64 # batch size for training
# 定义损失函数对象
criterion = torch.nn.CrossEntropyLoss()
# 定义优化器对象
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
# 定义学习率的动态调整对象
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
total_accu = None
train_iter = read_iter(train_path)
test_iter = read_iter(test_path)
train_dataset = list(train_iter)
test_dataset = list(test_iter)
num_train = int(len(train_dataset) * 0.95)
split_train_, split_vaild_ = random_split(train_dataset,
[num_train, len(train_dataset)-num_train])
train_dataloader = DataLoader(split_train_, batch_size=BATCH_SIZE,
shuffle=True, collate_fn=collate_batch)
valid_dataloader = DataLoader(split_vaild_, batch_size=BATCH_SIZE,
shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE,
shuffle=True, collate_fn=collate_batch)
for epoch in range(1, EPOCHS + 1):
epoch_start_time = time.time()
train(train_dataloader)
accu_val = evaluate(valid_dataloader)
if total_accu is not None and total_accu > accu_val:
scheduler.step()
else:
total_accu = accu_val
print('-' * 59)
print('| end of epoch {:3d} | time: {:5.2f}s | '
'valid accuracy {:8.3f} '.format(epoch,
time.time() - epoch_start_time,
accu_val))
print('-' * 59)
| epoch 1 | 500/ 1782 batches | accuracy 0.683
| epoch 1 | 1000/ 1782 batches | accuracy 0.853
| epoch 1 | 1500/ 1782 batches | accuracy 0.875
-----------------------------------------------------------
| end of epoch 1 | time: 17.74s | valid accuracy 0.890
-----------------------------------------------------------
| epoch 2 | 500/ 1782 batches | accuracy 0.897
| epoch 2 | 1000/ 1782 batches | accuracy 0.901
| epoch 2 | 1500/ 1782 batches | accuracy 0.901
-----------------------------------------------------------
| end of epoch 2 | time: 16.61s | valid accuracy 0.901
-----------------------------------------------------------
| epoch 3 | 500/ 1782 batches | accuracy 0.913
| epoch 3 | 1000/ 1782 batches | accuracy 0.913
| epoch 3 | 1500/ 1782 batches | accuracy 0.914
-----------------------------------------------------------
| end of epoch 3 | time: 15.95s | valid accuracy 0.902
-----------------------------------------------------------
| epoch 4 | 500/ 1782 batches | accuracy 0.922
| epoch 4 | 1000/ 1782 batches | accuracy 0.923
| epoch 4 | 1500/ 1782 batches | accuracy 0.922
-----------------------------------------------------------
| end of epoch 4 | time: 16.52s | valid accuracy 0.907
-----------------------------------------------------------
| epoch 5 | 500/ 1782 batches | accuracy 0.930
| epoch 5 | 1000/ 1782 batches | accuracy 0.931
| epoch 5 | 1500/ 1782 batches | accuracy 0.927
-----------------------------------------------------------
| end of epoch 5 | time: 17.86s | valid accuracy 0.909
-----------------------------------------------------------
| epoch 6 | 500/ 1782 batches | accuracy 0.939
| epoch 6 | 1000/ 1782 batches | accuracy 0.935
| epoch 6 | 1500/ 1782 batches | accuracy 0.933
-----------------------------------------------------------
| end of epoch 6 | time: 18.22s | valid accuracy 0.906
-----------------------------------------------------------
| epoch 7 | 500/ 1782 batches | accuracy 0.945
| epoch 7 | 1000/ 1782 batches | accuracy 0.948
| epoch 7 | 1500/ 1782 batches | accuracy 0.947
-----------------------------------------------------------
| end of epoch 7 | time: 17.89s | valid accuracy 0.916
-----------------------------------------------------------
| epoch 8 | 500/ 1782 batches | accuracy 0.951
| epoch 8 | 1000/ 1782 batches | accuracy 0.949
| epoch 8 | 1500/ 1782 batches | accuracy 0.947
-----------------------------------------------------------
| end of epoch 8 | time: 17.18s | valid accuracy 0.916
-----------------------------------------------------------
| epoch 9 | 500/ 1782 batches | accuracy 0.950
| epoch 9 | 1000/ 1782 batches | accuracy 0.950
| epoch 9 | 1500/ 1782 batches | accuracy 0.950
-----------------------------------------------------------
| end of epoch 9 | time: 17.25s | valid accuracy 0.917
-----------------------------------------------------------
| epoch 10 | 500/ 1782 batches | accuracy 0.950
| epoch 10 | 1000/ 1782 batches | accuracy 0.951
| epoch 10 | 1500/ 1782 batches | accuracy 0.949
-----------------------------------------------------------
| end of epoch 10 | time: 17.41s | valid accuracy 0.917
-----------------------------------------------------------
用test dataset验证
print('Checking the results of test dataset.')
accu_test = evaluate(test_dataloader)
print('test accuracy {:8.3f}'.format(accu_test))
Checking the results of test dataset.
test accuracy 0.909
ag_news_label = {1: 'World',
2: 'Sports',
3: 'Business',
4: 'Sci/Tec'}
def predict(text, text_pipeline):
with torch.no_grad():
text = torch.tensor(text_pipeline(text))
# offsets 参数很巧妙的用了 [0] 表示一个文本
output = model(text, torch.tensor([0]))
return output.argmax(1).item() + 1
ex_text_str = "MEMPHIS, Tenn. – Four days ago, Jon Rahm was \
enduring the season’s worst weather conditions on Sunday at The \
Open on his way to a closing 75 at Royal Portrush, which \
considering the wind and the rain was a respectable showing. \
Thursday’s first round at the WGC-FedEx St. Jude Invitational \
was another story. With temperatures in the mid-80s and hardly any \
wind, the Spaniard was 13 strokes better in a flawless round. \
Thanks to his best putting performance on the PGA Tour, Rahm \
finished with an 8-under 62 for a three-stroke lead, which \
was even more impressive considering he’d never played the \
front nine at TPC Southwind."
model = model.to("cpu")
print('This is a %s news' % ag_news_label[predict(ex_text_str,
text_pipeline)])
This is a Sports news