文本分类系列(1):TextCNN及其pytorch实现
文本分类系列(2):TextRNN及其pytorch实现
尽管TextCNN能够在很多任务里面能有不错的表现,但CNN有个最大问题是固定 filter_size 的视野,一方面无法建模更长的序列信息,另一方面 filter_size 的超参调节也很繁琐。CNN本质是做文本的特征表达工作,而自然语言处理中更常用的是递归神经网络(RNN, Recurrent Neural Network),能够更好的表达上下文信息。具体在文本分类任务中,Bi-directional RNN(实际使用的是双向LSTM)从某种意义上可以理解为可以捕获变长且双向的的 “n-gram” 信息。
RNN算是在自然语言处理领域非常标配的一个网络,在序列标注/命名体识别/seq2seq模型等很多场景都有应用,Recurrent Neural Network for Text Classification with Multi-Task Learning文中介绍了RNN用于分类问题的设计,下图LSTM用于网络结构原理示意图,最后一步的隐层代表着对整个句子的编码,直接接全连接层softmax输出。
pytorch代码实现,具体见github地址
import torch
import torch.nn as nn
# 循环神经网络 (many-to-one)
class TextRNN(nn.Module):
def __init__(self, args):
super(TextRNN, self).__init__()
embedding_dim = args.embedding_dim
label_num = args.label_num
vocab_size = args.vocab_size
self.hidden_size = args.hidden_size
self.layer_num = args.layer_num
self.bidirectional = args.bidirectional
self.embedding = nn.Embedding(vocab_size, embedding_dim)
if args.static: # 如果使用预训练词向量,则提前加载,当不需要微调时设置freeze为True
self.embedding = self.embedding.from_pretrained(args.vectors, freeze=not args.fine_tune)
self.lstm = nn.LSTM(embedding_dim, # x的特征维度,即embedding_dim
self.hidden_size,# 隐藏层单元数
self.layer_num,# 层数
batch_first=True,# 第一个维度设为 batch, 即:(batch_size, seq_length, embedding_dim)
bidirectional=self.bidirectional) # 是否用双向
self.fc = nn.Linear(self.hidden_size * 2, label_num) if self.bidirectional else nn.Linear(self.hidden_size, label_num)
def forward(self, x):
# 输入x的维度为(batch_size, max_len), max_len可以通过torchtext设置或自动获取为训练样本的最大长度
x = self.embedding(x) # 经过embedding,x的维度为(batch_size, time_step, input_size=embedding_dim)
# 隐层初始化
# h0维度为(num_layers*direction_num, batch_size, hidden_size)
# c0维度为(num_layers*direction_num, batch_size, hidden_size)
h0 = torch.zeros(self.layer_num * 2, x.size(0), self.hidden_size) if self.bidirectional else torch.zeros(self.layer_num, x.size(0), self.hidden_size)
c0 = torch.zeros(self.layer_num * 2, x.size(0), self.hidden_size) if self.bidirectional else torch.zeros(self.layer_num, x.size(0), self.hidden_size)
# LSTM前向传播,此时out维度为(batch_size, seq_length, hidden_size*direction_num)
# hn,cn表示最后一个状态?维度与h0和c0一样
out, (hn, cn) = self.lstm(x, (h0, c0))
# 我们只需要最后一步的输出,即(batch_size, -1, output_size)
out = self.fc(out[:, -1, :])
return out
import jieba
from torchtext import data
import re
from torchtext.vocab import Vectors
def tokenizer(text): # create a tokenizer function
regex = re.compile(r'[^\u4e00-\u9fa5aA-Za-z0-9]')
text = regex.sub(' ', text)
return [word for word in jieba.cut(text) if word.strip()]
# 去停用词
def get_stop_words():
file_object = open('data/stopwords.txt')
stop_words = []
for line in file_object.readlines():
line = line[:-1]
line = line.strip()
stop_words.append(line)
return stop_words
def load_data(args):
print('加载数据中...')
stop_words = get_stop_words() # 加载停用词表
'''
如果需要设置文本的长度,则设置fix_length,否则torchtext自动将文本长度处理为最大样本长度
text = data.Field(sequential=True, tokenize=tokenizer, fix_length=args.max_len, stop_words=stop_words)
'''
text = data.Field(sequential=True, lower=True, tokenize=tokenizer, stop_words=stop_words)
label = data.Field(sequential=False)
text.tokenize = tokenizer
train, val = data.TabularDataset.splits(
path='data/',
skip_header=True,
train='train.tsv',
validation='validation.tsv',
format='tsv',
fields=[('index', None), ('label', label), ('text', text)],
)
if args.static:
text.build_vocab(train, val, vectors=Vectors(name="/brucewu/projects/pytorch_tutorials/chinese_text_cnn/data/eco_article.vector"))
args.embedding_dim = text.vocab.vectors.size()[-1]
args.vectors = text.vocab.vectors
else: text.build_vocab(train, val)
label.build_vocab(train, val)
train_iter, val_iter = data.Iterator.splits(
(train, val),
sort_key=lambda x: len(x.text),
batch_sizes=(args.batch_size, len(val)), # 训练集设置batch_size,验证集整个集合用于测试
device=-1
)
args.vocab_size = len(text.vocab)
args.label_num = len(label.vocab)
return train_iter, val_iter
参考: