pytorch自然语言处理基础模型之三:TextCNN

1、模型原理

       TextCNN简介是CNN的一种变形,CNN(2011)主要运用于图片分类,Yoon Kim在论文《Convolutional Neural Networks for Sentence Classification》中提出了TextCNN。将卷积神经网络CNN应用到文本分类任务,利用多个不同size的kernel来提取句子中的关键信息(类似于多窗口大小的ngram),从而能够更好地捕捉局部相关性。
       模型的详细过程如下所示:
pytorch自然语言处理基础模型之三:TextCNN_第1张图片
Embedding:第一层是图中最左边的7乘5的句子矩阵,每行是词向量,维度=5,这个可以类比为图像中的原始像素点。
Convolution:然后经过 kernel_sizes=(2,3,4) 的一维卷积层,每个kernel_size 有两个输出 channel。
MaxPolling:第三层是一个1-max pooling层,这样不同长度句子经过pooling层之后都能变成定长的表示。
FullConnection and Softmax:最后接一层全连接的 softmax 层,输出每个类别的概率。

2、代码实现

本文以简单数据集和网络结构实现TextCNN,使用的参数与原论文的不同,目的是便于读者更好的理解该网络的原理。

1. 导入需要的库,设置数据类型

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim 
from torch.autograd import Variable
import numpy as np
dtype = torch.FloatTensor

2. 创建数据和字典

// 3 words sentences (=sequence_length is 3)
sentences = ["i love you", "he loves me", "she likes baseball", 
"i hate you", "sorry for that", "this is awful"]
labels = [1, 1, 1, 0, 0, 0]  // 1 is good, 0 is not good.

word_list = " ".join(sentences).split()
word_list = list(set(word_list))
word_dict = {w: i for i, w in enumerate(word_list)}
vocab_size = len(word_dict)

3. 创建batch

inputs = []
for sen in sentences:
    inp = [word_dict[w] for w in sen.split()]
    inputs.append(inp)
inputs = np.asarray(inputs)

targets = []
for out in labels:
    targets.append(out) // To using Torch Softmax Loss function

input_batch = Variable(torch.LongTensor(inputs))
target_batch = Variable(torch.LongTensor(targets))

4. 定义网络参数

// Text-CNN Parameter
embedding_size = 2 // n-gram
sequence_length = 3
n_classes= 2  // 0 or 1
filter_sizes = [2, 2, 2] // n-gram window
num_filters = 3

5. 创建网络

class TextCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.num_filters_total = num_filters * len(filter_sizes)
        //uniform(x, y) 方法将随机生成一个实数,它在 [x,y] 范围内
        //使用torch.empty(row, column)不加初始化地构建矩阵
        self.W = nn.Parameter(torch.empty(vocab_size, embedding_size).uniform_(-1, 1)).type(dtype)
        self.Weight = nn.Parameter(torch.empty(self.num_filters_total, n_classes).uniform_(-1, 1)).type(dtype)
        self.Bias = nn.Parameter(0.1 * torch.ones(n_classes)).type(dtype)
        
    def forward(self, X):
        //embedded: [batch_size, sequence_length, embedding_size]
        embedded = self.W[X]
        //add channel(=1) [batch, channel(=1), sequence_length, embedding_size]
        embedded = embedded.unsqueeze(1)
        
        pooled_outputs=[]
        for filter_size in filter_sizes:
            //conv : [input_channel(=1), output_channel(=3), (filter_height, filter_width), bias_option]
            conv = nn.Conv2d(1, num_filters, (filter_size, embedding_size))(embedded)
            feature_map = F.relu(conv)
            //max_pool : ((filter_height, filter_width))
            max_pool = nn.MaxPool2d((sequence_length-filter_size+1, 1))
            pooled = max_pool(feature_map)
            //permute可以灵活的对原数据的维度进行调换,而数据本身不变。
            pooled = pooled.permute(0, 3, 1, 2)
            pooled_outputs.append(pooled)
            
            
        //cat对数据沿着某一维度进行拼接
        h_pool = torch.cat(pooled_outputs, len(filter_sizes))
        //[batch_size(=6), output_height * output_width * (output_channel * 3)]
        h_pool_flat = torch.reshape(h_pool, [-1, self.num_filters_total])
            
        out = torch.mm(h_pool_flat, self.Weight) + self.Bias
            
        return out

6. 训练模型

model = TextCNN()

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

// Training
for epoch in range(5000):
    optimizer.zero_grad()
    output = model(input_batch)

    //output : [batch_size, num_classes], 
    //target_batch : [batch_size] (LongTensor, not one-hot)
    loss = criterion(output, target_batch)
    if (epoch + 1) % 1000 == 0:
        print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))

    loss.backward()
    optimizer.step()

结果如下:

Epoch: 1000 cost = 0.709715
Epoch: 2000 cost = 0.593492
Epoch: 3000 cost = 0.156285
Epoch: 4000 cost = 0.092555
Epoch: 5000 cost = 0.076225

7. 验证模型

// Test
test_text = 'sorry hate you'
tests = [np.asarray([word_dict[n] for n in test_text.split()])]
test_batch = Variable(torch.LongTensor(tests))

//max返回每一行中最大值的那个元素,且返回其索引(返回最大元素在这一行的列索引)
predict = model(test_batch).data.max(1, keepdim=True)[1]
if predict[0][0] == 0:
    print(test_text,"is Bad Mean...")
else:
    print(test_text,"is Good Mean!!")

验证结果:

sorry hate you is Bad Mean..

参考链接
https://www.cnblogs.com/bymo/p/9675654.html
https://github.com/graykode/nlp-tutorial

你可能感兴趣的:(pytorch自然语言处理基础模型之三:TextCNN)