TextCNN文本分类代码实现

TextCNN通过使用卷积神经网络实现文本分类,首先在《Convolutional Neural Networks for Sentence Classification》论文中提出。实验结果证明,基于预训练的词向量CNN可以实现较好的文本分类,无需对超参数进行过多的调整。

1. 模型结构

模型分为嵌入层,卷积层,池化层,全连接层。
TextCNN文本分类代码实现_第1张图片

2. 代码实现

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

class TextCNN(nn.Module):
    def __init__(self,num_filters,filter_sizes,vocab_size,embedding_size,sequence_length):
        super(TextCNN, self).__init__()

        self.sequence_length = sequence_length
        self.filter_sizes = filter_sizes
        self.num_filters_total = num_filters * len(filter_sizes)
        self.W = nn.Embedding(vocab_size, embedding_size)  #初始化嵌入
        self.Weight = nn.Linear(self.num_filters_total, num_classes, bias=False)
        self.Bias = nn.Parameter(torch.ones([num_classes]))
        self.filter_list = nn.ModuleList([nn.Conv2d(1, num_filters, (size, embedding_size)) for size in filter_sizes])  #卷积层集合(1,3,(2,2))

    def forward(self, X):
        #嵌入层
        embedded_chars = self.W(X) #[6,3,2]
        embedded_chars = embedded_chars.unsqueeze(1) #[6,1,3,2]

        #多层卷积
        pooled_outputs = []
        for i, conv in enumerate(self.filter_list):
            h = F.relu(conv(embedded_chars))  #[6,3,2,1)
            mp = nn.MaxPool2d((self.sequence_length - self.filter_sizes[i] + 1, 1)) #[2,1]
            pooled = mp(h).permute(0, 3, 2, 1)  #[6,1,1,3]
            pooled_outputs.append(pooled)
        h_pool = torch.cat(pooled_outputs, len(self.filter_sizes)) #[6,1,1,9]
        h_pool_flat = torch.reshape(h_pool, [-1, self.num_filters_total]) #[6,9]

        #输出层
        output = self.Weight(h_pool_flat) + self.Bias # [6,2]
        return output

if __name__ == '__main__':
    embedding_size = 2 #词嵌入维度
    sequence_length = 3 #序列长度
    num_classes = 2 #类型数目
    filter_sizes = [2, 2, 2]  #卷积核大小
    num_filters = 3 #卷积核数目

    
    sentences = ["i love you", "he loves me", "she likes baseball", "i hate you", "sorry for that", "this is awful"]
    labels = [1, 1, 1, 0, 0, 0]  # 1 is good, 0 is not good.

    word_list = " ".join(sentences).split()
    word_list = list(set(word_list))
    word_dict = {w: i for i, w in enumerate(word_list)}
    vocab_size = len(word_dict)

    model = TextCNN(num_filters,filter_sizes,vocab_size,embedding_size,sequence_length)

    criterion = nn.CrossEntropyLoss()  #交叉熵损失函数
    optimizer = optim.Adam(model.parameters(), lr=0.001)  #优化器

    inputs = torch.LongTensor([np.asarray([word_dict[n] for n in sen.split()]) for sen in sentences])   #类型转换
    targets = torch.LongTensor([out for out in labels])

    for epoch in range(10000):
        optimizer.zero_grad()
        output = model(inputs)

        # 输出 : [batch_size, num_classes], 目标 : [batch_size] (LongTensor, not one-hot)
        loss = criterion(output, targets)
        if (epoch + 1) % 1000 == 0:
            print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))
        loss.backward()
        optimizer.step()

    #测试
    test_text = 'sorry hate you'
    tests = [np.asarray([word_dict[n] for n in test_text.split()])]
    test_batch = torch.LongTensor(tests)

    #预测
    predict = model(test_batch).data.max(1, keepdim=True)[1]
    if predict[0][0] == 0:
        print(test_text,"is Bad Mean...")
    else:
        print(test_text,"is Good Mean!!")

TextCNN文本分类代码实现_第2张图片

以上是TextCNN的简单代码实现。TextCNN可用于情感分类、文本分类等分类任务,可以验证使用不同的词向量模型对分类任务有多大改进。

你可能感兴趣的:(自然语言处理,机器学习,分类,深度学习,自然语言处理,python)