使用FastNLP进行文本情感分类

大部分用于的 NLP 任务神经网络都可以看做由词嵌入（embeddings）和两种模块：编码器（encoder）、解码器（decoder）组成。以文本分类任务为例，下图展示了一个BiLSTM+Attention实现文本分类器的模型流程图：

文本分类的基本方法（参考： https://fastnlp.readthedocs.io/zh/latest/index.html ）

fastNLP 是一款轻量级的自然语言处理（NLP）工具包。你既可以用它来快速地完成一个NLP任务，也可以用它在研究中快速构建更复杂的模型。fastNLP具有如下的特性：

（1）统一的Tabular式数据容器，简化数据预处理过程;

（2）内置多种数据集的 Loader 和 Pipe ，省去预处理代码;

（3）各种方便的NLP工具，例如Embedding加载（包括 ElmoEmbedding 和 BertEmbedding ）、中间数据cache等;

（4）部分数据集与预训练模型的自动下载;

（5）提供多种神经网络组件以及复现模型（涵盖中文分词、命名实体识别、句法分析、文本分类、文本匹配、指代消解、摘要等任务）;

（6）Trainer 提供多种内置 callback 函数，方便实验记录、异常捕获等.

任务描述

本次推荐评论展示任务的目标是从真实的用户评论中，挖掘合适作为推荐理由的短句。点评软件展示的推荐理由具有长度限制，而真实用户评论语言通顺、信息完整。综合来说，两者都具有用户情感的正负向，但是展示推荐理由的内容相关性高于评论，需要较强的文本吸引力。

数据集文件分为训练集和测试集部分，对应文件如下：

带标签的训练数据：train_shuffle.txt

不带标签的测试数据：test_handout.txt

注意，test_handout.txt文件的行索引从0开始，对应于ID一列，评论内容为“展示”的预测概率应于Prediction一列。

需要注意的是，由于数据在标注时存在主观偏好，标记为“不展示”（0）的评论不一定是真正的负面评论，反之亦然。但是这种情况的存在，不会对任务造成很大的歧义，通过基准算法我们可以在测试集上实现很高的性能。

#导入Pytorch包

import torch

import torch.nn as nn

from fastNLP.io.loader import CSVLoader

dataset_loader = CSVLoader(

headers=('target', 'raw_words'), sep='\t'

)

testset_loader = CSVLoader( headers=['raw_words'])

# 表示将CSV文件中每一行的第一项将填入'raw_words' field，第二项填入'target' field。

# 其中项之间由'\t'分割开来

train_path=r'train_shuffle.txt'

test_path=r‘’test_handout.txt'

dataset = dataset_loader._load(train_path)

testset = testset_loader._load(test_path)

# 将句子分成单词形式, 详见DataSet.apply()方法

import jieba

from itertools import chain

print(jieba.__version__)

def get_tokenized(data,words=True):

'''

@params:

data: 数据的列表，列表中的每个元素为 [文本字符串，0/1标签] 二元组

@return: 切分词后的文本的列表，列表中的每个元素为切分后的词序列

'''

def tokenizer(text):

return [tok for tok in jieba.cut(text, cut_all=False)]

if words:

#按词语进行编码

return tokenizer(data)

else:

#按字进行编码

return [tokenizer(review) for review in data]

#dataset.apply(lambda ins: list(chain.from_iterable(get_tokenized(ins['raw_words']))), new_field_name='words', is_input=True)

dataset.apply(lambda ins:get_tokenized(ins['raw_words']), new_field_name='words', is_input=True)

dataset.apply(lambda ins: len(ins['words']) ,new_field_name='seq_len', is_input=True)

dataset.apply(lambda x: int(x['target']), new_field_name='target', is_target=True)

#testset.apply(lambda ins: list(chain.from_iterable(get_tokenized(ins['raw_words']))), new_field_name='words', is_input=True)

testset.apply(lambda ins: get_tokenized(ins['raw_words']), new_field_name='words', is_input=True)

testset.apply(lambda ins: len(ins['words']) ,new_field_name='seq_len',is_input=True)

###

from fastNLP import Vocabulary

#将DataSet按照ratio的比例拆分，返回两个DataSet

#ratio (float) -- 0

train_data, dev_data = dataset.split(0.1, shuffle=False)

print(len(train_data),len(dev_data),len(testset))

vocab = Vocabulary(min_freq=2).from_dataset(dataset, field_name='words')

vocab.index_dataset(train_data, dev_data, testset, field_name='words', new_field_name='words')

from fastNLP.embeddings import StaticEmbedding,StackEmbedding

fastnlp_embed = StaticEmbedding(vocab, model_dir_or_name='cn-char-fastnlp-100d',min_freq=2)

#The pre-trained embeddings are in Tencent_AILab_ChineseEmbedding.txt

cn_tencent = r'Tencent_AILab_ChineseEmbedding.txt'

tecent_embed_word = StaticEmbedding(vocab, model_dir_or_name=cn_tencent,min_freq=2)

from fastNLP.models import CNNText

model_CNN = CNNText(tecent_embed_word, num_classes=2,dropout=0.1)

print(model_CNN)

from fastNLP import Trainer, CrossEntropyLoss, AccuracyMetric,BCELoss

trainer_CNN = Trainer(model=model_CNN, train_data=train_data, dev_data=dev_data,

loss=CrossEntropyLoss(), metrics=AccuracyMetric())

trainer_CNN.train()

#批量进行数据预测

import pandas as pd

import torch

def batch_predict(model,data):

submission = pd.DataFrame(columns=['ID','Prediction'])

for i in range(len(data)):

#for i in range(5):

#print(data.words[i])

tensor = torch.tensor(data.words[i])

pred = model.predict(tensor.view(1,-1))

#print(pred)

prob = pred['pred'].numpy()[0]

#print("pred:%.2f"%(prob))

s2 = pd.Series([i,float(prob)], index=['ID', 'Prediction'])

submission = submission.append(s2, ignore_index=True)

submission['ID'] = submission.ID.astype(int)

submission['Prediction'] = submission.Prediction.astype(float)

#返回pd.DataFrame格式的数据帧

return submission

#开始进行预测，并将结果保存到提交格式文件中，提交平台

summission_path = r'data\Comments9120'

submission = batch_predict(model_CNN,testset)

submission.to_csv(summission_path+'\submission-CNN-20200229-words.csv', index=False)

参考：

https://www.kesci.com/org/boyuai/workspace/project

https://www.boyuai.com/elites/course/cZu18YmweLv10OeV/jupyter/pPLJ2YtrFxECbsSqd9l-Y

https://fastnlp.readthedocs.io/zh/latest/index.html

使用FastNLP进行文本情感分类

你可能感兴趣的:(使用FastNLP进行文本情感分类)