学习句子分类,使用深度学习的方法对句子数据集进行分类。
句子分类(Sentence Classification)是指给定一个句子,标注预先设定的若干类别中的一个类别。
句子分类包括情感分析(Sentiment Analysis)、问题分类(Question
Classification)等任务。情感分析又称倾向性分析、意见抽取(Opinion extraction)、意见挖掘(Opinion mining)、情感挖掘(Sentiment mining)、主观分析(Subjectivity analysis),它是对带有情感色彩的主观性文本进行分析、处理、归纳和推理的过程,如从评论文本中分析用户对“数码相机”的“变焦、价格、大小、重量、闪光、易用性”等属性的情感倾向。
了解对电影、商品、Twitter 等的褒贬评价,以此来改善产品和服务、发现竞争对手的优劣势、预测股票走势等。
Data | c | l | N | |V| | |V_pre| | Test |
---|---|---|---|---|---|---|
MR | 2 | 20 | 10662 | 18765 | 16448 | CV |
SST-1 | 5 | 18 | 11855 | 17836 | 16262 | 2210 |
SST-2 | 2 | 19 | 9613 | 16185 | 14838 | 1821 |
Subj | 2 | 23 | 10000 | 21323 | 17913 | CV |
TREC | 6 | 10 | 5952 | 9592 | 9125 | 500 |
CR | 2 | 19 | 3775 | 5340 | 5046 | CV |
MPQA | 2 | 3 | 10606 | 6246 | 6083 | CV |
- MR: Movie reviews 电影评论,每条评论包含一个句子。1
SST-1: Stanford Sentiment Treebank,MR 的扩展但划分了 train/dev/test 集合并提供 5 个细粒度标签(非常积极的,积极的,中性的,负面的,非常消极的)。
SST-2: 与 SST-1 一样但移除中性评论并用二进制标签。2
Subj: Subjectivity 主观性数据集,任务是将句子分类为主观或客观的。3
TREC: TREC question dataset TREC 问题数据集,任务是将一个问题分成 6 类(关于人、位置、数字信息等)。4
CR: Customer reviews 各种产品的客户评论,任务是预测正面/负面评论。5
MPQA: MPQA 数据集意见极性检测任务。6
通常会把任务拆分成几个子任务:
分词
把句子根据意思分成多个词,有时可能还需要去掉停用词、了解词性、转换成词向量等操作。
提取特征
有时我们不会直接使用分词后的多个词来直接分类,这时需要提取特征来方便分类。
常用特征:TF-IDF、LDA、LSI
构建分类器
输入特征或词向量等,通过一些模型,对该句子进行分类。
NBSVM: Naive Bayes SVM
MNB: Multinomial Naive Bayes 7
combine-skip
combine-skip + NB 8
Model | MR | SST-1 | SST-2 | Subj | TREC | CR | MPQA |
---|---|---|---|---|---|---|---|
NBSVM | 79.4 | - | - | 93.2 | - | 81.8 | 86.3 |
MNB | 79.0 | - | - | 93.6 | - | 80.0 | 86.3 |
combine-skip | 76.5 | - | - | 93.6 | 92.2 | 80.1 | 87.1 |
combine-skip+NB | 80.4 | - | - | 93.6 | - | 81.3 | 87.5 |
RCNN: Recurrent Convolutional Neural Networks 9
S-LSTM: Long Short-Term Memory Over Recursive Structures 10
LSTM: Long Short-Term Memory
BLSTM: Bidirectional Long Short-Term Memory
Tree-LSTM: Tree-structured Long Short-Term Memory 11
LSTMN: Long Short-Term Memory-Network 12
Multi-Task: Recurrent Neural Network for Text Classification with Multi-Task Learning 13
BLSTM-Att: Bidirectional Long Short-Term Memory, attention-based model
BLSTM-2DPooling: Bidirectional Long Short-Term Memory Networks with Two-Dimensional Max Pooling
BLSTM-2DCNN: Bidirectional Long Short-Term Memory Networks with 2D convolution 14
Model | MR | SST-1 | SST-2 | Subj | TREC | CR | MPQA |
---|---|---|---|---|---|---|---|
RCNN | - | 47.21 | - | - | - | - | - |
S-LSTM | - | - | 81.9 | - | - | - | - |
LSTM | - | 46.4 | 84.9 | - | - | - | - |
BLSTM | - | 49.1 | 87.5 | - | - | - | - |
Tree-LSTM | - | 51.0 | 88.0 | - | - | - | - |
LSTMN | - | 49.3 | 87.3 | - | - | - | - |
Multi-Task | - | 49.6 | 87.9 | 94.1 | - | - | - |
BLSTM | 80.0 | 49.1 | 87.6 | 92.1 | 93.0 | - | - |
BLSTM-Att | 81.0 | 49.8 | 88.2 | 93.5 | 93.8 | - | - |
BLSTM-2DPooling | 81.5 | 50.5 | 88.3 | 93.7 | 94.8 | - | - |
BLSTM-2DCNN | 82.3 | 52.4 | 89.5 | 94.0 | 96.1 | - | - |
DCNN: Dynamic Convolutional Neural Network 15
CNN-non-static: Convolutional Neural Networks, the pretrained vectors are fine-tuned for each task
CNN-multichannel: Convolutional Neural Networks with two sets of word vectors 16
TBCNN: Tree-based Convolutional Neural Network 17
Molding-CNN: Molding Convolutional Neural Networks 18
CNN-Ana: Non-static GloVe+word2vec CNN 19
MVCNN: Multichannel Variable-Size Convolution 20
DSCNN: Dependency Sensitive Convolutional Neural Networks 21
Model | MR | SST-1 | SST-2 | Subj | TREC | CR | MPQA |
---|---|---|---|---|---|---|---|
DCNN | - | 48.5 | 86.8 | - | 93.0 | - | - |
CNN-non-static | 81.5 | 48.0 | 87.2 | 93.4 | 93.6 | 84.3 | 89.5 |
CNN-multichannel | 81.1 | 47.4 | 88.1 | 93.2 | 92.2 | 85.0 | 89.4 |
TBCNN | - | 51.4 | 87.9 | - | 96.0 | - | - |
Molding-CNN | - | 51.2 | 88.6 | - | - | - | - |
CNN-Ana | 81.02 | 45.98 | 85.45 | 93.66 | 91.37 | 84.65 | 89.55 |
MVCNN | - | 49.6 | 89.4 | - | - | - | - |
DSCNN | 81.5 | 49.7 | 89.1 | 93.2 | 95.4 | - | - |
RAE: Recursive Autoencoders with pre-trained word vectors from Wikipedia 22
AdaSent: self-adaptive hierarchical sentence model 23
RNTN: Recursive Neural Tensor Network 24
DRNN: Deep Recursive Neural Networks 25
Model | MR | SST-1 | SST-2 | Subj | TREC | CR | MPQA |
---|---|---|---|---|---|---|---|
RAE | 77.7 | 43.2 | 82.4 | - | - | - | 86.4 |
AdaSent | 83.1 | - | - | 95.5 | 92.4 | 86.3 | 93.3 |
RNTN | - | 45.7 | 85.4 | - | - | - | - |
DRNN | - | 49.8 | 86.6 | - | - | - | - |