文本分类调研

持续更新中

Introduction

1. Definition

什么是文本分类，即我们常说的text classification，简单的说就是把一段文本划分到我们提前定义好的一个或多个类别。可以说是属于document classification的范畴。
Input:
a document d
a fixed set of classes C = {c1, c2, ... , cn}
Output：
a predicted class ci from C

2. Some simple application

spam detection
authorship attribution
age/gender identification
sentiment analysis
assigning subject categories, topics or genes
......

Traditional methods

1. Naive Bayes

two assumptions：

Bag of words assumption：
position doesn't matter
Conditional independency：

to compute these probabilities：

add-one smoothing to prevent the situation in which we get zero：(you can add other number as well)

to deal with unknown/unshown words：

main features：

very fast, low storage requirements
robust to irrelevant features
good in domains with many equally important features
optimal if the indolence assumption hold
lacks accuracy in general

2. SVM

cost function of SVM：

2. SVM decision boundary
when C is very large：

about kernel：

until now，it seems that the SVM are only applicable to two-class classification.

Comparing with Logistic regression：

while applying SVM and Logistic regression to text classification, all you need to do is to get the labeled data and find a proper way to represent the texts with vectors (you can use one-hot representation , word2vec, doc2vec ......)

Neural network methods

1. CNN

(1) the paper Convolutional Neural Networks for Sentence Classification which appeared in EMNLP 2014
(2) the paper A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification

The model uses multiple filters to obtain multiple features. These features form the penultimate layer and are passed to a fully connected softmax layer whose output is the probability distribution over labels.

For regularization we employ dropout on the penultimate layer with a constraint on l2-norms of the weight vectors. Dropout prevents co-adaptation of hidden units by randomly dropping out.

Pre-trained Word Vectors
We use the publicly available word2vec vectors that were trained on 100 billion words from Google News.

Results

There is simplified implementation using Tensorflow on Github：https://github.com/dennybritz/cnn-text-classification-tf

2. RNN

the paper Hierarchical Attention Networks for Document Classification which appeared in NAACL 2016

in this paper we test the hypothesis that better representations can be obtained by incorporating knowledge of document structure in the model architecture

It is observed that different words and sentences in a documents are differentially informative.
Moreover, the importance of words and sentences are highly context dependent.
i.e. the same word or sentence may be dif- ferentially important in different context

Attention serves two benefits: not only does it often result in better performance, but it also provides in- sight into which words and sentences contribute to the classification decision which can be of value in applications and analysis

Hierarchical Attention Network

If you want to learn more about Attention Mechanisms：http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nlp/

In the model they used the GRU-based sequence encoder.
1. Word Encoder：

2. Word Attention：

3. Sentence Encoder：

4. Sentence Attention：

5. Document Classification：
Because the document vector v is a high level representation of document d：

j is the label of document d