文本分类调研

持续更新中

Introduction

1. Definition

什么是文本分类,即我们常说的text classification,简单的说就是把一段文本划分到我们提前定义好的一个或多个类别。可以说是属于document classification的范畴。
Input:
a document d
a fixed set of classes C = {c1, c2, ... , cn}
Output:
a predicted class ci from C

2. Some simple application

  1. spam detection
  2. authorship attribution
  3. age/gender identification
  4. sentiment analysis
  5. assigning subject categories, topics or genes
    ......

Traditional methods

1. Naive Bayes

文本分类调研_第1张图片

two assumptions:

  1. Bag of words assumption:
    position doesn't matter
  2. Conditional independency:
文本分类调研_第2张图片

to compute these probabilities:

文本分类调研_第3张图片

add-one smoothing to prevent the situation in which we get zero:(you can add other number as well)

文本分类调研_第4张图片

to deal with unknown/unshown words:

文本分类调研_第5张图片

main features:

  1. very fast, low storage requirements
  2. robust to irrelevant features
  3. good in domains with many equally important features
  4. optimal if the indolence assumption hold
  5. lacks accuracy in general

2. SVM

cost function of SVM:

2. SVM decision boundary
when C is very large:

文本分类调研_第6张图片
文本分类调研_第7张图片

about kernel:

文本分类调研_第8张图片
文本分类调研_第9张图片
文本分类调研_第10张图片

until now,it seems that the SVM are only applicable to two-class classification.

文本分类调研_第11张图片

Comparing with Logistic regression:

文本分类调研_第12张图片

while applying SVM and Logistic regression to text classification, all you need to do is to get the labeled data and find a proper way to represent the texts with vectors (you can use one-hot representation , word2vec, doc2vec ......)

Neural network methods

1. CNN

(1) the paper Convolutional Neural Networks for Sentence Classification which appeared in EMNLP 2014
(2) the paper A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification

文本分类调研_第13张图片

The model uses multiple filters to obtain multiple features. These features form the penultimate layer and are passed to a fully connected softmax layer whose output is the probability distribution over labels.

文本分类调研_第14张图片

For regularization we employ dropout on the penultimate layer with a constraint on l2-norms of the weight vectors. Dropout prevents co-adaptation of hidden units by randomly dropping out.

Pre-trained Word Vectors
We use the publicly available word2vec vectors that were trained on 100 billion words from Google News.

Results

文本分类调研_第15张图片

There is simplified implementation using Tensorflow on Github:https://github.com/dennybritz/cnn-text-classification-tf

2. RNN

the paper Hierarchical Attention Networks for Document Classification which appeared in NAACL 2016

in this paper we test the hypothesis that better representations can be obtained by incorporating knowledge of document structure in the model architecture

  1. It is observed that different words and sentences in a documents are differentially informative.
  2. Moreover, the importance of words and sentences are highly context dependent.
    i.e. the same word or sentence may be dif- ferentially important in different context

Attention serves two benefits: not only does it often result in better performance, but it also provides in- sight into which words and sentences contribute to the classification decision which can be of value in applications and analysis

Hierarchical Attention Network

文本分类调研_第16张图片

If you want to learn more about Attention Mechanisms:http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nlp/

In the model they used the GRU-based sequence encoder.
1. Word Encoder:

文本分类调研_第17张图片

2. Word Attention:

文本分类调研_第18张图片
文本分类调研_第19张图片

3. Sentence Encoder:

文本分类调研_第20张图片

4. Sentence Attention:

文本分类调研_第21张图片

5. Document Classification:
Because the document vector v is a high level representation of document d

j is the label of document d

文本分类调研_第22张图片

Results

文本分类调研_第23张图片

There is simplified implementation written in Python on Github:https://github.com/richliao/textClassifier

References

https://www.cs.cmu.edu/%7Ediyiy/docs/naacl16.pdf
https://www.cs.cmu.edu/%7Ediyiy/docs/naacl16.pdf
https://www.coursera.org/learn/machine-learning/home/
https://www.youtube.com/playlist?list=PL6397E4B26D00A269

你可能感兴趣的:(文本分类调研)