Semi-supervised Text Categorization by Considering Sufficiency and Diversity

Following are some excerpts from the paper Semi-supervised Text Categorization by Considering Sufficiency and Diversity by Shoushan Li et al.. Those excerpts summarize the main idea of the paper.

Paper name: Semi-supervised Text Categorization by Considering Sufficiency and Diversity
Paper authors: Shoushan Li et al.
Key words: Semi-supervised, Text Categorization, Bootstrapping, Sufficiency, Diversity

Overview

The paper Semi-supervised Text Categorization by Considering Sufficiency and Diversity by Shoushan Li et al. proposed a novel bootstrapping approach to semi-supervised text categorization (TC) by considering two basic preferences, i.e., sufficiency and diversity. Experimental evaluation shows the effectiveness of the modified bootstrapping approach in both topic and sentiment-based TC tasks.

Bootstrapping

In bootstrapping, a classifier is first trained with a small amount of labeled data and then iteratively retrained by adding most confident unlabeled samples as new labeled data.

Sufficiency

In order to make bootstrapping successful, we should correctly predict the labels of the newly added data as possible as we can. Otherwise, many wrongly predicted samples would make bootstrapping fail completely. For clarity, we refer to this preference as sufficiency.

Diversity

When the newly-added data is too close to the initial labeled data, the trained hyperplane might be far away from the optimal one. One possible way to overcome the concentration drawback is to make the added data more different from the initial data and better reflect the natural data distribution. For clarity, we refer to this preference of letting newly labeled data more different from existing labeled data as diversity.

Bootstrapping by Considering Sufficiency and Diversity

To take sufficiency and diversity into consideration, the paper proposed three methods:

  • Bootstrapping with Random Subspace

In bootstrapping, the classifier for choosing the samples with high confidences is usually trained over the whole feature space. This type of classifier tends to choose the samples much similar to the initial labeled data in terms of the whole feature space.

Generally, the extent of the differences between each two classifiers largely depends on the differences of the features they used. One straight way to obtain different classifiers is to randomly select r features from the whole feature set in each iteration in bootstrapping. A classifier trained with the subspace training data is called a subspace classifier.

The size of the feature subset r is an important parameter in this algorithm. The smaller r, the more different subspace classifiers are from each other. However, the value of r should not be too small because a classifier trained with too few features is not capable of correctly predicting samples.

  • Bootstrapping with Excluded Subspace

To better satisfy the diversity preference, the paper improved the random subspace generation strategy with an constraint which restricts that every two adjacent subspace classifiers do not share any features.

  • Diversity Consideration among Different Types of Features

The paper introduced another constraint which restricts that every two adjacent subspace classifiers do not share any similar features. Here, two features are considered similar when they contain the same informative unigram.



来自为知笔记(Wiz)


你可能感兴趣的:(Semi-supervised Text Categorization by Considering Sufficiency and Diversity)