【论文阅读】Analyzing the Effectiveness and Applicability of Co-training

论文下载
bib:

@INPROCEEDINGS{NigamGhani2000CoEM,
	title		= {Analyzing the Effectiveness and Applicability of Co-Training},
	author 		= {Kamal Nigam and Rayid Ghani},
	booktitle 	= {CIKM},
	year 		= {2000},
	pages 		= {86--93}
}


1. 摘要

Recently there has been significant interest in supervised learning algorithms that combine labeled and unlabeled data for text learning tasks.

The co-training setting [1] applies to datasets that have a natural separation of their features into two disjoint sets.

We demonstrate that when learning from labeled and unlabeled data, algorithms explicitly leveraging a natural independent split of the features outperform algorithms that do not.

When a natural split does not exist, co-training algorithms that manufacture a feature split may out-perform algorithms not using a split.

These results help explain why co-training algorithms are both discriminative in nature and robust to the assumptions of their embedded classifiers.

2. 算法描述

大概是因为这篇论文写的时间太久远了吧(2000),害的我读半天找不到算法的描述。我只能说对于算法的描述太少了,不仔细读根本找不到, 我是用关键字搜索才找到的。

这里插入一下两个名词的理解,Incremental(增量),Iterative(迭代)。 Incremental是说每次会有新的数据添加进训练集(带有伪标签的无标记数据)。Iterative是训练数据的总量在迭代过程中是没有变化的(数量没有变)。注意,Co-EM算法算是Incremental的,原因是算法初始化是给所有的无标记数据都打上了伪标签。

This suggests that incremental algorithms may outperform iterative algorithms, so long as they are not led astray by a few mislabeled documents in the early rounds of using the unlabeled data.

简单描述,就是Co-EM就是简单的将Co-trainEM算法融合在了一起。在论文中,为了控制特征分割(feature split)对于实验结果的影响,提出了两个版本的算法,一个使用特征分割,一个不是用特征分割。

The first, co-EM, is an iterative algorithm that uses the feature split. It proceeds by initializing the A-feature-set naive Bayes classifier from the labeled data only. Then, A probabilistically labels all the unlabeled data. The B-feature-set classifier then trains using the labeled data and the unlabeled data with A’s labels. B then relabels the data for use by A, and this process iterates until the classifiers converge. A and B predictions are combined together as co-training embedded classiers are. In practice, co-EM converges as quickly as EM does, and experimentally we run co-EM for 10 iterations.

  • Co-EM:
  1. 分割两个特征子集,仅在有标记数据的两个特征子集上分别训练两个分类器A, B。
  2. A 给所有的无标记数据打上伪标签,B在有标签和带有A伪标签的无标记数据上训练。
  3. B 给所有的无标记数据打上伪标签,A在有标签和带有B伪标签的无标记数据上训练。
  4. 重复2, 3步,直到收敛。

Self-training is an incremental algorithm that does not use the split of the features. Initially, self-training builds a single naive Bayes classifier using the labeled training data and all the features. Then it labels the unlabeled data and converts the most confidently predicted document of each class into a labeled training example. This iterates until all the unlabeled documents are given labels.

  • self-training
    由于没有划分特征子集,所以只有一个分类器给自己打伪标签。每次不是将所有的无标记数据都打上伪标签,这个版本是选择高置信度(most confidence)的无标记样本打伪标签。
Method Use Feature Split(Yes) Use Feature Split(No)
Incremental co-training self-training
Iterative co-EM EM

3. 总结

论文本身是不难理解的。EM算法后面了解一下具体的内容。

你可能感兴趣的:(论文阅读,论文阅读)