半监督式学习(维基百科)

以下的内容摘自维基百科semi-supervised learning。用来对半监督学习进行一个概念性的直观体验。


Semi-supervised learning is a class of supervised learning tasks and techniques that also make use of unlabeled data for training - typically a small amount of labeled data with a large amount of unlabeled data. Semi-supervised learning falls between unsupervised learning (without any labeled training data) and supervised learning (with completely labeled training data). Many machine-learning researchers have found that unlabeled data, when used in conjunction with a small amount of labeled data, can produce considerable improvement in learning accuracy. The acquisition of labeled data for a learning problem often requires a skilled human agent (e.g. to transcribe an audio segment) or a physical experiment (e.g. determining the 3D structure of a protein or determining whether there is oil at a particular location). The cost associated with the labeling process thus may render a fully labeled training set infeasible, whereas acquisition of unlabeled data is relatively inexpensive. In such situations, semi-supervised learning can be of great practical value. Semi-supervised learning is also of theoretical interest in machine learning and as a model for human learning.


主要是说实际中,有标签的样本需要人工去做,花费较大,而对于无标签的样本则较容易获得,所以中庸思想,采用一部分的开销做标签样本,一部分去搜集无标签样本,这样进行学习就构成了semi-supervised learning,在实践中会有很好的应用价值。实际上也是模仿人类的学习过程。


As in the supervised learning framework, we are given a set of l independently identically distributed examples x1,,xlX with corresponding labels y1,,ylY . Additionally, we are given u unlabeled examples xl+1,,xl+uX . Semi-supervised learning attempts to make use of this combined information to surpass the classification performance that could be obtained either by discarding the unlabeled data and doing supervised learning or by discarding the labels and doing unsupervised learning.


用公式描述了半监督学习的两部分样本构成,说半监督学习要比扔掉无标签样本进行的监督学习和扔掉标签样本进行无监督学习的效果都要好。


Semi-supervised learning may refer to either transductive learning or inductive learning. The goal of transductive learning is to infer the correct labels for the given unlabeled data xl+1,,xl+u only. The goal of inductive learning is to infer the correct mapping from X to Y.


半监督学习可以推理到Transductive Learning(直推式学习)和inductive learning(归纳式学习)上来,直推式学习是推出无标签样本的正确的标签;而归纳式学习则是推出从X到Y的映射。下面一段有直观的解释。


Intuitively, we can think of the learning problem as an exam and labeled data as the few example problems that the teacher solved in class. The teacher also provides a set of unsolved problems. In the transductive setting, these unsolved problems are a take-home exam and you want to do well on them in particular. In the inductive setting, these are practice problems of the sort you will encounter on the in-class exam.


这里举了学生考试的问题为例,对于半监督学习就像是学生考试,有标签的样本就是老师在课上已经给出答案的试题,而无标签的样本就是课下给出的无答案的考试复习参考题。这里,直推式的学习是学生直接把复习参考题给做了出来,权当是熟悉考试试题;而归纳式的学习则是学生认为这是考试试题中要遇到的实际的问题,他做了一下归类,看看都有哪些试题。


It is unnecessary (and, according to Vapnik’s principle, imprudent) to perform transductive learning by way of inferring a classification rule over the entire input space; however, in practice, algorithms formally designed for transduction or induction are often used interchangeably.


实际中通常是将直推式和归纳式交替进行的学习。


Assumptions used in semi-supervised learning

In order to make any use of unlabeled data, we must assume some structure to the underlying distribution of data. Semi-supervised learning algorithms make use of at least one of the following assumptions.

Smoothness assumption
Points which are close to each other are more likely to share a label. This is also generally assumed in supervised learning and yields a preference for geometrically simple decision boundaries. In the case of semi-supervised learning, the smoothness assumption additionally yields a preference for decision boundaries in low-density regions, so that there are fewer points close to each other but in different classes.

Cluster assumption
The data tend to form discrete clusters, and points in the same cluster are more likely to share a label (although data sharing a label may be spread across multiple clusters). This is a special case of the smoothness assumption and gives rise to feature learning with clustering algorithms.

Manifold assumption
The data lie approximately on a manifold of much lower dimension than the input space. In this case we can attempt to learn the manifold using both the labeled and unlabeled data to avoid the curse of dimensionality. Then learning can proceed using distances and densities defined on the manifold.

The manifold assumption is practical when high-dimensional data are being generated by some process that may be hard to model directly, but which only has a few degrees of freedom. For instance, human voice is controlled by a few vocal folds,[2] and images of various facial expressions are controlled by a few muscles. We would like in these cases to use distances and smoothness in the natural space of the generating problem, rather than in the space of all possible acoustic waves or images respectively.


主要讲了为了使能半监督学习,有三个假设要说明:首先是Smoothing 假设,相近的样本应当更加可能具有相同的标签。这个跟监督学习中的泛化能力定义有点类似:相近的输入得到相近的输出,否则机器是不能学习的。这就是要求产生样本的真实的hypothesis应当是平滑的。然后是Cluster假设,要求是数据趋向于形成离散的簇,在相同的簇中的样本更加可能具有相同的标签。这个跟非监督学习中通过聚类算法进行feature learning的特性有关。最后是Manifold假设,输入样本空间是有冗余的。这就是保证了通过有标签的和无标签的样本来学习这个manifold避免维数灾难。就好像是人的声音和表情一样,控制发生的只有一些vocal folds,而且控制面部表情的肌肉也就那几个,我们更喜欢采用这些少量的关键的factor,而不是去直接在音波或者面部图片空间进行学习。


2015-8-28
艺少

你可能感兴趣的:(机器学习)