When you take a machine learning class, there’s a good chance it’s divided into a unit on supervised learning and a unit on unsupervised learning. We certainly care about this distinction for a practical reason: often there’s orders of magnitude more data available if we don’t need to collect ground-truth labels. But we also tend to think it matters for more fundamental reasons. In particular, the following are some common intuitions:
I’d argue that this is deceptive. I think real division in machine learning isn’t between supervised and unsupervised, but what I’ll term predictive learning and representation learning. I haven’t heard it described in precisely this way before, but I think this distinction reflects a lot of our intuitions about how to approach a given machine learning problem.
In predictive learning, we observe data drawn from some distribution, and we are interested in predicting some aspect of this distribution. In textbook supervised learning, for instance, we observe a bunch of pairs , and given some new example , we’re interested in predicting something about the corresponding . In density modeling (a form of unsupervised learning), we observe unlabeled data , and we are interested in modeling the distribution the data comes from, perhaps so we can perform inference in that distribution. In each of these cases, there is a well-defined predictive task where we try to predict some aspect of the observable values possibly given some other aspect.
In representation learning, our goal isn’t to predict observables, but to learn something about the underlying structure. In cognitive science and AI, a representation is a formal system which maps to some domain of interest in systematic ways. A good representation allows us to answer queries about the domain by manipulating that system. In machine learning, representations often take the form of vectors, either real- or binary-valued, and we can manipulate these representations with operations like Euclidean distance and matrix multiplication. For instance, PCA learns representations of data points as vectors. We can ask how similar two data points are by checking the Euclidean distance between them.
In representation learning, the goal isn’t to make predictions about observables, but to learn a representation which would later help us to answer various queries. Sometimes the representations are meant for people, such as when we visualize data as a two-dimensional embedding. Sometimes they’re meant for machines, such as when the binary vector representations learned by deep Boltzmann machines are fed into a supervised classifier. In either case, what’s important is that mathematical operations map to the underlying relationships in the data in systematic ways.
Whether your goal is prediction or representation learning influences the sorts of techniques you’ll use to solve the problem. If you’re doing predictive learning, you’ll probably try to engineer a system which exploits as much information as possible about the data, carefully using a validation set to tune parameters and monitor overfitting. If you’re doing representation learning, there’s no good quantitative criterion, so you’ll more likely build a model based on your intuitions about the domain, and then keep staring at the learned representations to see if they make intuitive sense.
In other words, it parallels the differences I listed above between supervised and unsupervised learning. This shouldn’t be surprising, because the two dimensions are strongly correlated: most supervised learning is predictive learning, and most unsupervised learning is representation learning. So to see which of these dimensions is really the crux of the issue, let’s look at cases where the two differ.
Language modeling is a perfect example of an application which is unsupervised but predictive. The goal is to take a large corpus of unlabeled text (such as Wikipedia) and learn a distribution over English sentences. The problem is motivated by Bayesian models for speech recognition: a distribution over sentences can be used as a prior for what a person is likely to say. The goal, then, is to model the distribution, and any additional structure is unnecessary. Log-linear models, such as that of Mnih et al. [1], are very good at this, and recurrent neural nets [2] are even better. These are the sorts of approaches we’d normally apply in a supervised setting: very good at making predictions, but often hard to interpret. One state-of-the-art algorithm for density modeling of text is PAQ [3], which is a heavily engineered ensemble of sequential predictors, somewhat reminiscent of the winning entries of the Netflix competition.
On the flip side, supervised neural nets are often used to learn representations. One example is Collobert-Weston networks [4], which attempt to solve a number of supervised NLP tasks by learning representations which are shared between them. Some of the tasks are fairly simple and have a large amount of labeled data, such as predicting which of two words should be used to fill in the blank. Others are harder and have less data available, such as semantic role labeling. The simpler tasks are artificial, and they are there to help learn a representation of words and phrases as vectors, where similar words and phrases map to nearby vectors; this representation should then help performance on the harder tasks. We don’t care about the performance on those tasks per se; we care whether the learned embeddings reflect the underlying structure. To debug and tune the algorithm, we’d focus on whether the representations make intuitive sense, rather than on the quantitative performance. There are no theoretical guarantees that such an approach would work — it all depends on our intuitions of how the different tasks are related.
Based on these two examples, it seems like it’s the predictive/representation dimension which determines how we should approach the problem, rather than supervised/unsupervised.
In machine learning, we tend to think there’s no solid theoretical framework for unsupervised learning. But really, the problem is that we haven’t begun to formally characterize the problem of representation learning. If you just want to build a density modeler, that’s about as well understood as the supervised case. But if the goal is to learn representations which capture the underlying structure, that’s much harder to formalize. In my next post, I’ll try to take a stab at characterizing what representation learning is actually about.
翻译(水平有限,如有错误请指正):
当你学习机器学习课程的时候,一个非常不错的是机器学习会分为监督学习和无监督学习。我们对于这样的区分基于一个实际的原因:监督学习会需要大量的带有标签的数据。但我们也更倾向于认为有个更根本的原因。下面是一些常见的直觉:
1、在监督学习中,算法通常没有特征工程和模型调整那么重要。在无监督学习中,我们对数据的构造更加关注,然后建立一个能反应这种构造规律的模型;
2、 在监督学习中,除了在小型数据集中,我们会抽取我们认为对这个问题有关的任何特征。在无监督学习中,我们必须仔细地挑选我们认为最能代表数据的某个方面的特征;
3、 有监督学习算法似乎有很多强力的理论支持,而无监督学习则少一些;
4、 在监督学习中,现成的算法在大部分任务中表现非常良好,而无监督学习则需要更小心以及更多的专业知识去建造一个有用的模型;
我认为上面这些直觉都是不准确的,我觉得真正的分类不应该是分为有监督和无监督,而应该分为“预测学习”和“表示学习”,我从没听说过有比这样的分类更为准确的了,我认为这样的分类更能反映我们对以上两种不同的学习算法的直觉。
在预测学习中,我们从数据的分布中观察数据,然后预测这种分布的某个属性的值。在监督学习的教科书中,我们观察到这样一对对的数据:(x1, y1),......(xn, yn), 然后对于新的x,我们兴趣在于预测这个x对应的y值是什么。在密度建模(一种无监督学习,看吧,现在这中无监督学习都属于预测学习了)中,我们观察到未标注的数据x1,.....,xn, 我们对这些数据的分布进行建模,也许还能用这个模型在这个分布下进行推断。在所有这些例子中,我们有清晰定义的预测任务,给出数据的某些方面的属性,预测另一个属性。
在表示学习中,我们的目的不是通过观察来做预测,而是学习到某种根本的表征,在认知科学和AI领域中,表征是对我们感兴趣的某个方面进行映射的正式的系统。一个良好的表征允许我们通过操作这个系统来回答有关某方面的问题。在机器学习中,表征经常是一个向量,或者是真假的二值,我们可以通过一些数学操作比如欧几里德距离或矩阵操作等等来操作这些表征。比如,主成分分析算法将数据点的表征看作一个向量,我们可以通过欧几里德距离来判断两个数据点的相似程度。
在表示学习中,我们的目的不是预测观察值,而是通过学习表征来回答之后的一些问题。有时候表征是对于人来说有意义的东西,比如我们将数据可视化到二维,有时表征又是对机器来说有意义的东西,比如把通过深度波尔兹曼机学习到的二进制向量表征喂给监督学习分类器的时候。另一方面,重要的是通过系统的方式,数学操作反映了根本关系的映射。
无论你的目标是预测还是表示学习,都会影响你用来解决问题的各种技术。如果你做的是预测学习,你可能会尝试用工程学制造出一个系统,这个系统会尽可能地探索数据的信息,然后你会用验证集数据去微调参数并监视模型有没有过拟合。如果你做的是表示学习,在这里并没有什么好的定量的标准可以遵守,所以你可能要依靠你的直觉来对要研究的问题的领域进行建模,然后盯着这个学到的表征看看它们是否符合你的直觉。
以下略。。。