大学渣的ISLR笔记(10)-Unsupervised Learning

Most of this book concerns supervised learning methods such as regression and classification. In the supervised learning setting, we typically have access to a set of p features X1,X2, . . .,Xp, measured on n observations, and a response Y also measured on those same n observations. The goal is then to predict Y using X1,X2, . . . , Xp.

This chapter will instead focus on unsupervised learning , a set of statistical tools intended for the setting in which we have only a set of features X1,X2, . . . , Xp measured on n observations.

We are not interested in prediction, because we do not have an associated response variable Y.Rather, the goal is to discover interesting things about the measurements on X1,X2, . . .,Xp. Is there an informative way to visualize the data? Can we discover subgroups among the variables or among the observations? Unsupervised learning refers to a diverse set of techniques for answering questions such as these. In this chapter, we will focus on two particular types of unsupervised learning: principal components analysis , a tool used for data visualization or data pre-processing before supervised techniques are applied, and clustering ,a broad class of methods for discovering unknown subgroups in data.

The Challenge of Unsupervised Learning

unsupervised learning is often much more challenging. The exercise tends to be more subjective, and there is no simple goal for the analysis, such as prediction of a response.

Unsupervised learning is often performed as part of an exploratory data analysis . Furthermore, it can be hard to assess the results obtained from unsupervised learning methods,since there is no universally accepted mechanism for performing cross validation or validating results on an independent data set.in unsupervised learning, there is no way to check our work because we don’t know the true answer—the problem is unsupervised.

Principal Components Analysis

When faced with a large set of correlated variables, principal components allow us to summarize this set with a smaller number of representative variables that collectively explain most of the variability in the original set.

What Are Principal Components?

PCA provides a tool to do just this. It finds a low-dimensional representation of a data set that contains as much as possible of the variation.The idea is that each of the n observations lives in p -dimensional space, but not all of these dimensions are equally interesting.PCA seeks a small number of dimensions that are as interesting as possible, where the concept of interesting is measured by the amount that the observations vary along each dimension. Each of the dimensions found by PCA is a linear combination of the p features. We now explain the manner in which these dimensions, or principal components , are found.

The first principal component of a set of features X1,X2, . . . , Xp is the normalized linear combination of the features：

that has the largest variance. By normalized , we mean that

Given a n*p data set X, how do we compute the first principal component? Since we are only interested in variance, we assume that each of the variables in X has been centered to have mean zero (that is, the column means of X are zero). We then look for the linear combination of the sample feature values of the form: