Course literature:

Most of the course content can be found in

Hastie T, Tibshirani R, and Friedman J (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York: Springer Science+Business Media, LLC

This book is freely available online (Links to an external site.). If I refer to ESL, then I mean this book.

Other helpful books

Books that have a more practical angle than ESL:
- James G, Witten D, Hastie T, and Tibshirani R (2013) An Introduction to Statistical Learning: With Applications in R. New York: Springer Science+Business Media, LLC
  The little sibling of ESL, freely available online (Links to an external site.).
- Kuhn M and Johnson K (2013) Applied Predictive Modeling. New York: Springer Science+Business Media, LLC Even this book is freely available at Springer Link (Links to an external site.) if you are in the university network.
For a theoretical angle on the more traditional (small to medium data) parts of the course:
- Falk M, Marohn F, and Tewes B (2002) Foundations of Statistical Analyses and Applications with SAS. Basel: Birkhäuser
A very pedagogic book with a (mostly) Bayesian angle
- Bishop CM (2006) Pattern Recognition and Machine Learning. New York: Springer
A compendium and also more focus on the Bayesian angle. Tougher for learning but great for reference
- Murphy KP (2012) Machine Learning: A Probabilistic Perspective. Cambridge, Massachusetts: The MIT Press
For the parts of the course concerning sparse modelling and penalisation methods the following book can be useful
- Hastie T, Tibshirani R, and Wainwright M (2016) Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press
  This book is freely available online (Links to an external site.).

Projects

There are 3 projects in total in this couse. Students can choose between R and Python.

Lectures plan

Lecture	Content	Chapters in the course book
1	Introduction	1, 2
2	Model-based classification	4
3	Model evaluation and bias-variance tradeoff	7
4	Rule-based classification and regression (CART and random forests)	9.2, 8.7, 15.1-15.3
5	Classification and dimension reduction (Singular Value Decomposition, PCA, Regularized Discriminant Analaysis)	15.3.1-15.3.2, 4.3.1
6	Clustering	14.3
7	Clustering (cont'd)	8.5, 12.7
8	Density-based and high-dimensional clustering
9	Feature selection and regularised regression	3.3, 3.4, 3.8.5, 3.8.6
10	Regularised regression (cont'd)	3.8.4, 18.1-18.3.2, 18.4
11	Data representations - linear methods	14.5.1, 14.6, 14.7.1
12	Data representations - kernel methods	14.5.4
13	Data representations - preserving local geometry	14.8, 14.9
14	Large-scale methods for data analysis
15	Advanced topics	14.5.3

Lecture 1 - Introduction

Chapter 1 - Introduction

Statistical learning, features, training set, learner, supervised learning problem, unsupervised learning problem
Some real examples:
1. Email spam (a classification problem)
2. Prostate cancer (a regression problem)
3. Handwritten digit recognition (a classification problem)
4. DNA expression microarrays (a clustering problem)

Chapter 2 - Overview of supervised learning

2.1 Introduction

For the first three examples given in Chapter 1, the common goal is to use the inputs to predict the values of the outputs, which is called supervised learning.
In the statistical literature the inputs are often called the predictors, a term we will use interchangeably with inputs, and more classically the independent variables. In the pattern recognition literature the term features is preferred, which we use as well. The outputs are called the responses, or classically the dependent variables.

2.2 Variable types and terminology

Outputs: could be a quantitative measurement, or a qualitative measurement. Qualitative variables are also referred to as categorical or discrete variables as well as factors. A third variable type is ordered categorical, such as small, medium, large. They are further discussed in Chapter 4.
Inputs also vary in measurement type.
This distinction in output type has led to a naming convetion for the prediction tasks:
1. Regression when we predict quantitative outputs
2. Classification when we predict qualitative outputs
We will typically denote an input variable by the symbol . If is a vector, its componments can be accessed by subscripts . Quantitative outputs will be denoted by , and qualitative outputs by (for group). We use uppercase letter such as , or when referring to the generic aspects of a variable. Observed values are written in lowercase; hence the th observed value of is written as (where is again a scalar or vector)

2.3 Two simple approaches to prediction: Least squares and nearest neighbors

In this section we develop two simple but powerful prediction methods:
1. The linear model fit by least squares: The linear model makes huge assumptions about structure and yields stable but possibly inaccurate predictions
2. The k-nearest-neighbor prediction rule: makes very mild structural assumptions, its predictions are often accurate but can eb unstable

2.3.1 Linear models and least squares (parametric method)

Given a vector of inputs , we predict the output via the model

The term is the intercept, also known as the bias in machine learning. We assume that the intercept is included in by including the constant variable 1 in .

Viewed as a function over the -dimensional input space, is linear, and the gradient is a vector in input space that points in the steepest uphill direction.
How do we fit the linear model to a set of training data? There are many different methods, but by far the most popular is the method of least squares. In this approach, we pick the coefficients to minimize the residual sum of squares
Normal equation
The linear model in a classification context, the two predicted classes are separated by the decision boundary.
A mixture of Gaussians is best described in terms of the generative model.

2.3.2 Nearest-neighbor methods (nonparametric method)

Nearest-neighbor methods use those observations in the training set closest in input space to to form . Specifically, the -nearest neighbor fit for is defined as follows:

where is the neighborhood of defined by the closest points in the training sample. Closeness implies a metric, which for the moment we assume is Euclidean distance. So, in words, we find the observations with closest to in input space, and average their responses.
It appears that -nearest-neighbor fits have a single parameter, the number of neighbors , compared to the parameters in least-squares fits.
The effective number of parameters of -nearest-neighbors is and is generally bigger than , and decreases with increasing .

2.3.3 From least squares to nearest neighbors

The linear decision boundary from least squares is very smooth, and apparently stable to fit. It does appear to rely heavily on the assumption that a linear decision boundary is appropriate. In language we will develop shortly, it has low variance and potentially high bias.
On the other hand, the -nearest-neighbor procedures do not appear to rely on any stringent assumptions about the underlying data, and can adapt to any situation. However, any particular subregion of the decision boundary depends on a handful of input points and their particular positions, and is thus wiggly and unstable - high variance and low bias.
A large subset of the most popular techniques in use today are variants of these two simple procedures (linear models and -nearest-neighbor).
In fact, -nearest-neighbor, the simplest of all, captures alarge percentage of the market for low-dimensional ( input variables) problems.
The following list describes some ways in which these simple procedures have been enhanced:
1. Kernel methods use weights that decrease smoothly to zero with distance from the target point, rather than the effective 0/1 weights used by -nearest-neighbor.
2. In high-dimensional spaces the distance kernels are modified to emphasize some variable more than others.
3. Local regression fits linear models by locally weighted least squares, rather than fitting constants locally.
4. Linear models fit to a basis expansion of the original inputs allow arbitrarily complex models
5. Projection pursuit and neural network models consist of sums of nonlinearly transformed linear models.

2.4 Statistical decision theory

We first consider the case of a quantitative output, and place ourselves in the owrld of random variables and probability spaces. Let denote a real valued random input vector, and a real valued random output variable, with joint distribution . We seek a function for predicting given values of the input . This theory requires a loss function for penalizing errors in prediction, and by far the most common and convenient is squared error loss: . This leads us to a criterion for choosing :

The solution is

the conditional mean, also known as the regression function. Thus the best prediction of at any point is the conditional mean, when best is measured by average squared error.
The nearest-neighbor methods attempt to directly implement this recipe using the training data. At each point , we settle for

where "Ave" denotes average, and is the neighborhood containing the points in closest to . Two approximations are happening here:
1. expectation is approximated by averaging over sample data
2. conditioning at a point is relaxed to conditioning on some region "close" to the target point. Similarly to tree-based models
  For large training sample size , the points in the neighborhood are likely to be close to , and as gets large the average will get more stable. In fact, under mild regularity conditions on the joint probability distribution , one can show that as such that , . In light of this, why look further, since it seems we have a universal approximator? We often do not have very large samples. If the linear or some more structured model is appropriate, then we can usually get a more stable estimate than -nearest neighbors, although such knowledge has to be learned from the data as well.
There are some disastrous problems, in Section 2.5, we see that as the dimension gets large, so does the metric size of the -nearest neighborhood. So settling for nearest neighborhood as a surrogate for conditioning will fail us miserably. The convergence above still holds, but the rate of convergence decreases as the dimension increases.
Both -nearest neighbors and least squares end up approximating conditional expectations by averages. But they differ dramatically in terms of model assumptions:
1. Least squares assumes is well approximated by a globally linear function.
2. -nearest neighbors assumes is well approximated by a locally constant function.
  Although the latter the latter seems more palatable (可口的), we have already seen that we may pay a price for this flexibility.
The problem of estimating a conditional expectation in high dimensions are swept away in this case by imposing some (often unrealistic) model assumptions, in this case additivity.
If we replace the loss function with the : ? The solution in this case is the conditional median

which is a different measure of location, and its estimates are more robust than those for the conditional mean. criteria have discontinuities in their derivatives, which have hindered their widespread use. Squared error is analytically convenient and the most popular.
In the context of classification problem, we often use the zero-one loss function
The error rate of the Bayes classifier is called the Bayes rate.

2.5 Local methods in high dimensions

The stable but biased linear model and the less stable but apparently less biased class of -nearest-neighbor estimates.
It would seem that with a reasonably large set of traning data, we could always approximate the theoretically optimal conditional expectation by -nearest-neighbor averaging since we should be able to find a fairly large neighborhood of observations close to any and average them. This approach and our intuition breaks down in high dimensions, and the phenomenon is commonly referred to as the curse of dimensionality (Bellman, 1961).
Here are some manifestations of curse of dimensionality
1. Consider the nearest-neighbor procedure for inputs uniformly distributed in a -dimensional unit hypercube, in order to capture 1% or 10% of the data to form a local average, we must cover 63% or 80% of the range of each input variable. Such neighborhoods are no longer "local". Reducing a fraction of the unit volume does not help much either, since the fewer observations we average, the higher is the variance of our fit.
2. Another consequence of the sparse sampling in high dimensions is that all sample points are close to an edge of the sample. The reason that this presents a problem is that prediction is much more difficult near the edges of the training sample. One must extrapolate from neighboring sample points rather than interpolate between them.
3. Another manifestation of the curse is that the sampling density is proportional to , where is the dimension of the input space and is the sample size. In high dimensions all feasible training samples sparsely populate the input space.
4. Bias-variance decomposition.
5. The complexity of functions of many variables can grow exponentially with the dimension, and if we wish to be able to estimate such functions with the same accuracy as function in low dimensions, then we need the size of our training set to grow exponentially as well.
6. By imposing some heavy restrictions on the class of models being fitted, we have avoided the curse of dimensionality.
7. By relying on rigid assumptions, the linear model has no bias at all and negligible variance, while the error in 1-nearest nighbor is substantially larger. However, if the assumptions are wrong, all bets are off and the 1-nearest neighbor may dominate.

2.6 Statistical models, supervised learning

Our goal is to find a useful approximation to the function that underlies the predictive relationship between the inputs and outputs.
In the theoretical setting of Section 2.4, we saw that squared error loss lead us to the regression function for a quantitative response.
The class of nearest-neighbor methods can be viewed as direct estimates of this conditional expectation, but we have seen that they can fail in at least two ways:
1. if the dimension of the input space is high, the nearest neighbors need not be close to the target point, and can result in large errors.
2. if special structure is known to exist, this can be used to reduce both the bias and the variance of the estimates.
  Note: Random forests can be viewed as an adaptive nearest neighbor algorithm (Lin and Jeon, 2006).

2.6.1 A statistical model for the joint distribution

Suppose in fact that our data arose from a statistical model

where the random error has and is independent of . Note that for this model, , and in fact the conditional distribution depends on only through the conditional mean .
The additive error model is a useful approximation to the truth. For most systems the input-output pairs will not hav ea deterministic relationship . Generally there will be other unmeasured variables that also contribute to , including measurement error. The additive model assumes that we can capture all these departures from a deterministic relationship via the error .
The assumption in (2.29) that the errors are independent and identically distributed is not strictly necessary. With such a model it becomes natural to use least squares as a data criterion for model estimation as in (2.1). Simple modifications can be made to avoid the independence assumption; for example, we can have , and now both the mean and variance depend on . In general, the conditional distribution can depend on in complicated ways, but the additive error model precludes (杜绝) these.
For qualitative outputs ; in this case the target function is the conditional density , and this is modeled directly.

2.6.2 Supervised learning

Suppose for simplicity that the errors are additive and that the model is a reasonable assumption. Supervised learning attempts to learn by example through a teacher.
Learning by example

2.6.3 Function approximation

Another useful approximators can be expressed as linear basis expansions

where are a suitable set of functions or transformations of the input vector . could be:
1. Polynomial expansions
2. Trigonometric expansions
3. Nonlinear expansions, such as the sigmoid transformation common to neural network models,
We can use least squares to estimate the parameters in as we did for the linear model, by minimizing the residual sum-of-squares

as a function of .
While least squares is generally very convenient, it is not the only criterion used and in some cases would not make much sense. A more general principle for estimation is maximum likelihood estimation.
Suppose we have a random sample from a density indexed by some parameters . The log-probability of the observed sample is

The principle of maximum likelihood assumes that the most reasonable values for are those for which the probability of the observed sample is largest.
Least squares for the additive error model , with , is equivalent to maximum likelihood using the conditional likelihood

So although the additional assumption of normality seems more restrictive, the results are the same.
The log-likelihood of the data is

and the only term involving is the last, which is up to a scalar negative multiplier.
In case of multinomial likelihood for the regression function for a qualitative output . Suppose we have a model for the conditional probability of each class given , indexed by the parameter vector . Then the log-likelihood (also referred to as the cross-entropy) is

and when maximized it delivers values of that best conform with the data in this likelihood sense.

2.7 Structured regression models

We have seen that although nearest-neighbor and other local methods focus directly on estimating the function at a point, they face problems in high dimensions. They may also be inappropriate even in low dimensions in cases where more structured approaches can make more efficient use of the data. This section introduces classes of such structured approaches.

2.7.1 Difficulty of the problem

Consider the RSS criterion for an arbitrary function ,

Minimizing (2.37) leads to infinity many solutions: any function passing through the training points is a solution. If the sample size were sufficiently large such that repeats were guaranteed and densely arranged, it would seem that these solutions might all tend to the limiting conditional expectation.
In order to obtain useful results for finite , we must restrict the eligible solutions to (2.37) to a smaller set of functions. These restrictions are sometimes encoded via the parametric representation of , or may be built into the learning method itself, either implicitly or explicitly.
Any restriction imposed on that lead to a unique solution to (2.37) do not really remove the ambiguity caused by the multuplicity of solutions. There are infinitely many possible restrictions, each leading to a unique solution, so the ambiguity has simply been transferred tot he choice of constraint.
The nature of the constraint depends on the metric used. Some methods, such as kernel and local regression and tree-based methods, directly specify the metric and size of the neighborhood. The nearest-neighbor methods discussed so far are based on the assumption that locally the function is constant; close to the target input , the function does not change much, and so close outputs can be averaged produce . Other methods such as spline, neural networks and basis-function methods implicitly define neighborhoods of local behavior.
One fact should be clear by now. Any method that attempts to produce locally varying functions in small isotropic neighborhoods will run into problems in high dimensions - again the curse of dimensionality. And conversely, all method that oversome the dimensionality problems have an associated - and often implicit or adaptive - metric for measuring neighborhoods, which basically does not allow the neighborhood to be simultaneously small in all directions.

2.8 Classes of restricted estimators

The variety of nonparametric regression techniques or learning method fall into a number of different classes depending on the nature of the restrictions imposed.
Each of the classes has associated with it one or more parameters, sometimes appropriately called smoothing parameters, that control the effective size of the local neighborhood. Here we describe three broad classes.

2.8.1 Roughness penalty and Bayesian methods

Here the class of functions is controlled by explicitly penalizing with a roughness penalty

The user-selected functional will be large for functions that vary too rapidly over small regions of input space.
We discuss roughness-penalty approaches in Chapter 5, and the Bayesian paradigm in Chapter 8.

2.8.2 Kernel methods and local regression

These methods can be thought of as explicitly providing estimates of the regression function or conditional expectation by specifying the nature of the local neighborhood, and of the class of regular functions fitted locally. The local neighborhood is specified by a kernel function which assigns weights to points in a region around (see Figure 6.1 on page 192).
In general, we can define a local regression eestimate of as , where minimizes

and is some parameterized function, such as a low-order polynomial.
Nearest-neighbor methods can be thought of as kernel methods having a more data-dependent metric. Indeed, the metric for -nearest neighbors is

where is the training observation ranked th in distance from , and is the indicator of the set .
These methods of course need to be modified in high dimensions, to avoid the curse of dimensionality. Various adaptations are discussed in Chapter 6.

2.8.3 Basis functions and dictionary methods

This class of methods includes the familiar linear and polynomial expansions, but more importantly a wide variety of more flexible models. The model for is a linear expansion of basis functions

where each of the is a function of the input , and the term linear here refers to the action of the parameters .
For one-dimensional , polynomial splines of degree can be represented by an appropriate sequence of spline basis functions, determined in turn by knots. The parameter controls the degree of the polynomial or the number of knots in the case of splines.
Radial basis functions are symmetric -dimenional kernels located at particular centroids,

for example the Gaussian kernel is popular. Radial basis functions have cnetroids and scales that have to be determined.
A single-layer feed-forward neural network model with linear output weights can be thought of as an adaptive basis function method.
These adaptively chosen basis function methods are also known as dictionary methods.

2.9 Model selection and the bias-variance tradeoff

All the models described above and many others discussed in later chapter have a smoothing or complexity parameter that has to be determined:
1. the multiplier of the penalty term
2. the width of the kernel
3. or the number of basis functions.
We cannot use residual sum-of-squares on the training data to determine these parameters either, since we would always pick those that gave interpolating fits and hence zero residuals. Such a model is unlikely to predict future data well at all.
Irreducible error, a bias component and a variance component.
Bias-variance tradeoff
For -nearest neighbors, the model complexity is controlled by .
In Chapter 7, we discuss methods for estimating the test error of a prediction method, and hence estimating the optimal amount of model complexity for a given prediction method and training set.

Chapter 3 - Linear methods for regression

3.3 Subset selection

3.4 Shrinkage methods

3.8 More on the Lasso and related path algorithms

3.8.5 Further properties of the Lasso

3.8.6 Pathwise coordinate optimization

Lecture 3 - Model evaluation and bias-variance tradeoff

Chapter 7 - Model assessment and selection

7.1 Introduction

The generalization performance of a learning method relates to its prediction capability on independent test data.
We start this chapter with a discussion of the interplay between bias, variance and model complexity.

7.2 Bias, Variance and Model Complexity

We have a target variable , a vector of inputs , and a prediction model that has been estimated from a training sample. The loss function for measuring errors between and is denoted by . Typically choices are:
1. - squared error.
2. - absolute error.
Test error, also referred to as generalization error, is the expected prediction error over an independent test sample where both and are drawn from their joint distribution (population). Note that this expectation averages anything that is random, including the randomness in the training sample that produced .
Training error is the average loss over the training sample Unfortunately training error is not a good estimate of the test error. Training error consistently decreases with model complexity but more prone to overfitting and will typically generalize poorly. In between there is an optimal model complexity that gives minimum test error.
For a qualitative or categorical response taking one of values in a set . Typically we model the probabilities , and then . Typical loss functions are:
1. - 0-1 loss
2. - log-likelihood
  Note: The log-likelihood is sometimes referred to as cross-entropy loss or deviance
Again, test error is given by , the expected misclassification rate or
Training error is the sample analogue, for example:
The log-likelihood can be used as a loss-function for general response densities. If is the denisty of , indexed by a parameter that depends on the predictor , then

The "2" in the definition makes the log-likelihood loss for the Gaussian distribution match squared-error loss.
Typically our model will have the tuning parameters varies the complexity of our model and we can write our predictions as and we wish to find the value of that minimizes error.
Two separate goals:
1. Model selection: estimating the performance of different models in order to choose the (approximate) best one.
2. Model assessment: having chosen a final model, estimating its prediction error (generalization error) on new data.
If we are in a data-rich situation, the best approach for both problems above is to randomly divide the dataset into three parts: a training set, a validation set, and a test set. The training set is used to fit the models; the validation set is used to estimate prediction error for model selection; the test set is used for assessment of the generalization error of the final chosen model.
It is difficult to give a general rule on how to choose the number of observations in training, validation and test parts, as this depends on the signal-to-noise ratio in the data and the the training sample size. A typical split might be 50% for training, and 25% each for validation and testing.
The methods in this chapter are designed for situations where there is insufficient data to split it into three parts.

7.3 The Bias-Variance Decomposition

As in chapter 2, if we assume that where and , we can derive an expression for the expected prediction error of a regression fit at an input point , using squared-error loss:

The first term is the variance of the target around its true mean , and cannot be avoided no matter how well we estimate , unless . The second term is the squared bias, the amount by which the average of our estimate differs from the true mean; the last term is the variance; the expected squared deviation of around its mean. Typically, the more complex we make the model , the lower the (squared) bias but the higher the variance.
For a linear model family such as ridge regression, we can break down the bias more finely into average squared model bias and the average squared estimation bias.
The overall point is that the bias-variance tradeoff behaves differently for 0-1 loss (classification) than it does for squared error loss (regression). This in turn means that the best choices of tuning parameters may differ substantially in the two settings. One should base the choice of tuning parameter on an estimate of prediction error, as described in the following sections.

7.4 Optimism (乐观) of the Training Error Rate

Typically, the training error rate

will be less than the true error , because the same data is being used to fit the method and assess its error.
is a kind of extra-sample error, since the test feature vectors don't need to coincide with the training feature vectors. The nature of the optimism in is easiest to understand when we focus not on but on the in-sample error.

The notation indicates that we observe new response values at each of the training points . We define the optimism as the expected difference between and the training error :

This is typically positive since is usually biased downward as an estimate of prediction error.
For squared error, 0-1, and other loss functions, one can show quite generally that

where Cov indicates covariance. The harder we fit the data (namely, how strongly affects its own prediction), the greater will be, thereby increasing the optimism.
In summary, we have the important relation:

This expression simplifies if is obtained by a linear fit with inputs or basis functions. For example,

for the additive error model , and so
The optimism increases linearly with the number of inputs or basis functions we use, but decreases as the training sample size increases.
An obvious way to estimate prediction error is to estimate the optimism and then add it to the training error rate
In contrast, the cross-validation and bootstrap methods, are direct estimates of the extra-sample (true) error . These general tools can be used with any loss function, and with nonlinear, adaptive fitting techniques.
In-sample error is not usually of direct interest since future values of the features are not likely to coincide with their training set values.

7.5 Estimates of In-Sample Prediction Error

The general form of the in-sample estimates is:

where is an estimate of the optimism.
statistic
The Akaike information criterion is a similar but more generally applicable estimate of when a log-likelihood loss function is used.

Here is a family of densities for (containing the "true" density), is the maximum-likelihood estimate of , and "loglik" is the maximized log-likelihood.
Given a set of models indexed by a tuning parameter , denote by the training error and the number of paramters for each model. Then for this set of models we define

The function provides an estimate of the test error curve, and we find the tuning parameter that minimizes it. Our final chosen model is . For nonlinear and other complex models, we need to replace by some measure of model complexity.

7.6 The Effective Number of Parameters

Suppose we stack the outcomes into a vector , and similarly for the predictions . Then a linear fitting method is one for which we can write

where is an matrix depending on the input vectors but not on the . Then the effective number of parameters is defined as

the sum of the diagonal elements of (trace of a matrix). Note that if is an orthogonal-projection matrix onto a basis set spanned by features, then . It turns out that is exactly the correct quantity to replace as the number of parameters in the statistic.
For models like neural networks, in which we minimize an error function with weight decay penalty (regularization) , the effective number of parameters has the form

where the are the eigenvalues of the Hessian matrix . This expression follows from if we make a quadratic approximation to the error function at the solution (Bishop, 1995)

7.7 The Bayesian Approach and BIC

The Bayesian information criterio (BIC), like AIC, is applicable in settings where the fitting is carries out by maximization of a log-likelihood. The generic form of BIC is

The BIC statistic (times ) is also known as the Schwartz criterion (Schwartz, 1979).
Under the Gaussian model, assuming the variance is known, equals (up to a constant) , which is for squared error loss. Hence we can write

Therefore BIC is proportional to AIC with the factor 2 replaced by . Assuming , BIC tends to penalize complex models more heavily, giving preference to simpler models in selection.
Define posterior probability of a given model , where represents the training data . To compare two models and , we form the posterior odds

If the odds are greater than one we choose model , otherwise we choose model . The rightmost quantity

is called the Bayes factor, the contribution of the data toward the posterior odds.
The posterior probability of a given model is

Typically, we assume that the prior over models is uniform, so that is constant. We need some way of approximating . A so-called Laplace approximation to the integral folowed by some other simplifications (Ripley, 1996, page 64) gives

Here is a maximum likelihood estimate and is the number of free parameters in model . If we define our loss function to be

this is equivalent to the BIC criterion of equation.
Therefore, choosing the model with minimum is equivalent to choosing the model with largest (approximate) posterior probability. But this framework gives us more. If we compute the criterion for a set of models, giving , then we can estimate the posterior probability of each model as

Thus we can estimate not only the best model, but also assess the relative merits of the models considered.
For model selection purposes, there is no clear choice between AIC and BIC. BIC is asymptotically consistent as a selection criterion. What this mean is that given a family of models, including the true model, the probability that BIC will select the correct model approaches one as the sample size . This is not the case for AIC, which tends to choose models which are too complex as . On the other hand, for finite samples, BIC often chooses models that are too simple, because of its heavy penalty on complexity.

7.8 Minimum Description Length

The minimum description length (MDL) approach gives a selection criterion formally identical to the BIC approach, but is motivated from an optimal coding viewpoint.
In general, if messages are sent with probabilities , a famous theorem due to Shannon says we should use code lengths and the average message length satisfies

The right-hand side above is also called the entropy of the distribution . The inequality is an equality when the probabilities satisfy . In general, the lower bound cannot be achieved, but procedures like the Huffmann coding scheme can get close to the bound.
From this result we glean the following:
To transmit a random variable having probability density function , we require about bits of information.
Now we apply this result to the problem of model selection. We have a model with parameters , and data consisting of both inputs and outputs. Let the (conditional) probability of the outputs under the model be , assume the receiver knows all of the inputs, and we wish to transmit the outputs. Then the message length required to transmit the outputs is

the log-probability of the target values given the inputs. The first term is the average code length for transmitting the discrepancy between the model and actual target values, while the second term is the average code length for transmitting the model parameters .
The MDL principle says that we should choose the model that minimizes the length equation above. We recognize the length equation as the (negative) log-posterior distribution, and hence minimizing description length is equivalent to maximizing posterior probability. Hence, the BIC criterion, derived as approximation to log-posterior probability, can also be viewed as a device for (approximate) model choice by minimum description length.
The preceding view of MDL for model selection says that we should choose the model with highest posterior probability. However, many bayesians would instead do inference by sampling from the posterior distribution.

7.9 Vapnik-Chervonenkis Dimension

A difficulty in using estimates of in-sample error is the need to specify the number of parameters (or the complexity) used in the fit.
Although the effective number of parameters introduced in Section 7.6 is useful for some nonlinear models, it is not fully general.
The Vapnik-Chervonenkis (VC) theory provides such a general measure of complexity, and gives associated bounds on the optimism.
The VC dimension is a way of measuring the complexity of a class of functions by assessing how wiggly (弯弯曲曲的) its members can be:
The VC dimension of the class is defined to be the largest number of points (in some configuration) that can be shattered by members of .
A set of points is said to be shattered by a class of functions if, no matter how we assign a binary label to each point, a member of the class can perfectly separate them.
In general, a linear indicator function (that is, takes the values 0 or 1) in dimensions has VC dimension , which is also the number of free parameters. On the other hand, it can be shown that the family has inifinite VC dimension.
One can use the VC dimension in constructing an estimate of in-sample prediction error.
If we fit training points using a class of functions having VC dimension , then with probability at least over training sets:

The bounds suggest that the optimism increases with and decreases with in qualitative agreement with the AIC correction .
Vapnik's structural risk minimization (SRM) approach fits a nested sequence of models of increasing VC dimensions , and then chooses the model with the smallest value of the upper bound.
The main drawback of this approach is the difficulty in calculating the VC dimension of a class of functions.

7.10 Cross-Validation

Probably the simplest and most widely used method for estimating prediction error is cross-validation. This method directly estimates the extra-sample error , which is the generalization error when the method is applied to an independent test sample from the joint distribution of and .
Given a set of models indexed by a tuning parameter , denote by the th model fit with the th part of the data removed. Then for this set of models we define

The function provides an estimate of the test error curve, and we find the tuning parameter that minimize it. Our final chosen model is , which we then fit to all the data.
To summarize, on the one hand, leave-one-out cross-validation has low bias but can have high variance; on the other hand, with say, CV has lower variance. But bias could be aproblem. Overall, five- or tenfold cross-validation are recommended as a good compromise.
Generalized cross-validation provides a convenient approximation to leave one out cross-validation, for linear fitting under squared-error loss.

7.11 Bootstrap Methods

Given the training data , the basic idea of bootstrap is to randomly draw datasets with replacement from the training set, each sample is the same size as the original training set. This is done times, producing bootstrap datasets.
How can we apply the bootstrap to estimate prediction error? One approach would be to fit the model in question on a set of bootstrap samples, and then keep track of how well it predicts the original training set. If is the predicted value at , from the model fitted to the th bootstrap dataset, our estimate is
However, it is easy to see that does not provide a good estimate in general. The reason is that the bootstrap datasets are acting as the training samples, while the original training set is acting as the test sample, and these two samples have observations in common. This overlap can make overfit predictions look unrealistically good, and is the reason that cross-validation explicitly uses non-overlapping data for the training and test samples.
By mimicking cross-validation, a better bootstrap estimate can be obtained. For each observation, we only keep track of predictions from bootstrap samples not containing that observation. The leave-one-out bootstrap estimate of prediction error is defined by

Here is the set of indices of the bootstrap samples that do not contain observation , and is the number of such samples.
If the learning curve has considerable slope at sample size , the leave-one-out bootstrap will be biased upward as an estimate of the true error. The ".632 estimator" is designed to alleviate this bias. It is defined by

The derivation of the .632 estimator is complex; intuitively it pulls the leave-one-out bootstrap estimate down toward the training error rate, and hence reduces its upward bias. The use of the constant .632 can be found in this textbook.
The .632 estimator works well in "light fitting" situations, but can break down in overfit ones. One can improve the .632 estimator by taking into account the amount of overfitting. We define the ".632+" estimator by

The derivation of the equation above is complicated: roughly speaking, it produces a compromise between the leave-one-out bootstrap and the training error rate that depends on the amount of overfitting.
On the average, the AIC criterion overestimated prediction error of its chosen model by 38%, 37%, 51% and 30%, respectively, over the four scenarios, with BIC performing similarly. In contrast, croos-validation overestimated the error by 1%, 4%, 0%, and 4%, with the bootstrap doing about the same. Hence the extra work involved in computing a cross-validation or bootstrap measure is worthwhile, if an accurate estimate of test error is required. With other fitting methods like trees, cross-validation and bootstrap can underestimate the true error by 10%, because the search for best tree is trongly affected by the validation set. In these situations only a separate test set will provide an unbiased estimate of test error.

7.12 Conditional or Expected Test Error?

Chapter 8 - Model inference and averaging

8.1 Introduction

For most of this book, the fitting (learning) of models has been achieved by minimizing
1. a sum of squares for regression
2. or cross-entropy for classification.
In fact, both of these minimizations are instances of the maximum likelihood approach to fitting.

8.2 The Boostrap and Maximum Likelihood Methods

8.2.1 A smoothing example

The bootstrap method provides a direct computational way of assessing uncertainty, by sampling from the training data.
There is actually a close connection between the least squares estimates (mean + covariance), the bootstrap, and maximum likelihood. Suppose we further assume that the model errors are Gaussian,

The bootstrap method described above, in which we sample with placement from the training data, is called the nonparametric bootstrap. This really means that the method is "model-free", since it uses the raw data, not a specific parametric model, to generate new datasets.
Considering a variation of the bootstrap, called the parametric bootstrap, in which we simulate new responses by adding Gaussian noise to the predicted values:

The confidence intervals from this method will exactly equal the least squares bands in the top right panel in Figure 8.2, as the number of bootstrap samples goes to infinity.

8.2.2 Maximum likelihood inference

It turns out that the parametric bootstrap agrees with least squares in the previous example because the model (8.5) has additive Gaussian errors. In general, the parametric bootstrap agrees not with least squares but with maximum likelihood, which we now review.

8.2.3 Bootstrap versus Maximum Likelihood

In essence, the bootstrap is a computer implementation of nonparametric or parametric maximum likelihood.
The advantage of the bootstrap over the maximum likelihood formula is that it allows us to compute maximum likelihood estimates of standard errors and other quantities in settings where no formulas are available.

8.3 Bayesian Methods

A noninformative prior for .
In Gaussian models, maximum likelihood and parametric bootstrap analyses tend to agree with Bayesian analyses that use a noninformative prior for the free parameters. This correspondence also extends to the nonparametric case, where the nonparametric bootstrap approximates a noninfromative Bayess analysis.

8.4 Relationship between the Bootstrap and Bayesian Inference

There are three ingredients that make a correspondence between the parametric bootstrap and the Bayes inference work:
1. The choice of noninformative prior for
2. The dependence of the log-likelihood on the data only through the maximum likelihood estimate . Hence we can write the log-likelihood as
3. The symmetry of the log-likelihood in and , that is
Properties (2) and (3) essentially only hold for the Gaussian distribution. However, they also hold approximately for the multinomial distribution, leading to a correspondence between the nonparametric bootstrap and Bayes inference, which we outline next.
In this sense, the bootstrap distribution represents an (approximate) nonparametric, noninformative posterior distribution for our parameter.

8.5 The EM (Expectation Maximization) Algorithm (for solving e.g., clustering problem)

The EM algorithm is a popular tool for simplifying difficiult maximum likelihood problems.

8.5.1 Two-component mixture model

In this section, we describe a simple mixture model for density estimation, and the associated EM algorithm for carrying out maximum likelihood estimation.

8.5.2 The EM algorithm in general

In certain classes of problems, maximizing the likelihood is difficult, but made easier by enlarging the sample with latent (unobserved) data. This is called data augmentation. Here the latent data are the model memberships , In other problems, the latent data are actual data that should have been observed but are missing.

8.5.3 EM as a Maximization-Maximization procedure

Here is a different view of the EM procedure, as a joint maximization algorithm.

8.6 MCMC for Sampling from the Posterior

Having defined a Bayesian model, one would like to draw samples from the resulting posterior distribution, in order to make inferences about the parameters of the posterior distribution.
Markov chain Monte Carlo (MCMC) is an approach to posterior sampling.
We will see that Gibbs sampling, an MCMC procedure, is closely related to the EM algorithm: the main difference is that it samples from the conditional distributions rather than maximizing over them.

8.7 Bagging

Early we introduced the bootstrap as a way of asessing the accuracy of a parameter estimate or a prediction.
Here we show how to use the bootstrap to improve the estimate or prediction itself.
In section 8.4 we found that the boostrap mean is approximately a posterior average in Bayes approaches.
Consider first the regression problem. Suppose we fit a model to our training data , obtaining the prediction at input . Bootstrap aggregation or bagging averages this prediction over a collection of bootstrap samples, thereby reducing its variance. For each bootstrap sample , , we fit our model, giving prediction . The bagging estimate is defined by

Denote by the empirical distribution putting equal probability on each of the data points . In fact the "true" bagging estimate is defined by , where and each . Expression (8.51) is a Monte Carlo estimate of the true bagging estimate , approaching it as .
The bagged estimate (8.51) will differ from the original estimate only when the latter is a nonlinear or adaptive function of the data.
Bagging requires that sampling with replacement while bootstrap can be with or without replacement.

8.7.1 Example: Trees with simulated data

Bagging can dramatically reduce the variance of unstable procedures like trees, leading to improved prediction. A simple argument shows why bagging helps under squared-error, in short because averaging reduces variance and leaves bias unchanged!
In Chapter 15, we see how random forests improve on bagging by reducing the correlation between the sampled trees.

8.8 Model Averaging and Stacking

8.9 Stochastic Search: Bumping

Bumping is a technique that uses bootstrap sampling to move randomly through model space for finding a better single model.

Chapter 9 - Additive models, trees, and related methods

Five nonparametric models are discussed in this chapter
1. Generalized additive models
2. Trees
3. Multivariate adaptive regression splines
4. The patient rule induction method
5. Hierarchical mixtures of experts

9.1 Generalized additive models

9.1.1 Fitting additive models

9.1.2 Example: Additive logistic regression

9.1.3 Summary

Additive models provide a useful extension of linear models, making them more flexible while still retaining much of their interpretability.
However, additive models can have limitations for large data-mining applications. A forward stagewise approach such as boosting (Chapter 10) is more effective, and also allows for interactions to be included in the model.

9.2 Tree-based methods

9.2.1 Background

We first describe a popular method for tree-based regression and classification called CART, and later contrast it with C4.5, a major competitor.
Let's consider a regression problem with continous response and inputs and , each taking values in the unit interval.We restrict attention to recursive binary partitions like that in the top right panel of Figure 9.2. We first split the space into two regions, and model the response by the mean of in each region. We choose the variable and split-point to achieve the best fit.Then one or both of these regions are split into two more regions, and this process is continued, until some stopping rule is applied. For example, let's assume that the result of this process is a partition into the five regions shown in the Figure 9.2. The corresponding regression model predicts with a constant in region , that is
A key advantage of the recursive binary tree is its interpretability. The feature space partition is fully described by a single tree. Even with more than two inputs, the binary tree representation works in the same way.

9.2.2 Regression trees

How to grow a regression tree? Our data consists of inputs and a response, for each of observations: that is for , with . The algorithm needs to automatically decide on the splitting variables and split points, and also what topology (shape) the tree should have. Suppose first that we have a partition into regions , and we model the response as a constant in each region:

If we adopt as our criterion minimization of the sum of squares , it is easy to see that the best is just the average of in region :

Now finding the best binary partition in terms of minimum sum of squares is generally computationally infeasible. Hence we proceed with a greedy algorithm. Starting with all of the data, consider a splitting variable and split point , and define the pair of half-planes

Then we seek the splitting variable and split point that solve

For any choice and , the inner minimization is solved by

For each splitting variable, the determination of the split point can be done very quickly and hence by scanning through all of the inputs, determination of the best pair is feasible.
Having found the best split, we partition the data into the two resulting regions and repeat the splitting process on each of the two regions. Then this process is repeated on all of the resulting regions.
How large should we grow the tree? Clearly a very large tree might overfit the data, while a small tree might not capture the important structure.
Tree size is a tuning parameter governing the model's complexity, and the optimal tree size should be adaptively chosen from the data.
1. One approach would be to split tree nodes only if the decrease in sum-of-squares due to the split exceeds some threshold. However, this approach is considered short-sighted since a seemingly worthless split might lead to a very good split below it.
2. The preferred strategy is to grow a large tree , stopping the splitting process only when some minimum node size (e.g., 5) is reached. Then this large tree is pruned using cost-complexity pruning, which we now describe.

9.2.3 Classification trees

9.2.4 Other issues

Categorical predictors (features) - The partitioning algorithm tends to favor categorical predictors with many levels ; the number of partitions grows exponentially in , and the more choices we have, the more likely we can find a good one for the data at hand. This can lead to severe overfitting if is large, and such variables should be avoided.
The loss matrix - In classification problems, the consequences of misclassifying observations are more serious in some classes than others. To account for this, we define a loss matrix , with being the loss incurred for classifying a class observation as class . Typically no loss is incurred for correct classifications, that is, . For the multi-class case, to incorporate the losses into the modelling process, we could modify the Gini index, this would be the expected loss incurred by the randomized rule. For two classes a better approach is to weight the observations in class by .
Missing predictor values - Suppose our data has some missing predictor values in some or all of the variables. For tree-based models, therea re two better approaches to deal with missing predictor values. The first is applicable to categorical predictors: we simply make a new category for "missing". The second more general approach is the construction of surrogate variables. When considering a predictor for a split, we use only the observations for which that predictor is not missing. Having chosen the best (primary) predictor and split point, we form a list of surrogate predictors and split points. The general problem of missing data is discussed in Section 9.6.
Why binary splits? - The problem is that multiway splits fragment the data too quickly, leaving insufficient data at the next level down. Multiway splits can be achieved by a series of binary splits, the latter are preferred.
Other tree-bulding procedures - C5.0
Linear combination splits - Rather than restricting splits to be of the form , one can allow splits along linear combinations of the form . The weights and split point are optimized to minimize the relevant criterion (e.g., Gini index). A better way to incorporate linear combination splits is in the hierarchical mixtures of experts (HME) model, the topic of Section 9.5.
Instability of trees - One major problem with trees is their high variance. Bagging averages many trees to reduce this variance.
Lack of smoothness - Another limitation of trees is the lack of smoothness of the prediction surface, as can be seen in the bottom right panel of Figrue 9.2. In classification with 0/1 loss, this doesn't hurt much but this can degrade performance in the regression setting, where we would normally expect the underlying function to be smooth. The MARS procedure, described in Section 9.4, can be viewed as a modification of CART designed to alleviate this lack of smoothness.
Difficulty in capturing additive structure - Another problem with trees is their difficulty in modeling additive structure. In regression, suppose, for example, that where is zero-mean noise. Then a binary tree might be able to capture this additive structure with sufficient data but the model is given no special encouragement to find such structure. The "blame" here can again be attributed to the binary tree structure, which has both advantages and drawbacks. Again the MARS method (Section 9.4) gives up this tree structure in order to capture additive structure.

9.2.5 Spam example (continued)

9.3 PRIM: Bump hunting

9.4 MARS: Multivariate adaptive regression splines

9.5 Hierarchical mixtures of experts

9.6 Missing data

9.7 Computational considerations

Chapter 10 - Boosting and additive trees

10.13 Interpretation

Single decision trees are highly interpretable. The entire model can be completely represented by a simple two-dimensional graphic (binary tree) that is easily visualized.
Linear combinations of trees (e.g., Random forests and Boosting trees) lose this important feature, and must therefore be interpreted in a different way.

10.13.1 Relative importance of predictor variables

In data mining applications the input predictor variables are seldom equally relevant. Often only a few of them have substantial influence on the response; the vast majority are irrelevant and could just as well not included.
It is often useful to learn the relative importance or contribution of each input variable in predicting the response.

10.13.2 Partial Dependence Plot

After the most relevant variables have been identified, the next step is to attempt to understand the nature of the dependence of the approximation on their join values.
For more than two variables, viewing functions of the corresponding higher-dimensional arguments is more difficult.
A useful alternative can somtimes be to view a collection of plots, each one of which shows the partial dependence of the approximation on a selected small subset of the input variables.
One way to define the average or partial dependence of on is

This is a marginal average of , and can serve as a useful description of the effect of the chosen subset on when, for example, the variables in do not have strong interactions with those in .
In practice, partial dependence functions can be used to interpret the results of any "black box" learning method. They can be estimated by

where are the values of occurring in the training data. This requires a pass over the data for each set of joint values of for which is to be evaluated.
It is important to note that partial dependence functions defined in (10.47) represent the effect of on after accounting for the (average) effects of the other variables on . They are not the effect of on ignoring the effects of .
If the effect of the chosen variable subset happens to be purely additive
If the effect of the chosen variable subset happens to be purely multiplicative
Those subsets whose effect on is approximately additive (10.50) or multiplicative (10.51) will be most revealing.

Chapter 15 - Random forests

15.1 Introduction

Bagging or bootstrap aggregation (section 8.7) is a technique for reducing the variance of an estimated prediction function.
Random forests is a substantial modification of bagging that builds a large collection of de-correlated trees, and then averages them.

15.2 Definition of random forests

The essential idea in bagging (section 8.7) is to average many noisy but approximately unbiased models, and hence reduce the variance.
Trees are ideal candidates for bagging, since they can capture complex interaction structures in the data, and if grown sufficiently deep, have relatively low bias. Since trees are notoriously noisy, they benefit greatly from the averaging.
An average of B i.i.d. random variables, each with variance , has variance . If the variables are simpy i.d. (identically distributed, but not necessarily independent) with positive pairwise correlation , the variance of the average is

As increases, the second term disappears, but the first remains, and hence the size of the correlation of pairs of bagged trees limits the benefits of averaging.
The idea in random forests (Algorithm 15.1) is to further improve the variance reduction of bagging by reducing the correlation between the trees, without increasing the variance too much. This is achieved in the tree-growing process through random selection of the input variables (instead of recursive binary partitioning in growing regression trees).
Specifically, when growing a tree on a bootstrapped dataset: Typically values for are or even as low as 1.
After such trees are grown, the random forest (regression) predictor is

As in Section 10.9 (page 356), characterizes the th random forest tree in terms of split variables, cutpoints at each node, and terminal-node values.
Intuitively, reducing will reduce the correlation between any pair of trees in the ensemble, and hence by (15.1) reduce the variance of the average.

15.3 Details of random forests

The inventors make the following recommendations:
1. For classification, the default value for is (the greatest integer function) and the minimum node size is one.
2. For regression, the default value for is and the minimum node size is five.
  In practice, the best values for theses parameters will depend on the porblem and they should be treated as tuning parameters.

15.3.1 Out of bag samples

An OOB error estimate is almost identical to that obtained by N-fold cross validation. Hence unlike many other nonlinear estimators, random forests can be fit in one sequence, with cross-validation being performed along the way. Once the OOB error stabilizes, the training can be terminated.
The OOB error is an unbiased estimate of the generalization error, since the trees have not seen these OOB samples while training, and it is proved that RF produces a limiting value of the generalization error. (Ref: Breiman L. Random forests. 2001)

15.3.2 Variable importance

At each split in each tree, the improvement in the split-criterion is the importance measure attributed to the splitting variable, and is accumulated over all the trees in the forest separately for each variable.
Boosting (Figure 10.6 on page 354) ignores some variables completely, while the random forest does not. The candidate split-variable selection increases the chance that any single variable gets in cluded in a random forest, while no such selection occurs with boosting.
Random forests also use the OOB samples to construct a different variable importance measure, apparently to measure the prediction strength of each variable.
There are two different variable importances discussed above:
1. Variable importance based on the mean decrease in impurity (MDI), impurity is quantified by the splitting criterion of the decision tree Gini index, Entropy in case of classification problem, Mean Squared Error in case of regression problem. (This variable importance is introduced by Louppe G. Understanding variable importances in forests of randomized trees)
2. Variable importance computed via OOB randomization, which tends to spread the importances more uniformly. (This variable importance introduced in Breiman L. Random forests. 2001)

15.3.3 Proximity plots

Proximity plots for random forests often look very similar, irrespective of the data, which casts doubt on their utility.

15.3.4 Random forests and overfitting

When the number of variables is large, but the fraction of relevant variables small, random forests are likely to perform poorly with small . At each split the chance can be small that the relevant variables will be selected.
Another claim is that random forests "cannot overfit" the data. It is certainly true that increasing does not cause the random forest sequence to overfit; like bagging, the random forest estimate (15.2) approximates the expectation

with an average over realizations of . The distribution of here is conditional on the training data. However, this limit can overfit the data; the average of fully grown trees can result in too rich a model, and incur unnecessary variance.
Segal (2004) demonstrates small gains in performance by controlling the depths of the individual trees grown in random forests. Using full-grown trees seldom costs much, and therefore, this parameter needs not to be tuned.

Chalmers Course: Statistical learning for big data