统计学习笔记(3) 监督学习概论(3)

Some further statements on KNN:

It appears that k-nearest-neighbor fits have a single parameter, the number of neighbors k, compared to the p parameters in least-squares fits. Although this is the case, we will see that the effective number of parameters of k-nearest neighbors is N/k and is generally bigger than p, and decreases with increasing k. To get an idea of why, note that if the neighborhoods were nonoverlapping, there would be N/k neighborhoods and we would fit one parameter (a mean) in each neighborhood.

N is the size of the training set, e.g. If k=1, each member in the training set is a mean value, we should store N values, but if k>1, for each sample in the input set, we have a neighbourhood containing k elements in the training set, and if the neighbourhoods belonging to different members of the input set do not overlap, then we store N/k mean values.

When we generate the following graph:
统计学习笔记(3) 监督学习概论(3)_第1张图片

We need a method to generate the test set. First we generated 10 means mk  from a bivariate Gaussian distribution N((1, 0)T , I) and labeled this class BLUE. Similarly, 10 more were drawn from N((0, 1)T , I) and labeled class ORANGE. Then for each class we generated 100 observations as follows: for each observation, we picked an mk at random with probability 1/10, and then generated a N(mk, I/5), thus leading to a mixture of Gaussian clusters for each class.

Some expansion on KNN:

To improve linear regression and KNN, we need to finish the following tasks:

1. Kernel methods use weights that decrease smoothly to zero with distance from the target point, rather than the effective 0/1 weights used by k-nearest neighbors.
2. In high-dimensional spaces the distance kernels are modified to emphasize some variable more than others.
3. Local regression fits linear models by locally weighted least squares, rather than fitting constants locally.
4. Linear models fit to a basis expansion of the original inputs allow arbitrarily complex models.
The meaning of basis expansion can be explained as follows:
统计学习笔记(3) 监督学习概论(3)_第2张图片统计学习笔记(3) 监督学习概论(3)_第3张图片
统计学习笔记(3) 监督学习概论(3)_第4张图片

Then to introduce kernel based on basis expansion:
To minimize function

We get
统计学习笔记(3) 监督学习概论(3)_第5张图片

To expand the basis

                                                                                                          

Which has a similar form as SVM.

                                                                                                          统计学习笔记(3) 监督学习概论(3)_第6张图片

                                                                           统计学习笔记(3) 监督学习概论(3)_第7张图片

                                                                                                       统计学习笔记(3) 监督学习概论(3)_第8张图片

The use of kernel is to firstly guarantee that feature can be mapped to high dimensional spaces, secondly calculation can be simplified.

5. Projection pursuit and neural network models consist of sums of nonlinearly transformed linear models.

Statistical Decision Theory:

We seek a function f(X) for predicting Y given values of the input vector X. This theory requires a loss function L(Y, f(X)) for penalizing errors in prediction, and by far the most common and convenient is squared error loss: L(Y, f(X)) = (Y − f(X)) squared.

Our aim is to choose f:

                                                                   

                                                                     

Provided a given X, we should make c closer to the label Y in the training set

                                                              

The above equation gives us the exact c, and the solution is


The above x is value in the training set.

To apply the above theory into practice, we can use KNN, that is, for any input x, we calculate its statistical value by averaging its cloest k neighbors in the training set. It would seem that with a reasonably large set of training data, we could always approximate the theoretically optimal conditional expectation by k-nearest-neighbor averaging, for the average value can approximate the statistical average value.

Local methods in high dimensions:

KNN breaks down in highdimensions, and the phenomenon is commonly referred to as thecurse of dimensionality.
Consider the nearest-neighbor procedure for inputs uniformly distributed in a p-dimensional unit hypercube. Suppose we send out a hypercubical neighborhood about a target point to capture a fraction r of the observations. Since this corresponds to a fraction r of the unit volume, r is a proportion and is less than 1. the expected edge length will be ep(r) = r^(1/p). In ten dimensions e10(0.01) = 0.63 and e10(0.1) = 0.80, while the entire range for each input is only 1.0. So to capture 1% or 10% of the data to form a local average, we must cover 63% or 80% of the range of each input variable. Such neighborhoods are no longer “local.” Reducing r dramatically does not help much either, since the fewer observations we average, the higher is the variance of our fit.

Another consequence of the sparse sampling in high dimensions is that all sample points are close to an edge of the sample. Consider N data points (training samples) uniformly distributed in a p-dimensional unit ball centered at the origin. Suppose we consider a nearest-neighbor estimate at the origin. The median distance from the origin to the closest data point is given by the expression


A more complicated expression exists for the mean distance to the closest point. For N = 500, p = 10 , d(p, N) ≈ 0.52, more than halfway to the boundary. Hence most data points are closer to the boundary of the sample space than to any other data point. The reason that this presents a problem is that prediction is much more difficult near the edges of the training sample. For those input samples that are nearer to the centering training samples, it is easier to find enough neighbors, but for those nearer to boundary training samples, it is not.














你可能感兴趣的:(统计学习笔记(3) 监督学习概论(3))