Chapter 8 Dimensionality Reduction

Chapter 8 Dimensionality Reduction

OReilly. Hands-On Machine Learning with Scikit-Learn and TensorFlow读书笔记

Dimensionality reduction speeds up training.

Dimensionality reduction is also extremely useful for data visualization (or DataViz).

8.1 The Curse of Dimensionality

The more dimensions a training set has, the greater the risk of overfitting it.

8.2 Main Approaches for Dimensionality Reduction

8.2.1 Projection

In most real-world problems, training instances are not spread out uniformly across all dimensions. As a result, all training instances actually lie within (or close to) a much lower-dimensional subspace of the high-dimensional space.

8.2.2 Manifold Learning

A 2D manifold is a 2D shape that can be bent and twisted in a higher-dimensional space. More generally, a d d d-dimensional manifold is a part of an n n n-dimensional space (where d < n d < n d<n) that locally resembles a d d d-dimensional hyperplane. In the case of the Swiss roll, d = 2 d = 2 d=2 and n = 3 n = 3 n=3: it locally resembles a 2D plane, but it is rolled in the third dimension.

Many dimensionality reduction algorithms work by modeling the manifold on which the training instances lie; this is called Manifold Learning. It relies on the manifold assumption, also called the manifold hypothesis, which holds that most real-world high-dimensional datasets lie close to a much lower-dimensional manifold.

8.3 PCA

Principal Component Analysis (PCA) is by far the most popular dimensionality reduction algorithm. First it identifies the hyperplane that lies closest to the data, and then it projects the data onto it.

8.3.1 Preserving the Variance

Before you can project the training set onto a lower-dimensional hyperplane, you first need to choose the right hyperplane.

It seems reasonable to select the axis that preserves the maximum amount of variance, as it will most likely lose less information than the other projections. Another way to justify this choice is that it is the axis that minimizes the mean squared distance between the original dataset and its projection onto that axis. This is the rather simple idea behind PCA.

8.3.2 Principal Components

The unit vector that defines the i i i-th axis is called the i i i-th principal component (PC).

Finding principal components of a training set: a standard matrix factorization technique called Singular Value Decomposition (SVD) can decompose the training set matrix X \textbf X X into the dot product of three matrices U ⋅ Σ ⋅ V T \textbf {U} \cdot \Sigma \cdot \textbf{V}^T UΣVT , where V T \textbf V^T VT contains all the principal components that we are looking for, as shown in Equation 8-1.

Equation 8-1. Principal components matrix
KaTeX parse error: Expected '}', got '_' at position 54: … &|\\ \textbf{c_̲1}&\textbf{c_2}…
The following Python code uses NumPy’s svd() \verb+svd()+ svd() function to obtain all the principal components of the training set, then extracts the first two PCs:

import numpy as np
np.random.seed(4)
m = 60
w1, w2 = 0.1, 0.3
noise = 0.1

angles = np.random.rand(m) * 3 * np.pi / 2 - 0.5
X = np.empty((m, 3))
X[:, 0] = np.cos(angles) + np.sin(angles)/2 + noise * np.random.randn(m) / 2
X[:, 1] = np.sin(angles) * 0.7 + noise * np.random.randn(m) / 2
X[:, 2] = X[:, 0] * w1 + X[:, 1] * w2 + noise * np.random.randn(m)

X_centered=X-X.mean(axis=0)
U,s,V=np.linalg.svd(X_centered)
c1=V.T[:,0]
c2=V.T[:,1]
c1,c2
#(array([0.93636116, 0.29854881, 0.18465208]),
# array([-0.34027485,  0.90119108,  0.2684542 ]))

Don’t forget to center the data first.

8.3.3 Projecting Down to Dimensions

To project the training set onto the hyperplane, you can simply compute the dot product of the training set matrix X \textbf X X by the matrix W d \textbf W_d Wd, defined as the matrix containing the first d d d principal components (i.e., the matrix composed of the first d d d columns of V T \textbf V^T VT), as shown in Equation 8-2.

Equation 8-2. Projecting the training set down to d dimensions
X d ‐ proj = X ⋅ W d \textbf X_{d‐\textrm{proj}} = \textbf X\cdot \textbf W_d Xdproj=XWd
The following Python code projects the training set onto the plane defined by the first two principal components:

W2=V.T[:,:2]
X2D=X_centered.dot(W2)

8.3.4 Using Scikit-Learn

from sklearn.decomposition import PCA
pca=PCA(n_components=2)
X2D=pca.fit_transform(X)

After fitting the PCA transformer to the dataset, you can access the principal components using the components_ \verb+components_+ components_ variable (note that it contains the PCs as horizontal vectors, so, for example, the first principal component is equal to pca.components_.T[:,0] \verb+pca.components_.T[:,0]+ pca.components_.T[:,0]).

8.3.5 Explained Variance Ratio

Explained variance ratio of each principal component indicates the proportion of the dataset’s variance that lies along the axis of each principal component.

print(pca.explained_variance_ratio_)
#array([ 0.84248607, 0.14631839])

This tells you that 84.2% of the dataset’s variance lies along the first axis, and 14.6% lies along the second axis. This leaves less than 1.2% for the third axis, so it is reasonable to assume that it probably carries little information.

8.3.6 Choosing the Right Number of Dimensions

It is generally preferable to choose the number of dimensions that add up to a sufficiently large portion of the variance (e.g., 95%).

The following code computes PCA without reducing dimensionality, then computes the minimum number of dimensions required to preserve 95% of the training set’s variance:

pca=PCA()
pca.fit(X)
#Return the cumulative sum of the elements along a given axis.
cumsum=np.cumsum(pca.explained_variance_ratio_)
d=np.argmax(cumsum>0.95)+1

You could then set n_components=d \verb+n_components=d+ n_components=d and run PCA again. However, there is a much better option: instead of specifying the number of principal components you want to preserve, you can set n_components \verb+n_components+ n_components to be a float between 0.0 and 1.0, indicating the ratio of variance you wish to preserve:

from sklearn.datasets import fetch_mldata
#https://raw.githubusercontent.com/amplab/datascience-sp14/master/lab7/mldata/mnist-original.mat
#下载mnist-original.mat文件放在./datasets/mnist/mldata目录中
mnist = fetch_mldata('MNIST original',data_home='./datasets/mnist')
mnist
X_mnist,y_mnist=mnist.data, mnist.target

pca=PCA(n_components=0.95)
X_mnist_reduced=pca.fit_transform(X_mnist)
pca.n_components #0.95
pca.n_components_#154

Similar as the difference between estimators \verb+estimators+ estimators and estimators_ \verb+estimators_+ estimators_, n_components \verb+n_components+ n_components and n_components_ \verb+n_components_+ n_components_ are parameter and attribute respectively. * \verb+*+ * and *_ \verb+*_+ *_ also can be deemed as “before” and “after” the model is trained.

8.3.7 PCA for Compression

The mean squared distance between the original data and the reconstructed data (compressed and then decompressed) is called the reconstruction error.

The following code compresses the MNIST dataset down to 154 dimensions, then uses the inverse_transform() method to decompress it back to 784 dimensions.

pca=PCA(n_components=154)
X_mnist_reduced=pca.fit_transform(X_mnist)
X_mnist_recoverd=pca.inverse_transform(X_mnist_reduced)

The equation of the inverse transformation is shown in Equation 8-3.
Equation 8-3. PCA inverse transformation, back to the original number of dimensions
X recovered = X d -proj ⋅ W d T \textbf X_{\textrm{recovered}}=\textbf X_{d\textrm{-proj}}\cdot\textbf W_d^T Xrecovered=Xd-projWdT

8.3.8 Incremental PCA

The preceding implementation of PCA requires the whole training set to fit in memory in order for the SVD algorithm to run. With Incremental PCA (IPCA) algorithms, you can split the training set into mini-batches and feed an IPCA algorithm one mini-batch at a time. This is useful to apply PCA online (i.e., on the fly, as new instances arrive).

The following code splits the MNIST dataset into 100 mini-batches (using NumPy’s array_split() \verb+array_split()+ array_split() function) and feeds them to Scikit-Learn’s IncrementalPCA \verb+IncrementalPCA+ IncrementalPCA class to reduce the dimensionality of the MNIST dataset down to 154 dimensions (just like before). Note that you must call the partial_fit() \verb+partial_fit()+ partial_fit() method with each mini-batch rather than the fit() \verb+fit()+ fit() method with the whole training set:

from sklearn.decomposition import IncrementalPCA

n_batches=100
inc_pca=IncrementalPCA(n_components=154)
for X_batch in np.array_split(X_mnist,n_batches):
    inc_pca.partial_fit(X_batch)

X_mnist_reduced=inc_pca.transform(X_mnist)

Alternatively, you can use NumPy’s memmap \verb+memmap+ memmap class, which allows you to manipulate a large array stored in a binary file on disk as if it were entirely in memory; the class loads only the data it needs in memory, when it needs it. Since the IncrementalPCA \verb+IncrementalPCA+ IncrementalPCA class uses only a small part of the array at any given time, the memory usage remains under control. This makes it possible to call the usual fit() method, as you can see in the following code:

from tempfile import mkdtemp
import os.path as path
filename = path.join(mkdtemp(), 'my_mnist.data')

m, n = X_mnist.shape
X_mm = np.memmap(filename, dtype='float32', mode='write', shape=(m, n))
X_mm[:] = X_mnist
batch_size = m // n_batches
inc_pca = IncrementalPCA(n_components=154, batch_size=batch_size)
inc_pca.fit(X_mm)
X_reduced=inc_pca.transform(X_mm)

8.3.9 Randomized PCA

Scikit-Learn offers yet another option to perform PCA, called Randomized PCA. This is a stochastic algorithm that quickly finds an approximation of the first d d d principal components. Its computational complexity is O ( m × d 2 ) + O ( d 3 ) O(m \times d^2) + O(d^3) O(m×d2)+O(d3), instead of O ( m × n 2 ) + O ( n 3 ) O(m \times n^2)+O(n^3) O(m×n2)+O(n3), so it is dramatically faster than the previous algorithms when d d d is much smaller than n n n.

rnd_pca = PCA(n_components=154, svd_solver="randomized")
X_reduced = rnd_pca.fit_transform(X_mnist) 

8.4 Kernel PCA

Kernel PCA (kPCA): apply the kernel trick to PCA, making it possible to perform complex nonlinear projections for dimensionality reduction.

from sklearn.datasets import make_swiss_roll
from sklearn.decomposition import KernelPCA
X, y = make_swiss_roll(n_samples=1000, noise=0.2, random_state=42)

rbf_pca=KernelPCA(n_components=2,kernel='rbf',gamma=0.04)
X_reduced=rbf_pca.fit_transform(X)

8.4.1 Selecting a Kernel and Tuning Hyperparameters

As kPCA is an unsupervised learning algorithm, there is no obvious performance measure to help you select the best kernel and hyperparameter values. However, dimensionality reduction is often a preparation step for a supervised learning task (e.g., classification), so you can simply use grid search to select the kernel and hyper‐parameters that lead to the best performance on that task.

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

clf = Pipeline([
        ("kpca", KernelPCA(n_components=2)),
        ("log_reg", LinearRegression(solver="liblinear"))
    ])

param_grid = [{
        "kpca__gamma": np.linspace(0.03, 0.05, 10),
        "kpca__kernel": ["rbf", "sigmoid"]
    }]

grid_search = GridSearchCV(clf, param_grid, cv=3)
grid_search.fit(X, y)
print(grid_search.best_params_)
#{'kpca__gamma': 0.05, 'kpca__kernel': 'rbf'}

Another approach, this time entirely unsupervised, is to select the kernel and hyper-parameters that yield the lowest reconstruction error.

KPCA has the same effect as first mapping the training set to an infinite-dimensional feature space using the feature map φ \varphi φ, then projecting the transformed training set down to 2D using linear PCA.

original space → KPCA \xrightarrow {\textrm{KPCA}} KPCA Reduced space

original space → φ \xrightarrow {\varphi} φ Feature space → linear PCA \xrightarrow {\textrm{linear PCA}} linear PCA Reduced space

Notice that if we could invert the linear PCA step for a given instance in the reduced space, the reconstructed point would lie in feature space, not in the original space (e.g., like the one represented by an x in the diagram). Since the feature space is infinite-dimensional, we cannot compute the reconstructed point, and therefore we cannot compute the true reconstruction error.

Fortunately, it is possible to find a point in the original space that would map close to the reconstructed point. This is called the reconstruction pre-image. Once you have this pre-image, you can measure its squared distance to the original instance. You can then select the kernel and hyperparameters that minimize this reconstruction pre-image error.

You may be wondering how to perform this reconstruction. One solution is to train a supervised regression model, with the projected instances as the training set and the
original instances as the targets. Scikit-Learn will do this automatically if you set
fit_inverse_transform=True \verb+fit_inverse_transform=True+ fit_inverse_transform=True, as shown in the following code:

rdf_pca=KernelPCA(n_components=2,kernel="rbf",gamma=0.0433,
                  fit_inverse_transform=True)
X_reduced=rbf_pca.fit_transform(X)
X_preimage=rbf_pca.inverse_transform(X_reduced)

from sklearn.metrics import mean_squared_error
mean_squared_error(X,X_preimage)#32.786308795766104

8.5 LLE

Locally Linear Embedding (LLE) is another powerful nonlinear dimensionality reduction (NLDR) technique. It is a Manifold Learning technique that does not rely on projections like the previous algorithms. LLE works by first measuring how each training instance linearly relates to its closest neighbors (c.n.), and then looking for a low-dimensional representation of the training set where these local relationships are best preserved.

from sklearn.manifold import LocallyLinearEmbedding

lle=LocallyLinearEmbedding(n_components=2,n_neighbors=10)
X_reduced=lle.fit_transform(X)

Here’s how LLE works: first, for each training instance x ( i ) \textbf x^{(i)} x(i), the algorithm identifies its k k k closest neighbors (in the preceding code k = 10 k = 10 k=10), then tries to reconstruct x ( i ) \textbf x^{(i)} x(i) as a linear function of these neighbors. More specifically, it finds the weights w i , j w_{i,j} wi,j such that the squared distance between x ( i ) \textbf x^{(i)} x(i) and ∑ j = 1 m w i , j x ( j ) \sum_{j = 1}^m w_{i, j}\textbf x^{(j)} j=1mwi,jx(j) is as small as possible, assuming
w i , j = 0 w_{i,j} = 0 wi,j=0 if x ( j ) \textbf x^{(j)} x(j) is not one of the k k k closest neighbors of x ( i ) \textbf x^{(i)} x(i). Thus the first step of LLE is the constrained optimization problem described in Equation 8-4, where W \textbf W W is the weight matrix containing all the weights w i , j w_{i,j} wi,j. The second constraint simply normalizes the weights for each training instance x ( i ) \textbf x^{(i)} x(i).

Equation 8-4. LLE step 1: linearly modeling local relationships
W ^ = arg ⁡ min ⁡ W ∑ i = 1 m ∣ ∣ x ( i ) − ∑ j = 1 m w i , j x ( j ) ∣ ∣ 2 subject to  { w i , j = 0  if  x ( j )  is not one of the  k  c.n. of  x ( i ) ∑ j = 1 m w i , j = 1  for  i = 1 , 2 , ⋯   , m \widehat {\textbf W}=\mathop{\arg\min}\limits_{\textbf W}\sum_{i=1}^m ||\textbf{x}^{(i)}-\sum_{j=1}^m w_{i,j}\textbf{x}^{(j)}||^2\\ \textrm{subject to }\left\{\begin{array}{ll} w_{i,j}=0& \textrm{ if } \textbf{x}^{(j)} \textrm{ is not one of the }k \textrm{ c.n. of }\textbf{x}^{(i)}\\ \sum\limits_{j=1}^m w_{i,j}=1& \textrm{ for } i=1,2,\cdots,m \end{array}\right. W =Wargmini=1mx(i)j=1mwi,jx(j)2subject to wi,j=0j=1mwi,j=1 if x(j) is not one of the k c.n. of x(i) for i=1,2,,m
After this step, the weight matrix W ^ \widehat {\textbf{W}} W (containing the weights w i , j w_{i, j} wi,j) encodes the local linear relationships between the training instances. Now the second step is to map the training instances into a d d d-dimensional space (where d < n d < n d<n) while preserving these local relationships as much as possible. If z ( i ) \textbf z^{(i)} z(i) is the image of x ( i ) \textbf x^{(i)} x(i) in this d d d-dimensional space, then we want the squared distance between z ( i ) \textbf z^{(i)} z(i) and ∑ j = 1 m w i , j z ( j ) \sum_{j = 1}^m w_{i, j}\textbf z^{(j)} j=1mwi,jz(j) to be as small as possible. This idea leads to the unconstrained optimization problem described in Equation 8-5. It looks very similar to the first step, but instead of keeping the instances fixed and finding the optimal weights, we are doing the reverse: keeping the weights fixed and finding the optimal position of the instances’ images in the lowdimensional space. Note that Z \textbf Z Z is the matrix containing all z ( i ) \textbf z^{(i)} z(i).

Equation 8-5. LLE step 2: reducing dimensionality while preserving relationships
Z ^ = arg ⁡ min ⁡ Z ∑ i = 1 m ∣ ∣ z ( i ) − ∑ j = 1 m w i , j z ( j ) ∣ ∣ 2 \widehat {\textbf Z}= \mathop{\arg\min}\limits_{\textbf Z}\sum_{i=1}^m ||\textbf{z}^{(i)}-\sum_{j=1}^m w_{i,j}\textbf{z}^{(j)}||^2 Z =Zargmini=1mz(i)j=1mwi,jz(j)2
Scikit-Learn’s LLE implementation has the following computational complexity:
O ( m log ⁡ ( m ) n log ⁡ ( k ) ) O(m \log(m)n \log(k)) O(mlog(m)nlog(k)) for finding the k k k nearest neighbors, O ( m n k 3 ) O(mnk^3) O(mnk3) for optimizing the
weights, and O ( d m 2 ) O(dm^2) O(dm2) for constructing the low-dimensional representations. Unfortunately, the m 2 m^2 m2 in the last term makes this algorithm scale poorly to very large datasets.

8.6 Other Dimensionality Reduction Techniques

  • Multidimensional Scaling (MDS) reduces dimensionality while trying to preserve the distances between the instances
  • Isomap creates a graph by connecting each instance to its nearest neighbors, then
    reduces dimensionality while trying to preserve the geodesic distances between the instances
  • t-Distributed Stochastic Neighbor Embedding (t-SNE) reduces dimensionality while trying to keep similar instances close and dissimilar instances apart. It is mostly used for visualization, in particular to visualize clusters of instances in high-dimensional space (e.g., to visualize the MNIST images in 2D)
  • Linear Discriminant Analysis (LDA) is actually a classification algorithm, but during training it learns the most discriminative axes between the classes, and these axes can then be used to define a hyperplane onto which to project the data. The benefit is that the projection will keep classes as far apart as possible, so LDA is a good technique to reduce dimensionality before running another classification algorithm such as an SVM classifier.

你可能感兴趣的:(Hands-On,Machine,Learning,with,Scik)