Dimensionality reduction speeds up training.
Dimensionality reduction is also extremely useful for data visualization (or DataViz).
The more dimensions a training set has, the greater the risk of overfitting it.
In most real-world problems, training instances are not spread out uniformly across all dimensions. As a result, all training instances actually lie within (or close to) a much lower-dimensional subspace of the high-dimensional space.
A 2D manifold is a 2D shape that can be bent and twisted in a higher-dimensional space. More generally, a d d d-dimensional manifold is a part of an n n n-dimensional space (where d < n d < n d<n) that locally resembles a d d d-dimensional hyperplane. In the case of the Swiss roll, d = 2 d = 2 d=2 and n = 3 n = 3 n=3: it locally resembles a 2D plane, but it is rolled in the third dimension.
Many dimensionality reduction algorithms work by modeling the manifold on which the training instances lie; this is called Manifold Learning. It relies on the manifold assumption, also called the manifold hypothesis, which holds that most real-world high-dimensional datasets lie close to a much lower-dimensional manifold.
Principal Component Analysis (PCA) is by far the most popular dimensionality reduction algorithm. First it identifies the hyperplane that lies closest to the data, and then it projects the data onto it.
Before you can project the training set onto a lower-dimensional hyperplane, you first need to choose the right hyperplane.
It seems reasonable to select the axis that preserves the maximum amount of variance, as it will most likely lose less information than the other projections. Another way to justify this choice is that it is the axis that minimizes the mean squared distance between the original dataset and its projection onto that axis. This is the rather simple idea behind PCA.
The unit vector that defines the i i i-th axis is called the i i i-th principal component (PC).
Finding principal components of a training set: a standard matrix factorization technique called Singular Value Decomposition (SVD) can decompose the training set matrix X \textbf X X into the dot product of three matrices U ⋅ Σ ⋅ V T \textbf {U} \cdot \Sigma \cdot \textbf{V}^T U⋅Σ⋅VT , where V T \textbf V^T VT contains all the principal components that we are looking for, as shown in Equation 8-1.
Equation 8-1. Principal components matrix
KaTeX parse error: Expected '}', got '_' at position 54: … &|\\ \textbf{c_̲1}&\textbf{c_2}…
The following Python code uses NumPy’s svd() \verb+svd()+ svd() function to obtain all the principal components of the training set, then extracts the first two PCs:
import numpy as np
np.random.seed(4)
m = 60
w1, w2 = 0.1, 0.3
noise = 0.1
angles = np.random.rand(m) * 3 * np.pi / 2 - 0.5
X = np.empty((m, 3))
X[:, 0] = np.cos(angles) + np.sin(angles)/2 + noise * np.random.randn(m) / 2
X[:, 1] = np.sin(angles) * 0.7 + noise * np.random.randn(m) / 2
X[:, 2] = X[:, 0] * w1 + X[:, 1] * w2 + noise * np.random.randn(m)
X_centered=X-X.mean(axis=0)
U,s,V=np.linalg.svd(X_centered)
c1=V.T[:,0]
c2=V.T[:,1]
c1,c2
#(array([0.93636116, 0.29854881, 0.18465208]),
# array([-0.34027485, 0.90119108, 0.2684542 ]))
Don’t forget to center the data first.
To project the training set onto the hyperplane, you can simply compute the dot product of the training set matrix X \textbf X X by the matrix W d \textbf W_d Wd, defined as the matrix containing the first d d d principal components (i.e., the matrix composed of the first d d d columns of V T \textbf V^T VT), as shown in Equation 8-2.
Equation 8-2. Projecting the training set down to d dimensions
X d ‐ proj = X ⋅ W d \textbf X_{d‐\textrm{proj}} = \textbf X\cdot \textbf W_d Xd‐proj=X⋅Wd
The following Python code projects the training set onto the plane defined by the first two principal components:
W2=V.T[:,:2]
X2D=X_centered.dot(W2)
from sklearn.decomposition import PCA
pca=PCA(n_components=2)
X2D=pca.fit_transform(X)
After fitting the PCA transformer to the dataset, you can access the principal components using the components_ \verb+components_+ components_ variable (note that it contains the PCs as horizontal vectors, so, for example, the first principal component is equal to pca.components_.T[:,0] \verb+pca.components_.T[:,0]+ pca.components_.T[:,0]).
Explained variance ratio of each principal component indicates the proportion of the dataset’s variance that lies along the axis of each principal component.
print(pca.explained_variance_ratio_)
#array([ 0.84248607, 0.14631839])
This tells you that 84.2% of the dataset’s variance lies along the first axis, and 14.6% lies along the second axis. This leaves less than 1.2% for the third axis, so it is reasonable to assume that it probably carries little information.
It is generally preferable to choose the number of dimensions that add up to a sufficiently large portion of the variance (e.g., 95%).
The following code computes PCA without reducing dimensionality, then computes the minimum number of dimensions required to preserve 95% of the training set’s variance:
pca=PCA()
pca.fit(X)
#Return the cumulative sum of the elements along a given axis.
cumsum=np.cumsum(pca.explained_variance_ratio_)
d=np.argmax(cumsum>0.95)+1
You could then set n_components=d \verb+n_components=d+ n_components=d and run PCA again. However, there is a much better option: instead of specifying the number of principal components you want to preserve, you can set n_components \verb+n_components+ n_components to be a float between 0.0 and 1.0, indicating the ratio of variance you wish to preserve:
from sklearn.datasets import fetch_mldata
#https://raw.githubusercontent.com/amplab/datascience-sp14/master/lab7/mldata/mnist-original.mat
#下载mnist-original.mat文件放在./datasets/mnist/mldata目录中
mnist = fetch_mldata('MNIST original',data_home='./datasets/mnist')
mnist
X_mnist,y_mnist=mnist.data, mnist.target
pca=PCA(n_components=0.95)
X_mnist_reduced=pca.fit_transform(X_mnist)
pca.n_components #0.95
pca.n_components_#154
Similar as the difference between estimators \verb+estimators+ estimators and estimators_ \verb+estimators_+ estimators_, n_components \verb+n_components+ n_components and n_components_ \verb+n_components_+ n_components_ are parameter and attribute respectively. * \verb+*+ * and *_ \verb+*_+ *_ also can be deemed as “before” and “after” the model is trained.
The mean squared distance between the original data and the reconstructed data (compressed and then decompressed) is called the reconstruction error.
The following code compresses the MNIST dataset down to 154 dimensions, then uses the inverse_transform() method to decompress it back to 784 dimensions.
pca=PCA(n_components=154)
X_mnist_reduced=pca.fit_transform(X_mnist)
X_mnist_recoverd=pca.inverse_transform(X_mnist_reduced)
The equation of the inverse transformation is shown in Equation 8-3.
Equation 8-3. PCA inverse transformation, back to the original number of dimensions
X recovered = X d -proj ⋅ W d T \textbf X_{\textrm{recovered}}=\textbf X_{d\textrm{-proj}}\cdot\textbf W_d^T Xrecovered=Xd-proj⋅WdT
The preceding implementation of PCA requires the whole training set to fit in memory in order for the SVD algorithm to run. With Incremental PCA (IPCA) algorithms, you can split the training set into mini-batches and feed an IPCA algorithm one mini-batch at a time. This is useful to apply PCA online (i.e., on the fly, as new instances arrive).
The following code splits the MNIST dataset into 100 mini-batches (using NumPy’s array_split() \verb+array_split()+ array_split() function) and feeds them to Scikit-Learn’s IncrementalPCA \verb+IncrementalPCA+ IncrementalPCA class to reduce the dimensionality of the MNIST dataset down to 154 dimensions (just like before). Note that you must call the partial_fit() \verb+partial_fit()+ partial_fit() method with each mini-batch rather than the fit() \verb+fit()+ fit() method with the whole training set:
from sklearn.decomposition import IncrementalPCA
n_batches=100
inc_pca=IncrementalPCA(n_components=154)
for X_batch in np.array_split(X_mnist,n_batches):
inc_pca.partial_fit(X_batch)
X_mnist_reduced=inc_pca.transform(X_mnist)
Alternatively, you can use NumPy’s memmap \verb+memmap+ memmap class, which allows you to manipulate a large array stored in a binary file on disk as if it were entirely in memory; the class loads only the data it needs in memory, when it needs it. Since the IncrementalPCA \verb+IncrementalPCA+ IncrementalPCA class uses only a small part of the array at any given time, the memory usage remains under control. This makes it possible to call the usual fit() method, as you can see in the following code:
from tempfile import mkdtemp
import os.path as path
filename = path.join(mkdtemp(), 'my_mnist.data')
m, n = X_mnist.shape
X_mm = np.memmap(filename, dtype='float32', mode='write', shape=(m, n))
X_mm[:] = X_mnist
batch_size = m // n_batches
inc_pca = IncrementalPCA(n_components=154, batch_size=batch_size)
inc_pca.fit(X_mm)
X_reduced=inc_pca.transform(X_mm)
Scikit-Learn offers yet another option to perform PCA, called Randomized PCA. This is a stochastic algorithm that quickly finds an approximation of the first d d d principal components. Its computational complexity is O ( m × d 2 ) + O ( d 3 ) O(m \times d^2) + O(d^3) O(m×d2)+O(d3), instead of O ( m × n 2 ) + O ( n 3 ) O(m \times n^2)+O(n^3) O(m×n2)+O(n3), so it is dramatically faster than the previous algorithms when d d d is much smaller than n n n.
rnd_pca = PCA(n_components=154, svd_solver="randomized")
X_reduced = rnd_pca.fit_transform(X_mnist)
Kernel PCA (kPCA): apply the kernel trick to PCA, making it possible to perform complex nonlinear projections for dimensionality reduction.
from sklearn.datasets import make_swiss_roll
from sklearn.decomposition import KernelPCA
X, y = make_swiss_roll(n_samples=1000, noise=0.2, random_state=42)
rbf_pca=KernelPCA(n_components=2,kernel='rbf',gamma=0.04)
X_reduced=rbf_pca.fit_transform(X)
As kPCA is an unsupervised learning algorithm, there is no obvious performance measure to help you select the best kernel and hyperparameter values. However, dimensionality reduction is often a preparation step for a supervised learning task (e.g., classification), so you can simply use grid search to select the kernel and hyper‐parameters that lead to the best performance on that task.
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
clf = Pipeline([
("kpca", KernelPCA(n_components=2)),
("log_reg", LinearRegression(solver="liblinear"))
])
param_grid = [{
"kpca__gamma": np.linspace(0.03, 0.05, 10),
"kpca__kernel": ["rbf", "sigmoid"]
}]
grid_search = GridSearchCV(clf, param_grid, cv=3)
grid_search.fit(X, y)
print(grid_search.best_params_)
#{'kpca__gamma': 0.05, 'kpca__kernel': 'rbf'}
Another approach, this time entirely unsupervised, is to select the kernel and hyper-parameters that yield the lowest reconstruction error.
KPCA has the same effect as first mapping the training set to an infinite-dimensional feature space using the feature map φ \varphi φ, then projecting the transformed training set down to 2D using linear PCA.
original space → KPCA \xrightarrow {\textrm{KPCA}} KPCA Reduced space
original space → φ \xrightarrow {\varphi} φ Feature space → linear PCA \xrightarrow {\textrm{linear PCA}} linear PCA Reduced space
Notice that if we could invert the linear PCA step for a given instance in the reduced space, the reconstructed point would lie in feature space, not in the original space (e.g., like the one represented by an x in the diagram). Since the feature space is infinite-dimensional, we cannot compute the reconstructed point, and therefore we cannot compute the true reconstruction error.
Fortunately, it is possible to find a point in the original space that would map close to the reconstructed point. This is called the reconstruction pre-image. Once you have this pre-image, you can measure its squared distance to the original instance. You can then select the kernel and hyperparameters that minimize this reconstruction pre-image error.
You may be wondering how to perform this reconstruction. One solution is to train a supervised regression model, with the projected instances as the training set and the
original instances as the targets. Scikit-Learn will do this automatically if you set
fit_inverse_transform=True \verb+fit_inverse_transform=True+ fit_inverse_transform=True, as shown in the following code:
rdf_pca=KernelPCA(n_components=2,kernel="rbf",gamma=0.0433,
fit_inverse_transform=True)
X_reduced=rbf_pca.fit_transform(X)
X_preimage=rbf_pca.inverse_transform(X_reduced)
from sklearn.metrics import mean_squared_error
mean_squared_error(X,X_preimage)#32.786308795766104
Locally Linear Embedding (LLE) is another powerful nonlinear dimensionality reduction (NLDR) technique. It is a Manifold Learning technique that does not rely on projections like the previous algorithms. LLE works by first measuring how each training instance linearly relates to its closest neighbors (c.n.), and then looking for a low-dimensional representation of the training set where these local relationships are best preserved.
from sklearn.manifold import LocallyLinearEmbedding
lle=LocallyLinearEmbedding(n_components=2,n_neighbors=10)
X_reduced=lle.fit_transform(X)
Here’s how LLE works: first, for each training instance x ( i ) \textbf x^{(i)} x(i), the algorithm identifies its k k k closest neighbors (in the preceding code k = 10 k = 10 k=10), then tries to reconstruct x ( i ) \textbf x^{(i)} x(i) as a linear function of these neighbors. More specifically, it finds the weights w i , j w_{i,j} wi,j such that the squared distance between x ( i ) \textbf x^{(i)} x(i) and ∑ j = 1 m w i , j x ( j ) \sum_{j = 1}^m w_{i, j}\textbf x^{(j)} ∑j=1mwi,jx(j) is as small as possible, assuming
w i , j = 0 w_{i,j} = 0 wi,j=0 if x ( j ) \textbf x^{(j)} x(j) is not one of the k k k closest neighbors of x ( i ) \textbf x^{(i)} x(i). Thus the first step of LLE is the constrained optimization problem described in Equation 8-4, where W \textbf W W is the weight matrix containing all the weights w i , j w_{i,j} wi,j. The second constraint simply normalizes the weights for each training instance x ( i ) \textbf x^{(i)} x(i).
Equation 8-4. LLE step 1: linearly modeling local relationships
W ^ = arg min W ∑ i = 1 m ∣ ∣ x ( i ) − ∑ j = 1 m w i , j x ( j ) ∣ ∣ 2 subject to { w i , j = 0 if x ( j ) is not one of the k c.n. of x ( i ) ∑ j = 1 m w i , j = 1 for i = 1 , 2 , ⋯   , m \widehat {\textbf W}=\mathop{\arg\min}\limits_{\textbf W}\sum_{i=1}^m ||\textbf{x}^{(i)}-\sum_{j=1}^m w_{i,j}\textbf{x}^{(j)}||^2\\ \textrm{subject to }\left\{\begin{array}{ll} w_{i,j}=0& \textrm{ if } \textbf{x}^{(j)} \textrm{ is not one of the }k \textrm{ c.n. of }\textbf{x}^{(i)}\\ \sum\limits_{j=1}^m w_{i,j}=1& \textrm{ for } i=1,2,\cdots,m \end{array}\right. W =Wargmini=1∑m∣∣x(i)−j=1∑mwi,jx(j)∣∣2subject to ⎩⎨⎧wi,j=0j=1∑mwi,j=1 if x(j) is not one of the k c.n. of x(i) for i=1,2,⋯,m
After this step, the weight matrix W ^ \widehat {\textbf{W}} W (containing the weights w i , j w_{i, j} wi,j) encodes the local linear relationships between the training instances. Now the second step is to map the training instances into a d d d-dimensional space (where d < n d < n d<n) while preserving these local relationships as much as possible. If z ( i ) \textbf z^{(i)} z(i) is the image of x ( i ) \textbf x^{(i)} x(i) in this d d d-dimensional space, then we want the squared distance between z ( i ) \textbf z^{(i)} z(i) and ∑ j = 1 m w i , j z ( j ) \sum_{j = 1}^m w_{i, j}\textbf z^{(j)} ∑j=1mwi,jz(j) to be as small as possible. This idea leads to the unconstrained optimization problem described in Equation 8-5. It looks very similar to the first step, but instead of keeping the instances fixed and finding the optimal weights, we are doing the reverse: keeping the weights fixed and finding the optimal position of the instances’ images in the lowdimensional space. Note that Z \textbf Z Z is the matrix containing all z ( i ) \textbf z^{(i)} z(i).
Equation 8-5. LLE step 2: reducing dimensionality while preserving relationships
Z ^ = arg min Z ∑ i = 1 m ∣ ∣ z ( i ) − ∑ j = 1 m w i , j z ( j ) ∣ ∣ 2 \widehat {\textbf Z}= \mathop{\arg\min}\limits_{\textbf Z}\sum_{i=1}^m ||\textbf{z}^{(i)}-\sum_{j=1}^m w_{i,j}\textbf{z}^{(j)}||^2 Z =Zargmini=1∑m∣∣z(i)−j=1∑mwi,jz(j)∣∣2
Scikit-Learn’s LLE implementation has the following computational complexity:
O ( m log ( m ) n log ( k ) ) O(m \log(m)n \log(k)) O(mlog(m)nlog(k)) for finding the k k k nearest neighbors, O ( m n k 3 ) O(mnk^3) O(mnk3) for optimizing the
weights, and O ( d m 2 ) O(dm^2) O(dm2) for constructing the low-dimensional representations. Unfortunately, the m 2 m^2 m2 in the last term makes this algorithm scale poorly to very large datasets.