14. Dimensionality Reduction

Dimensionality Reduction

Motivation I: Data compression

Reduce data from 2D to 1D:

Reduce data from 3D to 2D:

Motivation II: Data Visualization

Principal Component Analysis (PCA) problem formulation

Reduce from 2D to 1D: Find a direction (a vector ) onto which to project the data so as to minimize the projection error.

Reduce from nD to kD: Find k vectors onto which to project data, so as to minimize the projection error.

PCA is not liner regression

Principal Componenet Analysis algorithm

Data preprocessing

Training set:
Preprocessiong (feature scaling/mean normalization):

Replace each with .
If defferent features on different scales (e.g., = size of house, = number of bedrooms), scale features to have comparable range of values.

Pricipal Component Analysis (PCA) algroithm

Reduce data from nD to kD
Compute "covariance matrix":

Compute "eigenvectors" of matrix :
[U,S,V] = svd(Sigma)

U will be a matrix, what we should do is to take the first k columns of U.Then we get

Choosing the number of principal components

Choosing k

Average squared projection error:
Total variation in the data:
Typically, choose k to be smallest value so that

"99% of variance is retained"

[U,S,V]=svd(Sigma)
For given k


or

Reconstruction from compressed representation


Advice for applying PCA

Supervised learning speedup

  1. Extract inputs: get unlabled dataset
  2. PCA: get new training set

Mapping should be defined by running PCA only on the training set. This mapping can be applied as well to the examples and in the cross validation and test sets.

Applying of PCA

  • Compression
    • Reduce memory/disk needed to store data
    • Speed up learing algorithm
  • Visualization

Bad use of PCA: To prevent overfitting

PCA is sometimes used where it shouldn't be
Design of ML system

Before using PCA, first try running whatever you want to do with the original/raw data.

你可能感兴趣的:(14. Dimensionality Reduction)