参考:
1、http://scikit-learn.org/stable/
2、http://scikit-learn.org/stable/modules/decomposition.html#decompositions
PCA
被实现为一个变压器对象,它可以在其拟合方法中学习n个组件,并且可以使用新数据来投影这些组件。
可选参数whiten=True使得可以将数据投影到单个空间上,同时将每个组件缩放到单位方差。 如果下游模型对信号的各向同性作出强烈的假设,这通常是有用的:例如,具有RBF内核和K-Means聚类算法的支持向量机的情况。
以下是 iris 数据集的一个例子,它由4个特征组成,它们在二维上预测出可解释大多数方差:
PCA
对象还提供了PCA的概率解释,可以根据其解释的方差量给出数据的可能性。 因此,它实现了可用于交叉验证的分数方法:
Examples:
应用于此数据的主成分分析(PCA)标识了数据中最大差异的属性(主要组分或特征空间中的方向)的组合。 这里我们绘制2个第一主成分的不同样本。
线性判别分析(LDA)尝试识别考虑类之间差异最大的属性。 特别地,LDA与PCA相反,是使用已知类标签的监督方法。
print(__doc__) import matplotlib.pyplot as plt from sklearn import datasets from sklearn.decomposition import PCA from sklearn.discriminant_analysis import LinearDiscriminantAnalysis iris = datasets.load_iris() X = iris.data y = iris.target target_names = iris.target_names pca = PCA(n_components=2) X_r = pca.fit(X).transform(X) lda = LinearDiscriminantAnalysis(n_components=2) X_r2 = lda.fit(X, y).transform(X) # Percentage of variance explained for each components print('explained variance ratio (first two components): %s' % str(pca.explained_variance_ratio_)) plt.figure() colors = ['navy', 'turquoise', 'darkorange'] lw = 2 for color, i, target_name in zip(colors, [0, 1, 2], target_names): plt.scatter(X_r[y == i, 0], X_r[y == i, 1], color=color, alpha=.8, lw=lw, label=target_name) plt.legend(loc='best', shadow=False, scatterpoints=1) plt.title('PCA of IRIS dataset') plt.figure() for color, i, target_name in zip(colors, [0, 1, 2], target_names): plt.scatter(X_r2[y == i, 0], X_r2[y == i, 1], alpha=.8, color=color, label=target_name) plt.legend(loc='best', shadow=False, scatterpoints=1) plt.title('LDA of IRIS dataset') plt.show()
# Authors: Alexandre Gramfort # Denis A. Engemann # License: BSD 3 clause import numpy as np import matplotlib.pyplot as plt from scipy import linalg from sklearn.decomposition import PCA, FactorAnalysis from sklearn.covariance import ShrunkCovariance, LedoitWolf from sklearn.model_selection import cross_val_score from sklearn.model_selection import GridSearchCV print(__doc__) # ############################################################################# # Create the data n_samples, n_features, rank = 1000, 50, 10 sigma = 1. rng = np.random.RandomState(42) U, _, _ = linalg.svd(rng.randn(n_features, n_features)) X = np.dot(rng.randn(n_samples, rank), U[:, :rank].T) # Adding homoscedastic noise X_homo = X + sigma * rng.randn(n_samples, n_features) # Adding heteroscedastic noise sigmas = sigma * rng.rand(n_features) + sigma / 2. X_hetero = X + rng.randn(n_samples, n_features) * sigmas # ############################################################################# # Fit the models n_components = np.arange(0, n_features, 5) # options for n_components def compute_scores(X): pca = PCA(svd_solver='full') fa = FactorAnalysis() pca_scores, fa_scores = [], [] for n in n_components: pca.n_components = n fa.n_components = n pca_scores.append(np.mean(cross_val_score(pca, X))) fa_scores.append(np.mean(cross_val_score(fa, X))) return pca_scores, fa_scores def shrunk_cov_score(X): shrinkages = np.logspace(-2, 0, 30) cv = GridSearchCV(ShrunkCovariance(), {'shrinkage': shrinkages}) return np.mean(cross_val_score(cv.fit(X).best_estimator_, X)) def lw_score(X): return np.mean(cross_val_score(LedoitWolf(), X)) for X, title in [(X_homo, 'Homoscedastic Noise'), (X_hetero, 'Heteroscedastic Noise')]: pca_scores, fa_scores = compute_scores(X) n_components_pca = n_components[np.argmax(pca_scores)] n_components_fa = n_components[np.argmax(fa_scores)] pca = PCA(svd_solver='full', n_components='mle') pca.fit(X) n_components_pca_mle = pca.n_components_ print("best n_components by PCA CV = %d" % n_components_pca) print("best n_components by FactorAnalysis CV = %d" % n_components_fa) print("best n_components by PCA MLE = %d" % n_components_pca_mle) plt.figure() plt.plot(n_components, pca_scores, 'b', label='PCA scores') plt.plot(n_components, fa_scores, 'r', label='FA scores') plt.axvline(rank, color='g', label='TRUTH: %d' % rank, linestyle='-') plt.axvline(n_components_pca, color='b', label='PCA CV: %d' % n_components_pca, linestyle='--') plt.axvline(n_components_fa, color='r', label='FactorAnalysis CV: %d' % n_components_fa, linestyle='--') plt.axvline(n_components_pca_mle, color='k', label='PCA MLE: %d' % n_components_pca_mle, linestyle='--') # compare with other covariance estimators plt.axhline(shrunk_cov_score(X), color='violet', label='Shrunk Covariance MLE', linestyle='-.') plt.axhline(lw_score(X), color='orange', label='LedoitWolf MLE' % n_components_pca_mle, linestyle='-.') plt.xlabel('nb of components') plt.ylabel('CV scores') plt.legend(loc='lower right') plt.title(title) plt.show()
PCA
对象非常有用,但对大型数据集有一定限制。 最大的限制是PCA只支持批处理,这意味着要处理的所有数据必须适合主内存。 IncrementalPCA
对象使用不同的处理形式,并允许部分计算几乎完全匹配PCA的结果,同时以迷你形式处理数据。 IncrementalPCA使得可以通过以下方式实现核心主成分分析:partial_fit
方法。IncrementalPCA仅存储组件和噪声方差的估计,按顺序更新explained_variance_ratio_。 这就是为什么内存使用取决于每个批次的样本数,而不是数据集中要处理的样本数。
Examples:
print(__doc__) # Authors: Kyle Kastner # License: BSD 3 clause import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import load_iris from sklearn.decomposition import PCA, IncrementalPCA iris = load_iris() X = iris.data y = iris.target n_components = 2 ipca = IncrementalPCA(n_components=n_components, batch_size=10) X_ipca = ipca.fit_transform(X) pca = PCA(n_components=n_components) X_pca = pca.fit_transform(X) colors = ['navy', 'turquoise', 'darkorange'] for X_transformed, title in [(X_ipca, "Incremental PCA"), (X_pca, "PCA")]: plt.figure(figsize=(8, 8)) for color, i, target_name in zip(colors, [0, 1, 2], iris.target_names): plt.scatter(X_transformed[y == i, 0], X_transformed[y == i, 1], color=color, lw=2, label=target_name) if "Incremental" in title: err = np.abs(np.abs(X_pca) - np.abs(X_ipca)).mean() plt.title(title + " of iris dataset\nMean absolute unsigned error " "%.6f" % err) else: plt.title(title + " of iris dataset") plt.legend(loc="best", shadow=False, scatterpoints=1) plt.axis([-4, 4, -1.5, 1.5]) plt.show()
例如,如果我们使用64x64像素的灰度级图像进行人脸识别,则数据的维度为4096,并且在这样大的数据上训练RBF支持向量机的速度很慢。此外,我们知道数据的固有维数远低于4096,因为人脸的所有照片都看起来有点相似。样品位于更低维度的歧管上(例如约200左右)。 PCA算法可以用于线性变换数据,同时降低维数,同时保留大多数解释的方差。
在这种情况下,与可选参数svd_solver='randomized'一起使用的PCA
类非常有用:因为我们将要删除大部分奇异向量,所以将计算限制为单个向量的近似估计将更有效,我们将保持实际执行转换。
注意:即使在whiten = False(默认)时,使用svd_solver ='randomized'的PCA中的inverse_transform的实现也不是transform
的精确逆变换。
Examples:
from __future__ import print_function from time import time import logging import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.model_selection import GridSearchCV from sklearn.datasets import fetch_lfw_people from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.decomposition import PCA from sklearn.svm import SVC print(__doc__) # Display progress logs on stdout logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s') # ############################################################################# # Download the data, if not already on disk and load it as numpy arrays lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4) # introspect the images arrays to find the shapes (for plotting) n_samples, h, w = lfw_people.images.shape # for machine learning we use the 2 data directly (as relative pixel # positions info is ignored by this model) X = lfw_people.data n_features = X.shape[1] # the label to predict is the id of the person y = lfw_people.target target_names = lfw_people.target_names n_classes = target_names.shape[0] print("Total dataset size:") print("n_samples: %d" % n_samples) print("n_features: %d" % n_features) print("n_classes: %d" % n_classes) # ############################################################################# # Split into a training set and a test set using a stratified k fold # split into a training and testing set X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.25, random_state=42) # ############################################################################# # Compute a PCA (eigenfaces) on the face dataset (treated as unlabeled # dataset): unsupervised feature extraction / dimensionality reduction n_components = 150 print("Extracting the top %d eigenfaces from %d faces" % (n_components, X_train.shape[0])) t0 = time() pca = PCA(n_components=n_components, svd_solver='randomized', whiten=True).fit(X_train) print("done in %0.3fs" % (time() - t0)) eigenfaces = pca.components_.reshape((n_components, h, w)) print("Projecting the input data on the eigenfaces orthonormal basis") t0 = time() X_train_pca = pca.transform(X_train) X_test_pca = pca.transform(X_test) print("done in %0.3fs" % (time() - t0)) # ############################################################################# # Train a SVM classification model print("Fitting the classifier to the training set") t0 = time() param_grid = {'C': [1e3, 5e3, 1e4, 5e4, 1e5], 'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1], } clf = GridSearchCV(SVC(kernel='rbf', class_weight='balanced'), param_grid) clf = clf.fit(X_train_pca, y_train) print("done in %0.3fs" % (time() - t0)) print("Best estimator found by grid search:") print(clf.best_estimator_) # ############################################################################# # Quantitative evaluation of the model quality on the test set print("Predicting people's names on the test set") t0 = time() y_pred = clf.predict(X_test_pca) print("done in %0.3fs" % (time() - t0)) print(classification_report(y_test, y_pred, target_names=target_names)) print(confusion_matrix(y_test, y_pred, labels=range(n_classes))) # ############################################################################# # Qualitative evaluation of the predictions using matplotlib def plot_gallery(images, titles, h, w, n_row=3, n_col=4): """Helper function to plot a gallery of portraits""" plt.figure(figsize=(1.8 * n_col, 2.4 * n_row)) plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35) for i in range(n_row * n_col): plt.subplot(n_row, n_col, i + 1) plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray) plt.title(titles[i], size=12) plt.xticks(()) plt.yticks(()) # plot the result of the prediction on a portion of the test set def title(y_pred, y_test, target_names, i): pred_name = target_names[y_pred[i]].rsplit(' ', 1)[-1] true_name = target_names[y_test[i]].rsplit(' ', 1)[-1] return 'predicted: %s\ntrue: %s' % (pred_name, true_name) prediction_titles = [title(y_pred, y_test, target_names, i) for i in range(y_pred.shape[0])] plot_gallery(X_test, prediction_titles, h, w) # plot the gallery of the most significative eigenfaces eigenface_titles = ["eigenface %d" % i for i in range(eigenfaces.shape[0])] plot_gallery(eigenfaces, eigenface_titles, h, w) plt.show()
print(__doc__) # Authors: Vlad Niculae, Alexandre Gramfort # License: BSD 3 clause import logging from time import time from numpy.random import RandomState import matplotlib.pyplot as plt from sklearn.datasets import fetch_olivetti_faces from sklearn.cluster import MiniBatchKMeans from sklearn import decomposition # Display progress logs on stdout logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)s %(message)s') n_row, n_col = 2, 3 n_components = n_row * n_col image_shape = (64, 64) rng = RandomState(0) # ############################################################################# # Load faces data dataset = fetch_olivetti_faces(shuffle=True, random_state=rng) faces = dataset.data n_samples, n_features = faces.shape # global centering faces_centered = faces - faces.mean(axis=0) # local centering faces_centered -= faces_centered.mean(axis=1).reshape(n_samples, -1) print("Dataset consists of %d faces" % n_samples) def plot_gallery(title, images, n_col=n_col, n_row=n_row): plt.figure(figsize=(2. * n_col, 2.26 * n_row)) plt.suptitle(title, size=16) for i, comp in enumerate(images): plt.subplot(n_row, n_col, i + 1) vmax = max(comp.max(), -comp.min()) plt.imshow(comp.reshape(image_shape), cmap=plt.cm.gray, interpolation='nearest', vmin=-vmax, vmax=vmax) plt.xticks(()) plt.yticks(()) plt.subplots_adjust(0.01, 0.05, 0.99, 0.93, 0.04, 0.) # ############################################################################# # List of the different estimators, whether to center and transpose the # problem, and whether the transformer uses the clustering API. estimators = [ ('Eigenfaces - PCA using randomized SVD', decomposition.PCA(n_components=n_components, svd_solver='randomized', whiten=True), True), ('Non-negative components - NMF', decomposition.NMF(n_components=n_components, init='nndsvda', tol=5e-3), False), ('Independent components - FastICA', decomposition.FastICA(n_components=n_components, whiten=True), True), ('Sparse comp. - MiniBatchSparsePCA', decomposition.MiniBatchSparsePCA(n_components=n_components, alpha=0.8, n_iter=100, batch_size=3, random_state=rng), True), ('MiniBatchDictionaryLearning', decomposition.MiniBatchDictionaryLearning(n_components=15, alpha=0.1, n_iter=50, batch_size=3, random_state=rng), True), ('Cluster centers - MiniBatchKMeans', MiniBatchKMeans(n_clusters=n_components, tol=1e-3, batch_size=20, max_iter=50, random_state=rng), True), ('Factor Analysis components - FA', decomposition.FactorAnalysis(n_components=n_components, max_iter=2), True), ] # ############################################################################# # Plot a sample of the input data plot_gallery("First centered Olivetti faces", faces_centered[:n_components]) # ############################################################################# # Do the estimation and plot it for name, estimator, center in estimators: print("Extracting the top %d %s..." % (n_components, name)) t0 = time() data = faces if center: data = faces_centered estimator.fit(data) train_time = (time() - t0) print("done in %0.3fs" % train_time) if hasattr(estimator, 'cluster_centers_'): components_ = estimator.cluster_centers_ else: components_ = estimator.components_ # Plot an image representing the pixelwise variance provided by the # estimator e.g its noise_variance_ attribute. The Eigenfaces estimator, # via the PCA decomposition, also provides a scalar noise_variance_ # (the mean of pixelwise variance) that cannot be displayed as an image # so we skip it. if (hasattr(estimator, 'noise_variance_') and estimator.noise_variance_.ndim > 0): # Skip the Eigenfaces case plot_gallery("Pixelwise variance", estimator.noise_variance_.reshape(1, -1), n_col=1, n_row=1) plot_gallery('%s - Train time %.1fs' % (name, train_time), components_[:n_components]) plt.show()
KernelPCA
是通过使用内核实现非线性维数降低的PCA的扩展( Pairwise metrics, Affinities and Kernels )。 它具有许多应用,包括去噪,压缩和结构预测(内核依赖估计)。 KernelPCA支持 transform
and inverse_transform
。
Examples:
print(__doc__) # Authors: Mathieu Blondel # Andreas Mueller # License: BSD 3 clause import numpy as np import matplotlib.pyplot as plt from sklearn.decomposition import PCA, KernelPCA from sklearn.datasets import make_circles np.random.seed(0) X, y = make_circles(n_samples=400, factor=.3, noise=.05) kpca = KernelPCA(kernel="rbf", fit_inverse_transform=True, gamma=10) X_kpca = kpca.fit_transform(X) X_back = kpca.inverse_transform(X_kpca) pca = PCA() X_pca = pca.fit_transform(X) # Plot results plt.figure() plt.subplot(2, 2, 1, aspect='equal') plt.title("Original space") reds = y == 0 blues = y == 1 plt.scatter(X[reds, 0], X[reds, 1], c="red", s=20, edgecolor='k') plt.scatter(X[blues, 0], X[blues, 1], c="blue", s=20, edgecolor='k') plt.xlabel("$x_1$") plt.ylabel("$x_2$") X1, X2 = np.meshgrid(np.linspace(-1.5, 1.5, 50), np.linspace(-1.5, 1.5, 50)) X_grid = np.array([np.ravel(X1), np.ravel(X2)]).T # projection on the first principal component (in the phi space) Z_grid = kpca.transform(X_grid)[:, 0].reshape(X1.shape) plt.contour(X1, X2, Z_grid, colors='grey', linewidths=1, origin='lower') plt.subplot(2, 2, 2, aspect='equal') plt.scatter(X_pca[reds, 0], X_pca[reds, 1], c="red", s=20, edgecolor='k') plt.scatter(X_pca[blues, 0], X_pca[blues, 1], c="blue", s=20, edgecolor='k') plt.title("Projection by PCA") plt.xlabel("1st principal component") plt.ylabel("2nd component") plt.subplot(2, 2, 3, aspect='equal') plt.scatter(X_kpca[reds, 0], X_kpca[reds, 1], c="red", s=20, edgecolor='k') plt.scatter(X_kpca[blues, 0], X_kpca[blues, 1], c="blue", s=20, edgecolor='k') plt.title("Projection by KPCA") plt.xlabel("1st principal component in space induced by $\phi$") plt.ylabel("2nd component") plt.subplot(2, 2, 4, aspect='equal') plt.scatter(X_back[reds, 0], X_back[reds, 1], c="red", s=20, edgecolor='k') plt.scatter(X_back[blues, 0], X_back[blues, 1], c="blue", s=20, edgecolor='k') plt.title("Original space after inverse transform") plt.xlabel("$x_1$") plt.ylabel("$x_2$") plt.subplots_adjust(0.02, 0.10, 0.98, 0.94, 0.04, 0.35) plt.show()
SparsePCA
是PCA的一个变体,其目的是提取最能重建数据的稀疏组件集合。MiniBatchSparsePCA
)是SparsePCA
的一个变体,它更快,但不太准确。 通过迭代一组特征的小块来达到增加的速度,对于给定的迭代次数。
注意;虽然本着在线算法的精神,MiniBatchSparsePCA
类没有实现partial_fit
,因为算法沿着要素方向在线,而不是样本方向。
Examples:
TruncatedSVD
实现了仅计算k个最大奇异值的奇异值分解(SVD)的变体,其中k是用户指定的参数。
当截断的SVD被应用于术语文档矩阵(由CountVectorizer或TfidfVectorizer返回)时,该变换被称为潜在语义分析(LSA),因为它将这样的矩阵转换为低维度的“语义”空间。 特别地,LSA已知能够抵抗同义词和多义词的影响(两者大致意味着每个单词有多重含义),这导致术语文档矩阵过度稀疏,并且在诸如余弦相似性的度量下表现出差的相似性。
Examples:
SparseCoder
对象是一种估计器,可用于将信号转换为来自固定的预计算字典(例如离散小波)的原子的稀疏线性组合。 因此,该对象不实现拟合方法。 该转换相当于稀疏编码问题:将数据的表示尽可能少的字典原子的线性组合。 字典学习的所有变体实现以下变换方法,可通过transform_method初始化参数进行控制:
Examples:
DictionaryLearning
)是一个矩阵因式分解问题,相当于找到一个(通常是不完整的)字典,它会对拟合数据进行稀疏编码。
Examples:
MiniBatchDictionaryLearning
实现了更适合大型数据集的字典学习算法的更快,但不太准确的版本。
默认情况下,MiniBatchDictionaryLearning
将数据分成小批量,并通过在指定次数的迭代中循环使用小批量,以在线方式进行优化。 但是,目前它没有实现停止条件。
估计器还实现了partial_fit,它通过在一个迷你批处理中仅迭代一次来更新字典。 当数据从一开始就不容易获得,或者当数据不适合内存时,这可以用于在线学习。
Example: Online learning of a dictionary of parts of faces
Examples:
独立分量分析将多变量信号分解为最大独立的加性子组件。 它使用Fast ICA
算法在scikit-learn中实现。 通常,ICA不用于降低维度,而是用于分离叠加信号。 由于ICA模型不包括噪声项,因此要使模型正确,必须应用美白。 这可以在内部使用whiten参数或手动使用其中一种PCA变体进行。
通常用于分离混合信号(称为盲源分离的问题),如下例所示:
Examples:
Examples:
Latent Dirichlet Allocation是离散数据集(如文本语料库)的集合的生成概率模型。 它也是一个主题模型,用于从文档集合中发现抽象主题。
Examples: