pca(主成分分析技术)_主成分分析技巧

pca(主成分分析技术)

介绍 (Introduction)

Principal Component Analysis (PCA) is an unsupervised technique for dimensionality reduction.

主成分分析(PCA)是一种无监督的降维技术。

What is dimensionality reduction?

什么是降维?

Let us start with an example. In a tabular data set, each column would represent a feature, or dimension. It is commonly known that it is difficult to manipulate a tabular data set that has a lot of columns/features, especially if there are more columns than observations.

让我们从一个例子开始。 在表格数据集中,每列将代表一个要素或尺寸。 众所周知,很难处理具有许多列/功能的表格数据集,尤其是当列数多于观察值时。

Given a linearly modelable problem having a number of features p=40, then the best subset approach would fit a about trillion (2^p-1) possible models and submodels, making their computation extremely onerous.

给定一个具有多个特征p = 40的线性可建模问题,那么最佳子集方法将适合大约一万亿(2 ^ p-1)个可能的模型和子模型,从而使其计算极为繁琐。

How does PCA come to aid?

PCA如何提供帮助?

PCA can extract information from a high-dimensional space (i.e., a tabular data set with many columns) by projecting it onto a lower-dimensional subspace. The idea is that the projection space will have dimensions, named principal components, that will explain the majority of the variation of the original data set.

PCA可以通过将其投影到低维子空间上来 从高维空间 (即具有许多列的表格数据集)中提取信息 。 这个想法是,投影空间将具有称为主成分的维,这些维将解释原始数据集的大部分变化。

How does PCA work exactly?

PCA如何工作?

PCA is the eigenvalue decomposition of the covariance matrix obtained after centering the features, to find the directions of maximum variation. The eigenvalues represent the variance explained by each principal component.

PCA是对特征进行居中后找到最大变化方向的协方差矩阵的特征值分解。 特征值代表每个主成分所解释的方差。

The purpose of PCA is to obtain an easier and faster way to both manipulate data set (reducing its dimensions) and retain most of the original information through the explained variance.

PCA的目的是获得一种更简便,更快捷的方式来处理数据集(减小其尺寸)并通过所解释的差异保留大多数原始信息。

The question now is

现在的问题是

How many components should I use for dimensionality reduction? What is the “right” number?

我应该使用几个零件来减少尺寸? 什么是“正确的”数字?

In this post, we will discuss some tips for selecting the optimal number of principal components by providing practical examples in Python, by:

在本文中,我们将通过在Python中提供一些实用的示例讨论一些用于选择最佳数量的主要组件的技巧,方法如下:

  1. Observing the cumulative ratio of explained variance.

    观察解释方差的累积比率。
  2. Observing the eigenvalues of the covariance matrix

    观察协方差矩阵的特征值
  3. Tuning the number of components as hyper-parameter in a cross-validation framework where PCA is applied in a Machine Learning pipeline.

    在交叉验证框架中将组件的数量调整为超参数,在交叉验证框架中将PCA应用到机器学习管道中。

Finally, we will also apply dimensionality reduction on a new observation, in the scenario where PCA was already applied to a data set, and we would like to project the new observation on the previously obtained subspace.

最后,在已经将PCA应用于数据集的情况下,我们还将对新观测值进行降维 ,并且我们希望将新观测值投影到先前获得的子空间上。

环境设置 (Environment set-up)

At first, we import the modules we will be using, and load the “Breast Cancer Data Set”: it contains 569 observations and 30 features for relevant clinical information — such as radius, texture, perimeter, area, etc. — computed from digitized image of aspirates of breast masses, and it presents a binary classification problem, as the labes are only 0 or 1 (benign vs malignant), indicating whether a patient has breast cancer or not.

首先,我们导入将要使用的模块,并加载“ 乳腺癌数据集 ”:它包含569个观察值和30个相关临床信息的特征(例如半径,纹理,周长,面积等),这些特征是通过数字化计算得出的乳腺抽吸物的图像,它提出了一个二元分类问题 ,因为标记只有0或1(良性与恶性),表明患者是否患有乳腺癌。

The data set is already available in scikit-learn:

数据集已在scikit-learn中提供:

Without diving deep into the pre-processing task, it is important to mention that the PCA is affected by different scales in the data.

在不深入研究预处理任务的情况下,重要的是要提到PCA受数据中不同比例的影响。

Therefore, before applying PCA the data must be scaled (i.e., converted to have mean=0 and variance=1). This can be easily achieved with the scikit-learn StandardScaler object:

因此,在应用PCA之前, 必须数据进行缩放 (即转换为均值= 0和方差= 1)。 这可以通过scikit-learn StandardScaler对象轻松实现:

This returns:

返回:

Mean:  -6.118909323768877e-16
Standard Deviation: 1.0

Once the features are scaled, applying the PCA is straightforward. In fact, scikit-learn handles almost everything by itself: the user only has to declare the number of components and then fit.

缩放功能后,即可轻松应用PCA。 实际上,scikit-learn本身几乎可以处理所有事情:用户只需要声明组件的数量即可。

Notably, the scikit-learn user can either declare the number of components to be used, or the ratio of explained variance to be reached:

值得注意的是,scikit-learn用户可以声明要使用的组件数量要达到的解释方差比率

  • pca = PCA(n_components=5): performs PCA using 5 components.

    pca = PCA(n_components = 5) :使用5个组件执行PCA。

  • pca = PCA(n_components=.95): performs PCA using a number of components sufficient to consider 95% of variance.

    pca = PCA(n_components = .95) :使用足以考虑95%方差的多个分量执行PCA。

Indeed, this is a way to select the number of components: asking scikit-learn to reach a certain amount of explained variance, such as 95%. But maybe we could have used a significantly lower amount of dimensions and reach a similar variance, for example 92%.

确实,这是一种选择组件数量的方法:要求scikit-learn达到一定的解释方差,例如95%。 但是也许我们可以使用低得多的尺寸并达到类似的方差,例如92%。

So, how do we select the number of components?

那么,我们如何选择组件数量?

1.观察解释方差的比率 (1. Observing the ratio of explained variance)

PCA achieves dimensionality reduction by projecting the observations on a smaller subspace, but we also want to keep as much information as possible in terms of variance.

PCA通过将观测值投影在较小的子空间上来实现降维,但我们也希望在方差方面保留尽可能多的信息。

So, one heuristic yet effective approach is to see how much variance is explained by adding the principal components one by one, and afterwards select the number of dimensions that meet our expectations.

因此,一种启发式但有效的方法是通过将主要成分一一添加来查看解释了多少差异,然后选择满足我们期望的维数。

It is very easy to follow this approach thanks to scikit-learn, that provides the explained_variance_ratio_ property to the (fitted) PCA object:

借助scikit-learn,可以很容易地采用这种方法,该方法为(已安装的)PCA对象提供了explained_variance_ratio_属性:

From the plot, we can see that the first 6 components are sufficient to retain the 89% of the original variance.

从图中可以看出,前6个分量足以保留原始方差的89%。

This is a good result, if we think that we started with a data set of 30 features, and that we could limit further analysis to only 6 dimensions without loosing too much information.

如果我们认为我们从30个要素的数据集开始,并且可以将进一步的分析限制在6个维度而又不丢失太多信息,那将是一个很好的结果。

2.使用协方差矩阵 (2. Using the covariance matrix)

Covariance is a measure of the “spread” of a set of observations around their mean value. When we apply PCA, what happens behind the curtain is that we apply a rotation to the covariance matrix of our data, in order to achieve a diagonal covariance matrix. In this way, we obtain data whose dimensions are uncorrelated.

协方差是对一组观测值在其平均值附近的“分布”的度量。 当我们应用PCA时,幕后发生的事情是我们将旋转应用于数据的协方差矩阵,以获得对角协方差矩阵 。 通过这种方式,我们可以获得维度不相关的数据

The diagonal covariance matrix obtained after transformation is the eigenvalue matrix, where the eigenvalues correspond to the variance explained by each component.

变换后获得的对角协方差矩阵是特征值矩阵,其中特征值对应于每个组件解释的方差。

Therefore, another approach to the selection of the ideal number of components is to look for an “elbow” in the plot of the eigenvalues.

因此, 选择理想数量的零件的另一种方法是在特征值图中寻找“弯头”。

Let us observe the first elements of the covariance matrix of the principal components. As said, we expect it to be diagonal:

让我们观察主成分协方差矩阵的第一个元素。 如前所述,我们希望它是对角线的:

pca(主成分分析技术)_主成分分析技巧_第1张图片

Indeed, at first glance the covariance matrix appears to be diagonal. In order to be sure that the matrix is diagonal, we can verify that all the values outside of the main diagonal are almost equal to zero (up to a certain decimal, as they will not be exactly zero).

确实,乍看之下,协方差矩阵似乎是对角线的。 为了确保矩阵是对角线,我们可以验证主对角线之外的所有值几乎都等于零(最多为某个小数,因为它们将不完全为零)。

We can use the assert_almost_equal statement, that leads to an exception in case its inner condition is not met, while it leads to no visible output in case the condition is met. In this case, no exception is raised (up to the tenth decimal):

我们可以使用assert_almost_equal语句,在不满足其内部条件的情况下导致异常,而在满足条件的情况下则导致不可见输出。 在这种情况下,不会引发任何异常(最多十进制小数):

The matrix is diagonal. Now we can proceed to plot the eigenvalues from the covariance matrix and look for an elbow in the plot.

矩阵是对角线的。 现在我们可以继续绘制协方差矩阵的特征值,并在图中寻找弯头。

We use the diag method to extract the eigenvalues from the covariance matrix:

我们使用diag方法从协方差矩阵中提取特征值:

We may see an “elbow” around the sixth component, where the slope seems to change significantly.

我们可能会在第六部分周围看到一个“弯头”,那里的坡度似乎发生了很大变化。

Actually, all these steps were not needed: scikit-learn provides, among the others, the explained_variance_ attribute, defined in the documentation as “The amount of variance explained by each of the selected components. Equal to n_components largest eigenvalues of the covariance matrix of X.”:

实际上,不需要所有这些步骤:scikit-learn除其他外,还提供了explained_variance_属性,该属性在文档中定义为“每个选定组件所解释的方差量。 等于X的协方差矩阵的n_components个最大特征值。”:

In fact, we notice the same result as from the calculation of the covariance matrix and the eigenvalues.

实际上,我们注意到与计算协方差矩阵和特征值相同的结果。

3.应用交叉验证程序 (3. Applying a cross-validation procedure)

Although PCA is an unsupervised technique, it might be used together with other techniques in a broader pipeline for a supervised problem.

尽管PCA是一种不受监督的技术,但它可以与其他技术一起在更广泛的管道中用于受监督的问题。

For instance, we might have a classification (or regression) problem in a large data set, and we might apply PCA before our classification (or regression) model in order to reduce the dimensionality of the input dataset.

例如,我们可能在大型数据集中存在分类(或回归)问题,并且我们可能在分类(或回归)模型之前应用PCA,以降低输入数据集的维数。

In this scenario, we would tune the number of principal components as a hyper-parameter within a cross-validation procedure.

在这种情况下,我们将在交叉验证过程中将主成分的数量调整为超参数。

This can be achieved by using two scikit-learn object:

这可以通过使用两个scikit-learn对象来实现:

  • Pipeline: allows the definition of a pipeline of sequential steps in order to cross-validate them together.

    管道 :允许定义顺序步骤的管道,以便一起交叉验证它们。

  • GridSearchCV: performs a grid search in a cross-validation framework for hyper-parameter tuning (= finding the optimal parameters of the steps in the pipeline).

    GridSearchCV :在交叉验证框架中执行网格搜索以进行超参数调整 (=查找管线中步骤的最佳参数)。

The process is as follows:

流程如下:

  1. The steps (dimensionality reduction, classification) are chained in a pipeline.

    步骤(降维,分类)链接在管道中。
  2. The parameters to search are defined.

    已定义要搜索的参数。
  3. The grid search procedure is executed.

    执行网格搜索过程。

In our example, we are facing a binary classification problem. Therefore, we apply PCA followed by logistic regression in a pipeline:

在我们的示例中,我们面临一个二进制分类问题。 因此,我们在管道中应用PCA,然后进行逻辑回归

This returns:

返回:

Best parameters obtained from Grid Search:
{'log_reg__C': 1.2589254117941673, 'pca__n_components': 9}

The grid search finds the best number of components for the PCA during the cross-validation procedure.

网格搜索会在交叉验证过程中找到PCA的最佳组件数量。

For our problem and tested parameters range, the best number of components is 9.

对于我们的问题和经过测试的参数范围, 最佳组件数量是9

The grid search provides more detailed results in the cv_results_ attribute, that can be stored as a pandas dataframe and inspected:

网格搜索在cv_results_属性中提供了更详细的结果,可以将其存储为pandas数据并进行检查:

Some of the columns in the dataframe obtained by storing the cv_results_ attribute output. 通过存储 cv_results_属性输出获得的数据 框中的某些列。

As we can see, it contains detailed information on the cross-validated procedure with the grid search.

如我们所见,它包含有关通过网格搜索进行交叉验证的过程的详细信息。

But we might be not interested in seeing all the iterations performed by the grid search. Therefore, we can get the best validation score (averaged on all folds) for each number of components, and finally plot them together with the cumulative ratio of explained variance:

但是我们可能对查看网格搜索执行的所有迭代不感兴趣。 因此,对于每种数量的组分,我们可以获得最佳的验证分数(在所有折叠中平均),最后将它们与解释的方差的累积比率一起绘制:

From the plot, we can notice that 6 components are enough to create a model whose validation accuracy reaches 97%, where considering all 30 components would lead to a 98% validation accuracy.

从图中可以看出,只有6个组件足以创建一个验证精度达到97%的模型,而考虑所有30个组件将得出98%的验证精度

In a scenario with a significant number of features in a input data set, reducing the number of input features with PCA could lead to significant advantages in terms of:

输入数据集中大量要素的情况下,使用PCA减少输入要素的数量可能会带来以下方面的显着优势:

  1. Reduced training and prediction time.

    减少训练和预测时间。

  2. Increased scalability.

    增加可伸缩性。

  3. Reduced training computational effort.

    减少训练计算量。

While, at the same time, by choosing the optimal number of principal components in a pipeline for a supervised problem, tuning the hyper-parameter in a cross-validated procedure, we would ensure to retain optimal performances.

同时, 在同一时间 ,通过选择一个监督问题的管线主要成分的最佳数量,调整超参数在交叉验证过程中,我们会确保留住最佳的性能。

Although, it must be taken into account that in a data set with many features the PCA itself may prove computationally expensive.

但是,必须考虑到, 在具有许多功能的数据集中,PCA本身可能在计算上非常昂贵

如何将PCA应用于新观测? (How to apply PCA to a new observation?)

Now, let us suppose that we have applied the PCA to an existing data set and kept (for example) 6 components.

现在,让我们假设已经将PCA应用于现有数据集并保留(例如)6个组件。

At some point, a new observation is added to the data set and needs to be projected on the reduced subspace obtained by PCA.

在某个时候, 新的观测值会添加到数据集,并且需要投影到PCA获得的缩小子空间上

How can this be achieved?

如何做到这一点?

We can perform this calculation manually through the projection matrix.

我们可以通过投影矩阵手动执行此计算。

Therefore, we also estimate the error in the manual calculation by checking if we would get the same output as “fit_transform” on the original data:

因此,我们还通过检查是否在原始数据上获得与“ fit_transform”相同的输出来估计手动计算中的错误:

pca(主成分分析技术)_主成分分析技巧_第2张图片

The projection matrix is orthogonal, and the manual reduction provides a fairly reasonable error.

投影矩阵是正交的,并且手动缩小提供了相当合理的误差。

We can finally obtain the projection by the multiplication between the new observation (scaled) and the transposed projection matrix:

我们最终可以通过新观测值(按比例缩放)与转置投影矩阵之间的乘法来获得投影:

This returns:

返回:

[-3.22877012 -1.17207348  0.26466433 -1.00294458  0.89446764  0.62922496]

That’s it! The new observation is projected to the 6-dimensional subspace obtained with PCA.

而已! 新的观测值投影到使用PCA获得的6维子空间。

结论 (Conclusion)

This tutorial is meant to provide a few tips on the selection of the number of components to be used for the dimensionality reduction in the PCA, showing practical demonstrations in Python.

本教程旨在为您提供一些用于选择PCA中降维的组件数量的技巧,并显示Python的实际演示。

Finally, it is also explained how to perform the projection onto the reduced subspace of a new sample, information which is rarely found on tutorials on the subject.

最后,还说明了如何在新样本的缩小子空间上进行投影,该信息在该主题的教程中很少见。

This is but a brief overview. The topic is far broader and it has been deeply investigated in literature.

这只是一个简短的概述。 该主题范围更广,并且已在文献中进行了深入研究。

翻译自: https://medium.com/@nicolo_albanese/tips-on-principal-component-analysis-7116265971ad

pca(主成分分析技术)

你可能感兴趣的:(java,大数据,算法,python,mysql)