R进行主成分分析之princomp

在写期末大作业时,学习了一下R,在处理主成分分析时,遇到了一些问题,网上并没有找到系统的解决方案。解决该问题的思路和方法颇有借鉴意义,因此记录下来做个案底。

主成分分析作为传统的降维手段,以方差映射信息量,经过不复杂的数学推导,将降维问题落实到矩阵特征值得求解上,这已经是基本的主成分分析过程中的所有数学了。

R in action 一书中,主要借助psych包中的principal函数,fa函数,据说有更强大的功能和效果,由于我只是为了大作业,学习些简单的命令,并未采用这条技术路线。主要使用princomp函数。

R进行主成分分析之princomp_第1张图片

这是数理统计教材上的demo,教材直接基于协方差矩阵进行了分析,然而一般这种问题的出发点是基于相关矩阵的,这是教材上的问题。

Description

princomp performs a principal components analysis on the given numeric
data matrix and returns the results as an object of class princomp.

Usage

princomp(x, …)

S3 method for class ‘formula’ princomp(formula, data = NULL, subset, na.action, …)

Default S3 method: princomp(x, cor = FALSE, scores = TRUE, covmat = NULL,
subset = rep_len(TRUE, nrow(as.matrix(x))), fix_sign = TRUE, …)

S3 method for class ‘princomp’ predict(object, newdata, …)

Arguments

formula a formula with no response variable, referring only to
numeric variables.

data an optional data frame (or similar: see model.frame) containing
the variables in the formula formula. By default the variables are
taken from environment(formula).

subset an optional vector used to select rows (observations) of the
data matrix x.

na.action a function which indicates what should happen when the data
contain NAs. The default is set by the na.action setting of options,
and is na.fail if that is unset. The ‘factory-fresh’ default is
na.omit.

x a numeric matrix or data frame which provides the data for the
principal components analysis.

cor a logical value indicating whether the calculation should use the
correlation matrix or the covariance matrix. (The correlation matrix
can only be used if there are no constant variables.)

scores a logical value indicating whether the score on each principal
component should be calculated.

covmat a covariance matrix, or a covariance list as returned by
cov.wt (and cov.mve or cov.mcd from package MASS). If supplied, this
is used rather than the covariance matrix of x.

fix_sign Should the signs of the loadings and scores be chosen so
that the first element of each loading is non-negative?

… arguments passed to or from other methods. If x is a formula one
might specify cor or scores.

object Object of class inheriting from “princomp”.

newdata An optional data frame or matrix in which to look for
variables with which to predict. If omitted, the scores are used. If
the original fit used a formula or a data frame or a matrix with
column names, newdata must contain columns with the same names.
Otherwise it must contain the same number of columns, to be used in
the same order.

以上是说说明文档,想当然把协方差矩阵当作x扔进去,未配置其他参数,结果与eigen直接求特征值不同,因此开始寻找错误的点。

bug的排除源于对函数源码的阅读,阅读源码发现函数中参数的使用问题。
首先,x作为一个数据框或矩阵,是原始数据,针对上面的问题,原始数据已经被处理成协方差矩阵,直接把协方差矩阵扔进去,会按原数据处理,当然是错的。
第二,如果要使用协方差矩阵或者相关系数阵作为处理的源头,需将cor置为TRUE,无论原数据是协方差矩阵还是相关系数阵,都要使用cov2cor()处理一遍,因为相关系数阵是收敛的。
第三, covmat才是接受协方差矩阵的函数参数。

对于如何查看r函数的源代码,请参考查看R源代码的六种方法,写的相当经典。

最后补充一点,求出的Standard deviation是矩阵特征值的平方根。

去验证吧,enjoy yourself!

你可能感兴趣的:(R)