最简单的主成分分析函数,prcomp 和 princomp 都是自带的函数,不需要额外的包
http://strata.uga.edu/software/pdf/pcaTutorial.pdf很好的一个介绍
http://gastonsanchez.wordpress.com/2012/06/17/principal-components-analysis-in-r-part-1/很好的一个介绍
主成分分析的结果包含特征根集,PC scores表,(变量和PC)相关系数表(table of loadings)
特征根包含了数据变化度的信息,scores提供了观测结构的信息,相关系数表提供了变量之间,以及和PC之间的关系的大致感官概念
prcomp : Performs a principalcomponents analysis on the givendata matrix and returns the results as anobject of class prcomp.
princomp : Performs a principal components analysison the givennumeric data matrix and returns the results as an object of class princomp.
以下使用内置数据集USArrests
> str(USArrests)
'data.frame': 50 obs. of 4 variables:
$ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
$ Assault : int NA 263 294 190 276 204 110 238 335 211 ...
$ UrbanPop: int 58 48 80 50 91 78 77 72 80 60 ...
$ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
prcomp(x, ...)
prcomp(formula, data = NULL, subset, na.action, ...)
prcomp(x, retx = TRUE, center = TRUE, scale. = FALSE, tol = NULL, ...)
prcomp(USArrests) #inappropriate,没有scale不太合适
prcomp(USArrests, scale = TRUE) #直接数据矩阵
prcomp(~ Murder + Assault + Rape, data = USArrests, scale = TRUE) #直接方程
plot(prcomp(USArrests))
summary(prcomp(USArrests, scale = TRUE))
biplot(prcomp(USArrests, scale = TRUE))
princomp :
princomp(x, ...) #完全一样
princomp(formula, data = NULL, subset, na.action, ...) #继续完全一样
princomp(x, cor = FALSE, scores = TRUE, covmat = NULL, subset = rep(TRUE,nrow(as.matrix(x))), ...) #参数变化
princomp(USArrests, cor = TRUE) # =^= prcomp(USArrests, scale=TRUE) 近似但不完全一样,标准差differ by a factor of sqrt(49/50)
summary(pc.cr <- princomp(USArrests, cor = TRUE))
loadings(pc.cr) #一个列包含了特征向量的矩阵,对应rotation in prcomp
plot(pc.cr) # shows a screeplot.
biplot(pc.cr)
sdev 标准差 |
the standard deviations of the principal components (i.e., the square roots of the eigenvalues of the covariance/correlation matrix, though the calculation is actually done with the singular values of the data matrix). |
rotation 特征向量矩阵 |
the matrix of variable loadings (i.e., a matrix whose columns contain the eigenvectors). The function princomp returns this in the element loadings. |
x |
在retx值为true的情况下,返回旋转后的数据,也就是(centred (and scaled if requested) data multiplied by the rotation matrix). 所以, cov(x) 就是矩阵对角元素(sdev^2). For the formula method, napredict() is applied to handle the treatment of values omitted by the na.action. |
center, scale |
the centering and scaling used, or FALSE.
因为PCA必须建立在标准正态数据上
(mean=0, variance=1)所以通常需要标准化。
|
sdev 标准差 |
the standard deviations of the principal components. |
loadings 特征向量矩阵 |
the matrix of variable loadings (i.e., a matrix whose columns contain the eigenvectors). This is of class "loadings": see loadings for its print method. |
center |
the means that were subtracted. |
scale |
the scalings applied to each variable. |
n.obs |
the number of observations. |
scores |
if scores = TRUE, the scores of the supplied data on the principal components. These are non-null only if x was supplied, and if covmat was also supplied if it was a covariance list. For the formula method, napredict() is applied to handle the treatment of values omitted by the na.action. |
call |
the matched call. |
na.action |
If relevant. |
The print method for these objects prints the results in a nice format and theplot method produces a screeplot.
Unlike princomp, variances are computed with the usual divisor N - 1.
Note that scale= TRUE cannot be used if there are zero or constant(for center = TRUE) variables.
princomp is a generic function with "formula" and "default" methods.
The calculation is done using eigen on the correlation or covariance matrix, as determined by cor. This is done for compatibility with the S-PLUS result. Apreferred method of calculation is to use svd on x, as is done in prcomp.
Note that the default calculation uses divisor N for the covariance matrix.
The print method for these objects prints the results in a nice formatand the plot method produces a scree plot (screeplot).There is also a biplot method.
If x is a formula then the standard NA-handling is applied to the scores (if requested): seenapredict.
princomp only handles so-calledR-mode PCA, that is feature extraction of variables. If a data matrix is supplied (possibly via a formula) it is required that there are at least as many units as variables. ForQ-mode PCA use prcomp.
通常多变量分析,例如计算相关系数,是在数据列(features或者Question)上完成的;然而每一行是一个样本单位sample unit,也就是Respondents(R way analysis)
有时候数据列Question被当做样本单位那么就是Q analysis. 区别也许就在于标准化和结果解释的时候。
Parameter Estimates Parameter Standard Variance Variable DF Estimate Error t Value Pr > |t| Inflation Intercept 1 134.96790 237.81430 0.57 0.5778 0 occup 1 -1.28377 0.80469 -1.60 0.1291 2.16276 checkin 1 1.80351 0.51624 3.49 0.0028 4.52397 hours 1 0.66915 1.84640 0.36 0.7215 1.35735 common 1 -21.42263 10.17160 -2.11 0.0504 2.33264 wings 1 5.61923 14.74609 0.38 0.7079 3.65318 cap 1 -14.48025 4.22018 -3.43 0.0032 37.12912 rooms 1 29.32475 6.36590 4.61 0.0003 63.70809
Eigenvalues of the Correlation Matrix Eigenvalue Difference Proportion Cumulative 1 4.64302239 3.90281147 0.6633 0.6633 2 0.74021092 0.03390878 0.1057 0.7690 3 0.70630215 0.25669541 0.1009 0.8699 4 0.44960674 0.15020062 0.0642 0.9342 5 0.29940611 0.14798282 0.0428 0.9769 6 0.15142329 0.14139489 0.0216 0.9986 7 0.01002840 0.0014 1.0000主成分之间的VI完美为1
Parameter Estimates Variance Variable DF Inflation Intercept 1 0 Prin1 1 1.00000 Prin2 1 1.00000 Prin3 1 1.00000 Prin4 1 1.00000 Prin5 1 1.00000 Prin6 1 1.00000 Prin7 1 1.00000