数据降维
目的:找到正确的分析指标,同时减少数据损失(许多变量间可能会出现相关,但是我们不能盲目提出剔除变量)
n维空间中的n个点一定能在一个n-1维子空间中分析
PCA and FA
- both objectives
- 主成分能够总结observed variables之间的相关性;将复杂的相关变量减少为少数不相关变量
- variance-covariance structure of a set of n variables using m components
- 减少多重共线性
- 简化冗余信息
- 区别
PCA
Advantages
- exploit relationships in variables and re-express the multivariate data
- highlight similarities and differences
- without loss of information
How?
by rotating (or re-orienting the data) to reveal possible relationships (例如旋转x, y坐标轴使之与数据相契合) + Exclude principal components whose eigenvalues(特征值) are
less than average.
Example
- 在二维数据中,分别计算x, y均值。
- 用原数据减去均值得到调整值(“中心化”)
- 求Cov协方差(协方差大于0表示x和y若一个增,另一个也增;小于0表示一个增,一个减。如果x和y是统计独立的,那么二者之间的协方差就是0;但是协方差是0,并不能说明x和y是独立的。协方差绝对值越大,两者对彼此的影响越大,反之越小。)
- 计算协方差矩阵
- 求协方差的特征值与特征向量
两个维度,两个特征值。
when?
数据相关
Check the correlation matrix using Bartlett’s sphericity test.
- spherical -> mutually independent
- ellipsoidal -> PCA
PCA 处理有序问题
忽视离散
缺点:相关性较小;pca的分布假设被违反?;高偏度和峰度⇒不同的渐近特性;使用Spearman的相关性(不假定线性关系)
FA
- 目标:确定两个或多个变量共有的共同因素
- the observed variation in each variable is related to underlying common factors (IVs) and to a unique variable (δ), ^ In PCA, there is no underlying measurement model; i.e., no variance attributed to measurement error(没有方差去衡量测量误差)
- 分类:
- Exploratory Factor Analysis
基于普通因子模型,No strong prior notions regarding the structure of the factor solution; infer a factor structure from the correlation patterns - Confirmatory Factor Analysis
Strong prior notion (have some idea on which variables will load on which factors); No rotational indeterminacy (testing to see if our hypothesis is consistent with the data)
EFA
- 为了找到最少数量的公因子去形容变量之间的关系。
- 认为observed variable之间没有任何结构关系。
- 仅表明潜在变量的数量
- pca 中每一PC与原变量之间有明显的线性组合,fa中没有
- 重复使用“fitting function”去产生削减变量的集合
-Variance for a variable can be separated into “common variance” and “unique variance"
因素分析法之步骤
(一) 选择所欲分析的变量
(二) 准备相关矩阵, 估计共同性
sigma - the explains variance of factors
uniqueness - not shared with other variables
communality - (1-uniqueness) shared with other variables, 反映出无法完美地捕捉潜在的公因子(残差)
若Xi的communality趋近1,Xi是公因子的近乎完美的度量;相对越接近于0,Xi就并不是很完美,这大多都是因为theta引起的。
assumption:that half of variation (0.5) is in the uniqueness factor
求出特征值,特征向量
需要指出:pca中的特征值均为正;SUM(eigenvalue) != SUM(observed factors) because some variation is accounted for in uniqueness factor.
使用方法:反复因素抽取法
(三) 决定因素的数目
决定因素数目的方法:
1.保留特征值λ大于1的共同素。
2.保留特征值大于0的共同因素。
3.在抽取之因素已能解释75%之变异量后,若继续抽取之因素对变异量之解释少于5%,则不予选取。
(四) 从相关矩阵中抽取共同因素
Factor 1:para, sent, word → Verbal
Factor 2:addition, counting dots → Quantitative
抽取因素的方法:(最为常用)
1.主成份法:由正交成份中抽取最大变异之因素
2.主轴法:抽取因素的顺序是以能对各变量之共同性产生最大贡献之因素优先抽取principle axis
3.最大概率法:不需先估计共同幸而事先假设共同因素之数目而后一此假设导出因素和共同性maximum likelihood
(五) 旋转因素, 增加变项与因素之间关系的解释
- 最大变异数法(Varimax) 正交 orthogonal:force factors to be uncorrelated
- 最优斜交法(Promax)斜交: allows factors to be correlated
(六) 结果解释
把代码放上来
#check correlation
myCors <- cor(PCA_data[,-1], method="spearman")
summary(PCA_data)
#plot correlation in 6*6
plot(PCA_data[,2:8])
##############################################PCA
#normalize
##make sure that row name is six variables' names
Xoriginal = t(as.matrix(PCA_data[,-1]))
##mean
rm = rowMeans(Xoriginal)
X = Xoriginal - matrix(rep(rm, dim(Xoriginal)[2]), nrow = dim(Xoriginal)[1])
#correlation matrix
A = X %*% t(X)
#eigenvalues and eigenvectors
E = eigen(A, symmetric = TRUE)
#show eigenvectors
P = t(E$vectors)
#calculate standard deviation
newdata = P %*% X
sdev = sqrt(diag((P %*% A %*% t(P))/(dim(X)[2]-1)))
sdev
#PCA
pr = prcomp(PCA_data[,-1],scale = TRUE)
print(pr)
summary(pr)
plot(pr, main = "Scree Plot")
#plot
library(ggfortify)
autoplot(pr,label=TRUE)
B<-predict(pr)[,1]
round(B,2)
biplot(pr,col=c("blue","red"),xlim=c(-.4,.6),cex=c(.8,1.25))
cor(PCA_data$drive, pr$x[,1] )
pr2 = prcomp(PCA_data[,-1], tol = .1)
plot.ts(pr2$x)
##############################################FA
library(psych)
library(GPArotation)
cor_data <- cor(PCA_data[,-1])
# fa="both":同时展示主成分分析和因子分析的结果
fa.parallel(cor_data, n.obs=36, fa="both", n.iter=100,show.legend = F) #main axis
##in order to draw diagram
fa_model <- fa(cor_data,nfactors = 2,rotate = 'none',fm='pa')
fa_model2 <- fa(cor_data, nfactors = 2, rotate = "varimax", fm = "pa")
fa_model3 <- fa(cor_data, nfactors = 2, rotate = "promax", fm = "pa")
##same as above, however use this outcome(simplicity)
factanal(covmat=A,factors=2,rotation="none")
factanal(covmat=A,factors=2,rotation="varimax")
factanal(covmat=A,factors=2,rotation="promax")
#draw correlation plot
fa.diagram(fa_model, digits = 2)
fa.diagram(fa_model2, digits = 2)
fa.diagram(fa_model3, digits = 2)
#################################################pridict
# pridict by PCA
mydata_pcr_pre <- predict(pr)
# independent variable
PCA_data$a1 <- mydata_pcr_pre[,1]
PCA_data$a2 <- mydata_pcr_pre[,2]
# build model
mydata_newols <- lm(drive~a1+a2, data=PCA_data)
summary(mydata_newols)
# pridict by FA
PCA_data$b1 = 0.86*PCA_data$bike+0.83*PCA_data$ticket+0.76*PCA_data$truck
PCA_data$b2 = 1.03*PCA_data$leave+0.62*PCA_data$arrive-0.55*PCA_data$skill-0.43*PCA_data$traffic
mydata_newols2 <- lm(drive~b1+b2, data=PCA_data)
summary(mydata_newols2)