18主成分与因子分析

主成分分析
在统计学中，主成分分析主要是一种通过降维技术来将数据集进行简化的操作，并且在减少对数据集的维数的同时保证对方差贡献最大。在定量分析研究中人们往往需要对数据中的变量要求更少，信息量更多，所以主成分分析降维的特点正好解决这类问题。
主成分分析在降维的操作上主要采取正交变换，将其分量转化为分量不想关的新随机向量，然后再对多维变量系统进行降维处理。其主要原理就是讲重新将原变量组合成一个一组新的相互无关的变量，同时从中取出几个较少的变量尽可能的反应出原变量的信息。在数学方面简单的说就是在所有线性组合中选取出一个方差最大的F1，称作第一主成分，如果他不足以代表原来的信息，则需要选取出第二个线性组合F2，来反应原有的信息，依次类推。

image.png

探索性因子分析
探索性因子分析主要思想就是寻找公共因子，来达到数据降维的目的。与主成分分析不同的是，探索性因子分析是在事先不知道因子的情况下，依据样本数据来对变量进行因子分析，从而得出因子。探索性因子分析的主要步骤如下：
收集观测样本数据，构造相关矩阵或者协方差矩阵，确定因子个数，提取因子，因子旋转，解释因子，计算因子得分

image.png

平行分析法
主要是比较基于真实数据的某一个特征值和随机数据矩阵相应的平均特征值，根据交叉点的位置来选择主成分的个数。
碎石图
显示降序的与分量或因子关联的特征值以及分量或因子的数量。用在主成分分析和因子分析中，以直观地评估哪些分量或因子占数据中变异性的大部分。
碎石图中的理想模式是一条陡曲线，接着是一段弯曲，然后是一条平坦或水平的线。保留陡曲线中在开始平坦线趋势的第一个点之前的那些分量或因子。实际上，可能难以解释碎石图。使用对数据的了解以及根据其他选择分量的方法得到的结果以帮助决定重要分量或因子的数量。

image.png

选择因子模型的分析步骤图

image.png

数据准备
主要探讨城市工业主体结构，数据包括某事工业部门 13 个行业和 8 个指标，其中 13 个行业分别是冶金、电力、煤炭、化学、机械、建材、森工、食品、纺织、缝纫、皮革、造纸和文教艺术用品，8 个指标分别是年末固定资产净值 X1、职工人数 X2、工业总产值 X3、全员劳动生产率 X4、百元固定原值实现产值 X5、资金利税率 X6、标准燃料消费量 X7 和能源利用效果 X8。

> options(stringsAsFactors=F)
> test <- readLines("http://labfile.oss.aliyuncs.com/courses/931/test.csv")
> test <- unlist(strsplit(test, split=","))
> test <- matrix(test, ncol=8, byrow=T)
> colnames(test) <- test[1,]
> test <- as.data.frame(test[-1,])
> test <- as.data.frame(sapply(test, as.numeric))
> head(test)
      X1     X2     X3    X4   X5   X6     X7    X8
1  90342  52455 101091 19272 82.0 16.1 197435 0.172
2   4903   1973   2035 10313 34.2  7.1 592077 0.003
3   6735  21139   3767  1780 36.1  8.2 726396 0.003
4  49454  36241  81557 22504 98.1 25.9 348226 0.985
5 139190 203505 215898 10609 93.2 12.6 139572 0.628
6  12215  16219  10351  6382 62.5  8.7 145818 0.066

主成分分析
作主成分分析主要的函数是 princomp

> test.pr <- princomp(test, cor=T)
> summary(test.pr, loading=T)
Importance of components:
                          Comp.1    Comp.2    Comp.3     Comp.4
Standard deviation     1.7619819 1.7017737 0.9640911 0.80175884
Proportion of Variance 0.3880725 0.3620042 0.1161839 0.08035216
Cumulative Proportion  0.3880725 0.7500767 0.8662607 0.94661285
                           Comp.5     Comp.6      Comp.7       Comp.8
Standard deviation     0.55549344 0.28777809 0.182388563 0.0494213026
Proportion of Variance 0.03857162 0.01035203 0.004158198 0.0003053081
Cumulative Proportion  0.98518446 0.99553649 0.999694692 1.0000000000

Loadings:
   Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
X1  0.478  0.294  0.104         0.178         0.761  0.243
X2  0.474  0.276  0.164 -0.175 -0.300        -0.519  0.528
X3  0.426  0.376  0.156                      -0.177 -0.780
X4 -0.210  0.452         0.519  0.537 -0.289 -0.248  0.221
X5 -0.387  0.332  0.321 -0.202 -0.454 -0.583  0.222       
X6 -0.351  0.405  0.147  0.278 -0.314  0.714              
X7  0.213 -0.379  0.140  0.756 -0.424 -0.192              
X8         0.273 -0.891        -0.325 -0.119

summary 函数展示了主成分分析的主要信息，Standard deviation 行表示主成分的标准差，Proportion of Variance 行鄙视方差的贡献率， Cumulative Proportion 表示方差的累积贡献率。由于前 3 个主成分的累积贡献率已经达到了 85% ，另外几个可以舍去，达到降维的目的。

> screeplot(test.pr, type="lines")

image.png

> p <- predict(test.pr)
> p
          Comp.1      Comp.2      Comp.3      Comp.4        Comp.5        Comp.6        Comp.7       Comp.8
 [1,]  1.5383166  0.78329878  0.55914834  0.51447981  1.0939484872 -0.0187808278  4.214247e-01  0.004341591
 [2,]  0.5058551 -2.69970060  0.23469505  0.88712153  0.1600083295 -0.3019824994 -1.327024e-01  0.070443862
 [3,]  1.0828155 -3.36157747  0.42584055  0.60061666 -0.9731163100  0.0678706653  8.022067e-02 -0.025708566
 [4,]  0.4824792  1.23193366 -1.03794614  1.66277167 -0.0004448157  0.0749140127 -4.020261e-03 -0.053876926
 [5,]  4.7220539  2.33445776  0.49001510 -0.79294924 -0.5148616738  0.0219852173 -1.291664e-01  0.023661090
 [6,]  0.3335154 -1.84639307  0.03320975 -0.97490048  0.3886327205  0.2126692286 -2.315539e-02 -0.069531318
 [7,] -1.1528228 -0.32238541  0.29771274 -0.72051586  0.0979704754  0.3091870079 -6.784565e-05 -0.036518510
 [8,] -2.2807730  2.35196023  1.15228786  0.57465434 -0.6021969592 -0.0004185042 -4.292176e-02 -0.054086746
 [9,] -0.8366965  0.90656114  0.33778417  0.15505987  0.5876067643 -0.4389057657 -3.212697e-01 -0.001972479
[10,] -2.1176434  0.87407151  0.24834391 -0.54171795 -0.6801139694 -0.1965720948  2.853128e-01  0.075335379
[11,] -0.7496483 -0.78017371 -0.12474862 -1.15720353  0.2431065198 -0.4037748082  1.583684e-02 -0.030013111
[12,] -1.2531170  0.04010349  0.30197644  0.08694461  0.3886764718  0.6626465187 -1.633114e-01  0.081970107
[13,] -0.2743347  0.48784369 -2.91831915 -0.29436143 -0.1892160403  0.0111618497  1.382012e-02  0.015955626

principal 函数可以根据原始数据矩阵或者相关系数矩阵作主成分分析
判断主成分的个数主要用到的是 psych 包中的 fa.parallel 函数，对三个特征值（特征值的碎石检验，随机矩阵计算出的特征值均值和大于 1 的特征值准则）

> library(psych)
> fa.parallel(test, fa="pc", n.iter=100, show.legend=F)
Parallel analysis suggests that the number of factors =  NA  and the number of components =  2 
Warning message:
In fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs,  :
  The estimated weights for the factor scores are probably incorrect.  Try a different factor score estimation method.

image.png

三个特征值建议选择 2 个主成分。但是有时三个准则并不总相同，需要根据实际情况进行选择主成分的数目。
提取主成分调用 principal 函数挑选出主成分根据三个特征值建议，我们先选定 2 个主成分进行主成分提取。

> pc <- principal(test, nfactors=2, rotate="none")
> pc
Principal Components Analysis
Call: principal(r = test, nfactors = 2, rotate = "none")
Standardized loadings (pattern matrix) based upon correlation matrix
     PC1   PC2   h2    u2 com
X1  0.84  0.50 0.96 0.041 1.6
X2  0.84  0.47 0.92 0.082 1.6
X3  0.75  0.64 0.97 0.028 2.0
X4 -0.37  0.77 0.73 0.271 1.4
X5 -0.68  0.57 0.78 0.216 1.9
X6 -0.62  0.69 0.86 0.143 2.0
X7  0.38 -0.64 0.56 0.444 1.6
X8  0.10  0.46 0.23 0.775 1.1

                       PC1  PC2
SS loadings           3.10 2.90
Proportion Var        0.39 0.36
Cumulative Var        0.39 0.75
Proportion Explained  0.52 0.48
Cumulative Proportion 0.52 1.00

Mean item complexity =  1.7
Test of the hypothesis that 2 components are sufficient.

The root mean square of the residuals (RMSR) is  0.09 
 with the empirical chi square  5.64  with prob <  0.96 

Fit based upon off diagonal values = 0.96

从实现结果可以看出方差累积贡献率才 75%，效果不好。下面验证一下主成分为 3 的情况。

> pc <- principal(test, nfactors=3, rotate="none")
> pc
Principal Components Analysis
Call: principal(r = test, nfactors = 3, rotate = "none")
Standardized loadings (pattern matrix) based upon correlation matrix
     PC1   PC2   PC3   h2     u2 com
X1  0.84  0.50  0.10 0.97 0.0307 1.7
X2  0.84  0.47  0.16 0.94 0.0575 1.7
X3  0.75  0.64  0.15 0.99 0.0056 2.0
X4 -0.37  0.77 -0.01 0.73 0.2713 1.4
X5 -0.68  0.57  0.31 0.88 0.1195 2.4
X6 -0.62  0.69  0.14 0.88 0.1224 2.1
X7  0.38 -0.64  0.13 0.57 0.4258 1.7
X8  0.10  0.46 -0.86 0.96 0.0370 1.6

                       PC1  PC2  PC3
SS loadings           3.10 2.90 0.93
Proportion Var        0.39 0.36 0.12
Cumulative Var        0.39 0.75 0.87
Proportion Explained  0.45 0.42 0.13
Cumulative Proportion 0.45 0.87 1.00

Mean item complexity =  1.8
Test of the hypothesis that 3 components are sufficient.

The root mean square of the residuals (RMSR) is  0.06 
 with the empirical chi square  2.95  with prob <  0.89 

Fit based upon off diagonal values = 0.98

可以看出方差累积贡献率达到了 87% ，效果较好。 PC 是成分载荷，表示变量和主成分的相关系数，用来解释主成分的含义。h2 表示主成分对每一个变量的方差解释度，u2 是方差无法被解释的比例（即 1-h2）。 SS loadings 指标准化后的方差值。
主成分旋转主成分旋转是将成分载荷变得更加容易理解的方法，包括正交旋转（使选择的成分保持不相关）和斜交旋转（使成分变得相关）。这里我们选择正交旋转即方差极大旋转。

> rc <- principal(test, nfactors=3, rotate="varimax")
> rc
Principal Components Analysis
Call: principal(r = test, nfactors = 3, rotate = "varimax")
Standardized loadings (pattern matrix) based upon correlation matrix
     RC1   RC2   RC3   h2     u2 com
X1  0.98 -0.08  0.11 0.97 0.0307 1.0
X2  0.97 -0.09  0.04 0.94 0.0575 1.0
X3  0.99  0.09  0.09 0.99 0.0056 1.0
X4  0.12  0.82  0.21 0.73 0.2713 1.2
X5 -0.17  0.90 -0.18 0.88 0.1195 1.2
X6 -0.09  0.93  0.02 0.88 0.1224 1.0
X7 -0.02 -0.70 -0.29 0.57 0.4258 1.3
X8  0.14  0.14  0.96 0.96 0.0370 1.1

                       RC1  RC2  RC3
SS loadings           2.93 2.89 1.10
Proportion Var        0.37 0.36 0.14
Cumulative Var        0.37 0.73 0.87
Proportion Explained  0.42 0.42 0.16
Cumulative Proportion 0.42 0.84 1.00

Mean item complexity =  1.1
Test of the hypothesis that 3 components are sufficient.

The root mean square of the residuals (RMSR) is  0.06 
 with the empirical chi square  2.95  with prob <  0.89 

Fit based upon off diagonal values = 0.98

主成分得分从原始数据中获得成分得分

> pc <- principal(test, nfactors=3, rotate="varimax", score=T)
> score <- pc$scores
> score
              RC1         RC2         RC3
 [1,]  1.04364745 -0.03678306 -0.34512037
 [2,] -0.55875261 -1.31380572 -0.64465715
 [3,] -0.46663927 -1.75504604 -0.91245535
 [4,]  0.35978428  0.19045559  1.20712572
 [5,]  2.90692027 -0.38189511  0.09594469
 [6,] -0.41617365 -0.91925769 -0.32025745
 [7,] -0.53962165  0.28399672 -0.38006970
 [8,] -0.01787588  1.99714514 -0.79973098
 [9,] -0.01209566  0.73557256 -0.20543438
[10,] -0.60535196  1.11405463 -0.17474919
[11,] -0.59851283 -0.13012917 -0.03742436
[12,] -0.47080587  0.47771113 -0.32879003
[13,] -0.62452261 -0.26201898  2.84561855

获得主成分得分的系数

> test.cov <- cov(test)
> rc <- principal(test.cov, nfactors=3, rotate="varimax")
> round(unclass(rc$weights), 3)
      RC1    RC2    RC3
X1  0.338 -0.003 -0.034
X2  0.344  0.002 -0.096
X3  0.352  0.063 -0.074
X4  0.047  0.277  0.079
X5  0.005  0.347 -0.277
X6  0.004  0.334 -0.091
X7  0.008 -0.218 -0.194
X8 -0.095 -0.073  0.931

因此，可以得到如下主成分得分： PC1 = 0.338 * X1 + 0.344 * X2 + 0.352 * X3 + 0.047 * X4 + 0.005 * X5 + 0.004 * X6 + 0.008 * X7 - 0.095 * X8 PC2 = -0.003 * X1 + 0.002 * X2 + 0.063 * X3 + 0.277 * X4 + 0.347 * X5 + 0.334 * X6 - 0.218 * X7 - 0.073 * X8 PC3 = -0.034 * X1 - 0.096 * X2 - 0.074 * X3 + 0.079 * X4 - 0.277 * X5 - 0.091 * X6 - 0.194 * X7 - 0.931 * X8

探索性因子分析
EFA 和 PCA 的区别在于，PCA 中的主成分是原始观测变量的线性组合，组合的选择是在各主成分无关条件下使其方差最大化。而 EFA 中的因子是影响原始观测变量的潜在变量，变量中不能被因子所解释的部分称为误差，因子和误差均不能直接观察到。虽然 EFA 和 PCA 有本质上的区别，但在分析流程上有相似之处。

> library(psych)
> fa.parallel(test, fa="both", n.iter=30)
Parallel analysis suggests that the number of factors =  2  and the number of components =  2 
There were 28 warnings (use warnings() to see them)

image.png

“fa = both”，即会同时展示主成分和因子分析的结果。观测图中可以发现，在三个准则的估计下，建议的是 2 个主成分。
调用 psych 包中的 fa 函数来提取因子，将 nfactors 参数设定因子数为 2，rotate 参数不进行因子旋转，最后的 fm 表示分析方法，由于极大似然方法有时不能收敛，所以此处设为迭代主轴方法。

> fa <- fa(test, nfactors=2, rotate="none", fm="pa")
Warning messages:
1: In fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs,  :
  The estimated weights for the factor scores are probably incorrect.  Try a different factor score estimation method.
2: In fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate,  :
  An ultra-Heywood case was detected.  Examine the results carefully
> fa
Factor Analysis using method =  pa
Call: fa(r = test, nfactors = 2, rotate = "none", fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
     PA1   PA2   h2     u2 com
X1  0.95  0.26 0.96  0.038 1.2
X2  0.91  0.23 0.87  0.126 1.1
X3  0.91  0.44 1.02 -0.017 1.4
X4 -0.14  0.78 0.63  0.366 1.1
X5 -0.47  0.71 0.72  0.278 1.7
X6 -0.41  0.85 0.90  0.102 1.4
X7  0.16 -0.60 0.39  0.614 1.1
X8  0.15  0.30 0.11  0.886 1.5

                       PA1  PA2
SS loadings           3.00 2.60
Proportion Var        0.38 0.33
Cumulative Var        0.38 0.70
Proportion Explained  0.54 0.46
Cumulative Proportion 0.54 1.00

Mean item complexity =  1.3
Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the null model are  28  and the objective function was  11.4 with Chi Square of  96.93
The degrees of freedom for the model are 13  and the objective function was  3.23 

The root mean square of the residuals (RMSR) is  0.06 
The df corrected root mean square of the residuals is  0.09 

The harmonic number of observations is  13 with the empirical chi square  2.83  with prob <  1 
The total number of observations was  13  with Likelihood Chi Square =  23.14  with prob <  0.04 

Tucker Lewis Index of factoring reliability =  0.593
RMSEA index =  0.232  and the 90 % confidence intervals are  0.054 0.421
BIC =  -10.2
Fit based upon off diagonal values = 0.98

可以观察到个因子解释了 70% 的总方差。因子载荷的意义并不好解释，所以使用因子旋转有助于因子解释。

> fa.varimax <- fa(test, nfactors=2, rotate="varimax", fm="pa")
Warning messages:
1: In fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs,  :
  The estimated weights for the factor scores are probably incorrect.  Try a different factor score estimation method.
2: In fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate,  :
  An ultra-Heywood case was detected.  Examine the results carefully
> fa.varimax
Factor Analysis using method =  pa
Call: fa(r = test, nfactors = 2, rotate = "varimax", fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
     PA1   PA2   h2     u2 com
X1  0.98 -0.09 0.96  0.038 1.0
X2  0.93 -0.11 0.87  0.126 1.0
X3  1.00  0.09 1.02 -0.017 1.0
X4  0.14  0.78 0.63  0.366 1.1
X5 -0.19  0.83 0.72  0.278 1.1
X6 -0.08  0.94 0.90  0.102 1.0
X7 -0.06 -0.62 0.39  0.614 1.0
X8  0.25  0.23 0.11  0.886 2.0

                       PA1  PA2
SS loadings           2.95 2.65
Proportion Var        0.37 0.33
Cumulative Var        0.37 0.70
Proportion Explained  0.53 0.47
Cumulative Proportion 0.53 1.00

Mean item complexity =  1.2
Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the null model are  28  and the objective function was  11.4 with Chi Square of  96.93
The degrees of freedom for the model are 13  and the objective function was  3.23 

The root mean square of the residuals (RMSR) is  0.06 
The df corrected root mean square of the residuals is  0.09 

The harmonic number of observations is  13 with the empirical chi square  2.83  with prob <  1 
The total number of observations was  13  with Likelihood Chi Square =  23.14  with prob <  0.04 

Tucker Lewis Index of factoring reliability =  0.593
RMSEA index =  0.232  and the 90 % confidence intervals are  0.054 0.421
BIC =  -10.2
Fit based upon off diagonal values = 0.98

通过调用 factor.plot() 和 fa.diagram() 函数绘制出正交或斜交结果的图形结果。

> factor.plot(fa.varimax, labels=rownames(fa.varimax$loadings))

image.png

X1 X2 X3 在 PA1 上载荷较大，X4 X5 X6 X7 在 PA2 上载荷较大，X8 在两个因子上较为平均。

> fa.diagram(fa.varimax, simple=F)

image.png

simple = TRUE 是将仅显示每个因子下最大的载荷，和因子间的相关系数。

与 PCA 相比，EFA 并不是很关心因子得分，主成分得分是通过精确计算得到的，而因子得分只是估计得到的，不过也可以简单查看一下。

> fa <- fa(test, nfactors=2, rotate="none", fm="pa", score=T)
Warning messages:
1: In fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs,  :
  The estimated weights for the factor scores are probably incorrect.  Try a different factor score estimation method.
2: In fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate,  :
  An ultra-Heywood case was detected.  Examine the results carefully
> fa$weights
           PA1         PA2
X1 -1.62406748 -2.30332697
X2 -4.64153359 -5.75603873
X3  7.27471660  8.56053149
X4 -2.05179521 -2.35273597
X5 -0.28531209 -0.19186435
X6  0.16648968  1.14450904
X7 -0.32570711 -0.58966470
X8  0.05073991  0.07664009

基于银行财务数据的分析运用
搜集银行财务数据来分析股票价格的财务影响因素，观测流动比率、净资产负债比率、资产固定资产比率、每股收益、净利润、增长率、股价和公布时间等数据。

> dataf <- readLines("http://labfile.oss.aliyuncs.com/courses/931/bank.csv")
> dataf <- unlist(strsplit(dataf, split=","))
> dataf <- matrix(dataf, ncol=7, byrow=T)
> colnames(dataf) <- dataf[1,]
> dataf <- as.data.frame(dataf[-1,])
> dataf <- as.data.frame(sapply(dataf, as.numeric))
Warning message:
In lapply(X = X, FUN = FUN, ...) : NAs introduced by coercion
> dataf
   流动比率 净资产负债比率 资产固定资产比率 每股收益 净利润 增长率  股价
1    1.0716       0.020515            27.04   0.1925  17.77 -3.942 18.56
2    1.0181       0.009379           113.22   0.1300  14.77 46.914 18.86
3    1.0469       0.013588            85.34   0.2230  14.30 25.433 13.65
4    1.0398       0.013137            93.34   0.2752  14.72 30.732 15.21
5    1.0216       0.013970            88.40   0.1197  14.10 30.578 13.73
6    0.9607       0.013284            93.59   0.1850  16.60 14.550 12.43
7    0.9256       0.011708           102.88   0.2365  14.98 13.879 13.89
8    0.9424       0.011860           103.24   0.3040  13.52 21.894 11.10
9    0.9164       0.011641           103.53   0.0915  14.45 27.488 11.42
10   0.8754       0.010129           112.47   0.1720  13.29 19.168 12.14
11   0.9008       0.009532           127.28   0.2605  12.81 21.915 10.43
12   0.8814       0.009450           133.40   0.3240  11.51 23.687  8.56
13   0.8907       0.008080           128.08   0.1116  12.96 44.718 10.24
14   0.8629       0.009338           236.14   0.1902  11.82 37.542  9.02
15   0.8634       0.009430           117.57   0.2846  11.62 35.189  7.55
16   0.8494       0.010992           107.08   0.3470  10.31 21.157  6.65
17   0.8637       0.010824           105.04   0.1119  12.08 14.875  6.49
18   0.8577       0.011688           110.31   0.1849  10.91 10.622  6.14
19   0.8743       0.009964            98.12   0.3066  11.58 22.350  6.12
20   0.8848       0.010763           116.04   0.3744  10.42 26.894  5.67
21   0.8962       0.009194            97.98   0.1158  11.39 28.249  6.98
22   0.7740       0.009581           105.11   0.2456  11.51 64.610  6.68
23       NA       0.006989           156.92   0.3900  11.79 52.344  6.39
> is.na(dataf)
      流动比率 净资产负债比率 资产固定资产比率 每股收益 净利润 增长率  股价
 [1,]    FALSE          FALSE            FALSE    FALSE  FALSE  FALSE FALSE
 [2,]    FALSE          FALSE            FALSE    FALSE  FALSE  FALSE FALSE
 [3,]    FALSE          FALSE            FALSE    FALSE  FALSE  FALSE FALSE
 [4,]    FALSE          FALSE            FALSE    FALSE  FALSE  FALSE FALSE
 [5,]    FALSE          FALSE            FALSE    FALSE  FALSE  FALSE FALSE
 [6,]    FALSE          FALSE            FALSE    FALSE  FALSE  FALSE FALSE
 [7,]    FALSE          FALSE            FALSE    FALSE  FALSE  FALSE FALSE
 [8,]    FALSE          FALSE            FALSE    FALSE  FALSE  FALSE FALSE
 [9,]    FALSE          FALSE            FALSE    FALSE  FALSE  FALSE FALSE
[10,]    FALSE          FALSE            FALSE    FALSE  FALSE  FALSE FALSE
[11,]    FALSE          FALSE            FALSE    FALSE  FALSE  FALSE FALSE
[12,]    FALSE          FALSE            FALSE    FALSE  FALSE  FALSE FALSE
[13,]    FALSE          FALSE            FALSE    FALSE  FALSE  FALSE FALSE
[14,]    FALSE          FALSE            FALSE    FALSE  FALSE  FALSE FALSE
[15,]    FALSE          FALSE            FALSE    FALSE  FALSE  FALSE FALSE
[16,]    FALSE          FALSE            FALSE    FALSE  FALSE  FALSE FALSE
[17,]    FALSE          FALSE            FALSE    FALSE  FALSE  FALSE FALSE
[18,]    FALSE          FALSE            FALSE    FALSE  FALSE  FALSE FALSE
[19,]    FALSE          FALSE            FALSE    FALSE  FALSE  FALSE FALSE
[20,]    FALSE          FALSE            FALSE    FALSE  FALSE  FALSE FALSE
[21,]    FALSE          FALSE            FALSE    FALSE  FALSE  FALSE FALSE
[22,]    FALSE          FALSE            FALSE    FALSE  FALSE  FALSE FALSE
[23,]     TRUE          FALSE            FALSE    FALSE  FALSE  FALSE FALSE

由于因子分析对缺失值非常的敏感，在进行因子分析之前我们先对数据进行缺失值的检查。发现流动比率变量第 23 个数据缺失,因此在进行因子分析的时候，对缺失值进行整行删除处理，即在因子分析时排除第 23 行整行的 7 个数据。
利用因子分析提取对银行业上市公司股价影响较为明显的因素，分析银行业上市公司股价的决定因素。

> fa.parallel(dataf[-23, -7])
Parallel analysis suggests that the number of factors =  1  and the number of components =  1 
There were 25 warnings (use warnings() to see them)

image.png

由碎石图看出，对于因子分析，合适的因子个数为 2。利用 fa 函数对所选取的变量做因子分析，利用极大似然法（ml）提取公因子，运用最大方差旋转法（varimax），找出其中 2 个因子。

> fa(dataf[-23, -7], nfactors=2, fm="ml", rotate="varimax", score=T)
Factor Analysis using method =  ml
Call: fa(r = dataf[-23, -7], nfactors = 2, rotate = "varimax", 
    scores = T, fm = "ml")
Standardized loadings (pattern matrix) based upon correlation matrix
                   ML1   ML2   h2    u2 com
流动比率          0.60  0.56 0.67 0.331 2.0
净资产负债比率    0.98  0.21 1.00 0.005 1.1
资产固定资产比率 -0.65 -0.17 0.45 0.547 1.1
每股收益          0.05 -0.47 0.23 0.773 1.0
净利润            0.53  0.84 1.00 0.005 1.7
增长率           -0.63  0.02 0.39 0.608 1.0

                       ML1  ML2
SS loadings           2.41 1.32
Proportion Var        0.40 0.22
Cumulative Var        0.40 0.62
Proportion Explained  0.65 0.35
Cumulative Proportion 0.65 1.00

Mean item complexity =  1.3
Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the null model are  15  and the objective function was  3.13 with Chi Square of  56.79
The degrees of freedom for the model are 4  and the objective function was  0.02 

The root mean square of the residuals (RMSR) is  0.02 
The df corrected root mean square of the residuals is  0.04 

The harmonic number of observations is  22 with the empirical chi square  0.22  with prob <  0.99 
The total number of observations was  22  with Likelihood Chi Square =  0.38  with prob <  0.98 

Tucker Lewis Index of factoring reliability =  1.361
RMSEA index =  0  and the 90 % confidence intervals are  0 0
BIC =  -11.99
Fit based upon off diagonal values = 1
Measures of factor score adequacy             
                                                   ML1
Correlation of (regression) scores with factors   1.00
Multiple R square of scores with factors          0.99
Minimum correlation of possible factor scores     0.99
                                                   ML2
Correlation of (regression) scores with factors   0.99
Multiple R square of scores with factors          0.99
Minimum correlation of possible factor scores     0.98

结果说明：两个因子的累计贡献方差（Cumulative Var）为 62%，说明得到的两个因子能解释所有变量 62% 的信息。各变量与两个因子的关系如下：流动比率 = 0.60 × 因子A + 0.56 × 因子B 净资产负债比率 = 0.98 × 因子A + 0.21 × 因子B 资产固定资产比率 = -0.65 × 因子A - 0.17 × 因子B 每股收益 = 0.05 × 因子A - 0.48 × 因子B 净利润 = 0.53 × 因子A + 0.84 × 因子B 增长率 = -0.63 × 因子A - 0.02 × 因子B 因子 A 主要影响流动比率、净资产负债比率、资产固定资产比率和增长率。其中因子 A 对流动比率和净资产负债比率有正向影响而对资产固定资产比率和增长率有负向影响。将它称为资产因子。因子 B 主要影响每股收益、净利润。其中因子 B 对净利润有正向作用而对每股收益则为负向作用。将它称为收益因子。

18主成分与因子分析

你可能感兴趣的:(18主成分与因子分析)