纬度灾难
变量过多(没用的变量)
变量相关(相关的变量)
解决办法
剔除无用变量
逐步回归
向前引入法
向后剔除法
逐步筛选法
Step函数
AIC越小越好 AIC = n ln(SSE) + 2p
主成分分析
快速降维技术
降维过程中不影响解的精度
消除多重共线性
数学工具
原变量线性组合得到新变量;方差的重新分配,保留几个方差最大的变量;
矩阵对角化
R的函数
princomp 这个函数是R中的标准PCA函数,可用cor,也可用cov协方差阵来做PCA
predict
loadings
screeplot
因子分析
因子分析和主成分分析的区别
主成分分析从“方差”出发
因子分析从“相关性”出发
因子分析的方法
主成分法
主因子法
最大似然法
因子分析的步骤
观察相关系数矩阵
提取因子变量
因子变量命名
计算因子得分(降维)
R函数
factanal (极大似然法做因子分析)
psych::principal (主成分法)、psync::fa (主因子法)
psych::factor.plot、psych::fa.diagram (可视化)
逐步回归法剔除无用变量代码示例:
> mtcars
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
> carlm = lm(mpg~cyl+disp+hp+drat+wt+qsec,data = mtcars)
> summary(mtcars)
mpg cyl disp hp
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
Median :19.20 Median :6.000 Median :196.3 Median :123.0
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
drat wt qsec vs
Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
Median :3.695 Median :3.325 Median :17.71 Median :0.0000
Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
am gear carb
Min. :0.0000 Min. :3.000 Min. :1.000
1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
Median :0.0000 Median :4.000 Median :2.000
Mean :0.4062 Mean :3.688 Mean :2.812
3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :1.0000 Max. :5.000 Max. :8.000
> carlm.step = step(carlm)
Start: AIC=66.19
mpg ~ cyl + disp + hp + drat + wt + qsec
Df Sum of Sq RSS AIC
- qsec 1 3.949 167.43 64.954
- drat 1 5.209 168.69 65.194
- cyl 1 6.652 170.13 65.466
- disp 1 7.870 171.35 65.695
- hp 1 8.744 172.22 65.857
163.48 66.190
- wt 1 72.580 236.06 75.947
Step: AIC=64.95
mpg ~ cyl + disp + hp + drat + wt
Df Sum of Sq RSS AIC
- drat 1 3.018 170.44 63.526
- disp 1 6.949 174.38 64.255
167.43 64.954
- cyl 1 15.411 182.84 65.772
- hp 1 21.066 188.49 66.746
- wt 1 77.476 244.90 75.124
Step: AIC=63.53
mpg ~ cyl + disp + hp + wt
Df Sum of Sq RSS AIC
- disp 1 6.176 176.62 62.665
170.44 63.526
- hp 1 18.048 188.49 64.746
- cyl 1 24.546 194.99 65.831
- wt 1 90.925 261.37 75.206
Step: AIC=62.66
mpg ~ cyl + hp + wt
Df Sum of Sq RSS AIC
176.62 62.665
- hp 1 14.551 191.17 63.198
- cyl 1 18.427 195.05 63.840
- wt 1 115.354 291.98 76.750
> summary(carlm.step)
Call:
lm(formula = mpg ~ cyl + hp + wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.9290 -1.5598 -0.5311 1.1850 5.8986
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 38.75179 1.78686 21.687 < 2e-16 ***
cyl -0.94162 0.55092 -1.709 0.098480 .
hp -0.01804 0.01188 -1.519 0.140015
wt -3.16697 0.74058 -4.276 0.000199 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.512 on 28 degrees of freedom
Multiple R-squared: 0.8431, Adjusted R-squared: 0.8263
F-statistic: 50.17 on 3 and 28 DF, p-value: 2.184e-11
> carlm2 = lm(mpg~cyl+wt,data = mtcars)
> summary(carlm2)
Call:
lm(formula = mpg ~ cyl + wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.2893 -1.5512 -0.4684 1.5743 6.1004
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.6863 1.7150 23.141 < 2e-16 ***
cyl -1.5078 0.4147 -3.636 0.001064 **
wt -3.1910 0.7569 -4.216 0.000222 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.568 on 29 degrees of freedom
Multiple R-squared: 0.8302, Adjusted R-squared: 0.8185
F-statistic: 70.91 on 2 and 29 DF, p-value: 6.809e-12
> carlm2 = lm(mpg~cyl*wt,data = mtcars)
> summary(carlm2)
Call:
lm(formula = mpg ~ cyl * wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.2288 -1.3495 -0.5042 1.4647 5.2344
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 54.3068 6.1275 8.863 1.29e-09 ***
cyl -3.8032 1.0050 -3.784 0.000747 ***
wt -8.6556 2.3201 -3.731 0.000861 ***
cyl:wt 0.8084 0.3273 2.470 0.019882 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.368 on 28 degrees of freedom
Multiple R-squared: 0.8606, Adjusted R-squared: 0.8457
F-statistic: 57.62 on 3 and 28 DF, p-value: 4.231e-12
主成分分析R语言实例:
# 第一步:将student.csv中的数据读入到程序中
> # 30名学生的身高,体重,胸围,坐高
> student = read.csv(‘e:/student.csv’,header = T)
# 注:header = T表示将students_data.csv中的第一行数据设置为列名,这种情况下,
student.csv中的第二行到最后一行数据作为data中的有效数据。header = F
表示不将student.csv中的第一行数据设置为列名,这种情况下,student.csv
中的第一行到最后一行数据作为data中的有效数据。
> str(student)
‘data.frame’: 30 obs. of 4 variables:
$ X1: int 148 139 160 149 159 142 153 150 151 139 …
$ X2: int 41 34 49 36 45 31 43 43 42 31 …
$ X3: int 72 71 77 67 80 66 76 77 77 68 …
$ X4: int 78 76 86 79 86 76 83 79 80 74 …
# 第二步:进行主成分分析
> student.pr <- princomp(student, cor = TRUE)
# 注:cor = T的意思是用相关系数进行主成分分析。
# 第三步:观察主成分分析的详细情况
> summary(student.pr)
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4
Standard deviation 1.8817805 0.55980636 0.28179594 0.25711844
Proportion of Variance 0.8852745 0.07834579 0.01985224 0.01652747
Cumulative Proportion 0.8852745 0.96362029 0.98347253 1.00000000
# 说明: 结果中的Comp.1、Comp.2、Comp.3和Comp.4是计算出来的主成分,Standard deviation代表每个主成分的标准差,
Proportion of Variance代表每个主成分的贡献率,Cumulative Proportion代表各个主成分的累积贡献率。
每个主成分都不属于X1、X2、X3和X4中的任何一个。第一主成分、第二主成分、第三主成分和第四主成分都是X1、X2、X3和X4的线性组合,
也就是说最原始数据的成分经过线性变换得到了各个主成分。然而并不是每个主成分的作用都非常关键,因此,我们只选择作用比较关键的几个,
一般地,选择累积贡献率达到八成的前几个主成分即可(这个实例中我们选择前两个,毕竟第二主成分的贡献率也比较大)。
接下来,在得到主成分的基础上进行回归也好进行聚类也好,就不再使用原始的X1、X2、X3和X4了,而是使用主成分的数据。
但现在还没有各个样本的主成分的数据,所以,最后一步就是得到各个样本主成分的数据。
# 第四步:计算得到各个样本主成分的数据
> predict(student.pr)
Comp.1 Comp.2 Comp.3 Comp.4
[1,] -0.06990950 -0.23813701 0.35509248 -0.266120139
[2,] -1.59526340 -0.71847399 -0.32813232 -0.118056646
[3,] 2.84793151 0.38956679 0.09731731 -0.279482487
[4,] -0.75996988 0.80604335 0.04945722 -0.162949298
[5,] 2.73966777 0.01718087 -0.36012615 0.358653044
[6,] -2.10583168 0.32284393 -0.18600422 -0.036456084
[7,] 1.42105591 -0.06053165 -0.21093321 -0.044223092
[8,] 0.82583977 -0.78102576 0.27557798 0.057288572
[9,] 0.93464402 -0.58469242 0.08814136 0.181037746
[10,] -2.36463820 -0.36532199 -0.08840476 0.045520127
[11,] -2.83741916 0.34875841 -0.03310423 -0.031146930
[12,] 2.60851224 0.21278728 0.33398037 0.210157574
[13,] 2.44253342 -0.16769496 0.46918095 -0.162987830
[14,] -1.86630669 0.05021384 -0.37720280 -0.358821916
[15,] -2.81347421 -0.31790107 0.03291329 -0.222035112
[16,] -0.06392983 0.20718448 -0.04334340 0.703533624
[17,] 1.55561022 -1.70439674 0.33126406 0.007551879
[18,] -1.07392251 -0.06763418 -0.02283648 0.048606680
[19,] 2.52174212 0.97274301 -0.12164633 -0.390667991
[20,] 2.14072377 0.02217881 -0.37410972 0.129548960
[21,] 0.79624422 0.16307887 -0.12781270 -0.294140762
[22,] -0.28708321 -0.35744666 0.03962116 0.080991989
[23,] 0.25151075 1.25555188 0.55617325 0.109068939
[24,] -2.05706032 0.78894494 0.26552109 0.388088643
[25,] 3.08596855 -0.05775318 -0.62110421 -0.218939612
[26,] 0.16367555 0.04317932 -0.24481850 0.560248997
[27,] -1.37265053 0.02220972 0.23378320 -0.257399715
[28,] -2.16097778 0.13733233 -0.35589739 0.093123683
[29,] -2.40434827 -0.48613137 0.16154441 -0.007914021
[30,] -0.50287468 0.14734317 0.20590831 -0.122078819
# 我们只保留Comp.1和Comp.2的数据即可。
> screeplot(student.pr,type = 'lines') #碎石图
> consumedata = read.csv('e:/consume.csv')
> lm1 = lm(Y~.,data = consumedata)
> summary(lm1)
Call:
lm(formula = Y ~ ., data = consumedata)
Residuals:
1 2 3 4 5 6 7 8 9 10
0.024803 0.079476 0.012381 -0.007025 -0.288345 0.216090 -0.142085 0.158360 -0.135964 0.082310
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.66768 5.94360 -2.973 0.03107 *
X1 0.09006 0.02095 4.298 0.00773 **
X2 -0.23132 0.07132 -3.243 0.02287 *
X3 0.01806 0.03907 0.462 0.66328
X4 0.42075 0.11847 3.552 0.01636 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2037 on 5 degrees of freedom
Multiple R-squared: 0.9988, Adjusted R-squared: 0.9978
F-statistic: 1021 on 4 and 5 DF, p-value: 1.827e-07
> # 观察自变量之间的相关性
> library(corrgram)
> corrgram(consumedata[,1:4],lower.panel = panel.conf,upper.panel = panel.pie,text.panel = panel.txt)
> consumedata.pre = princomp(consumedata[,1:4],cor = TRUE)
> summary(consumedata.pre)
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4
Standard deviation 1.9859037 0.199906992 0.11218966 0.0603085506
Proportion of Variance 0.9859534 0.009990701 0.00314663 0.0009092803
Cumulative Proportion 0.9859534 0.995944090 0.99909072 1.0000000000
> screeplot(consumedata.pre,type = 'lines')
> loadings(consumedata.pre)
Loadings:
Comp.1 Comp.2 Comp.3 Comp.4
X1 0.502 0.237 0.579 0.598
X2 0.500 -0.493 -0.610 0.367
X3 0.498 0.707 -0.368 -0.342
X4 0.501 -0.449 0.396 -0.626
Comp.1 Comp.2 Comp.3 Comp.4
SS loadings 1.00 1.00 1.00 1.00
Proportion Var 0.25 0.25 0.25 0.25
Cumulative Var 0.25 0.50 0.75 1.00
> # 做线性回归
> consumedata$z1 = predict(consumedata.pre)[,1]
> consumedata$z2 = predict(consumedata.pre)[,2]
> consumedata.lm = lm(Y~z1+z2,data = consumedata)
>
> summary(consumedata.lm)
Call:
lm(formula = Y ~ z1 + z2, data = consumedata)
Residuals:
Min 1Q Median 3Q Max
-0.74323 -0.29223 0.01746 0.30807 0.80849
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.03000 0.17125 81.927 1.06e-11 ***
z1 2.06119 0.08623 23.903 5.70e-08 ***
z2 0.62409 0.85665 0.729 0.49
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5415 on 7 degrees of freedom
Multiple R-squared: 0.9879, Adjusted R-squared: 0.9845
F-statistic: 285.9 on 2 and 7 DF, p-value: 1.945e-07
> consumedata.lm2 = lm(Y~z1,data = consumedata)
> summary(consumedata.lm2)
Call:
lm(formula = Y ~ z1, data = consumedata)
Residuals:
Min 1Q Median 3Q Max
-0.72237 -0.20946 0.05154 0.21032 0.81856
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.03000 0.16615 84.44 4.32e-13 ***
z1 2.06119 0.08367 24.64 7.87e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5254 on 8 degrees of freedom
Multiple R-squared: 0.987, Adjusted R-squared: 0.9854
F-statistic: 606.9 on 1 and 8 DF, p-value: 7.873e-09
> # install.packages('psych')
> install.packages('psych')
> library(psych)
> R = read.csv('e:/sw.csv')
> R = R[-1]
> R
身高 臂长 上肢长 腿长 体重 颈围 胸围 腰围
1 1.000 0.846 0.805 0.859 0.473 0.398 0.301 0.382
2 0.846 1.000 0.881 0.826 0.376 0.326 0.277 0.277
3 0.805 0.881 1.000 0.801 0.380 0.319 0.237 0.345
4 0.859 0.826 0.801 1.000 0.436 0.329 0.327 0.365
5 0.473 0.376 0.380 0.436 1.000 0.762 0.730 0.629
6 0.398 0.326 0.319 0.329 0.762 1.000 0.583 0.577
7 0.301 0.277 0.237 0.327 0.730 0.583 1.000 0.539
8 0.382 0.277 0.345 0.365 0.629 0.577 0.539 1.000
> # 主成分法
> pc = principal(r = R,nfactors = 2,rotate = 'varimax')
> pc
Principal Components Analysis
Call: principal(r = R, nfactors = 2, rotate = "varimax")
Standardized loadings (pattern matrix) based upon correlation matrix
RC1 RC2 h2 u2 com
1 0.90 0.27 0.88 0.120 1.2
2 0.93 0.17 0.90 0.097 1.1
3 0.92 0.18 0.87 0.129 1.1
4 0.90 0.24 0.86 0.137 1.1
5 0.25 0.89 0.85 0.151 1.2
6 0.18 0.84 0.74 0.264 1.1
7 0.11 0.84 0.71 0.289 1.0
8 0.20 0.77 0.63 0.370 1.1
RC1 RC2
SS loadings 3.46 2.98
Proportion Var 0.43 0.37
Cumulative Var 0.43 0.81
Proportion Explained 0.54 0.46
Cumulative Proportion 0.54 1.00
Mean item complexity = 1.1
Test of the hypothesis that 2 components are sufficient.
The root mean square of the residuals (RMSR) is 0.05
Fit based upon off diagonal values = 0.99
> # 因子载荷图
> par(mfrow = c(2,1))
> factor.plot(pc)
> # 因子结果图
> fa.diagram(pc)
> par(mfrow = c(1,1))
> # 主因子法
> fa = fa(r = R,nfactors = 2,rotate = 'varimax')
> par(mfrow = c(2,1))
> factor.plot(fa)
> fa.diagram(fa,simple = T)
> par(mfrow = c(1,1))