R语言学习_数据降维

纬度灾难
变量过多(没用的变量)
变量相关(相关的变量)
解决办法
剔除无用变量
逐步回归
向前引入法
向后剔除法
逐步筛选法
Step函数
AIC越小越好 AIC = n ln(SSE) + 2p
主成分分析
快速降维技术
降维过程中不影响解的精度
消除多重共线性

        数学工具
            原变量线性组合得到新变量;方差的重新分配,保留几个方差最大的变量;
            矩阵对角化

        R的函数
            princomp    这个函数是R中的标准PCA函数,可用cor,也可用cov协方差阵来做PCA
            predict
            loadings
            screeplot
    因子分析
        因子分析和主成分分析的区别
            主成分分析从“方差”出发
            因子分析从“相关性”出发
        因子分析的方法
            主成分法
            主因子法
            最大似然法

        因子分析的步骤
            观察相关系数矩阵
            提取因子变量
            因子变量命名
            计算因子得分(降维)

        R函数
            factanal    (极大似然法做因子分析)
            psych::principal (主成分法)、psync::fa  (主因子法)
            psych::factor.plot、psych::fa.diagram  (可视化)

逐步回归法剔除无用变量代码示例:
> mtcars
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
> carlm = lm(mpg~cyl+disp+hp+drat+wt+qsec,data = mtcars)
> summary(mtcars)
mpg cyl disp hp
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
Median :19.20 Median :6.000 Median :196.3 Median :123.0
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
drat wt qsec vs
Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
Median :3.695 Median :3.325 Median :17.71 Median :0.0000
Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
am gear carb
Min. :0.0000 Min. :3.000 Min. :1.000
1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
Median :0.0000 Median :4.000 Median :2.000
Mean :0.4062 Mean :3.688 Mean :2.812
3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :1.0000 Max. :5.000 Max. :8.000
> carlm.step = step(carlm)
Start: AIC=66.19
mpg ~ cyl + disp + hp + drat + wt + qsec

       Df Sum of Sq    RSS    AIC
- qsec  1     3.949 167.43 64.954
- drat  1     5.209 168.69 65.194
- cyl   1     6.652 170.13 65.466
- disp  1     7.870 171.35 65.695
- hp    1     8.744 172.22 65.857
              163.48 66.190
- wt    1    72.580 236.06 75.947

Step:  AIC=64.95
mpg ~ cyl + disp + hp + drat + wt

       Df Sum of Sq    RSS    AIC
- drat  1     3.018 170.44 63.526
- disp  1     6.949 174.38 64.255
              167.43 64.954
- cyl   1    15.411 182.84 65.772
- hp    1    21.066 188.49 66.746
- wt    1    77.476 244.90 75.124

Step:  AIC=63.53
mpg ~ cyl + disp + hp + wt

       Df Sum of Sq    RSS    AIC
- disp  1     6.176 176.62 62.665
              170.44 63.526
- hp    1    18.048 188.49 64.746
- cyl   1    24.546 194.99 65.831
- wt    1    90.925 261.37 75.206

Step:  AIC=62.66
mpg ~ cyl + hp + wt

       Df Sum of Sq    RSS    AIC
              176.62 62.665
- hp    1    14.551 191.17 63.198
- cyl   1    18.427 195.05 63.840
- wt    1   115.354 291.98 76.750
> summary(carlm.step)

Call:
lm(formula = mpg ~ cyl + hp + wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max
-3.9290 -1.5598 -0.5311  1.1850  5.8986

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 38.75179    1.78686  21.687  < 2e-16 ***
cyl         -0.94162    0.55092  -1.709 0.098480 .
hp          -0.01804    0.01188  -1.519 0.140015
wt          -3.16697    0.74058  -4.276 0.000199 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.512 on 28 degrees of freedom
Multiple R-squared:  0.8431,	Adjusted R-squared:  0.8263
F-statistic: 50.17 on 3 and 28 DF,  p-value: 2.184e-11

> carlm2 = lm(mpg~cyl+wt,data = mtcars)
> summary(carlm2)

Call:
lm(formula = mpg ~ cyl + wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max
-4.2893 -1.5512 -0.4684  1.5743  6.1004

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  39.6863     1.7150  23.141  < 2e-16 ***
cyl          -1.5078     0.4147  -3.636 0.001064 **
wt           -3.1910     0.7569  -4.216 0.000222 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.568 on 29 degrees of freedom
Multiple R-squared:  0.8302,	Adjusted R-squared:  0.8185
F-statistic: 70.91 on 2 and 29 DF,  p-value: 6.809e-12

> carlm2 = lm(mpg~cyl*wt,data = mtcars)
> summary(carlm2)

Call:
lm(formula = mpg ~ cyl * wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max
-4.2288 -1.3495 -0.5042  1.4647  5.2344

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  54.3068     6.1275   8.863 1.29e-09 ***
cyl          -3.8032     1.0050  -3.784 0.000747 ***
wt           -8.6556     2.3201  -3.731 0.000861 ***
cyl:wt        0.8084     0.3273   2.470 0.019882 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.368 on 28 degrees of freedom
Multiple R-squared:  0.8606,	Adjusted R-squared:  0.8457
F-statistic: 57.62 on 3 and 28 DF,  p-value: 4.231e-12

主成分分析R语言实例:
# 第一步:将student.csv中的数据读入到程序中
> # 30名学生的身高,体重,胸围,坐高
> student = read.csv(‘e:/student.csv’,header = T)
# 注:header = T表示将students_data.csv中的第一行数据设置为列名,这种情况下,
student.csv中的第二行到最后一行数据作为data中的有效数据。header = F
表示不将student.csv中的第一行数据设置为列名,这种情况下,student.csv
中的第一行到最后一行数据作为data中的有效数据。
> str(student)
‘data.frame’: 30 obs. of 4 variables:
$ X1: int 148 139 160 149 159 142 153 150 151 139 …
$ X2: int 41 34 49 36 45 31 43 43 42 31 …
$ X3: int 72 71 77 67 80 66 76 77 77 68 …
$ X4: int 78 76 86 79 86 76 83 79 80 74 …
# 第二步:进行主成分分析
> student.pr <- princomp(student, cor = TRUE)
# 注:cor = T的意思是用相关系数进行主成分分析。
# 第三步:观察主成分分析的详细情况
> summary(student.pr)
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4
Standard deviation 1.8817805 0.55980636 0.28179594 0.25711844
Proportion of Variance 0.8852745 0.07834579 0.01985224 0.01652747
Cumulative Proportion 0.8852745 0.96362029 0.98347253 1.00000000
# 说明: 结果中的Comp.1、Comp.2、Comp.3和Comp.4是计算出来的主成分,Standard deviation代表每个主成分的标准差,
Proportion of Variance代表每个主成分的贡献率,Cumulative Proportion代表各个主成分的累积贡献率。
每个主成分都不属于X1、X2、X3和X4中的任何一个。第一主成分、第二主成分、第三主成分和第四主成分都是X1、X2、X3和X4的线性组合,
也就是说最原始数据的成分经过线性变换得到了各个主成分。然而并不是每个主成分的作用都非常关键,因此,我们只选择作用比较关键的几个,
一般地,选择累积贡献率达到八成的前几个主成分即可(这个实例中我们选择前两个,毕竟第二主成分的贡献率也比较大)。
接下来,在得到主成分的基础上进行回归也好进行聚类也好,就不再使用原始的X1、X2、X3和X4了,而是使用主成分的数据。
但现在还没有各个样本的主成分的数据,所以,最后一步就是得到各个样本主成分的数据。

# 第四步:计算得到各个样本主成分的数据
> predict(student.pr)
           Comp.1      Comp.2      Comp.3       Comp.4
 [1,] -0.06990950 -0.23813701  0.35509248 -0.266120139
 [2,] -1.59526340 -0.71847399 -0.32813232 -0.118056646
 [3,]  2.84793151  0.38956679  0.09731731 -0.279482487
 [4,] -0.75996988  0.80604335  0.04945722 -0.162949298
 [5,]  2.73966777  0.01718087 -0.36012615  0.358653044
 [6,] -2.10583168  0.32284393 -0.18600422 -0.036456084
 [7,]  1.42105591 -0.06053165 -0.21093321 -0.044223092
 [8,]  0.82583977 -0.78102576  0.27557798  0.057288572
 [9,]  0.93464402 -0.58469242  0.08814136  0.181037746
[10,] -2.36463820 -0.36532199 -0.08840476  0.045520127
[11,] -2.83741916  0.34875841 -0.03310423 -0.031146930
[12,]  2.60851224  0.21278728  0.33398037  0.210157574
[13,]  2.44253342 -0.16769496  0.46918095 -0.162987830
[14,] -1.86630669  0.05021384 -0.37720280 -0.358821916
[15,] -2.81347421 -0.31790107  0.03291329 -0.222035112
[16,] -0.06392983  0.20718448 -0.04334340  0.703533624
[17,]  1.55561022 -1.70439674  0.33126406  0.007551879
[18,] -1.07392251 -0.06763418 -0.02283648  0.048606680
[19,]  2.52174212  0.97274301 -0.12164633 -0.390667991
[20,]  2.14072377  0.02217881 -0.37410972  0.129548960
[21,]  0.79624422  0.16307887 -0.12781270 -0.294140762
[22,] -0.28708321 -0.35744666  0.03962116  0.080991989
[23,]  0.25151075  1.25555188  0.55617325  0.109068939
[24,] -2.05706032  0.78894494  0.26552109  0.388088643
[25,]  3.08596855 -0.05775318 -0.62110421 -0.218939612
[26,]  0.16367555  0.04317932 -0.24481850  0.560248997
[27,] -1.37265053  0.02220972  0.23378320 -0.257399715
[28,] -2.16097778  0.13733233 -0.35589739  0.093123683
[29,] -2.40434827 -0.48613137  0.16154441 -0.007914021
[30,] -0.50287468  0.14734317  0.20590831 -0.122078819
# 我们只保留Comp.1和Comp.2的数据即可。
> screeplot(student.pr,type = 'lines')      #碎石图

> consumedata = read.csv('e:/consume.csv')
> lm1 = lm(Y~.,data = consumedata)
> summary(lm1)

Call:
lm(formula = Y ~ ., data = consumedata)

Residuals:
        1         2         3         4         5         6         7         8         9        10
 0.024803  0.079476  0.012381 -0.007025 -0.288345  0.216090 -0.142085  0.158360 -0.135964  0.082310

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.66768    5.94360  -2.973  0.03107 *
X1            0.09006    0.02095   4.298  0.00773 **
X2           -0.23132    0.07132  -3.243  0.02287 *
X3            0.01806    0.03907   0.462  0.66328
X4            0.42075    0.11847   3.552  0.01636 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2037 on 5 degrees of freedom
Multiple R-squared:  0.9988,	Adjusted R-squared:  0.9978
F-statistic:  1021 on 4 and 5 DF,  p-value: 1.827e-07

> # 观察自变量之间的相关性
> library(corrgram)

> corrgram(consumedata[,1:4],lower.panel = panel.conf,upper.panel = panel.pie,text.panel = panel.txt)
> consumedata.pre = princomp(consumedata[,1:4],cor = TRUE)
> summary(consumedata.pre)
Importance of components:
                          Comp.1      Comp.2     Comp.3       Comp.4
Standard deviation     1.9859037 0.199906992 0.11218966 0.0603085506
Proportion of Variance 0.9859534 0.009990701 0.00314663 0.0009092803
Cumulative Proportion  0.9859534 0.995944090 0.99909072 1.0000000000
> screeplot(consumedata.pre,type = 'lines')


> loadings(consumedata.pre)

Loadings:
   Comp.1 Comp.2 Comp.3 Comp.4
X1  0.502  0.237  0.579  0.598
X2  0.500 -0.493 -0.610  0.367
X3  0.498  0.707 -0.368 -0.342
X4  0.501 -0.449  0.396 -0.626

               Comp.1 Comp.2 Comp.3 Comp.4
SS loadings      1.00   1.00   1.00   1.00
Proportion Var   0.25   0.25   0.25   0.25
Cumulative Var   0.25   0.50   0.75   1.00
> # 做线性回归
> consumedata$z1 = predict(consumedata.pre)[,1]
> consumedata$z2 = predict(consumedata.pre)[,2]
> consumedata.lm = lm(Y~z1+z2,data = consumedata)
>
> summary(consumedata.lm)

Call:
lm(formula = Y ~ z1 + z2, data = consumedata)

Residuals:
     Min       1Q   Median       3Q      Max
-0.74323 -0.29223  0.01746  0.30807  0.80849

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.03000    0.17125  81.927 1.06e-11 ***
z1           2.06119    0.08623  23.903 5.70e-08 ***
z2           0.62409    0.85665   0.729     0.49
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5415 on 7 degrees of freedom
Multiple R-squared:  0.9879,	Adjusted R-squared:  0.9845
F-statistic: 285.9 on 2 and 7 DF,  p-value: 1.945e-07

> consumedata.lm2 = lm(Y~z1,data = consumedata)
> summary(consumedata.lm2)

Call:
lm(formula = Y ~ z1, data = consumedata)

Residuals:
     Min       1Q   Median       3Q      Max
-0.72237 -0.20946  0.05154  0.21032  0.81856

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.03000    0.16615   84.44 4.32e-13 ***
z1           2.06119    0.08367   24.64 7.87e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5254 on 8 degrees of freedom
Multiple R-squared:  0.987,	Adjusted R-squared:  0.9854
F-statistic: 606.9 on 1 and 8 DF,  p-value: 7.873e-09



> # install.packages('psych')
> install.packages('psych')
> library(psych)
> R = read.csv('e:/sw.csv')
> R = R[-1]
> R
   身高  臂长 上肢长  腿长  体重  颈围  胸围  腰围
1 1.000 0.846  0.805 0.859 0.473 0.398 0.301 0.382
2 0.846 1.000  0.881 0.826 0.376 0.326 0.277 0.277
3 0.805 0.881  1.000 0.801 0.380 0.319 0.237 0.345
4 0.859 0.826  0.801 1.000 0.436 0.329 0.327 0.365
5 0.473 0.376  0.380 0.436 1.000 0.762 0.730 0.629
6 0.398 0.326  0.319 0.329 0.762 1.000 0.583 0.577
7 0.301 0.277  0.237 0.327 0.730 0.583 1.000 0.539
8 0.382 0.277  0.345 0.365 0.629 0.577 0.539 1.000
> # 主成分法
> pc = principal(r = R,nfactors = 2,rotate = 'varimax')
> pc
Principal Components Analysis
Call: principal(r = R, nfactors = 2, rotate = "varimax")
Standardized loadings (pattern matrix) based upon correlation matrix
   RC1  RC2   h2    u2 com
1 0.90 0.27 0.88 0.120 1.2
2 0.93 0.17 0.90 0.097 1.1
3 0.92 0.18 0.87 0.129 1.1
4 0.90 0.24 0.86 0.137 1.1
5 0.25 0.89 0.85 0.151 1.2
6 0.18 0.84 0.74 0.264 1.1
7 0.11 0.84 0.71 0.289 1.0
8 0.20 0.77 0.63 0.370 1.1

                       RC1  RC2
SS loadings           3.46 2.98
Proportion Var        0.43 0.37
Cumulative Var        0.43 0.81
Proportion Explained  0.54 0.46
Cumulative Proportion 0.54 1.00

Mean item complexity =  1.1
Test of the hypothesis that 2 components are sufficient.

The root mean square of the residuals (RMSR) is  0.05

Fit based upon off diagonal values = 0.99
> # 因子载荷图
> par(mfrow = c(2,1))
> factor.plot(pc)
> # 因子结果图
> fa.diagram(pc)
> par(mfrow = c(1,1))
> # 主因子法
> fa = fa(r = R,nfactors = 2,rotate = 'varimax')
> par(mfrow = c(2,1))
> factor.plot(fa)

> fa.diagram(fa,simple = T)
> par(mfrow = c(1,1))

你可能感兴趣的:(R语言学习)