如何对数据进行汇总统计(R语言)

1 模拟数据

这里模拟了4个因子,5个观测值的数据框, 主要介绍了一下几种方法的汇总统计:

  • 1, 单变量~单因子,单个个统计量, 这里使用平均数mean
  • 2 单变量~单因子,多个个统计量, 这里使用自定义的函数func
  • 3 单变量 ~ 多因子, 单个个统计量
  • 4 多变量~单因子
  • 5 多变量~多因子

1.1 模拟数据代码

dat = data.frame(F1=1:24,F2=rep(1:2,12),F3=rep(1:3,8),F4=rep(1:4,6),
                 y1=rnorm(24),y2=rnorm(24),y3=rnorm(24),y4=rnorm(24))
dat

结果:

> dat
   F1 F2 F3 F4          y1          y2          y3          y4
1   1  1  1  1 -1.56638762  1.77659389  0.62182746  0.07154109
2   2  2  2  2  1.09314649  0.71375709 -2.00699087 -0.21156736
3   3  1  3  3 -0.05927923  0.37890941 -2.44829351 -0.21725814
4   4  2  1  4 -0.20873143  0.11067616  0.68841731  1.76528949
5   5  1  2  1 -1.27492917 -0.95776287 -1.68332656 -0.01702117
6   6  2  3  2 -1.05095342 -0.38322499  0.14197083 -0.78430424
7   7  1  1  3  1.37034964  0.05623515 -1.40426807  0.65247027
8   8  2  2  4  0.36660747 -0.12935219  0.35927791  0.78090801
9   9  1  3  1  0.23858612  1.40575764  0.10948955 -0.97792913
10 10  2  1  2 -1.20208996  0.83394104  0.81612552 -1.16199479
11 11  1  2  3  0.67429860  1.64004800  0.21721424  0.10194002
12 12  2  3  4 -0.28761315 -0.16285338  0.88606656  0.89780823
13 13  1  1  1  0.58100320 -0.50242117  0.69975049  0.23075716
14 14  2  2  2 -0.09756759  0.32500760  1.34954777  0.49576819
15 15  1  3  3 -0.79733970 -0.45139957  0.96597139 -2.47475726
16 16  2  1  4 -1.53313299 -1.36002014  0.06478981  0.27118850
17 17  1  2  1 -1.76762191 -1.17475175 -1.16165180  0.08503871
18 18  2  3  2 -0.32539248 -1.12102656  1.35283538  0.46963266
19 19  1  1  3 -0.29976865  1.19147376  0.38726070  0.12839759
20 20  2  2  4 -0.53285724 -0.37190046 -1.02641877 -1.71363552
21 21  1  3  1 -0.74750973 -0.69994486  1.29616246 -0.22394345
22 22  2  1  2 -0.82581172 -0.83660765  0.43636897  0.29364722
23 23  1  2  3  0.74471471  0.38635141 -0.85874012 -1.17886383
24 24  2  3  4  1.28956868 -1.41161366  0.36144567 -0.31512618

1.2 定义函数

假定汇总的统计量包括: 观测值个数, 平均数, 标准差, 变异系数. 统计时不包括缺失值.

func <- function(x)(c(n = length(x),mean=mean(x,na.rm = T),sd=sd(x,na.rm = T),cv=sd(x,na.rm = T)/mean(x,na.rm = T)*100))

2 单性状~ 单因子

2.1 一个统计量, 使用mean

代码

aggregate(y1 ~ F4, data=dat,mean)

结果

> aggregate(y1 ~ F4, data=dat,mean)
  F4         y1
1  1 -0.7561432
2  2 -0.4014448
3  3  0.2721626
4  4 -0.1510264

2.2 多个统计量, 使用上面定义的函数func

代码

aggregate(y1 ~ F4, data=dat,mean)

结果

> aggregate(y1 ~ F4, data=dat,func)
  F4         y1.n      y1.mean        y1.sd        y1.cv
1  1    6.0000000   -0.7561432    0.9722392 -128.5787188
2  2    6.0000000   -0.4014448    0.8455661 -210.6307286
3  3    6.0000000    0.2721626    0.7964707  292.6452101
4  4    6.0000000   -0.1510264    0.9403466 -622.6370275

3 单变量 ~ 多因子

aggregate(y1 ~ F4 + F3, data=dat,mean)
aggregate(y1 ~ F4 + F3, data=dat,func)

结果

> aggregate(y1 ~ F4 + F3, data=dat,mean)
   F4 F3          y1
1   1  1 -0.49269221
2   2  1 -1.01395084
3   3  1  0.53529049
4   4  1 -0.87093221
5   1  2 -1.52127554
6   2  2  0.49778945
7   3  2  0.70950665
8   4  2 -0.08312489
9   1  3 -0.25446181
10  2  3 -0.68817295
11  3  3 -0.42830947
12  4  3  0.50097777
> aggregate(y1 ~ F4 + F3, data=dat,func)
   F4 F3          y1.n       y1.mean         y1.sd         y1.cv
1   1  1    2.00000000   -0.49269221    1.51843461 -308.19131500
2   2  1    2.00000000   -1.01395084    0.26606889  -26.24080813
3   3  1    2.00000000    0.53529049    1.18095197  220.61889335
4   4  1    2.00000000   -0.87093221    0.93649332 -107.52769414
5   1  2    2.00000000   -1.52127554    0.34838637  -22.90093824
6   2  2    2.00000000    0.49778945    0.84196201  169.14018691
7   3  2    2.00000000    0.70950665    0.04979171    7.01779324
8   4  2    2.00000000   -0.08312489    0.63601759 -765.13499750
9   1  3    2.00000000   -0.25446181    0.69727506 -274.01953526
10  2  3    2.00000000   -0.68817295    0.51304906  -74.55234353
11  3  3    2.00000000   -0.42830947    0.52188756 -121.84824327
12  4  3    2.00000000    0.50097777    1.11523596  222.61186859

4 多变量 ~ 单因子

注意, 这里多变量时, 使用cbind函数

aggregate(cbind(y1,y2)~F4, data=dat, mean)
aggregate(cbind(y1,y2)~F4, data=dat, func)

结果

> aggregate(cbind(y1,y2)~F4, data=dat, mean)
  F4         y1          y2
1  1 -0.7561432 -0.02542152
2  2 -0.4014448 -0.07802558
3  3  0.2721626  0.53360303
4  4 -0.1510264 -0.55417728
> aggregate(cbind(y1,y2)~F4, data=dat, func)
  F4         y1.n      y1.mean        y1.sd        y1.cv          y2.n       y2.mean         y2.sd         y2.cv
1  1    6.0000000   -0.7561432    0.9722392 -128.5787188  6.000000e+00 -2.542152e-02  1.278144e+00 -5.027804e+03
2  2    6.0000000   -0.4014448    0.8455661 -210.6307286  6.000000e+00 -7.802558e-02  8.218860e-01 -1.053355e+03
3  3    6.0000000    0.2721626    0.7964707  292.6452101  6.000000e+00  5.336030e-01  7.616742e-01  1.427417e+02
4  4    6.0000000   -0.1510264    0.9403466 -622.6370275  6.000000e+00 -5.541773e-01  6.623361e-01 -1.195170e+02

5 多变量 ~ 多因子

aggregate(cbind(y1,y2,y3)~F4+F3, data=dat, mean)
aggregate(cbind(y1,y2,y3)~F4+F3, data=dat, func)

结果

> aggregate(cbind(y1,y2,y3)~F4+F3, data=dat, mean)
   F4 F3          y1           y2         y3
1   1  1 -0.49269221  0.637086358  0.6607890
2   2  1 -1.01395084 -0.001333303  0.6262472
3   3  1  0.53529049  0.623854458 -0.5085037
4   4  1 -0.87093221 -0.624671988  0.3766036
5   1  2 -1.52127554 -1.066257310 -1.4224892
6   2  2  0.49778945  0.519382343 -0.3287215
7   3  2  0.70950665  1.013199705 -0.3207629
8   4  2 -0.08312489 -0.250626325 -0.3335704
9   1  3 -0.25446181  0.352906393  0.7028260
10  2  3 -0.68817295 -0.752125774  0.7474031
11  3  3 -0.42830947 -0.036245083 -0.7411611
12  4  3  0.50097777 -0.787233521  0.6237561
> aggregate(cbind(y1,y2,y3)~F4+F3, data=dat, func)
   F4 F3          y1.n       y1.mean         y1.sd         y1.cv          y2.n       y2.mean         y2.sd         y2.cv         y3.n
1   1  1    2.00000000   -0.49269221    1.51843461 -308.19131500  2.000000e+00  6.370864e-01  1.611507e+00  2.529495e+02    2.0000000
2   2  1    2.00000000   -1.01395084    0.26606889  -26.24080813  2.000000e+00 -1.333303e-03  1.181256e+00 -8.859624e+04    2.0000000
3   3  1    2.00000000    0.53529049    1.18095197  220.61889335  2.000000e+00  6.238545e-01  8.027349e-01  1.286734e+02    2.0000000
4   4  1    2.00000000   -0.87093221    0.93649332 -107.52769414  2.000000e+00 -6.246720e-01  1.039939e+00 -1.664777e+02    2.0000000
5   1  2    2.00000000   -1.52127554    0.34838637  -22.90093824  2.000000e+00 -1.066257e+00  1.534343e-01 -1.438999e+01    2.0000000
6   2  2    2.00000000    0.49778945    0.84196201  169.14018691  2.000000e+00  5.193823e-01  2.748874e-01  5.292583e+01    2.0000000
7   3  2    2.00000000    0.70950665    0.04979171    7.01779324  2.000000e+00  1.013200e+00  8.864974e-01  8.749483e+01    2.0000000
8   4  2    2.00000000   -0.08312489    0.63601759 -765.13499750  2.000000e+00 -2.506263e-01  1.715075e-01 -6.843157e+01    2.0000000
9   1  3    2.00000000   -0.25446181    0.69727506 -274.01953526  2.000000e+00  3.529064e-01  1.488957e+00  4.219126e+02    2.0000000
10  2  3    2.00000000   -0.68817295    0.51304906  -74.55234353  2.000000e+00 -7.521258e-01  5.217045e-01 -6.936400e+01    2.0000000
11  3  3    2.00000000   -0.42830947    0.52188756 -121.84824327  2.000000e+00 -3.624508e-02  5.871171e-01 -1.619853e+03    2.0000000
12  4  3    2.00000000    0.50097777    1.11523596  222.61186859  2.000000e+00 -7.872335e-01  8.830069e-01 -1.121658e+02    2.0000000
        y3.mean        y3.sd        y3.cv
1     0.6607890    0.0550999    8.3385012
2     0.6262472    0.2685284   42.8789796
3    -0.5085037    1.2668021 -249.1234928
4     0.3766036    0.4409712  117.0916257
5    -1.4224892    0.3688798  -25.9319910
6    -0.3287215    2.3734312 -722.0187593
7    -0.3207629    0.7608146 -237.1890640
8    -0.3335704    0.9798355 -293.7417173
9     0.7028260    0.8391045  119.3900697
10    0.7474031    0.8562105  114.5580645
11   -0.7411611    2.4142499 -325.7388963
12    0.6237561    0.3709630   59.4724411

6 作图

6.1 假设F1为时间点, 对y1画折线图

ggplot(dat,aes(x=F1,y=y2))+geom_line()

结果
如何对数据进行汇总统计(R语言)_第1张图片

增加点图

ggplot(dat,aes(x=F1,y=y2))+geom_line() + geom_point()

如何对数据进行汇总统计(R语言)_第2张图片

6.2 对y1, y2, y3, y4做折线图, 不同折线图用不同的颜色

使用reshape2包中的melt进行数据转换

dd = reshape2::melt(dat,1:4,value.name="y")
head(dd)
ggplot(dd,aes(x=F1,y=y,colour=variable))+geom_line() + geom_point()

结果

> dd = reshape2::melt(dat,1:4,value.name="y")
> head(dd)
  F1 F2 F3 F4 variable           y
1  1  1  1  1       y1 -1.56638762
2  2  2  2  2       y1  1.09314649
3  3  1  3  3       y1 -0.05927923
4  4  2  1  4       y1 -0.20873143
5  5  1  2  1       y1 -1.27492917
6  6  2  3  2       y1 -1.05095342

如何对数据进行汇总统计(R语言)_第3张图片

搞定!!!

如何对数据进行汇总统计(R语言)_第4张图片

你可能感兴趣的:(统计分析)