首先我们以mtcars数据集为例,先看一下这个数据集前几行的内容
,主要有英里数(mpg),马力(hp),车重(wt),变速箱的类型(am),气缸数(cyl)等
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
描述统计量的函数 主要有summary(), sapply(),aggregate()等
summary()提供了最小值,最大值,均值,四分位数,频数统计
1.summary
> summary(mtcars)
mpg cyl disp hp drat
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :2.760
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080
Median :19.20 Median :6.000 Median :196.3 Median :123.0 Median :3.695
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 Mean :3.597
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 Max. :4.930
wt qsec vs am gear
Min. :1.513 Min. :14.50 Min. :0.0000 Min. :0.0000 Min. :3.000
1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.000
Median :3.325 Median :17.71 Median :0.0000 Median :0.0000 Median :4.000
Mean :3.217 Mean :17.85 Mean :0.4375 Mean :0.4062 Mean :3.688
3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000
Max. :5.424 Max. :22.90 Max. :1.0000 Max. :1.0000 Max. :5.000
carb
Min. :1.000
1st Qu.:2.000
Median :2.000
Mean :2.812
3rd Qu.:4.000
Max. :8.000
如果我们只想研究其中的几个变量
> myvars<-c("mpg","hp","wt")
> summary(mtcars[myvars])
mpg hp wt
Min. :10.40 Min. : 52.0 Min. :1.513
1st Qu.:15.43 1st Qu.: 96.5 1st Qu.:2.581
Median :19.20 Median :123.0 Median :3.325
Mean :20.09 Mean :146.7 Mean :3.217
3rd Qu.:22.80 3rd Qu.:180.0 3rd Qu.:3.610
Max. :33.90 Max. :335.0 Max. :5.424
2.sapply()可以按自己的想法定义描述统计量
格式为sapply(x,FUN,options)这里的x为数据框,FUN为任意函数,如果你定义了options将会被传递给FUN。这里的典型函数有mean(),sd(),var(),min(),max(),median()等,还可以自定义函数,下面我们自定义函数计算偏度和峰度
> mystats<-function(x,na.omit=FALSE){
+ if (na.omit)
+ x<-x[!is.na(x)]
+ m<-mean(x)
+ n<-length(x)
+ s<-sd(x)
+ skew<-sum((x-m)^3/s^3)/n
+ kurt<-sum((x-m)^4/s^4)/n-3
+ return(c(n=n,mean=m,stdev=s,skew=skew,kurtosis=kurt))
+ }
> myvars<-c("mpg","hp","wt")
> t<-sapply(mtcars[myvars],mystats)
> t
mpg hp wt
n 32.000000 32.0000000 32.00000000
mean 20.090625 146.6875000 3.21725000
stdev 6.026948 68.5628685 0.97845744
skew 0.610655 0.7260237 0.42314646
kurtosis -0.372766 -0.1355511 -0.02271075
偏度能够反应分布的对称情况,右偏(也叫正偏),在图像上表现为数据右边脱了一个长长的尾巴,这时大多数值分布在左侧,有一小部分值分布在右侧。
峰度反应的是图像的尖锐程度:峰度越大,表现在图像上面是中心点越尖锐。在相同方差的情况下,中间一大部分的值方差都很小,为了达到和正太分布方差相同的目的,必须有一些值离中心点越远,所以这就是所说的“厚尾”,反应的是异常点增多这一现象。
aggregate分组获取描述性统计量
> aggregate(mtcars[myvars],by=list(am=mtcars$am),mean)
am mpg hp wt
1 0 17.14737 160.2632 3.768895
2 1 24.39231 126.8462 2.411000
变速箱(am)为0的车型,他的mpg的均值为17.14737
使用 list(am=mtcars$am)是为了定义列标签。
by是一个变量名组成的列表,这些变量会被去掉以形成新的观测
这次使用的数据是vcd包中的Arthritis数据集,这是一份关于风湿病新疗法的数据
> library(grid)
> library(vcd)
> head(Arthritis)
ID Treatment Sex Age Improved
1 57 Treated Male 27 Some
2 46 Treated Male 29 None
3 77 Treated Male 30 None
4 17 Treated Male 32 Marked
5 36 Treated Male 46 Marked
6 23 Treated Male 58 Marked
>
使用table(),xtabs()函数创建表
> attach(Arthritis)
> mytable<-table(Improved)
> mytable
Improved
None Some Marked
42 14 28
这里使用attach()与with的用法相同
可以使用prop.table()将频数转为比例
> prop.table(mytable)
Improved
None Some Marked
0.5000000 0.1666667 0.3333333
或者用 prop.table*100%转为百分比
创建二维列联表
格式table(A,B),A为行变量,B为列变量
xtabs(~A+B,data=mydata)要进行交叉分类的变量写在公式的右侧,若写左侧的话则其为一个频数向量
> mydatas<-xtabs(~Treatment+Improved,data=Arthritis)
> mydatas
Improved
Treatment None Some Marked
Placebo 29 7 7
Treated 13 7 21
可以使用addmargins()计算边际和
> addmargins(mydatas)
Improved
Treatment None Some Marked Sum
Placebo 29 7 7 43
Treated 13 7 21 41
Sum 42 14 28 84
还可以通过以下这种方式计算边际和,使用函数margin.table()
> mydatas
Improved
Treatment None Some Marked
Placebo 29 7 7
Treated 13 7 21
> margin.table(mydatas,2)
Improved
None Some Marked
42 14 28
如果计算一行的和直接把2改为1即可
可以使用chisq.test()进行独立性检验
> mytable<-xtabs(~Treatment+Improved,data=Arthritis)
> chisq.test(mytable)
Pearson's Chi-squared test
data: mytable
X-squared = 13.055, df = 2, p-value = 0.001463
这里的p-value小于0.01,所以接受备择假设,即治疗和改善水平有联系。
当样本量小于5时,用fisher精确检验代替卡方检验函数为Fisher.test()
相关系数主要有pearson相关系数,spearman相关系数
Pearson相关系数是用来衡量两个变量之间的线性相关程度
使用cor函数可以计算相关系数,cov函数可以计算协方差
关于协方差:
标准差和方差一般是用来描述一维数据的,但现实生活我们常常遇到含有多维数据的数据集,比如说要统计多个学科的考试成绩。面对这样的数据集,我们可以按照每一维独立的计算其方差,但是通常我们还想了解更多,比如,一个男孩子的猥琐程度跟他受女孩子欢迎程度是否存在一些联系啊,协方差就可以用来度量两个随机变量关系
如果结果为正值,则说明两者是正相关的(从协方差可以引出“相关系数”的定义),也就是说一个人越猥琐就越受女孩子欢迎,结果为负值就说明负相关的,越猥琐女孩子越讨厌,如果为0,也就是不相关
这里我们使用state.x77数据集,
> head(state.x77)
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
California 21198 5114 1.1 71.71 10.3 62.6 20 156361
Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
然后选取数据集前六列,计算协方差
> state<-state.x77[,1:6]
> cov(state)
Population Income Illiteracy Life Exp Murder HS Grad
Population 19931683.7588 571229.7796 292.8679592 -407.8424612 5663.523714 -3551.509551
Income 571229.7796 377573.3061 -163.7020408 280.6631837 -521.894286 3076.768980
Illiteracy 292.8680 -163.7020 0.3715306 -0.4815122 1.581776 -3.235469
Life Exp -407.8425 280.6632 -0.4815122 1.8020204 -3.869480 6.312685
Murder 5663.5237 -521.8943 1.5817755 -3.8694804 13.627465 -14.549616
HS Grad -3551.5096 3076.7690 -3.2354694 6.3126849 -14.549616 65.237894
>
计算相关系数
> cor(state,method = "pearson")
Population Income Illiteracy Life Exp Murder HS Grad
Population 1.00000000 0.2082276 0.1076224 -0.06805195 0.3436428 -0.09848975
Income 0.20822756 1.0000000 -0.4370752 0.34025534 -0.2300776 0.61993232
Illiteracy 0.10762237 -0.4370752 1.0000000 -0.58847793 0.7029752 -0.65718861
Life Exp -0.06805195 0.3402553 -0.5884779 1.00000000 -0.7808458 0.58221620
Murder 0.34364275 -0.2300776 0.7029752 -0.78084575 1.0000000 -0.48797102
HS Grad -0.09848975 0.6199323 -0.6571886 0.58221620 -0.4879710 1.00000000
这里收入和高中毕业率呈正相关,文盲率与预期寿命呈负相关
相关性检验
常用的原假设为变量间不相关,即通过对相关系数的检验得到p-value,然后与0.01或0.05比较,小于0.01即否定原假设
> cor.test(state[,3],state[,5])
Pearson's product-moment correlation
data: state[, 3] and state[, 5]
t = 6.8479, df = 48, p-value = 1.258e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.5279280 0.8207295
sample estimates:
cor
0.7029752
这里对原假设为预期寿命与谋杀率不相关(pearson=0),通过计算p-value<0.01,否定原假设,即他们之间有关系
用到的数据集为PlantGrowth数据集
先判断总体是否符合正态分布
单样本t检验,shapiro.test()方法是用于校验数据集是否符合正态分布
>weight <- PlantGrowth$weight[PlantGrowth$group=="ctrl"]
> weight
[1] 4.17 5.58 5.18 6.11 4.50 4.61 5.17 4.53 5.33 5.14
>shapiro.test(weight)
>summary(weight)
Shapiro-Wilk normality test
data: weight
W = 0.95668, p-value = 0.7475
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.170 4.550 5.155 5.032 5.293 6.110
如果总体不是正态分布用wilcoxon符号秩检验wilcox.test
如果是双样本的话要考虑是否独立
独立t检验
首先先判断总体是否为正态分布,然后判断方差是否相等。检验方差是否相等用到的是F检验,函数为var.test
> with(PlantGrowth,{
+ shapiro.test(weight[group=="ctrl"])
+ shapiro.test(weight[group=="trt1"])
+ var.test(weight[group=="ctrl"],
+ weight[group=="trt1"])
+
+
+ })
F test to compare two variances
data: weight[group == "ctrl"] and weight[group == "trt1"]
F = 0.53974, num df = 9, denom df = 9, p-value = 0.3719
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.1340645 2.1730025
sample estimates:
ratio of variances
0.5397431
下一步进行t检验得到p-value即可
> with(PlantGrowth,{
+ t.test(weight[group=="ctrl"],
+ weight[group=="trt1"],
+ var.equal = T )
+ })
Two Sample t-test
data: weight[group == "ctrl"] and weight[group == "trt1"]
t = 1.1913, df = 18, p-value = 0.249
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.2833003 1.0253003
sample estimates:
mean of x mean of y
5.032 4.661
当样本不符合正态分布时用wilcoxon秩和检验,这里我没有使用with函数,代码显得很乱
> wilcox.test(PlantGrowth$weight[PlantGrowth$group=="ctrl"],
+ PlantGrowth$weight[PlantGrowth$group=="trt1"],
+ exact = F )
Wilcoxon rank sum test with continuity correction
data: PlantGrowth$weight[PlantGrowth$group == "ctrl"] and PlantGrowth$weight[PlantGrowth$group == "trt1"]
W = 67.5, p-value = 0.1986
alternative hypothesis: true location shift is not equal to 0
非独立t检验
> with(PlantGrowth,{
+ t.test(weight[group=="ctrl"],
+ weight[group=="trt1"],
+ paired = T )
+ })
Paired t-test
data: weight[group == "ctrl"] and weight[group == "trt1"]
t = 0.99384, df = 9, p-value = 0.3463
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.4734609 1.2154609
sample estimates:
mean of the differences
0.371
补充一点假设检验的内容吧
假设检验是数理统计学中根据一定假设条件由样本推断总体的一种方法。
假设检验的基本步骤