statquest学习笔记+1啊啊啊啊啊,以前真是对置信区间有误解。视频地址:https://www.bilibili.com/video/BV1vb411p7xE
1.Bootstrapping自助抽样法
为了更好的理解置信区间,先学习一下如何计算它(Bootstrapping是多种计算置信区间方法中的一种)。
Bootstrapping的例子:
1.从12只小鼠的体重数据中抽样,每次抽12个,有放回
weight = c(15.4,25.3,25.6,34.7,28.8,18.9,30.0,36.7,25.8,27.7,38.7,32.5)
n1 = sample(weight,12,replace = T);n1
## [1] 18.9 25.3 28.8 25.8 15.4 18.9 25.8 18.9 25.3 27.7 36.7 28.8
2.求取出的12个样本的均值
mean(n1)
## [1] 24.69167
3.将前两步重复1万次,得到1万个均值
(意思一下,100个就好了)
ms = sapply(1:100, function(x){mean(sample(weight,12,replace = T))});head(ms)
## [1] 31.60833 29.40833 26.21667 25.99167 29.39167 28.61667
把多次抽样计算出来的均值画在图上:
library(ggplot2)
dat = data.frame(a = 10:50,b = -20:20)
p <- ggplot(dat,aes(x=a,y=b)) + geom_point(alpha = 0)+theme_bw()+
geom_segment(aes(x = 10, y = 0, xend = 50, yend = 0),
arrow = arrow(length = unit(0.5, "cm")))
a1 = p
for(i in 1:100){
a1 = a1 + geom_vline(xintercept = ms[[i]],color = "red",size = 0.3,alpha = 0.5)
}
a1
95%置信区间:包含95%的均值的区间
99%置信区间:包含99%的均值的区间
2.置信区间的作用
12个小鼠,可以看作是地球上所有小鼠的中抽出的一组样本。样本均值是总体均值的估计,所以,总体均值(true mean)落在上面计算的置信区间内。置信区间外的取值,和true mean有显著差异。
如果我们有两组数据
再来另外12只小鼠。
weight2 = c(32.5,23.4,36.7,35.7,38.7,32.5,32.4,37.0,26.7,30.0,34.4,49.8)
ms = sapply(1:100, function(x){mean(sample(weight2,12,replace = T))})
dat = data.frame(a = 10:50,b = -20:20)
p <- ggplot(dat,aes(x=a,y=b)) + geom_point(alpha = 0)+theme_bw()+
geom_segment(aes(x = 10, y = 0, xend = 50, yend = 0),
arrow = arrow(length = unit(0.5, "cm")))
p +geom_vline(aes(xintercept = ms[1]),color = "red",size = 0.3,alpha = 0.5)
a2 = p
for(i in 1:100){
a2 = a2 + geom_vline(xintercept = ms[[i]],color = "blue",size = 0.3,alpha = 0.5)
}
a2
library(patchwork)
a1/a2
-当两组小鼠的95%置信区间不存在重叠时,说明组小鼠的体重具有显著差异(仅根据置信区间就可以得出结论)
-当两组小鼠的95%置信区间存在重叠时,仍然有可能有显著差异,需要借助t检验来判断。
啊!