本文使用的包
library(tidyverse)
library(moderndive)
使用的数据集,总共有2400个红球和白色球:
bowl
此处采用简单随机抽样,从2400个球中估算出红球所占比例。采用不同的抽取方法,一组是一次性抽取30个,重复1000次;一组是一次性抽取50个,重复1000次。重复1000次后,红色球稳定的概率,我们可以近似估计2400个红色球的概率。
# 重复抽取1000次,每次取出30个
sampling_30 <- bowl %>%
rep_sample_n(size = 30, reps = 1000)
# 计算红球的比例
sampling_red_30 <- sampling_30 %>%
group_by(replicate) %>%
summarise(red = sum(color == "red")) %>%
mutate(prep_red = red / 30)
# 画出每次抽样时红色球的概率分布
ggplot(sampling_red_30, aes(x = prep_red)) +
geom_histogram(binwidth = 0.05, boundary = 0.5, color = "white") +
labs(x = "proportion of 30 balls that were red", title = "30")
结果如下:
抽取50个:
# 重复抽取1000次,每次抽取50个
sampling_50 <- bowl %>%
rep_sample_n(size = 50, reps = 1000)
# 计算红球的比例
sampling_red_50 <- sampling_50 %>%
group_by(replicate) %>%
summarise(red = sum(color == "red")) %>%
mutate(prep_red = red / 50)
# 画出每次抽样时红色球的概率分布
ggplot(sampling_red_50, aes(x = prep_red)) +
geom_histogram(binwidth = 0.05, boundary = 0.5, color = "white") +
labs(x = "proportion of 50 balls that were red", title = "50")
结果如下:
抽取100个:
# 重复抽取1000次,每次抽取100个
sampling_100 <- bowl %>%
rep_sample_n(size = 100, reps = 1000)
# 计算红球的比例
sampling_red_100 <- sampling_100 %>%
group_by(replicate) %>%
summarise(red = sum(color == "red")) %>%
mutate(prep_red = red / 100)
# 画出每次抽样时红色球的概率分布
ggplot(sampling_red_100, aes(x = prep_red)) +
geom_histogram(binwidth = 0.05, boundary = 0.5, color = "white") +
labs(x = "proportion of 100 balls that were red", title = "100")
可以看出,随着每次抽样数量的增加(30、50、100),红球出现的概率越来越稳定。三个直方图显示的红球概率都集中在40%附近。
我们在通过标准差衡量验证一下,标准差数值不断减少。
summarise(sampling_red_30, sd = sd(prep_red))
summarise(sampling_red_50, sd = sd(prep_red))
summarise(sampling_red_100, sd = sd(prep_red))
sd(30) sd(50) sd(100)
1 0.0890 0.0657 0.0479
参考资料:https://moderndive.netlify.app/7-sampling.html#using-the-virtual-shovel-once