R - 简单随机抽样

本文使用的包

library(tidyverse)
library(moderndive)

使用的数据集,总共有2400个红球和白色球:

bowl

此处采用简单随机抽样,从2400个球中估算出红球所占比例。采用不同的抽取方法,一组是一次性抽取30个,重复1000次;一组是一次性抽取50个,重复1000次。重复1000次后,红色球稳定的概率,我们可以近似估计2400个红色球的概率。

# 重复抽取1000次,每次取出30个
sampling_30 <- bowl %>%
  rep_sample_n(size = 30, reps = 1000)

# 计算红球的比例
sampling_red_30 <- sampling_30 %>%
  group_by(replicate) %>%
  summarise(red = sum(color == "red")) %>%
  mutate(prep_red = red / 30)

# 画出每次抽样时红色球的概率分布
ggplot(sampling_red_30, aes(x = prep_red)) +
  geom_histogram(binwidth = 0.05, boundary = 0.5, color = "white") + 
  labs(x = "proportion of 30 balls that were red", title = "30")

结果如下:

R - 简单随机抽样_第1张图片

抽取50个:

# 重复抽取1000次,每次抽取50个
sampling_50 <- bowl %>%
  rep_sample_n(size = 50, reps = 1000)

# 计算红球的比例
sampling_red_50 <- sampling_50 %>%
  group_by(replicate) %>%
  summarise(red = sum(color == "red")) %>%
  mutate(prep_red = red / 50)

# 画出每次抽样时红色球的概率分布
ggplot(sampling_red_50, aes(x = prep_red)) +
  geom_histogram(binwidth = 0.05, boundary = 0.5, color = "white") + 
  labs(x = "proportion of 50 balls that were red", title = "50")

结果如下:

R - 简单随机抽样_第2张图片

抽取100个:

# 重复抽取1000次,每次抽取100个
sampling_100 <- bowl %>%
  rep_sample_n(size = 100, reps = 1000)

# 计算红球的比例
sampling_red_100 <- sampling_100 %>%
  group_by(replicate) %>%
  summarise(red = sum(color == "red")) %>%
  mutate(prep_red = red / 100)

# 画出每次抽样时红色球的概率分布
ggplot(sampling_red_100, aes(x = prep_red)) +
  geom_histogram(binwidth = 0.05, boundary = 0.5, color = "white") + 
  labs(x = "proportion of 100 balls that were red", title = "100")

R - 简单随机抽样_第3张图片

可以看出,随着每次抽样数量的增加(30、50、100),红球出现的概率越来越稳定。三个直方图显示的红球概率都集中在40%附近。

我们在通过标准差衡量验证一下,标准差数值不断减少。

summarise(sampling_red_30, sd = sd(prep_red))
summarise(sampling_red_50, sd = sd(prep_red))
summarise(sampling_red_100, sd = sd(prep_red))
      sd(30)  sd(50)   sd(100)
             
1 0.0890      0.0657   0.0479

参考资料:https://moderndive.netlify.app/7-sampling.html#using-the-virtual-shovel-once

你可能感兴趣的:(R,r语言,机器学习,深度学习)