odds = p/(1-p)
Find how many standard deviations away from the mean of large_sample
.18
is. Assign the result todeviations_from_mean
.
Find how many probabilities in large sample are greater than or equal to .18
. Assign the result to over_18_count
.
import numpy
large_sample_std = numpy.std(large_sample)
avg = numpy.mean(large_sample)
deviations_from_mean = (.18 - avg)/ large_sample_std
over_18_count = len([p for p in large_sample if p >= .18])
sample counties
Use the select_random_sample
function to pick 1000
random samples of 100 counties each from the income
data. Find the mean of the median_income
column for each sample.
Plot a histogram with 20
bins of all the mean median incomes.
import pandas as pd
import matplotlib.pyplot as plt
import random
income = pd.read_csv('us_income.csv')
# print(income.head())
# this is the mean median income in anh US county
mean_median_income = income['median_income'].mean()
# one section
def get_sample_mean(start,end):
return income['median_income'][start:end].mean()
# sample by some step every time,iterate
# starting at 0 ,and counting in blocks of row_step
# (0,row_step,row_step*2,etc.)
def find_mean_incomes(row_step):
mean_median_sample_incomes=[]
for i in range(0,income.shape[0],row_step):
mean_median_sample_incomes.append(
get_sample_mean(i, i+ row_step))
return mean_median_sample_incomes
non_random_sample = find_mean_incomes(100)
plt.hist(non_random_sample,20)
plt.show()
# What you're seeing above is the result of biased sampling.
# Instead of selecting randomly, we selected counties that were
# next to each other in the data.
# This picked counties in the same state more often that not,
# and created means that didn't represent the whole country.
# This is the danger of not using random sampling --
# you end up with samples that don't reflect
# the entire population.
# This gives you a distribution that isn't normal.
# random sample at one time. make a series contain 100 values.
# one random sample.
def select_random_sample(count):
# make 100 indexes once once once .
random_indices = random.sample(range(0, income.shape[0]),count)
# make 100 values once once .
return income.iloc[random_indices]
# Use the select_random_sample function to pick 1000 random samples
# of 100 counties each from the income data. Find the mean of the
#median_income column for each sample.
random.seed(1)
# make 1000 rancom samples
random_sample = [select_random_sample(100)['median_income'].mean() for _ in range (1000)]
plt.hist(random_sample,20)
plt.show()
An experiment
def select_random_sample(count):
random_indices = random.sample(range(0,income.shape[0]),count)
return income.iloc[random_indices]
# Select 1000 random samples of 100 counties each
# from the income data using the
# select_random_sample method.
random.seed(1)
mean_ratios=[]
# For each sample:
Divide the median_income_hs column by
# median_income_college to get ratios.
# Then, find the mean of all the ratios in the sample.
# Add it to the list, mean_ratio
for i in range(1000):
sample = select_random_sample(100)
ratios= sample['median_income_hs'
] / sample['median_income_college']
mean_ratios.append(ratios.mean())
plt.hist(mean_ratios,20)
plt.show()
Statistical significance
After 5 years, we determine that the mean ratio in our random sample of 100 counties is .675 -- that is, high school graduates on average earn 67.5% of what college graduates do.
Now that we have our result, how do we know if our hypothesis is correct? Remember, our hypothesis was about the whole population, not about the sample.
Statistical significance is used to determine if a result is valid for a population or not. You usually set a significance level beforehand that will determine if your hypothesis is true or not. After conducting the experiment, you check against the significance level to determine.
A common significance level is .05. This means: "only 5% or less of the time will the result have been due to chance".
In our case, chance could be that the high school graduates in the county changed income some way other than through our program -- maybe some higher paying factory jobs came to town, or there were some other educational initiatives around.
In order to test for significance, we compare our result ratio with the mean ratios we found in the last section.
Determine how many values in mean_ratios
are greater than or equal to .675
.
Divide by the total number of items in mean_ratios
to get the significance level.
Assign the result to significance_value
.
significance_value = None
mean_higher = len([m for m in mean_ratios if m >= .675])
significance_value = mean_higher / len(mean_ratios)
Final result
Our significance value was .014. Based on the entire population, only 1.4% of the time will the wage results we saw have occurred on their own. So our experiment exceeded our significance level (lower means more significant). Thus, our experiment showed that the program did improve the wages of high school graduates relative to college graduates.
You may have noticed earlier that the more samples in our trials, the "steeper" the histograms of outcomes get (look back on the probability of rolling one with the die if you need a refresher). This "steepness" arose because the more trials we have, the less likely the value is to vary from the "true" value.
This same principle applies to significance testing. You need a larger deviation from the mean to have something be "significant" if your sample size is smaller. The larger the trial, the smaller the deviation needs to be to get a significant result.
You may be asking at this point how we can determine statistical significance without knowing the population values upfront. In a lot of cases, like drug trials, you don't have the capability to measure everyone in the world to compare against your sample.
Statistics gives us tools to deal with this, and we'll learn about them in the next missions.
# This is "steeper" than the graph from before, because it has 500 items in each sample.
random.seed(1)
mean_ratios = []
for i in range(1000):
sample = select_random_sample(500)
ratios = sample["median_income_hs"] / sample["median_income_college"]
mean_ratios.append(ratios.mean())
plt.hist(mean_ratios, 20)
plt.show()