class Statistical Thinking in Python (Part 1)

一些EDA数据的基本技术


  • Import matplotlib.pyplot and seaborn as their usual aliases (plt and sns).
  • Use seaborn to set the plotting defaults.
  • Plot a histogram of the Iris versicolor petal lengths using plt.hist() and the provided NumPy array versicolor_petal_length.
  • Show the histogram using plt.show().

# Import plotting modules
import matplotlib.pyplot as plt
import seaborn as sns


# Set default Seaborn style
sns.set()
# Plot histogram of versicolor petal lengths

plt.hist(versicolor_petal_length)


# Show histogram

plt.show()

  • Label the axes. Don't forget that you should always include units in your axis labels. Your yy-axis label is just 'count'. Your xx-axis label is 'petal length (cm)'. The units are essential!
  • Display the plot constructed in the above steps using plt.show().

    # Plot histogram of versicolor petal lengths
    _ = plt.hist(versicolor_petal_length)
    
    # Label axes
    plt.xlabel("petal length (cm)")
    plt.ylabel("count")
    
    # Show histogram
    plt.show()


hist直方图的bins数目一般是数据数量的开根号

  • Import numpy as np. This gives access to the square root function, np.sqrt().
  • Determine how many data points you have using len().
  • Compute the number of bins using the square root rule.
  • Convert the number of bins to an integer using the built in int() function.
  • Generate the histogram and make sure to use the binskeyword argument.
  • Hit 'Submit Answer' to plot the figure and see the fruit of your labors!

# Import numpy
import numpy as np

# Compute number of data points: n_data
n_data = len(versicolor_petal_length)

# Number of bins is the square root of number of data points: n_bins
n_bins = np.sqrt(n_data)

# Convert number of bins to integer: n_bins
n_bins = int(n_bins)

# Plot the histogram
plt.hist(versicolor_petal_length, bins=n_bins)

# Label axes
_ = plt.xlabel('petal length (cm)')
_ = plt.ylabel('count')

# Show histogram
plt.show()


_ = sns.swarmplot(x='state', y='dem_share', data=df_swing)
_ = plt.xlabel('state')
_ = plt.ylabel('percent of vote for Obama')
plt.show()

  • In the IPython Shell, inspect the DataFrame df using df.head(). This will let you identify which column names you need to pass as the x and y keyword arguments in your call to sns.swarmplot().
  • Use sns.swarmplot() to make a bee swarm plot from the DataFrame containing the Fisher iris data set, df. The x-axis should contain each of the three species, and the y-axis should contain the petal lengths.
  • Label the axes.
  • Show your plot.

# Create bee swarm plot with Seaborn's default settings
sns.swarmplot(x='species', y='petal length (cm)', data=df)

# Label the axes
plt.xlabel('species')
plt.ylabel('petal length (cm)')

# Show the plot
plt.show()

Define a function with the signature ecdf(data). Within the function definition,
  • Compute the number of data points, n, using the len() function.
  • The xx-values are the sorted data. Use the np.sort() function to perform the sorting.
  • The yy data of the ECDF go from 1/n to 1 in equally spaced increments. You can construct this using np.arange(). Remember, however, that the end value in np.arange() is not inclusive. Therefore, np.arange() will need to go from 1to n+1. Be sure to divide this by n.
  • The function returns the values x and y.
def ecdf(data):
    """Compute ECDF for a one-dimensional array of measurements."""

    # Number of data points: n
    n = len(data)

    # x-data for the ECDF: x
    x = np.sort(data)

    # y-data for the ECDF: y
    y = np.arange(1, n+1) / n

    return x, y


  • Use ecdf() to compute the ECDF of versicolor_petal_length. Unpack the output intox_vers and y_vers.
  • Plot the ECDF as dots. Remember to include marker = '.' and linestyle = 'none' in addition to x_vers and y_vers as arguments inside plt.plot().
  • Set the margins of the plot with plt.margins() so that no data points are cut off. Use a 2% margin.
  • Label the axes. You can label the y-axis 'ECDF'.
  • Show your plot.
  • Use ecdf() to compute the ECDF of versicolor_petal_length. Unpack the output intox_vers and y_vers.
  • Plot the ECDF as dots. Remember to include marker = '.' and linestyle = 'none' in addition to x_vers and y_vers as arguments inside plt.plot().
  • Set the margins of the plot with plt.margins() so that no data points are cut off. Use a 2% margin.
  • Label the axes. You can label the y-axis 'ECDF'.
  • Show your plot.


# Compute ECDF for versicolor data: x_vers, y_vers
x_vers, y_vers = ecdf(versicolor_petal_length)

# Generate plot
plt.plot(x_vers, y_vers, marker='.', linestyle='none')

# Make the margins nice
plt.margins(0.02)

# Label the axes
plt.ylabel('ECDF')
plt.xlabel('versicolor_petal_length')

# Display the plot
plt.show()

  • Compute ECDFs for each of the three species using your ecdf() function. The variables setosa_petal_lengthversicolor_petal_length, and virginica_petal_length are all in your namespace. Unpack the ECDFs into x_set, y_setx_vers, y_vers and x_virg, y_virg, respectively.
  • Plot all three ECDFs on the same plot as dots. To do this, you will need three plt.plot() commands. Assign the result of each to _.
  • Specify 2% margins.
  • A legend and axis labels have been added for you, so hit 'Submit Answer' to see all the ECDFs!

# Compute ECDFs
x_set, y_set = ecdf(setosa_petal_length)
x_vers, y_vers = ecdf(versicolor_petal_length)
x_virg, y_virg = ecdf(virginica_petal_length)

# Plot all ECDFs on the same plot
_ = plt.plot(x_set, y_set, marker='.', linestyle='none', )
_ = plt.plot(x_vers, y_vers, marker='.', linestyle='none', )
_ = plt.plot(x_virg, y_virg, marker='.', linestyle='none', )

# Make nice margins
plt.margins(0.02)

# Annotate the plot
plt.legend(('setosa', 'versicolor', 'virginica'), loc='lower right')
_ = plt.xlabel('petal length (cm)')
_ = plt.ylabel('ECDF')

# Display the plot
plt.show()

dataframe计算分位数

  • Create percentiles, a NumPy array of percentiles you want to compute. These are the 2.5th, 25th, 50th, 75th, and 97.5th. You can do so by creating a list containing these ints/floats and convert the list to a NumPy array using np.array(). For example, np.array([30, 50])would create an array consisting of the 30th and 50th percentiles.
  • Use np.percentile() to compute the percentiles of the petal lengths from the Iris versicolor samples. The variable versicolor_petal_length is in your namespace.
  • Print the percentiles.
# Specify array of percentiles: percentiles
percentiles = np.array([2.5, 25, 50, 75, 97.5])

# Compute percentiles: ptiles_vers
ptiles_vers = np.percentile(versicolor_petal_length, percentiles)

# Print the result
print(ptiles_vers)


画图并且标记处分位数点:

  • Plot the percentiles as red diamonds on the ECDF. Pass the x and y co-ordinates - ptiles_vers and percentiles/100 - as positional arguments and specify the marker='D'color='red' and linestyle='none' keyword arguments. The argument for the y-axis - percentiles/100 has been specified for you.
  • Display the plot.

# Plot the ECDF
_ = plt.plot(x_vers, y_vers, '.')
plt.margins(0.02)
_ = plt.xlabel('petal length (cm)')
_ = plt.ylabel('ECDF')

# Overlay percentiles as red diamonds.
_ = plt.plot(ptiles_vers, percentiles/100, marker='D', color='red',
         linestyle='none')

# Show the plot
plt.show()


计算方差

  • Create an array called differences that is the difference between the petal lengths (versicolor_petal_length) and the mean petal length. The variable versicolor_petal_length is already in your namespace as a NumPy array so you can take advantage of NumPy's vectorized operations.
  • Square each element in this array. For example, x**2squares each element in the array x. Store the result as diff_sq.
  • Compute the mean of the elements in diff_sq using np.mean(). Store the result as variance_explicit.
  • Compute the variance of versicolor_petal_lengthusing np.var(). Store the result as variance_np.
  • Print both variance_explicit and variance_npin one print call to make sure they are consistent.

# Array of differences to mean: differences
differences = versicolor_petal_length - np.mean(versicolor_petal_length)

# Square the differences: diff_sq
diff_sq = differences ** 2

# Compute the mean square difference: variance_explicit
variance_explicit = np.mean(diff_sq)

# Compute the variance using NumPy: variance_np
variance_np = np.var(versicolor_petal_length)

# Print the results
print(variance_explicit,variance_np)


计算皮尔逊相关系数

  • Define a function with signature pearson_r(x, y).
    • Use np.corrcoef() to compute the correlation matrix of x and y (pass them to np.corrcoef() in that order).
    • The function returns entry [0,1] of the correlation matrix.
  • Compute the Pearson correlation between the data in the arrays versicolor_petal_length and versicolor_petal_width. Assign the result to r.
  • Print the result.
def pearson_r(x, y):
    """Compute Pearson correlation coefficient between two arrays."""
    # Compute correlation matrix: corr_mat
    corr_mat = np.corrcoef(x, y)

    # Return entry [0,1]
    return corr_mat[0,1]

# Compute Pearson correlation coefficient for I. versicolor: r
r = pearson_r(versicolor_petal_length,versicolor_petal_width)

# Print the result
print(r)

  • Draw samples out of the Binomial distribution using np.random.binomial(). You should use parameters n = 100 and p = 0.05, and set the size keyword argument to 10000.
  • Compute the CDF using your previously-written ecdf()function.
  • Plot the CDF with axis labels. The x-axis here is the number of defaults out of 100 loans, while the y-axis is the CDF.
  • Show the plot.

# Take 10,000 samples out of the binomial distribution: n_defaults
n_defaults = np.random.binomial(100, 0.05, size = 10000)


# Compute CDF: x, y
x, y = ecdf(n_defaults)


# Plot the CDF with axis labels
plt.plot(x, y, marker = '.', linestyle = 'none' )
plt.xlabel('the number of defaults out of 100 loans')
plt.ylabel('CDF')

# Show the plot
plt.show()


正态分布的构建和绘制

  • Draw 100,000 samples from a Normal distribution that has a mean of 20 and a standard deviation of 1. Do the same for Normal distributions with standard deviations of 3 and 10, each still with a mean of 20. Assign the results to samples_std1samples_std3 and samples_std10, respectively.
  • Plot a histograms of each of the samples; for each, use 100 bins, also using the keyword arguments normed=Trueand histtype='step'. The latter keyword argument makes the plot look much like the smooth theoretical PDF. You will need to make 3 plt.hist() calls.
  • Hit 'Submit Answer' to make a legend, showing which standard deviations you used, and show your plot! There is no need to label the axes because we have not defined what is being described by the Normal distribution; we are just looking at shapes of PDFs.
  • # Draw 100000 samples from Normal distribution with stds of interest: samples_std1, samples_std3, samples_std10
    samples_std1 = np.random.normal(20,1,100000)
    samples_std3 = np.random.normal(20,3,100000)
    samples_std10 = np.random.normal(20,10,100000)

    # Make histograms
    plt.hist(samples_std1,normed=True,histtype='step',bins=100)
    plt.hist(samples_std3,normed=True,histtype='step',bins=100)
    plt.hist(samples_std10,normed=True,histtype='step',bins=100)

    # Make a legend, set limits and show plot
    _ = plt.legend(('std = 1', 'std = 3', 'std = 10'))
    plt.ylim(-0.01, 0.42)
    plt.show()

  
          数据的标准正态分布拟合和原数据的拟合

  • Compute mean and standard deviation of Belmont winners' times with the two outliers removed. The NumPy array belmont_no_outliers has these data.
  • Take 10,000 samples out of a normal distribution with this mean and standard deviation using np.random.normal().
  • Compute the CDF of the theoretical samples and the ECDF of the Belmont winners' data, assigning the results to x_theor, y_theor and x, y, respectively.
  • Hit submit to plot the CDF of your samples with the ECDF, label your axes and show the plot.

# Compute mean and standard deviation: mu, sigma
mu = np.mean(belmont_no_outliers)
sigma = np.std(belmont_no_outliers)

# Sample out of a normal distribution with this mu and sigma: samples
samples = np.random.normal(mu,sigma,10000)

# Get the CDF of the samples and of the data
x_theor,y_theor = ecdf(samples)
x, y = ecdf(belmont_no_outliers)

# Plot the CDFs and show the plot
_ = plt.plot(x_theor, y_theor)
_ = plt.plot(x, y, marker='.', linestyle='none')
plt.margins(0.02)
_ = plt.xlabel('Belmont winning time (sec.)')
_ = plt.ylabel('CDF')
plt.show()

指数分布

Define a function with call signature successive_poisson(tau1, tau2, size=1) that samples the waiting time for a no-hitter and a hit of the cycle.

  • Draw waiting times tau1 (size number of samples) for the no-hitter out of an exponential distribution and assign to t1.
  • Draw waiting times tau2 (size number of samples) for hitting the cycle out of an exponential distribution and assign to t2.
  • The function returns the sum of the waiting times for the two events.

def successive_poisson(tau1, tau2, size=1):
    # Draw samples out of first exponential distribution: t1
    t1 = np.random.exponential(tau1, size)

    # Draw samples out of second exponential distribution: t2
    t2 = np.random.exponential(tau2, size)

    return t1 + t2

  • Use your successive_poisson() function to draw 100,000 out of the distribution of waiting times for observing a no-hitter and a hitting of the cycle.
  • Plot the PDF of the waiting times using the step histogram technique of a previous exercise. Don't forget the necessary keyword arguments. You should use bins=100normed=True, and histtype='step'.
  • Label the axes.
  • Show your plot.

# Draw samples of waiting times: waiting_times
waiting_times = successive_poisson(764, 715, size=100000)

# Make the histogram
plt.hist(waiting_times, bins=100, normed=True, histtype='step')

# Label axes
plt.xlabel('time')
plt.ylabel('waiting_times')

# Show the plot
plt.show()

多项式拟合

  • Compute the slope and intercept of the regression line using np.polyfit(). Remember, fertility is on the y-axis and illiteracy on the x-axis.
  • Print out the slope and intercept from the linear regression.
  • To plot the best fit line, create an array x that consists of 0 and 100 using np.array(). Then, compute the theoretical values of y based on your regression parameters. I.e., y = a * x + b.
  • Plot the data and the regression line on the same plot. Be sure to label your axes.
  • Hit 'Submit Answer' to display your plot.

# Plot the illiteracy rate versus fertility
_ = plt.plot(illiteracy, fertility, marker='.', linestyle='none')
plt.margins(0.02)
_ = plt.xlabel('percent illiterate')
_ = plt.ylabel('fertility')

# Perform a linear regression using np.polyfit(): a, b
a, b = np.polyfit(illiteracy, fertility,1)

# Print the results to the screen
print('slope =', a, 'children per woman / percent illiterate')
print('intercept =', b, 'children per woman')

# Make theoretical line to plot
x = np.array([0,100])
y = a * x + b

# Add regression line to your plot
_ = plt.plot(x, y)

# Draw the plot
plt.show()

  • Specify the values of the slope for which to compute the RSS. Use np.linspace() to get 200 points in the range between 0 and 0.1. For example, to get 100 points in the range between 0 and 0.5, you could use np.linspace() like so: np.linspace(0, 0.5, 100).
  • Initialize an array, rss, to contain the RSS using np.empty_like() and the array you created above. The empty_like() function returns a new array with the same shape and type as a given array (in this case, a_vals).
  • Write a for loop to compute the sum of RSS of the slope. Hint: the RSS is given by np.sum((y_data - a * x_data - b)**2). The variable b you computed in the last exercise is already in your namespace. Here, fertility is the y_data and illiteracy the x_data.
  • Plot the RSS (rss) versus slope (a_vals).
  • Hit 'Submit Answer' to see the plot!


# Specify slopes to consider: a_vals
a_vals = np.linspace(0, 0.1, 200)

# Initialize sum of square of residuals: rss
rss = np.empty_like(a_vals)

# Compute sum of square of residuals for each value of a_vals
for i, a in enumerate(a_vals):
    rss[i] = np.sum((fertility - a*illiteracy - b)**2)

# Plot the RSS
plt.plot(a_vals, rss, '-')
plt.xlabel('slope (children per woman / percent illiterate)')
plt.ylabel('sum of square of residuals')

plt.show()


  • Compute the parameters for the slope and intercept using np.polyfit(). The Anscombe data are stored in the arrays x and y.
  • Print the slope a and intercept b.
  • Generate theoretical xx and yy data from the linear regression. Your xx array, which you can create with np.array(), should consist of 3 and 15. To generate the yy data, multiply the slope by x_theor and add the intercept.
  • Plot the Anscombe data as a scatter plot and then plot the theoretical line. Remember to include the marker='.'and linestyle='none' keyword arguments in addition to x and y when to plot the Anscombe data as a scatter plot. You do not need these arguments when plotting the theoretical line.
  • Hit 'Submit Answer' to see the plot!
# Perform linear regression: a, b
a, b = np.polyfit(x,y,1)

# Print the slope and intercept
print(a, b)

# Generate theoretical x and y data: x_theor, y_theor
x_theor = np.array([3, 15])
y_theor = x_theor * a + b

# Plot the Anscombe data and theoretical line
_ = plt.plot(x_theor, y_theor)
_ = plt.plot(x,y,marker='.',linestyle='none')

# Label the axes
plt.xlabel('x')
plt.ylabel('y')

# Show the plot
plt.show()


样本小的时候进行重复采样的方法

  • Write a for loop to acquire 50 bootstrap samples of the rainfall data and plot their ECDF.
    • Use np.random.choice() to generate a bootstrap sample from the NumPy array rainfall. Be sure that the size of the resampled array is len(rainfall).
    • Use the function ecdf() that you wrote in the prequel to this course to generate the x and yvalues for the ECDF of the bootstrap sample bs_sample.
    • Plot the ECDF values. Specify color='gray' (to make gray dots) and alpha=0.1 (to make them semi-transparent, since we are overlaying so many) in addition to the marker='.' and linestyle='none' keyword arguments.
  • Use ecdf() to generate x and y values for the ECDF of the original rainfall data available in the array rainfall.
  • Plot the ECDF values of the original data.
  • Hit 'Submit Answer' to visualize the samples!

for _ in range(50):
    # Generate bootstrap sample: bs_sample
    bs_sample = np.random.choice(rainfall, size=len(rainfall))

    # Compute and plot ECDF from bootstrap sample
    x, y = ecdf(bs_sample)
    _ = plt.plot(x, y, marker='.', linestyle='none',
                 color='gray', alpha=0.1)

# Compute and plot ECDF from original data
x, y = ecdf(rainfall)
_ = plt.plot(x, y, marker='.')

# Make margins and label axes
plt.margins(0.02)
_ = plt.xlabel('yearly rainfall (mm)')
_ = plt.ylabel('ECDF')

# Show the plot
plt.show()


  • Draw 10000 bootstrap replicates of the mean annual rainfall using your draw_bs_reps() function and the rainfall array. Hint: Pass in np.mean for func to compute the mean.
    • As a reminder, draw_bs_reps() accepts 3 arguments: datafunc, and size.
  • Compute and print the standard error of the mean of rainfall.
    • The formula to compute this is np.std(data) / np.sqrt(len(data)).
  • Compute and print the standard deviation of your bootstrap replicates bs_replicates.
  • Make a histogram of the replicates using the normed=True keyword argument and 50 bins.
  • Hit 'Submit Answer' to see the plot!

# Take 10,000 bootstrap replicates of the mean: bs_replicates
bs_replicates = draw_bs_reps(rainfall, np.mean, 10000)

# Compute and print SEM
sem = np.std(rainfall) / np.sqrt(len(rainfall))
print(sem)

# Compute and print standard deviation of bootstrap replicates
bs_std = np.std(bs_replicates)
print(bs_std)

# Make a histogram of the results
_ = plt.hist(bs_replicates, bins=50, normed=True)
_ = plt.xlabel('mean annual rainfall (mm)')
_ = plt.ylabel('PDF')

# Show the plot
plt.show()

计算置信区间

  • Generate 10000 bootstrap replicates of ττ from the nohitter_times data using your draw_bs_reps()function. Recall that the the optimal ττ is calculated as the mean of the data.
  • Compute the 95% confidence interval using np.percentile() and passing in two arguments: The array bs_replicates, and the list of percentiles - in this case 2.5 and 97.5.
  • Print the confidence interval.
  • Plot a histogram of your bootstrap replicates. This has been done for you, so hit 'Submit Answer' to see the plot!

# Draw bootstrap replicates of the mean no-hitter time (equal to tau): bs_replicates
bs_replicates = draw_bs_reps(nohitter_times,np.mean,10000)

# Compute the 95% confidence interval: conf_int
conf_int = np.percentile(bs_replicates,[2.5,97.5])

# Print the confidence interval
print('95% confidence interval =', conf_int, 'games')

# Plot the histogram of the replicates
_ = plt.hist(bs_replicates, bins=50, normed=True)
_ = plt.xlabel(r'$\tau$ (games)')
_ = plt.ylabel('PDF')

# Show the plot
plt.show()

重复抽取数据和进行线性拟合

Define a function with call signature draw_bs_pairs_linreg(x, y, size=1) to perform pairs bootstrap estimates on linear regression parameters.

  • Use np.arange() to set up an array of indices going from 0 to len(x). These are what you will resample and use them to pick values out of the xand y arrays.
  • Use np.empty() to initialize the slope and intercept replicate arrays to be of size size.
  • Write a for loop to:
    • Resample the indices inds. Use np.random.choice() to do this.
    • Make new xx and yy arrays bs_x and bs_yusing the the resampled indices bs_inds. To do this, slice x and y with bs_inds.
    • Use np.polyfit() on the new xx and yyarrays and store the computed slope and intercept.
  • Return the pair bootstrap replicates of the slope and intercept.

def draw_bs_pairs_linreg(x, y, size=1):
    """Perform pairs bootstrap for linear regression."""

    # Set up array of indices to sample from: inds
    inds = np.arange(len(x))

    # Initialize replicates: bs_slope_reps, bs_intercept_reps
    bs_slope_reps = np.empty(size)
    bs_intercept_reps = np.empty(size)

    # Generate replicates
    for i in range(size):
        bs_inds = np.random.choice(inds, size=len(inds))
        bs_x, bs_y = x[bs_inds], y[bs_inds]
        bs_slope_reps[i], bs_intercept_reps[i] = np.polyfit(bs_x, bs_y, 1)

    return bs_slope_reps, bs_intercept_reps

  • Generate an array of xx-values consisting of 0 and 100for the plot of the regression lines. Use the np.array()function for this.
  • Write a for loop in which you plot a regression line with a slope and intercept given by the pairs bootstrap replicates. Do this for 100 lines.
    • When plotting the regression lines in each iteration of the for loop, recall the regression equation y = a*x + b. Here, a is bs_slope_reps[i] and b is bs_intercept_reps[i].
    • Specify the keyword arguments linewidth=0.5alpha=0.2, and color='red' in your call to plt.plot().
  • Make a scatter plot with illiteracy on the x-axis and fertility on the y-axis. Remember to specify the marker='.' and linestyle='none' keyword arguments.
  • Label the axes, set a 2% margin, and show the plot. This has been done for you, so hit 'Submit Answer' to visualize the bootstrap regressions!

# Generate array of x-values for bootstrap lines: x
x = np.array([0,100])

# Plot the bootstrap lines
for i in range(100):
    _ = plt.plot(x, bs_slope_reps[i]*x + bs_intercept_reps[i],
                 linewidth=0.5, alpha=0.2, color='red')

# Plot the data
_ = plt.plot()

# Label axes, set the margins, and show the plot
_ = plt.xlabel('illiteracy')
_ = plt.ylabel('fertility')
plt.margins(0.02)
plt.show()

Visualizing permutation sampling


  • Write a for loop to 50 generate permutation samples, compute their ECDFs, and plot them.
    • Generate a permutation sample pair from rain_july and rain_november using your permutation_sample() function.
    • Generate the x and y values for an ECDF for each of the two permutation samples for the ECDF using your ecdf() function.
    • Plot the ECDF of the first permutation sample (x_1and y_1) as dots. Do the same for the second permutation sample (x_2 and y_2).
  • Generate x and y values for ECDFs for the rain_july and rain_november data and plot the ECDFs using respectively the keyword arguments color='red' and color='blue'.
  • Label your axes, set a 2% margin, and show your plot. This has been done for you, so just hit 'Submit Answer' to view the plot!

for _ in range(50):
    # Generate permutation samples
    perm_sample_1, perm_sample_2 = permutation_sample(rain_july, rain_november)

    # Compute ECDFs
    x_1, y_1 = ecdf(perm_sample_1)
    x_2, y_2 = ecdf(perm_sample_2)

    # Plot ECDFs of permutation sample
    _ = plt.plot(x_1, y_1, marker='.', linestyle='none',
                 color='red', alpha=0.02)
    _ = plt.plot(x_2, y_2, marker='.', linestyle='none',
                 color='blue', alpha=0.02)

# Create and plot ECDFs from original data
x_1, y_1 = ecdf(rain_july)
x_2, y_2 = ecdf(rain_november)
_ = plt.plot(x_1, y_1, marker='.', linestyle='none', color='red')
_ = plt.plot(x_2, y_2, marker='.', linestyle='none', color='blue')

# Label axes, set margin, and show plot
plt.margins(0.02)
_ = plt.xlabel('monthly rainfall (mm)')
_ = plt.ylabel('ECDF')
plt.show()

P值的计算:

  • Define a function with call signature diff_of_means(data_1, data_2) that returns the differences in means between two data sets, mean of data_1 minus mean of data_2.
  • Use this function to compute the empirical difference of means that was observed in the frogs.
  • Draw 10,000 permutation replicates of the difference of means.
  • Compute the p-value.
  • Print the p-value.

def diff_of_means(data_1, data_2):
    """Difference in means of two arrays."""

    # The difference of means of data_1, data_2: diff
    diff = np.mean(data_1) - np.mean(data_2)

    return diff

# Compute difference of mean impact force from experiment: empirical_diff_means
empirical_diff_means = diff_of_means(force_a, force_b)

# Draw 10,000 permutation replicates: perm_replicates
perm_replicates = draw_perm_reps(force_a, force_b,
                                 diff_of_means, size=10000)

# Compute p-value: p
p = np.sum( perm_replicates >= empirical_diff_means) / len(perm_replicates)

# Print the result
print('p-value =', p)


  • Translate the impact forces of Frog B such that its mean is 0.55 N.
  • Use your draw_bs_reps() function to take 10,000 bootstrap replicates of the mean of your translated forces.
  • Compute the p-value by finding the fraction of your bootstrap replicates that are less than the observed mean impact force of Frog B. Note that the variable of interest here is force_b.
  • Print your p-value.

# Make an array of translated impact forces: translated_force_b
translated_force_b = force_b - np.mean(force_b) + 0.55

# Take bootstrap replicates of Frog B's translated impact forces: bs_replicates
bs_replicates = draw_bs_reps(translated_force_b, np.mean, 10000)

# Compute fraction of replicates that are less than the observed Frog B force: p
p = np.sum(bs_replicates <= np.mean(force_b)) / 10000

# Print the p-value
print('p = ', p)

  • Construct Boolean arrays, dems and reps that contain the votes of the respective parties; e.g., dems has 153 True entries and 91 False entries.
  • Write a function, frac_yay_dems(dems, reps) that returns the fraction of Democrats that voted yay. The first input is an array of Booleans, Two inputs are required to use your draw_perm_reps() function, but the second is not used.
  • Use your draw_perm_reps() function to draw 10,000 permutation replicates of the fraction of Democrat yay votes.
  • Compute and print the p-value.

# Construct arrays of data: dems, reps
dems = np.array([True] * 153 + [False] * 91)
reps = np.array([True] * 136 + [False] * 35)

def frac_yay_dems(dems, reps):
    """Compute fraction of Democrat yay votes."""
    frac = np.sum(dems) / len(dems)
    return frac

# Acquire permutation samples: perm_replicates
perm_replicates = draw_perm_reps(dems, reps, frac_yay_dems, 10000)

# Compute and print p-value: p
p = np.sum(perm_replicates <= 153/244) / len(perm_replicates)
print('p-value =', p)


  • Compute the observed difference in mean inter-nohitter time using diff_of_means().
  • Generate 10,000 permutation replicates of the difference of means using draw_perm_reps().
  • Compute and print the p-value.

# Compute the observed difference in mean inter-no-hitter times: nht_diff_obs
nht_diff_obs = diff_of_means(nht_dead, nht_live)

# Acquire 10,000 permutation replicates of difference in mean no-hitter time: perm_replicates
perm_replicates = draw_perm_reps(nht_dead, nht_live,
                                 diff_of_means, size=10000)

# Compute and print the p-value: p
p = np.sum(perm_replicates <= nht_diff_obs) / len(perm_replicates)
print('p-val =',p)


  • Compute the observed Pearson correlation between illiteracy and fertility.
  • Initialize an array to store your permutation replicates.
  • Write a for loop to draw 10,000 replicates:
    • Permute the illiteracy measurements using np.random.permutation().
    • Compute the Pearson correlation between the permuted illiteracy array, illiteracy_permuted, and fertility.
  • Compute and print the p-value from the replicates.

# Compute observed correlation: r_obs
r_obs = pearson_r(illiteracy, fertility)

# Initialize permutation replicates: perm_replicates
perm_replicates = np.empty(10000)

# Draw replicates
for i in range(10000):
    # Permute illiteracy measurments: illiteracy_permuted
    illiteracy_permuted = np.random.permutation(illiteracy)

    # Compute Pearson correlation
    perm_replicates[i] = pearson_r(illiteracy_permuted, fertility)

# Compute p-value: p
p = np.sum(perm_replicates >= r_obs) / len(perm_replicates)
print('p-val =', p)


  • Use your ecdf() function to generate x,y values from the control and treated arrays for plotting the ECDFs.
  • Plot the ECDFs on the same plot.
  • The margins have been set for you, along with the legend and axis labels. Hit 'Submit Answer' to see the result!

# Compute x,y values for ECDFs
x_control, y_control = ecdf(control)
x_treated, y_treated = ecdf(treated)

# Plot the ECDFs
plt.plot(x_control, y_control, marker='.', linestyle='none')
plt.plot(x_treated, y_treated, marker='.', linestyle='none')

# Set the margins
plt.margins(0.02)

# Add a legend
plt.legend(('control', 'treated'), loc='lower right')

# Label axes and show plot
plt.xlabel('millions of alive sperm per mL')
plt.ylabel('ECDF')
plt.show()


  • Compute the mean alive sperm count of control minus that of treated.
  • Compute the mean of all alive sperm counts. To do this, first concatenate control and treated and take the mean of the concatenated array.
  • Generate shifted data sets for both control and treated such that the shifted data sets have the same mean. This has already been done for you.
  • Generate 10,000 bootstrap replicates of the mean each for the two shifted arrays. Use your draw_bs_reps()function.
  • Compute the bootstrap replicates of the difference of means.
  • The code to compute and print the p-value has been written for you. Hit 'Submit Answer' to see the result!

# Compute the difference in mean sperm count: diff_means
diff_means = np.mean(control) - np.mean(treated)

# Compute mean of pooled data: mean_count
mean_count = np.mean(np.concatenate((control, treated)))

# Generate shifted data sets
control_shifted = control - np.mean(control) + mean_count
treated_shifted = treated - np.mean(treated) + mean_count

# Generate bootstrap replicates
bs_reps_control = draw_bs_reps(control_shifted,
                               np.mean, size=10000)
bs_reps_treated = draw_bs_reps(treated_shifted,
                               np.mean, size=10000)

# Get replicates of difference of means: bs_replicates
bs_replicates = bs_reps_control - bs_reps_treated

# Compute and print p-value: p
p = np.sum(bs_replicates >= np.mean(control) - np.mean(treated)) \
            / len(bs_replicates)
print('p-value =', p)



你可能感兴趣的:(class Statistical Thinking in Python (Part 1))