一些EDA数据的基本技术
matplotlib.pyplot
and seaborn
as their usual aliases (plt
and sns
).seaborn
to set the plotting defaults.plt.hist()
and the provided NumPy array versicolor_petal_length
.plt.show()
.# Import plotting modules
import matplotlib.pyplot as plt
import seaborn as sns
# Set default Seaborn style
sns.set()
# Plot histogram of versicolor petal lengths
plt.hist(versicolor_petal_length)
# Show histogram
plt.show()
'count'
. Your x-axis label is 'petal length (cm)'
. The units are essential!plt.show()
. # Plot histogram of versicolor petal lengths
_ = plt.hist(versicolor_petal_length)
# Label axes
plt.xlabel("petal length (cm)")
plt.ylabel("count")
# Show histogram
plt.show()
hist直方图的bins数目一般是数据数量的开根号
numpy
as np
. This gives access to the square root function, np.sqrt()
.len()
.int()
function.bins
keyword argument.# Import numpy
import numpy as np
# Compute number of data points: n_data
n_data = len(versicolor_petal_length)
# Number of bins is the square root of number of data points: n_bins
n_bins = np.sqrt(n_data)
# Convert number of bins to integer: n_bins
n_bins = int(n_bins)
# Plot the histogram
plt.hist(versicolor_petal_length, bins=n_bins)
# Label axes
_ = plt.xlabel('petal length (cm)')
_ = plt.ylabel('count')
# Show histogram
plt.show()
_ = sns.swarmplot(x='state', y='dem_share', data=df_swing)
_ = plt.xlabel('state')
_ = plt.ylabel('percent of vote for Obama')
plt.show()
df
using df.head()
. This will let you identify which column names you need to pass as the x
and y
keyword arguments in your call to sns.swarmplot()
.sns.swarmplot()
to make a bee swarm plot from the DataFrame containing the Fisher iris data set, df
. The x-axis should contain each of the three species, and the y-axis should contain the petal lengths.ecdf(data)
. Within the function definition,
n
, using the len()
function.np.sort()
function to perform the sorting.1/n
to 1
in equally spaced increments. You can construct this using np.arange()
. Remember, however, that the end value in np.arange()
is not inclusive. Therefore, np.arange()
will need to go from 1
to n+1
. Be sure to divide this by n
.x
and y
.ecdf()
to compute the ECDF of versicolor_petal_length
. Unpack the output intox_vers
and y_vers
.marker = '.'
and linestyle = 'none'
in addition to x_vers
and y_vers
as arguments inside plt.plot()
.plt.margins()
so that no data points are cut off. Use a 2% margin.'ECDF'
.ecdf()
to compute the ECDF of versicolor_petal_length
. Unpack the output intox_vers
and y_vers
.marker = '.'
and linestyle = 'none'
in addition to x_vers
and y_vers
as arguments inside plt.plot()
.plt.margins()
so that no data points are cut off. Use a 2% margin.'ECDF'
.ecdf()
function. The variables setosa_petal_length
, versicolor_petal_length
, and virginica_petal_length
are all in your namespace. Unpack the ECDFs into x_set, y_set
, x_vers, y_vers
and x_virg, y_virg
, respectively.plt.plot()
commands. Assign the result of each to _
.dataframe计算分位数
percentiles
, a NumPy array of percentiles you want to compute. These are the 2.5th, 25th, 50th, 75th, and 97.5th. You can do so by creating a list containing these ints/floats and convert the list to a NumPy array using np.array()
. For example, np.array([30, 50])
would create an array consisting of the 30th and 50th percentiles.np.percentile()
to compute the percentiles of the petal lengths from the Iris versicolor samples. The variable versicolor_petal_length
is in your namespace.画图并且标记处分位数点:
ptiles_vers
and percentiles/100
- as positional arguments and specify the marker='D'
, color='red'
and linestyle='none'
keyword arguments. The argument for the y-axis - percentiles/100
has been specified for you.# Plot the ECDF
_ = plt.plot(x_vers, y_vers, '.')
plt.margins(0.02)
_ = plt.xlabel('petal length (cm)')
_ = plt.ylabel('ECDF')
# Overlay percentiles as red diamonds.
_ = plt.plot(ptiles_vers, percentiles/100, marker='D', color='red',
linestyle='none')
# Show the plot
plt.show()
计算方差
differences
that is the difference between the petal lengths (versicolor_petal_length
) and the mean petal length. The variable versicolor_petal_length
is already in your namespace as a NumPy array so you can take advantage of NumPy's vectorized operations.x**2
squares each element in the array x
. Store the result as diff_sq
.diff_sq
using np.mean()
. Store the result as variance_explicit
.versicolor_petal_length
using np.var()
. Store the result as variance_np
.variance_explicit
and variance_np
in one print
call to make sure they are consistent.# Array of differences to mean: differences
differences = versicolor_petal_length - np.mean(versicolor_petal_length)
# Square the differences: diff_sq
diff_sq = differences ** 2
# Compute the mean square difference: variance_explicit
variance_explicit = np.mean(diff_sq)
# Compute the variance using NumPy: variance_np
variance_np = np.var(versicolor_petal_length)
# Print the results
print(variance_explicit,variance_np)
计算皮尔逊相关系数
pearson_r(x, y)
.
np.corrcoef()
to compute the correlation matrix of x
and y
(pass them to np.corrcoef()
in that order).[0,1]
of the correlation matrix.versicolor_petal_length
and versicolor_petal_width
. Assign the result to r
.np.random.binomial()
. You should use parameters n = 100
and p = 0.05
, and set the size
keyword argument to 10000
.ecdf()
function.# Take 10,000 samples out of the binomial distribution: n_defaults
n_defaults = np.random.binomial(100, 0.05, size = 10000)
# Compute CDF: x, y
x, y = ecdf(n_defaults)
# Plot the CDF with axis labels
plt.plot(x, y, marker = '.', linestyle = 'none' )
plt.xlabel('the number of defaults out of 100 loans')
plt.ylabel('CDF')
# Show the plot
plt.show()
正态分布的构建和绘制
20
and a standard deviation of 1
. Do the same for Normal distributions with standard deviations of 3
and 10
, each still with a mean of 20
. Assign the results to samples_std1
, samples_std3
and samples_std10
, respectively.normed=True
and histtype='step'
. The latter keyword argument makes the plot look much like the smooth theoretical PDF. You will need to make 3 plt.hist()
calls.belmont_no_outliers
has these data.np.random.normal()
.x_theor, y_theor
and x, y
, respectively.# Compute mean and standard deviation: mu, sigma
mu = np.mean(belmont_no_outliers)
sigma = np.std(belmont_no_outliers)
# Sample out of a normal distribution with this mu and sigma: samples
samples = np.random.normal(mu,sigma,10000)
# Get the CDF of the samples and of the data
x_theor,y_theor = ecdf(samples)
x, y = ecdf(belmont_no_outliers)
# Plot the CDFs and show the plot
_ = plt.plot(x_theor, y_theor)
_ = plt.plot(x, y, marker='.', linestyle='none')
plt.margins(0.02)
_ = plt.xlabel('Belmont winning time (sec.)')
_ = plt.ylabel('CDF')
plt.show()
指数分布
Define a function with call signature successive_poisson(tau1, tau2, size=1)
that samples the waiting time for a no-hitter and a hit of the cycle.
tau1
(size
number of samples) for the no-hitter out of an exponential distribution and assign to t1
.tau2
(size
number of samples) for hitting the cycle out of an exponential distribution and assign to t2
.def successive_poisson(tau1, tau2, size=1):
# Draw samples out of first exponential distribution: t1
t1 = np.random.exponential(tau1, size)
# Draw samples out of second exponential distribution: t2
t2 = np.random.exponential(tau2, size)
return t1 + t2
successive_poisson()
function to draw 100,000 out of the distribution of waiting times for observing a no-hitter and a hitting of the cycle.bins=100
, normed=True
, and histtype='step'
.多项式拟合
np.polyfit()
. Remember, fertility
is on the y-axis and illiteracy
on the x-axis.x
that consists of 0 and 100 using np.array()
. Then, compute the theoretical values of y
based on your regression parameters. I.e., y = a * x + b
.# Plot the illiteracy rate versus fertility
_ = plt.plot(illiteracy, fertility, marker='.', linestyle='none')
plt.margins(0.02)
_ = plt.xlabel('percent illiterate')
_ = plt.ylabel('fertility')
# Perform a linear regression using np.polyfit(): a, b
a, b = np.polyfit(illiteracy, fertility,1)
# Print the results to the screen
print('slope =', a, 'children per woman / percent illiterate')
print('intercept =', b, 'children per woman')
# Make theoretical line to plot
x = np.array([0,100])
y = a * x + b
# Add regression line to your plot
_ = plt.plot(x, y)
# Draw the plot
plt.show()
np.linspace()
to get 200
points in the range between 0
and 0.1
. For example, to get 100
points in the range between 0
and 0.5
, you could use np.linspace()
like so: np.linspace(0, 0.5, 100)
.rss
, to contain the RSS using np.empty_like()
and the array you created above. The empty_like()
function returns a new array with the same shape and type as a given array (in this case, a_vals
).for
loop to compute the sum of RSS of the slope. Hint: the RSS is given by np.sum((y_data - a * x_data - b)**2)
. The variable b
you computed in the last exercise is already in your namespace. Here, fertility
is the y_data
and illiteracy
the x_data
.rss
) versus slope (a_vals
).# Specify slopes to consider: a_vals
a_vals = np.linspace(0, 0.1, 200)
# Initialize sum of square of residuals: rss
rss = np.empty_like(a_vals)
# Compute sum of square of residuals for each value of a_vals
for i, a in enumerate(a_vals):
rss[i] = np.sum((fertility - a*illiteracy - b)**2)
# Plot the RSS
plt.plot(a_vals, rss, '-')
plt.xlabel('slope (children per woman / percent illiterate)')
plt.ylabel('sum of square of residuals')
plt.show()
np.polyfit()
. The Anscombe data are stored in the arrays x
and y
.a
and intercept b
.np.array()
, should consist of 3
and 15
. To generate the y data, multiply the slope by x_theor
and add the intercept.marker='.'
and linestyle='none'
keyword arguments in addition to x
and y
when to plot the Anscombe data as a scatter plot. You do not need these arguments when plotting the theoretical line.样本小的时候进行重复采样的方法
for
loop to acquire 50
bootstrap samples of the rainfall data and plot their ECDF.
np.random.choice()
to generate a bootstrap sample from the NumPy array rainfall
. Be sure that the size
of the resampled array is len(rainfall)
.ecdf()
that you wrote in the prequel to this course to generate the x
and y
values for the ECDF of the bootstrap sample bs_sample
.color='gray'
(to make gray dots) and alpha=0.1
(to make them semi-transparent, since we are overlaying so many) in addition to the marker='.'
and linestyle='none'
keyword arguments.ecdf()
to generate x
and y
values for the ECDF of the original rainfall data available in the array rainfall
.for _ in range(50):
# Generate bootstrap sample: bs_sample
bs_sample = np.random.choice(rainfall, size=len(rainfall))
# Compute and plot ECDF from bootstrap sample
x, y = ecdf(bs_sample)
_ = plt.plot(x, y, marker='.', linestyle='none',
color='gray', alpha=0.1)
# Compute and plot ECDF from original data
x, y = ecdf(rainfall)
_ = plt.plot(x, y, marker='.')
# Make margins and label axes
plt.margins(0.02)
_ = plt.xlabel('yearly rainfall (mm)')
_ = plt.ylabel('ECDF')
# Show the plot
plt.show()
10000
bootstrap replicates of the mean annual rainfall using your draw_bs_reps()
function and the rainfall
array. Hint: Pass in np.mean
for func
to compute the mean.
draw_bs_reps()
accepts 3 arguments: data
, func
, and size
.rainfall
.
np.std(data) / np.sqrt(len(data))
.bs_replicates
.normed=True
keyword argument and 50
bins.# Take 10,000 bootstrap replicates of the mean: bs_replicates
bs_replicates = draw_bs_reps(rainfall, np.mean, 10000)
# Compute and print SEM
sem = np.std(rainfall) / np.sqrt(len(rainfall))
print(sem)
# Compute and print standard deviation of bootstrap replicates
bs_std = np.std(bs_replicates)
print(bs_std)
# Make a histogram of the results
_ = plt.hist(bs_replicates, bins=50, normed=True)
_ = plt.xlabel('mean annual rainfall (mm)')
_ = plt.ylabel('PDF')
# Show the plot
plt.show()
计算置信区间
10000
bootstrap replicates of τ from the nohitter_times
data using your draw_bs_reps()
function. Recall that the the optimal τ is calculated as the mean of the data.np.percentile()
and passing in two arguments: The array bs_replicates
, and the list of percentiles - in this case 2.5
and 97.5
.# Draw bootstrap replicates of the mean no-hitter time (equal to tau): bs_replicates
bs_replicates = draw_bs_reps(nohitter_times,np.mean,10000)
# Compute the 95% confidence interval: conf_int
conf_int = np.percentile(bs_replicates,[2.5,97.5])
# Print the confidence interval
print('95% confidence interval =', conf_int, 'games')
# Plot the histogram of the replicates
_ = plt.hist(bs_replicates, bins=50, normed=True)
_ = plt.xlabel(r'$\tau$ (games)')
_ = plt.ylabel('PDF')
# Show the plot
plt.show()
重复抽取数据和进行线性拟合
Define a function with call signature draw_bs_pairs_linreg(x, y, size=1)
to perform pairs bootstrap estimates on linear regression parameters.
np.arange()
to set up an array of indices going from 0
to len(x)
. These are what you will resample and use them to pick values out of the x
and y
arrays.np.empty()
to initialize the slope and intercept replicate arrays to be of size size
.for
loop to:
inds
. Use np.random.choice()
to do this.bs_x
and bs_y
using the the resampled indices bs_inds
. To do this, slice x
and y
with bs_inds
.np.polyfit()
on the new x and yarrays and store the computed slope and intercept.0
and 100
for the plot of the regression lines. Use the np.array()
function for this.for
loop in which you plot a regression line with a slope and intercept given by the pairs bootstrap replicates. Do this for 100
lines.
for
loop, recall the regression equation y = a*x + b
. Here, a
is bs_slope_reps[i]
and b
is bs_intercept_reps[i]
.linewidth=0.5
, alpha=0.2
, and color='red'
in your call to plt.plot()
.illiteracy
on the x-axis and fertility
on the y-axis. Remember to specify the marker='.'
and linestyle='none'
keyword arguments.# Generate array of x-values for bootstrap lines: x
x = np.array([0,100])
# Plot the bootstrap lines
for i in range(100):
_ = plt.plot(x, bs_slope_reps[i]*x + bs_intercept_reps[i],
linewidth=0.5, alpha=0.2, color='red')
# Plot the data
_ = plt.plot()
# Label axes, set the margins, and show the plot
_ = plt.xlabel('illiteracy')
_ = plt.ylabel('fertility')
plt.margins(0.02)
plt.show()
for
loop to 50 generate permutation samples, compute their ECDFs, and plot them.
rain_july
and rain_november
using your permutation_sample()
function.x
and y
values for an ECDF for each of the two permutation samples for the ECDF using your ecdf()
function.x_1
and y_1
) as dots. Do the same for the second permutation sample (x_2
and y_2
).x
and y
values for ECDFs for the rain_july
and rain_november
data and plot the ECDFs using respectively the keyword arguments color='red'
and color='blue'
.for _ in range(50):
# Generate permutation samples
perm_sample_1, perm_sample_2 = permutation_sample(rain_july, rain_november)
# Compute ECDFs
x_1, y_1 = ecdf(perm_sample_1)
x_2, y_2 = ecdf(perm_sample_2)
# Plot ECDFs of permutation sample
_ = plt.plot(x_1, y_1, marker='.', linestyle='none',
color='red', alpha=0.02)
_ = plt.plot(x_2, y_2, marker='.', linestyle='none',
color='blue', alpha=0.02)
# Create and plot ECDFs from original data
x_1, y_1 = ecdf(rain_july)
x_2, y_2 = ecdf(rain_november)
_ = plt.plot(x_1, y_1, marker='.', linestyle='none', color='red')
_ = plt.plot(x_2, y_2, marker='.', linestyle='none', color='blue')
# Label axes, set margin, and show plot
plt.margins(0.02)
_ = plt.xlabel('monthly rainfall (mm)')
_ = plt.ylabel('ECDF')
plt.show()
P值的计算:
diff_of_means(data_1, data_2)
that returns the differences in means between two data sets, mean of data_1
minus mean of data_2
.def diff_of_means(data_1, data_2):
"""Difference in means of two arrays."""
# The difference of means of data_1, data_2: diff
diff = np.mean(data_1) - np.mean(data_2)
return diff
# Compute difference of mean impact force from experiment: empirical_diff_means
empirical_diff_means = diff_of_means(force_a, force_b)
# Draw 10,000 permutation replicates: perm_replicates
perm_replicates = draw_perm_reps(force_a, force_b,
diff_of_means, size=10000)
# Compute p-value: p
p = np.sum( perm_replicates >= empirical_diff_means) / len(perm_replicates)
# Print the result
print('p-value =', p)
draw_bs_reps()
function to take 10,000 bootstrap replicates of the mean of your translated forces.force_b
.# Make an array of translated impact forces: translated_force_b
translated_force_b = force_b - np.mean(force_b) + 0.55
# Take bootstrap replicates of Frog B's translated impact forces: bs_replicates
bs_replicates = draw_bs_reps(translated_force_b, np.mean, 10000)
# Compute fraction of replicates that are less than the observed Frog B force: p
p = np.sum(bs_replicates <= np.mean(force_b)) / 10000
# Print the p-value
print('p = ', p)
dems
and reps
that contain the votes of the respective parties; e.g., dems
has 153 True
entries and 91 False
entries.frac_yay_dems(dems, reps)
that returns the fraction of Democrats that voted yay. The first input is an array of Booleans, Two inputs are required to use your draw_perm_reps()
function, but the second is not used.draw_perm_reps()
function to draw 10,000 permutation replicates of the fraction of Democrat yay votes.# Construct arrays of data: dems, reps
dems = np.array([True] * 153 + [False] * 91)
reps = np.array([True] * 136 + [False] * 35)
def frac_yay_dems(dems, reps):
"""Compute fraction of Democrat yay votes."""
frac = np.sum(dems) / len(dems)
return frac
# Acquire permutation samples: perm_replicates
perm_replicates = draw_perm_reps(dems, reps, frac_yay_dems, 10000)
# Compute and print p-value: p
p = np.sum(perm_replicates <= 153/244) / len(perm_replicates)
print('p-value =', p)
diff_of_means()
.draw_perm_reps()
.# Compute the observed difference in mean inter-no-hitter times: nht_diff_obs
nht_diff_obs = diff_of_means(nht_dead, nht_live)
# Acquire 10,000 permutation replicates of difference in mean no-hitter time: perm_replicates
perm_replicates = draw_perm_reps(nht_dead, nht_live,
diff_of_means, size=10000)
# Compute and print the p-value: p
p = np.sum(perm_replicates <= nht_diff_obs) / len(perm_replicates)
print('p-val =',p)
illiteracy
and fertility
.for
loop to draw 10,000 replicates:
illiteracy
measurements using np.random.permutation()
.illiteracy_permuted
, and fertility
.# Compute observed correlation: r_obs
r_obs = pearson_r(illiteracy, fertility)
# Initialize permutation replicates: perm_replicates
perm_replicates = np.empty(10000)
# Draw replicates
for i in range(10000):
# Permute illiteracy measurments: illiteracy_permuted
illiteracy_permuted = np.random.permutation(illiteracy)
# Compute Pearson correlation
perm_replicates[i] = pearson_r(illiteracy_permuted, fertility)
# Compute p-value: p
p = np.sum(perm_replicates >= r_obs) / len(perm_replicates)
print('p-val =', p)
ecdf()
function to generate x,y
values from the control
and treated
arrays for plotting the ECDFs.# Compute x,y values for ECDFs
x_control, y_control = ecdf(control)
x_treated, y_treated = ecdf(treated)
# Plot the ECDFs
plt.plot(x_control, y_control, marker='.', linestyle='none')
plt.plot(x_treated, y_treated, marker='.', linestyle='none')
# Set the margins
plt.margins(0.02)
# Add a legend
plt.legend(('control', 'treated'), loc='lower right')
# Label axes and show plot
plt.xlabel('millions of alive sperm per mL')
plt.ylabel('ECDF')
plt.show()
control
minus that of treated
.control
and treated
and take the mean of the concatenated array.control
and treated
such that the shifted data sets have the same mean. This has already been done for you.draw_bs_reps()
function.# Compute the difference in mean sperm count: diff_means
diff_means = np.mean(control) - np.mean(treated)
# Compute mean of pooled data: mean_count
mean_count = np.mean(np.concatenate((control, treated)))
# Generate shifted data sets
control_shifted = control - np.mean(control) + mean_count
treated_shifted = treated - np.mean(treated) + mean_count
# Generate bootstrap replicates
bs_reps_control = draw_bs_reps(control_shifted,
np.mean, size=10000)
bs_reps_treated = draw_bs_reps(treated_shifted,
np.mean, size=10000)
# Get replicates of difference of means: bs_replicates
bs_replicates = bs_reps_control - bs_reps_treated
# Compute and print p-value: p
p = np.sum(bs_replicates >= np.mean(control) - np.mean(treated)) \
/ len(bs_replicates)
print('p-value =', p)