%matplotlib inline
import random
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
Anscombe's quartet comprises of four datasets, and is rather famous. Why? You'll find out in this exercise.
For each of the four datasets...
for i in range(4):
x = anascombe.x[i*10:(i+1)*10]
y = anascombe.y[i*10:(i+1)*10]
corrlation = x.corr(y)
print("corrlation of group", i, ':', corrlation)
for i in range(4):
x = anascombe.x[i*10:(i+1)*10]
y = anascombe.y[i*10:(i+1)*10]
mod = sm.OLS(y,x)
result = mod.fit()
每组x的均值 dataset I 9.0 II 9.0 III 9.0 IV 9.0 Name: x, dtype: float64 每组x的方差 dataset I 11.0 II 11.0 III 11.0 IV 11.0 Name: x, dtype: float64 每组y的均值 dataset I 7.500909 II 7.500909 III 7.500000 IV 7.500909 Name: y, dtype: float64 每组y的方差 dataset I 4.127269 II 4.127629 III 4.122620 IV 4.123249 Name: y, dtype: float64 相关性 corrlation of group: 0 0.797081575906253 corrlation of group: 1 0.8107567988514719 corrlation of group: 2 0.828558301914895 corrlation of group: 3 0.4695259621639301 线性回归 OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.965 Model: OLS Adj. R-squared: 0.962 Method: Least Squares F-statistic: 251.5 Date: Sat, 09 Jun 2018 Prob (F-statistic): 6.95e-08 Time: 17:13:54 Log-Likelihood: -18.061 No. Observations: 10 AIC: 38.12 Df Residuals: 9 BIC: 38.43 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ x 0.7881 0.050 15.859 0.000 0.676 0.901 ============================================================================== Omnibus: 0.651 Durbin-Watson: 2.507 Prob(Omnibus): 0.722 Jarque-Bera (JB): 0.396 Skew: -0.424 Prob(JB): 0.820 Kurtosis: 2.519 Cond. No. 1.00 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.961 Model: OLS Adj. R-squared: 0.957 Method: Least Squares F-statistic: 221.7 Date: Sat, 09 Jun 2018 Prob (F-statistic): 1.20e-07 Time: 17:13:54 Log-Likelihood: -18.584 No. Observations: 10 AIC: 39.17 Df Residuals: 9 BIC: 39.47 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ x 0.7894 0.053 14.889 0.000 0.669 0.909 ============================================================================== Omnibus: 3.223 Durbin-Watson: 2.351 Prob(Omnibus): 0.200 Jarque-Bera (JB): 1.584 Skew: -0.969 Prob(JB): 0.453 Kurtosis: 2.795 Cond. No. 1.00 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.963 Model: OLS Adj. R-squared: 0.959 Method: Least Squares F-statistic: 235.0 Date: Sat, 09 Jun 2018 Prob (F-statistic): 9.34e-08 Time: 17:13:54 Log-Likelihood: -18.117 No. Observations: 10 AIC: 38.23 Df Residuals: 9 BIC: 38.54 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ x 0.8175 0.053 15.329 0.000 0.697 0.938 ============================================================================== Omnibus: 0.753 Durbin-Watson: 1.401 Prob(Omnibus): 0.686 Jarque-Bera (JB): 0.590 Skew: -0.489 Prob(JB): 0.745 Kurtosis: 2.323 Cond. No. 1.00 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.964 Model: OLS Adj. R-squared: 0.960 Method: Least Squares F-statistic: 243.1 Date: Sat, 09 Jun 2018 Prob (F-statistic): 8.06e-08 Time: 17:13:54 Log-Likelihood: -17.121 No. Observations: 10 AIC: 36.24 Df Residuals: 9 BIC: 36.54 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ x 0.8537 0.055 15.591 0.000 0.730 0.978 ============================================================================== Omnibus: 1.048 Durbin-Watson: 1.199 Prob(Omnibus): 0.592 Jarque-Bera (JB): 0.714 Skew: -0.287 Prob(JB): 0.700 Kurtosis: 1.823 Cond. No. 1.00 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Using Seaborn, visualize all four datasets.
hint: use sns.FacetGrid combined with plt.scatter
g = sns.FacetGrid(anascombe, col = 'dataset')
g_map = g.map(plt.scatter, 'x', 'y')