本次作业是emu193课程ipython的教程课后作业,原地址请见:
https://nbviewer.jupyter.org/github/schmit/cme193-ipython-notebooks-lecture/blob/master/Exercises.ipynb
一.IPython的使用
ipython是一个很好用的网页python编辑、终端集成插件。使用它,可以很方便地进行代码的编写与运行。在我这次的作业中,我感受到它最大的便利是
1.可以直接在网页上输出运行的结果,如图片、表格等,比普通的python的终端更简洁、易读。
2.可以对同一行代码很方便地进行多次修改与运行,避免了在普通终端中每行代码只能使用一次的麻烦。
ipython中代码的运行方式很简单,只用按住shift+Enter就可以运行当前文本框里的代码了。这次作业,我都将使用ipython来完成。
二.作业题目
本次作业是在一个很著名的教学数据集Anscombe上面进行的(在文章开头链接中,找到data文件夹就可以下载)。该数据集因为其可视化后的特殊效果而广为使用。
有什么特殊效果呢?我们在题目中去寻找吧。
For each of the four datasets...
第一部分,让我们计算几个数据集中x、y的一些统计特征。最后,我们要通过x和y的关系拟合出一个线性方程出来。
我们可以直接使用三个数据分析库完成这三个工作:
1、使用pandas来读取csv文件,并对表格的各项进行分类、统计。
2、使用statsmodels来进行曲线的拟合,在数据集上进行训练和测试。
代码如下:
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
#分数据集输出x和y的均值
print('The mean of x and y:')
print(anascombe.groupby(['dataset'])[['x', 'y']].mean())
#分数据集输出x和y的方差
print('\nThe varience of x and y:')
print(anascombe.groupby(['dataset'])[['x', 'y']].var())
#分数据集输出x和y的关联系数
print('\nThe correlation coefficient between x and y:')
print(anascombe.groupby(['dataset'])[['x', 'y']].corr());
#对每一个数据集学习一个拟合曲线出来
datasets = ['I', 'II', 'III', 'IV']
for dataset in datasets:
lin_model = smf.ols('y ~ x', anascombe[anascombe['dataset'] == dataset]).fit()
print('\nThe linear model for dataset', dataset)
print(lin_model.summary())
print('\n')
输出结果如下:
The mean of x and y:
x y
dataset
I 9.0 7.500909
II 9.0 7.500909
III 9.0 7.500000
IV 9.0 7.500909
The varience of x and y:
x y
dataset
I 11.0 4.127269
II 11.0 4.127629
III 11.0 4.122620
IV 11.0 4.123249
The correlation coefficient between x and y:
x y
dataset
I x 1.000000 0.816421
y 0.816421 1.000000
II x 1.000000 0.816237
y 0.816237 1.000000
III x 1.000000 0.816287
y 0.816287 1.000000
IV x 1.000000 0.816521
y 0.816521 1.000000
The linear model for dataset I
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.667
Model: OLS Adj. R-squared: 0.629
Method: Least Squares F-statistic: 17.99
Date: Sat, 09 Jun 2018 Prob (F-statistic): 0.00217
Time: 20:38:34 Log-Likelihood: -16.841
No. Observations: 11 AIC: 37.68
Df Residuals: 9 BIC: 38.48
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 3.0001 1.125 2.667 0.026 0.456 5.544
x 0.5001 0.118 4.241 0.002 0.233 0.767
==============================================================================
Omnibus: 0.082 Durbin-Watson: 3.212
Prob(Omnibus): 0.960 Jarque-Bera (JB): 0.289
Skew: -0.122 Prob(JB): 0.865
Kurtosis: 2.244 Cond. No. 29.1
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The linear model for dataset II
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.666
Model: OLS Adj. R-squared: 0.629
Method: Least Squares F-statistic: 17.97
Date: Sat, 09 Jun 2018 Prob (F-statistic): 0.00218
Time: 20:38:34 Log-Likelihood: -16.846
No. Observations: 11 AIC: 37.69
Df Residuals: 9 BIC: 38.49
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 3.0009 1.125 2.667 0.026 0.455 5.547
x 0.5000 0.118 4.239 0.002 0.233 0.767
==============================================================================
Omnibus: 1.594 Durbin-Watson: 2.188
Prob(Omnibus): 0.451 Jarque-Bera (JB): 1.108
Skew: -0.567 Prob(JB): 0.575
Kurtosis: 1.936 Cond. No. 29.1
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The linear model for dataset III
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.666
Model: OLS Adj. R-squared: 0.629
Method: Least Squares F-statistic: 17.97
Date: Sat, 09 Jun 2018 Prob (F-statistic): 0.00218
Time: 20:38:34 Log-Likelihood: -16.838
No. Observations: 11 AIC: 37.68
Df Residuals: 9 BIC: 38.47
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 3.0025 1.124 2.670 0.026 0.459 5.546
x 0.4997 0.118 4.239 0.002 0.233 0.766
==============================================================================
Omnibus: 19.540 Durbin-Watson: 2.144
Prob(Omnibus): 0.000 Jarque-Bera (JB): 13.478
Skew: 2.041 Prob(JB): 0.00118
Kurtosis: 6.571 Cond. No. 29.1
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The linear model for dataset IV
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.667
Model: OLS Adj. R-squared: 0.630
Method: Least Squares F-statistic: 18.00
Date: Sat, 09 Jun 2018 Prob (F-statistic): 0.00216
Time: 20:38:34 Log-Likelihood: -16.833
No. Observations: 11 AIC: 37.67
Df Residuals: 9 BIC: 38.46
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 3.0017 1.124 2.671 0.026 0.459 5.544
x 0.4999 0.118 4.243 0.002 0.233 0.766
==============================================================================
Omnibus: 0.555 Durbin-Watson: 1.662
Prob(Omnibus): 0.758 Jarque-Bera (JB): 0.524
Skew: 0.010 Prob(JB): 0.769
Kurtosis: 1.931 Cond. No. 29.1
==============================================================================
Using Seaborn, visualize all four datasets.
hint: use sns.FacetGrid combined with plt.scatter
对于这一题来说,我们也可以通过调用:
1.seaborn来进行针对不同数据集,不同的特征的绘图操作。
2.matplotlib来进行散点图的绘制操作
代码如下:
import matplotlib.pyplot as plt
import seaborn as sns
g = sns.FacetGrid(anascombe, col='dataset', hue="y")
g.map(plt.scatter, 'x', 'y')
运行结果如下图:
三.结果分析
仅从第一个部分来看,这四个数据集的均值、方差、相关系数三个重要的统计数据都几乎是一样的。但从真实绘制出来的图像来看,这四个数据集的分布情况其实有很大的差别。
这提示我们,在分析数据的时候,仅用少量的统计数据的分析是很有可能出现偏颇的。比较好的方法应该是进行数据可视化,可以让我们更直观地了解到数据的分布情况。
同时使用可视化和统计数据,我们才能更全面地对数据进行完整的分析。