《高级编程技术》作业[18]——python数据分析练习题

本次作业是emu193课程ipython的教程课后作业,原地址请见:

https://nbviewer.jupyter.org/github/schmit/cme193-ipython-notebooks-lecture/blob/master/Exercises.ipynb

一.IPython的使用

ipython是一个很好用的网页python编辑、终端集成插件。使用它,可以很方便地进行代码的编写与运行。在我这次的作业中,我感受到它最大的便利是

    1.可以直接在网页上输出运行的结果,如图片、表格等,比普通的python的终端更简洁、易读。

    2.可以对同一行代码很方便地进行多次修改与运行,避免了在普通终端中每行代码只能使用一次的麻烦。

ipython中代码的运行方式很简单,只用按住shift+Enter就可以运行当前文本框里的代码了。这次作业,我都将使用ipython来完成。

《高级编程技术》作业[18]——python数据分析练习题_第1张图片



二.作业题目

本次作业是在一个很著名的教学数据集Anscombe上面进行的(在文章开头链接中,找到data文件夹就可以下载)。该数据集因为其可视化后的特殊效果而广为使用。

有什么特殊效果呢?我们在题目中去寻找吧。

Part 1

For each of the four datasets...

  • Compute the mean and variance of both x and y
  • Compute the correlation coefficient between x and y
  • Compute the linear regression line: y=β0+β1x+ϵy=β0+β1x+ϵ (hint: use statsmodels and look at the Statsmodels notebook)


第一部分,让我们计算几个数据集中x、y的一些统计特征。最后,我们要通过x和y的关系拟合出一个线性方程出来。

我们可以直接使用三个数据分析库完成这三个工作:

    1、使用pandas来读取csv文件,并对表格的各项进行分类、统计。

    2、使用statsmodels来进行曲线的拟合,在数据集上进行训练和测试。

    代码如下:

import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
import statsmodels.formula.api as smf

#分数据集输出x和y的均值
print('The mean of x and y:')
print(anascombe.groupby(['dataset'])[['x', 'y']].mean())

#分数据集输出x和y的方差
print('\nThe varience of x and y:')
print(anascombe.groupby(['dataset'])[['x', 'y']].var())

#分数据集输出x和y的关联系数
print('\nThe correlation coefficient between x and y:')
print(anascombe.groupby(['dataset'])[['x', 'y']].corr());

#对每一个数据集学习一个拟合曲线出来
datasets = ['I', 'II', 'III', 'IV']
for dataset in datasets:
    lin_model = smf.ols('y ~ x', anascombe[anascombe['dataset'] == dataset]).fit()
    print('\nThe linear model for dataset', dataset)
    print(lin_model.summary())

print('\n')

输出结果如下:

The mean of x and y:
           x         y
dataset               
I        9.0  7.500909
II       9.0  7.500909
III      9.0  7.500000
IV       9.0  7.500909

The varience of x and y:
            x         y
dataset                
I        11.0  4.127269
II       11.0  4.127629
III      11.0  4.122620
IV       11.0  4.123249

The correlation coefficient between x and y:
                  x         y
dataset                      
I       x  1.000000  0.816421
        y  0.816421  1.000000
II      x  1.000000  0.816237
        y  0.816237  1.000000
III     x  1.000000  0.816287
        y  0.816287  1.000000
IV      x  1.000000  0.816521
        y  0.816521  1.000000

The linear model for dataset I
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.667
Model:                            OLS   Adj. R-squared:                  0.629
Method:                 Least Squares   F-statistic:                     17.99
Date:                Sat, 09 Jun 2018   Prob (F-statistic):            0.00217
Time:                        20:38:34   Log-Likelihood:                -16.841
No. Observations:                  11   AIC:                             37.68
Df Residuals:                       9   BIC:                             38.48
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.0001      1.125      2.667      0.026       0.456       5.544
x              0.5001      0.118      4.241      0.002       0.233       0.767
==============================================================================
Omnibus:                        0.082   Durbin-Watson:                   3.212
Prob(Omnibus):                  0.960   Jarque-Bera (JB):                0.289
Skew:                          -0.122   Prob(JB):                        0.865
Kurtosis:                       2.244   Cond. No.                         29.1
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The linear model for dataset II
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.666
Model:                            OLS   Adj. R-squared:                  0.629
Method:                 Least Squares   F-statistic:                     17.97
Date:                Sat, 09 Jun 2018   Prob (F-statistic):            0.00218
Time:                        20:38:34   Log-Likelihood:                -16.846
No. Observations:                  11   AIC:                             37.69
Df Residuals:                       9   BIC:                             38.49
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.0009      1.125      2.667      0.026       0.455       5.547
x              0.5000      0.118      4.239      0.002       0.233       0.767
==============================================================================
Omnibus:                        1.594   Durbin-Watson:                   2.188
Prob(Omnibus):                  0.451   Jarque-Bera (JB):                1.108
Skew:                          -0.567   Prob(JB):                        0.575
Kurtosis:                       1.936   Cond. No.                         29.1
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The linear model for dataset III
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.666
Model:                            OLS   Adj. R-squared:                  0.629
Method:                 Least Squares   F-statistic:                     17.97
Date:                Sat, 09 Jun 2018   Prob (F-statistic):            0.00218
Time:                        20:38:34   Log-Likelihood:                -16.838
No. Observations:                  11   AIC:                             37.68
Df Residuals:                       9   BIC:                             38.47
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.0025      1.124      2.670      0.026       0.459       5.546
x              0.4997      0.118      4.239      0.002       0.233       0.766
==============================================================================
Omnibus:                       19.540   Durbin-Watson:                   2.144
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               13.478
Skew:                           2.041   Prob(JB):                      0.00118
Kurtosis:                       6.571   Cond. No.                         29.1
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The linear model for dataset IV
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.667
Model:                            OLS   Adj. R-squared:                  0.630
Method:                 Least Squares   F-statistic:                     18.00
Date:                Sat, 09 Jun 2018   Prob (F-statistic):            0.00216
Time:                        20:38:34   Log-Likelihood:                -16.833
No. Observations:                  11   AIC:                             37.67
Df Residuals:                       9   BIC:                             38.46
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.0017      1.124      2.671      0.026       0.459       5.544
x              0.4999      0.118      4.243      0.002       0.233       0.766
==============================================================================
Omnibus:                        0.555   Durbin-Watson:                   1.662
Prob(Omnibus):                  0.758   Jarque-Bera (JB):                0.524
Skew:                           0.010   Prob(JB):                        0.769
Kurtosis:                       1.931   Cond. No.                         29.1
==============================================================================

Part 2

Using Seaborn, visualize all four datasets.

hint: use sns.FacetGrid combined with plt.scatter

对于这一题来说,我们也可以通过调用:

  1.seaborn来进行针对不同数据集,不同的特征的绘图操作。

  2.matplotlib来进行散点图的绘制操作

代码如下:

import matplotlib.pyplot as plt
import seaborn as sns

g = sns.FacetGrid(anascombe, col='dataset', hue="y")
g.map(plt.scatter, 'x', 'y')

运行结果如下图:

《高级编程技术》作业[18]——python数据分析练习题_第2张图片

三.结果分析

仅从第一个部分来看,这四个数据集的均值、方差、相关系数三个重要的统计数据都几乎是一样的。但从真实绘制出来的图像来看,这四个数据集的分布情况其实有很大的差别。

这提示我们,在分析数据的时候,仅用少量的统计数据的分析是很有可能出现偏颇的。比较好的方法应该是进行数据可视化,可以让我们更直观地了解到数据的分布情况。

同时使用可视化和统计数据,我们才能更全面地对数据进行完整的分析。

你可能感兴趣的:(《高级编程技术》作业[18]——python数据分析练习题)