pandas 习题

题目来源:

https://nbviewer.jupyter.org/github/schmit/cme193-ipython-notebooks-lecture/blob/master/Exercises.ipynb


Anscombe's quartet

Anscombe's quartet comprises of four datasets, and is rather famous. Why? You'll find out in this exercise.

初始数据:

pandas 习题_第1张图片

Part 1

For each of the four datasets...

  • Compute the mean and variance of both x and y
  • Compute the correlation coefficient between x and y
  • Compute the linear regression line: y=β0+β1x+ϵy=β0+β1x+ϵ (hint: use statsmodels and look at the Statsmodels notebook

结果:

means: pandas 习题_第2张图片 

variance:   pandas 习题_第3张图片

correlation coefficient: pandas 习题_第4张图片

model:  

pandas 习题_第5张图片

pandas 习题_第6张图片

pandas 习题_第7张图片

pandas 习题_第8张图片


Part 2

Using Seaborn, visualize all four datasets.

hint: use sns.FacetGrid combined with plt.scatter

pandas 习题_第9张图片


具体代码:

import random

import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
import statsmodels.formula.api as smf

sns.set_context("talk")

#读取并显示初始数据
anascombe = pd.read_csv('Anscombe.csv')
data = anascombe.head()
print(data)
#计算并显示平均数
means = anascombe.groupby('dataset')['x','y'].mean()    
print("the mean of x and y:")
print(means)
#计算并显示方差
std = anascombe.groupby('dataset')['x','y'].std()
print("the variance of x and y:")
print(std)
#计算并显示相关系数
corr = anascombe.groupby('dataset')['x','y'].corr()
print("the correlation coefficient of x and y:")
print(corr)

print()
#拟合并输出结果
l = ['I','II','III','IV']
for i in l:
    x = anascombe[anascombe['dataset'] == i]['x']
    y = anascombe[anascombe['dataset'] == i]['y']
    #增加常数项
    x = sm.add_constant(x)
    model = sm.OLS(y,x).fit()
    print('the model of data '+i+' :')
    print(model.params)
    print(model.summary())
    
g = sns.FacetGrid(anascombe, col="dataset")
g.map(plt.scatter, "x", "y")
plt.show()



你可能感兴趣的:(pandas 习题)