根据给定的少量的数据集使用一系列操作,练习的网站如下:
https://github.com/schmit/cme193-ipython-notebooks-lecture
直接下载到本地,用Jupyter打开即可
In[1] :
`
%matplotlib inline
import random
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
sns.set_context(“talk”)`
In[2]:
anascombe = pd.read_csv('data/anscombe.csv')
anascombe.head()
Out[2]:
dataset x y
0 I 10.0 8.04
1 I 8.0 6.95
2 I 13.0 7.58
3 I 9.0 8.81
4 I 11.0 8.33
In[3]:
#数据处理,读出
lendata = len(anascombe.x)
x = [[] for i in range(4)]
y = [[] for i in range(4)]
for i in range(lendata):
if anascombe.dataset[i] == 'I':
x[0].append(anascombe.x[i])
y[0].append(anascombe.y[i])
elif anascombe.dataset[i] == 'II':
x[1].append(anascombe.x[i])
y[1].append(anascombe.y[i])
elif anascombe.dataset[i] == 'III':
x[2].append(anascombe.x[i])
y[2].append(anascombe.y[i])
elif anascombe.dataset[i] == 'IV':
x[3].append(anascombe.x[i])
y[3].append(anascombe.y[i])
#求解每个数据集x与y的平均值与方差
for i in range(4):
print('mean of x' + str(i+1) + ': ' + str(np.mean(x[i])))
print('mean of y' + str(i+1) + ': ' + str(np.mean(y[i])))
print('var of x' + str(i+1) + ': ' + str(np.var(x[i])))
print('var of y' + str(i+1) + ': ' + str(np.var(y[i])))
co = np.corrcoef(x[i], y[i])[0][1]
print('correlation coefficient of dataset' + str(i+1) + ': ' + str(co))
#最小二乘拟合
#为模型添加常数项
X = sm.add_constant(x[i])
est = sm.OLS(y[i], X)
est = est.fit()
print('dataset' + str(i+1) + ' line: y = ' + str(est.params[0]) + ' + '+ str(est.params[1]) + 'x')
print('\n')
Out[3]:
mean of x1: 9.0
mean of y1: 7.50090909091
var of x1: 10.0
var of y1: 3.75206280992
correlation coefficient of dataset1: 0.816420516345
dataset1 line: y = 3.00009090909 + 0.500090909091x
mean of x2: 9.0
mean of y2: 7.50090909091
var of x2: 10.0
var of y2: 3.75239008264
correlation coefficient of dataset2: 0.816236506
dataset2 line: y = 3.00090909091 + 0.5x
mean of x3: 9.0
mean of y3: 7.5
var of x3: 10.0
var of y3: 3.74783636364
correlation coefficient of dataset3: 0.81628673949
dataset3 line: y = 3.00245454545 + 0.499727272727x
mean of x4: 9.0
mean of y4: 7.50090909091
var of x4: 10.0
var of y4: 3.74840826446
correlation coefficient of dataset4: 0.816521436889
dataset4 line: y = 3.00172727273 + 0.499909090909x
In[4]:
g = sns.FacetGrid(anascombe, col="dataset")
g.map(plt.scatter, "x","y")
plt.show()