Anscombe’s quartet comprises of four datasets, and is rather famous. Why? You’ll find out in this exercise.
Using Seaborn, visualize all four datasets.
(hint: use sns.FacetGrid combined with plt.scatter)
anascombe = pd.read_csv('data.csv')
anascombe.head()
print(anascombe)
dataset x y
0 I 10.0 8.04
1 I 8.0 6.95
2 I 13.0 7.58
3 I 9.0 8.81
4 I 11.0 8.33
5 I 14.0 9.96
6 I 6.0 7.24
7 I 4.0 4.26
8 I 12.0 10.84
9 I 7.0 4.82
10 I 5.0 5.68
11 II 10.0 9.14
12 II 8.0 8.14
13 II 13.0 8.74
14 II 9.0 8.77
15 II 11.0 9.26
16 II 14.0 8.10
17 II 6.0 6.13
18 II 4.0 3.10
19 II 12.0 9.13
20 II 7.0 7.26
21 II 5.0 4.74
22 III 10.0 7.46
23 III 8.0 6.77
24 III 13.0 12.74
25 III 9.0 7.11
26 III 11.0 7.81
27 III 14.0 8.84
28 III 6.0 6.08
29 III 4.0 5.39
30 III 12.0 8.15
31 III 7.0 6.42
32 III 5.0 5.73
33 IV 8.0 6.58
34 IV 8.0 5.76
35 IV 8.0 7.71
36 IV 8.0 8.84
37 IV 8.0 8.47
38 IV 8.0 7.04
39 IV 8.0 5.25
40 IV 19.0 12.50
41 IV 8.0 5.56
42 IV 8.0 7.91
43 IV 8.0 6.89
print(anascombe.groupby('dataset')['x'].mean())
print(anascombe.groupby('dataset')['y'].mean())
dataset
I 9.0
II 9.0
III 9.0
IV 9.0
Name: x, dtype: float64
dataset
I 7.500909
II 7.500909
III 7.500000
IV 7.500909
Name: y, dtype: float64
print(anascombe.groupby('dataset')['x'].var())
print(anascombe.groupby('dataset')['y'].var())
dataset
I 11.0
II 11.0
III 11.0
IV 11.0
Name: x, dtype: float64
dataset
I 4.127269
II 4.127629
III 4.122620
IV 4.123249
Name: y, dtype: float64
X1 = anascombe.x[0:10].values
X2 = anascombe.x[11:21].values
X3 = anascombe.x[22:32].values
X4 = anascombe.x[33:43].values
Y1 = anascombe.y[0:10].values
Y2 = anascombe.y[11:21].values
Y3 = anascombe.y[22:32].values
Y4 = anascombe.y[33:43].values
coefficients = [0,0,0,0]
coefficients[0] = sp.stats.pearsonr(X1, Y1)[0] #返回的第一个参数是相关系数
coefficients[1] = sp.stats.pearsonr(X2, Y2)[0]
coefficients[2] = sp.stats.pearsonr(X3, Y3)[0]
coefficients[3] = sp.stats.pearsonr(X4, Y4)[0]
for coefficient in coefficients:
print(coefficient)
0.7970815759062526
0.7773093020784241
0.7985632617088811
0.8146722146933596
X_I = sm.add_constant(X1) #计算x与y的线性回归
model_I = sm.OLS(Y1, X_I)
result_I = model_I.fit()
params_I = result_I.params
print("DatasetI: y =", params_I[0], "+", params_I[1], "* x")
X_II = sm.add_constant(X2)
model_II = sm.OLS(Y2, X_II)
result_II = model_II.fit()
params_II = result_II.params
print("DatasetII: y =", params_II[0], "+", params_II[1], "* x")
X_III = sm.add_constant(X3)
model_III = sm.OLS(Y3, X_III)
result_III = model_III.fit()
params_III = result_III.params
print("DatasetIII: y =", params_III[0], "+", params_III[1], "* x")
X_IV = sm.add_constant(X4)
model_IV = sm.OLS(Y4, X_IV)
result_IV = model_IV.fit()
params_IV = result_IV.params
print("DatasetIV: y =", params_IV[0], "+", params_IV[1], "* x")
DatasetI: y = 2.9018181818181796 + 0.5086363636363637 * x
DatasetII: y = 3.4175974025974023 + 0.4637662337662336 * x
DatasetIII: y = 2.877099567099565 + 0.5106277056277057 * x
DatasetIV: y = 3.023030303030303 + 0.49878787878787884 * x
sns.set(style='whitegrid')
g = sns.FacetGrid(anascombe, col="dataset", hue="dataset", size=3)
g.map(plt.scatter, 'x', 'y')
plt.show()