#两个特征的相关性
pd.DataFrame({"full_data": p1,"red_data": p2}).corr()
根据提供的文件分析各个特征之间的相关性
Id |
MSSubClass |
MSZoning |
LotFrontage |
LotArea |
LandContour |
OverallCond |
YearBuilt |
YearRemodAdd |
RoofStyle |
MasVnrArea |
ExterQual |
ExterCond |
Heating |
HeatingQC |
CentralAir |
Electrical |
1stFlrSF |
2ndFlrSF |
LowQualFinSF |
GrLivArea |
BsmtFullBath |
BsmtHalfBath |
FullBath |
EnclosedPorch |
MiscVal |
MoSold |
SalePrice |
1 |
60 |
RL |
65 |
8450 |
Lvl |
5 |
2003 |
2003 |
Gable |
196 |
Gd |
TA |
GasA |
Ex |
Y |
SBrkr |
856 |
854 |
0 |
1710 |
1 |
0 |
2 |
0 |
0 |
2 |
208500 |
2 |
20 |
RL |
80 |
9600 |
Lvl |
8 |
1976 |
1976 |
Gable |
0 |
TA |
TA |
GasA |
Ex |
Y |
SBrkr |
1262 |
0 |
0 |
1262 |
0 |
1 |
2 |
0 |
0 |
5 |
181500 |
3 |
60 |
RL |
68 |
11250 |
Lvl |
5 |
2001 |
2002 |
Gable |
162 |
Gd |
TA |
GasA |
Ex |
Y |
SBrkr |
920 |
866 |
0 |
1786 |
1 |
0 |
2 |
0 |
0 |
9 |
223500 |
4 |
70 |
RL |
60 |
9550 |
Lvl |
5 |
1915 |
1970 |
Gable |
0 |
TA |
TA |
GasA |
Gd |
Y |
SBrkr |
961 |
756 |
0 |
1717 |
1 |
0 |
1 |
272 |
0 |
2 |
140000 |
5 |
60 |
RL |
84 |
14260 |
Lvl |
5 |
2000 |
2000 |
Gable |
350 |
Gd |
TA |
GasA |
Ex |
Y |
SBrkr |
1145 |
1053 |
0 |
2198 |
1 |
0 |
2 |
0 |
0 |
12 |
250000 |
6 |
50 |
RL |
85 |
14115 |
Lvl |
5 |
1993 |
1995 |
Gable |
0 |
TA |
TA |
GasA |
Ex |
Y |
SBrkr |
796 |
566 |
0 |
1362 |
1 |
0 |
1 |
0 |
700 |
10 |
143000 |
7 |
20 |
RL |
75 |
10084 |
Lvl |
5 |
2004 |
2005 |
Gable |
186 |
Gd |
TA |
GasA |
Ex |
Y |
SBrkr |
1694 |
0 |
0 |
1694 |
1 |
0 |
2 |
0 |
0 |
8 |
307000 |
8 |
60 |
RL |
NA |
10382 |
Lvl |
6 |
1973 |
1973 |
Gable |
240 |
TA |
TA |
GasA |
Ex |
Y |
SBrkr |
1107 |
983 |
0 |
2090 |
1 |
0 |
2 |
228 |
350 |
11 |
200000 |
9 |
50 |
RM |
51 |
6120 |
Lvl |
5 |
1931 |
1950 |
Gable |
0 |
TA |
TA |
GasA |
Gd |
Y |
FuseF |
1022 |
752 |
0 |
1774 |
0 |
0 |
2 |
205 |
0 |
4 |
129900 |
10 |
190 |
RL |
50 |
7420 |
Lvl |
6 |
1939 |
1950 |
Gable |
0 |
TA |
TA |
GasA |
Ex |
Y |
SBrkr |
1077 |
0 |
0 |
1077 |
1 |
0 |
1 |
0 |
0 |
1 |
118000 |
11 |
20 |
RL |
70 |
11200 |
Lvl |
5 |
1965 |
1965 |
Hip |
0 |
TA |
TA |
GasA |
Ex |
Y |
SBrkr |
1040 |
0 |
0 |
1040 |
1 |
0 |
1 |
0 |
0 |
2 |
129500 |
12 |
60 |
RL |
85 |
11924 |
Lvl |
5 |
2005 |
2006 |
Hip |
286 |
Ex |
TA |
GasA |
Ex |
Y |
SBrkr |
1182 |
1142 |
0 |
2324 |
1 |
0 |
3 |
0 |
0 |
7 |
345000 |
13 |
20 |
RL |
NA |
12968 |
Lvl |
6 |
1962 |
1962 |
Hip |
0 |
TA |
TA |
GasA |
TA |
Y |
SBrkr |
912 |
0 |
0 |
912 |
1 |
0 |
1 |
0 |
0 |
9 |
144000 |
14 |
20 |
RL |
91 |
10652 |
Lvl |
5 |
2006 |
2007 |
Gable |
306 |
Gd |
TA |
GasA |
Ex |
Y |
SBrkr |
1494 |
0 |
0 |
1494 |
0 |
0 |
2 |
0 |
0 |
8 |
279500 |
15 |
20 |
RL |
NA |
10920 |
Lvl |
5 |
1960 |
1960 |
Hip |
212 |
TA |
TA |
GasA |
TA |
Y |
SBrkr |
1253 |
0 |
0 |
1253 |
1 |
0 |
1 |
176 |
0 |
5 |
157000 |
16 |
45 |
RM |
51 |
6120 |
Lvl |
8 |
1929 |
2001 |
Gable |
0 |
TA |
TA |
GasA |
Ex |
Y |
FuseA |
854 |
0 |
0 |
854 |
0 |
0 |
1 |
0 |
0 |
7 |
132000 |
17 |
20 |
RL |
NA |
11241 |
Lvl |
7 |
1970 |
1970 |
Gable |
180 |
TA |
TA |
GasA |
Ex |
Y |
SBrkr |
1004 |
0 |
0 |
1004 |
1 |
0 |
1 |
0 |
700 |
3 |
149000 |
18 |
90 |
RL |
72 |
10791 |
Lvl |
5 |
1967 |
1967 |
Gable |
0 |
TA |
TA |
GasA |
TA |
Y |
SBrkr |
1296 |
0 |
0 |
1296 |
0 |
0 |
2 |
0 |
500 |
10 |
90000 |
19 |
20 |
RL |
66 |
13695 |
Lvl |
5 |
2004 |
2004 |
Gable |
0 |
TA |
TA |
GasA |
Ex |
Y |
SBrkr |
1114 |
0 |
0 |
1114 |
1 |
0 |
1 |
0 |
0 |
6 |
159000 |
20 |
20 |
RL |
70 |
7560 |
Lvl |
6 |
1958 |
1965 |
Hip |
0 |
TA |
TA |
GasA |
TA |
Y |
SBrkr |
1339 |
0 |
0 |
1339 |
0 |
0 |
1 |
0 |
0 |
5 |
139000 |
21 |
60 |
RL |
101 |
14215 |
Lvl |
5 |
2005 |
2006 |
Gable |
380 |
Gd |
TA |
GasA |
Ex |
Y |
SBrkr |
1158 |
1218 |
0 |
2376 |
0 |
0 |
3 |
0 |
0 |
11 |
325300 |
22 |
45 |
RM |
57 |
7449 |
Bnk |
7 |
1930 |
1950 |
Gable |
0 |
TA |
TA |
GasA |
Ex |
Y |
FuseF |
1108 |
0 |
0 |
1108 |
0 |
0 |
1 |
205 |
0 |
6 |
139400 |
23 |
20 |
RL |
75 |
9742 |
Lvl |
5 |
2002 |
2002 |
Hip |
281 |
Gd |
TA |
GasA |
Ex |
Y |
SBrkr |
1795 |
0 |
0 |
1795 |
0 |
0 |
2 |
0 |
0 |
9 |
230000 |
24 |
120 |
RM |
44 |
4224 |
Lvl |
7 |
1976 |
1976 |
Gable |
0 |
TA |
TA |
GasA |
TA |
Y |
SBrkr |
1060 |
0 |
0 |
1060 |
1 |
0 |
1 |
0 |
0 |
6 |
129900 |
25 |
20 |
RL |
NA |
8246 |
Lvl |
8 |
1968 |
2001 |
Gable |
0 |
TA |
Gd |
GasA |
Ex |
Y |
SBrkr |
1060 |
0 |
0 |
1060 |
1 |
0 |
1 |
0 |
0 |
5 |
154000 |
26 |
20 |
RL |
110 |
14230 |
Lvl |
5 |
2007 |
2007 |
Gable |
640 |
Gd |
TA |
GasA |
Ex |
Y |
SBrkr |
1600 |
0 |
0 |
1600 |
0 |
0 |
2 |
0 |
0 |
7 |
256300 |
27 |
20 |
RL |
60 |
7200 |
Lvl |
7 |
1951 |
2000 |
Gable |
0 |
TA |
TA |
GasA |
TA |
Y |
SBrkr |
900 |
0 |
0 |
900 |
0 |
1 |
1 |
0 |
0 |
5 |
134800 |
28 |
20 |
RL |
98 |
11478 |
Lvl |
5 |
2007 |
2008 |
Gable |
200 |
Gd |
TA |
GasA |
Ex |
Y |
SBrkr |
1704 |
0 |
0 |
1704 |
1 |
0 |
2 |
0 |
0 |
5 |
306000 |
29 |
20 |
RL |
47 |
16321 |
Lvl |
6 |
1957 |
1997 |
Gable |
0 |
TA |
TA |
GasA |
TA |
Y |
SBrkr |
1600 |
0 |
0 |
1600 |
1 |
0 |
1 |
0 |
0 |
12 |
207500 |
30 |
30 |
RM |
60 |
6324 |
Lvl |
6 |
1927 |
1950 |
Gable |
0 |
TA |
TA |
GasA |
Fa |
N |
SBrkr |
520 |
0 |
0 |
520 |
0 |
0 |
1 |
87 |
0 |
5 |
68500 |
1.观察哪些变量会和预测目标关系比较大(比如这个分析主要是(saleprice)
2.观察哪些变量之间会有较强的关联
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import norm
import scipy.stats as stats
import pandas
df_train = pandas.read_csv('train1.csv')
print(df_train)
corrmat = df_train.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True)
plt.show()
输出入下图所示,是各个特征之间的相关性,
1.我们关注最后一行可以查看和saleprice关联性最大的特征,比如GliveArea
2.YearBuilt 和 YearRemodAdd 之间关联性很强,所以如果特征比较多的时候,可以考虑取其中一个就可以了
k=5
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(df_train[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()
取对saleprice影响最大的几个特征并排序
查看某个特征的outliers
var = 'GrLivArea'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));
plt.show()
很重要的一步是把不符合正态分布的变量给转化成正态分布的
sns.distplot(df_train['SalePrice'], fit=norm);
fig = plt.figure()
res = stats.probplot(df_train['SalePrice'], plot=plt)
plt.show()
这个图里可以看到 ‘SalePrice’ 的分布是正偏度,在正偏度的情况下,用 log 取对数后可以做到转换:
df_train['SalePrice'] = np.log(df_train['SalePrice'])
sns.distplot(df_train['SalePrice'], fit=norm);
fig = plt.figure()
res = stats.probplot(df_train['SalePrice'], plot=plt)
plt.show()
GrLivArea 和 目标值 SalePrice 在转化之前的关系图是类似锥形的:
转换之后就好多了
内容参考了杨熹的kaggle比赛总结 开发者自述:我是如何从 0 到 1 走进 Kaggle 的