相关矩阵 Correlation matrix

  1. #两个特征的相关性
    
    pd.DataFrame({"full_data": p1,"red_data": p2}).corr()

     

根据提供的文件分析各个特征之间的相关性

Id

MSSubClass

MSZoning

LotFrontage

LotArea

LandContour

OverallCond

YearBuilt

YearRemodAdd

RoofStyle

MasVnrArea

ExterQual

ExterCond

Heating

HeatingQC

CentralAir

Electrical

1stFlrSF

2ndFlrSF

LowQualFinSF

GrLivArea

BsmtFullBath

BsmtHalfBath

FullBath

EnclosedPorch

MiscVal

MoSold

SalePrice

1

60

RL

65

8450

Lvl

5

2003

2003

Gable

196

Gd

TA

GasA

Ex

Y

SBrkr

856

854

0

1710

1

0

2

0

0

2

208500

2

20

RL

80

9600

Lvl

8

1976

1976

Gable

0

TA

TA

GasA

Ex

Y

SBrkr

1262

0

0

1262

0

1

2

0

0

5

181500

3

60

RL

68

11250

Lvl

5

2001

2002

Gable

162

Gd

TA

GasA

Ex

Y

SBrkr

920

866

0

1786

1

0

2

0

0

9

223500

4

70

RL

60

9550

Lvl

5

1915

1970

Gable

0

TA

TA

GasA

Gd

Y

SBrkr

961

756

0

1717

1

0

1

272

0

2

140000

5

60

RL

84

14260

Lvl

5

2000

2000

Gable

350

Gd

TA

GasA

Ex

Y

SBrkr

1145

1053

0

2198

1

0

2

0

0

12

250000

6

50

RL

85

14115

Lvl

5

1993

1995

Gable

0

TA

TA

GasA

Ex

Y

SBrkr

796

566

0

1362

1

0

1

0

700

10

143000

7

20

RL

75

10084

Lvl

5

2004

2005

Gable

186

Gd

TA

GasA

Ex

Y

SBrkr

1694

0

0

1694

1

0

2

0

0

8

307000

8

60

RL

NA

10382

Lvl

6

1973

1973

Gable

240

TA

TA

GasA

Ex

Y

SBrkr

1107

983

0

2090

1

0

2

228

350

11

200000

9

50

RM

51

6120

Lvl

5

1931

1950

Gable

0

TA

TA

GasA

Gd

Y

FuseF

1022

752

0

1774

0

0

2

205

0

4

129900

10

190

RL

50

7420

Lvl

6

1939

1950

Gable

0

TA

TA

GasA

Ex

Y

SBrkr

1077

0

0

1077

1

0

1

0

0

1

118000

11

20

RL

70

11200

Lvl

5

1965

1965

Hip

0

TA

TA

GasA

Ex

Y

SBrkr

1040

0

0

1040

1

0

1

0

0

2

129500

12

60

RL

85

11924

Lvl

5

2005

2006

Hip

286

Ex

TA

GasA

Ex

Y

SBrkr

1182

1142

0

2324

1

0

3

0

0

7

345000

13

20

RL

NA

12968

Lvl

6

1962

1962

Hip

0

TA

TA

GasA

TA

Y

SBrkr

912

0

0

912

1

0

1

0

0

9

144000

14

20

RL

91

10652

Lvl

5

2006

2007

Gable

306

Gd

TA

GasA

Ex

Y

SBrkr

1494

0

0

1494

0

0

2

0

0

8

279500

15

20

RL

NA

10920

Lvl

5

1960

1960

Hip

212

TA

TA

GasA

TA

Y

SBrkr

1253

0

0

1253

1

0

1

176

0

5

157000

16

45

RM

51

6120

Lvl

8

1929

2001

Gable

0

TA

TA

GasA

Ex

Y

FuseA

854

0

0

854

0

0

1

0

0

7

132000

17

20

RL

NA

11241

Lvl

7

1970

1970

Gable

180

TA

TA

GasA

Ex

Y

SBrkr

1004

0

0

1004

1

0

1

0

700

3

149000

18

90

RL

72

10791

Lvl

5

1967

1967

Gable

0

TA

TA

GasA

TA

Y

SBrkr

1296

0

0

1296

0

0

2

0

500

10

90000

19

20

RL

66

13695

Lvl

5

2004

2004

Gable

0

TA

TA

GasA

Ex

Y

SBrkr

1114

0

0

1114

1

0

1

0

0

6

159000

20

20

RL

70

7560

Lvl

6

1958

1965

Hip

0

TA

TA

GasA

TA

Y

SBrkr

1339

0

0

1339

0

0

1

0

0

5

139000

21

60

RL

101

14215

Lvl

5

2005

2006

Gable

380

Gd

TA

GasA

Ex

Y

SBrkr

1158

1218

0

2376

0

0

3

0

0

11

325300

22

45

RM

57

7449

Bnk

7

1930

1950

Gable

0

TA

TA

GasA

Ex

Y

FuseF

1108

0

0

1108

0

0

1

205

0

6

139400

23

20

RL

75

9742

Lvl

5

2002

2002

Hip

281

Gd

TA

GasA

Ex

Y

SBrkr

1795

0

0

1795

0

0

2

0

0

9

230000

24

120

RM

44

4224

Lvl

7

1976

1976

Gable

0

TA

TA

GasA

TA

Y

SBrkr

1060

0

0

1060

1

0

1

0

0

6

129900

25

20

RL

NA

8246

Lvl

8

1968

2001

Gable

0

TA

Gd

GasA

Ex

Y

SBrkr

1060

0

0

1060

1

0

1

0

0

5

154000

26

20

RL

110

14230

Lvl

5

2007

2007

Gable

640

Gd

TA

GasA

Ex

Y

SBrkr

1600

0

0

1600

0

0

2

0

0

7

256300

27

20

RL

60

7200

Lvl

7

1951

2000

Gable

0

TA

TA

GasA

TA

Y

SBrkr

900

0

0

900

0

1

1

0

0

5

134800

28

20

RL

98

11478

Lvl

5

2007

2008

Gable

200

Gd

TA

GasA

Ex

Y

SBrkr

1704

0

0

1704

1

0

2

0

0

5

306000

29

20

RL

47

16321

Lvl

6

1957

1997

Gable

0

TA

TA

GasA

TA

Y

SBrkr

1600

0

0

1600

1

0

1

0

0

12

207500

30

30

RM

60

6324

Lvl

6

1927

1950

Gable

0

TA

TA

GasA

Fa

N

SBrkr

520

0

0

520

0

0

1

87

0

5

68500

1.观察哪些变量会和预测目标关系比较大(比如这个分析主要是(saleprice)

2.观察哪些变量之间会有较强的关联

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import norm
import scipy.stats as stats
import pandas

df_train = pandas.read_csv('train1.csv')
print(df_train)

corrmat = df_train.corr()

f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True)
plt.show()

输出入下图所示,是各个特征之间的相关性,

1.我们关注最后一行可以查看和saleprice关联性最大的特征,比如GliveArea

2.YearBuilt 和 YearRemodAdd 之间关联性很强,所以如果特征比较多的时候,可以考虑取其中一个就可以了

相关矩阵 Correlation matrix_第1张图片

k=5
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(df_train[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

 相关矩阵 Correlation matrix_第2张图片

取对saleprice影响最大的几个特征并排序 

查看某个特征的outliers

var = 'GrLivArea'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));
plt.show()

相关矩阵 Correlation matrix_第3张图片

很重要的一步是把不符合正态分布的变量给转化成正态分布的 

sns.distplot(df_train['SalePrice'], fit=norm);
fig = plt.figure()
res = stats.probplot(df_train['SalePrice'], plot=plt)
plt.show()

 相关矩阵 Correlation matrix_第4张图片

这个图里可以看到 ‘SalePrice’ 的分布是正偏度,在正偏度的情况下,用 log 取对数后可以做到转换: 

df_train['SalePrice'] = np.log(df_train['SalePrice'])
sns.distplot(df_train['SalePrice'], fit=norm);
fig = plt.figure()
res = stats.probplot(df_train['SalePrice'], plot=plt)
plt.show()

 相关矩阵 Correlation matrix_第5张图片

GrLivArea 和 目标值 SalePrice 在转化之前的关系图是类似锥形的: 

相关矩阵 Correlation matrix_第6张图片

转换之后就好多了 

相关矩阵 Correlation matrix_第7张图片

内容参考了杨熹的kaggle比赛总结  开发者自述:我是如何从 0 到 1 走进 Kaggle 的

你可能感兴趣的:(python)