特征多重共线对随机森林模型预测性能的影响研究

特征共线是否对随机森林模型的预测性能有影响?

我们为什么关注特征共线?

特征共线就是指数据集中的特征之间匹配得太好或特征高度相关,例如:降雨量和乌云云团大小、织物纤维和吸水能力等;

然而,在机器学习模型中,特征共线是一件坏事。它可能造成模型偏向于某些特征,而导致信息丢失,尤其是在多特征回归任务中更是如此。

实际上,特征共线对随机森林模型并没有影响。这里将对特征共线对随机森林模型的影响进行讨论。

下面是本文的一些参考链接:

参考链接1

参考链接2

# 工具包导入
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

import warnings
warnings.filterwarnings('ignore')
# 显示当前工作目录
%pwd
'D:\\python code\\9日常\\--------20210723特征共线对随机森林模型的影响--------\\TheDataVolcano-master'
# 载入数据,以下是数据地址 
# https://catalog.data.gov/dataset/state-of-new-york-mortgage-agency-sonyma-loans-purchased-beginning-2004
df=pd.read_csv('./datasets/State_of_New_York_Mortgage_Agency.csv')
df.info()

RangeIndex: 28528 entries, 0 to 28527
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Bond Series              28528 non-null  object
 1   Original Loan Amount     28528 non-null  object
 2   Loan Purchase Date       28528 non-null  object
 3   Purchase Year            28528 non-null  int64 
 4   Original Loan To Value   28528 non-null  object
 5   Loan Type                28528 non-null  object
 6   SONYMA DPAL/CCAL Amount  21012 non-null  object
 7   Original Term            28528 non-null  int64 
 8   County                   28528 non-null  object
 9   FIPS Code                28528 non-null  int64 
 10  Number of Units          28528 non-null  object
 11  Property Type            28528 non-null  object
 12  Housing Type             28528 non-null  object
 13  Household Size           28528 non-null  int64 
dtypes: int64(4), object(10)
memory usage: 3.0+ MB
df.head(10)
Bond Series Original Loan Amount Loan Purchase Date Purchase Year Original Loan To Value Loan Type SONYMA DPAL/CCAL Amount Original Term County FIPS Code Number of Units Property Type Housing Type Household Size
0 Series 109/110 $32470 01/02/2004 2004 97% Conventional $2933 360 Monroe 36055 1 Family Detached Existing 1
1 Series 109/110 $48500 01/02/2004 2004 97% Conventional $3435 360 Genesee 36037 1 Family Detached Existing 4
2 Series 109/110 $49470 01/02/2004 2004 97% Conventional $4996 360 Monroe 36055 1 Family Detached Existing 3
3 Series 109/110 $58200 01/02/2004 2004 97% Conventional $4170 360 Erie 36029 1 Family Detached Existing 2
4 Series 109/110 $64990 01/02/2004 2004 97% Conventional $4940 360 Erie 36029 1 Family Detached Existing 3
5 Series 109/110 $64990 01/02/2004 2004 97% Conventional $4772 360 Schenectady 36093 1 Family Detached Existing 1
6 Series 109/110 $67900 01/02/2004 2004 97% Conventional $5000 360 Orleans 36073 1 Family Detached Existing 3
7 Series 109/110 $67900 01/02/2004 2004 97% Conventional $4845 360 Wayne 36117 1 Family Detached Existing 2
8 Series 109/110 $72775 01/02/2004 2004 97% Conventional $5000 360 Monroe 36055 1 Family Detached Existing 2
9 Series 109/110 $77115 01/02/2004 2004 97% Conventional $5000 360 Broome 36007 1 Family Detached Existing 2
df.columns
Index(['Bond Series', 'Original Loan Amount', 'Loan Purchase Date ',
       'Purchase Year', 'Original Loan To Value', 'Loan Type ',
       'SONYMA DPAL/CCAL Amount', 'Original Term', 'County', 'FIPS Code',
       'Number of Units', 'Property Type', 'Housing Type', 'Household Size '],
      dtype='object')
dfmod=df[['Original Loan Amount', 'Purchase Year', 'Original Loan To Value', 'SONYMA DPAL/CCAL Amount', 'Number of Units', \
'Household Size ', 'Property Type', 'County', 'Housing Type', 'Bond Series', 'Original Term']]

# turn off warnings on the slice operation we do below. 
# This is a unique factorize problem because it returns a tuple, sigh
# https://stackoverflow.com/questions/45080400/dealing-with-pandas-settingwithcopywarning-without-indexer
pd.options.mode.chained_assignment = None 

# factorize changes features from like 'condo' and 'house' to numeric (1, 2, etc.) so our model can handle it
stacked = dfmod[['Property Type', 'County', 'Housing Type', 'Bond Series']].stack()
dfmod[['Property Type', 'County', 'Housing Type', 'Bond Series']] = pd.Series(stacked.factorize()[0], \
                                                                              index=stacked.index).unstack()

# use regex replace to fix some of the columns that have partial numeric, partial text values
dfmod=dfmod.replace('[\$,]', '', regex=True)
dfmod=dfmod.replace('[\%,]', '', regex=True)
dfmod=dfmod.replace('Family', '', regex=True)

# need to convert to float
dfmod=dfmod.astype(float)

查看原始数据集中相关的特征是哪些?

dfmod.columns
Index(['Original Loan Amount', 'Purchase Year', 'Original Loan To Value',
       'SONYMA DPAL/CCAL Amount', 'Number of Units', 'Household Size ',
       'Property Type', 'County', 'Housing Type', 'Bond Series',
       'Original Term'],
      dtype='object')
dfmod.info()

RangeIndex: 28528 entries, 0 to 28527
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Original Loan Amount     28528 non-null  float64
 1   Purchase Year            28528 non-null  float64
 2   Original Loan To Value   28528 non-null  float64
 3   SONYMA DPAL/CCAL Amount  21012 non-null  float64
 4   Number of Units          28528 non-null  float64
 5   Household Size           28528 non-null  float64
 6   Property Type            28528 non-null  float64
 7   County                   28528 non-null  float64
 8   Housing Type             28528 non-null  float64
 9   Bond Series              28528 non-null  float64
 10  Original Term            28528 non-null  float64
dtypes: float64(11)
memory usage: 2.4 MB
dfmod.head(10)
Original Loan Amount Purchase Year Original Loan To Value SONYMA DPAL/CCAL Amount Number of Units Household Size Property Type County Housing Type Bond Series Original Term
0 32470.0 2004.0 97.0 2933.0 1.0 1.0 0.0 1.0 2.0 3.0 360.0
1 48500.0 2004.0 97.0 3435.0 1.0 4.0 0.0 4.0 2.0 3.0 360.0
2 49470.0 2004.0 97.0 4996.0 1.0 3.0 0.0 1.0 2.0 3.0 360.0
3 58200.0 2004.0 97.0 4170.0 1.0 2.0 0.0 5.0 2.0 3.0 360.0
4 64990.0 2004.0 97.0 4940.0 1.0 3.0 0.0 5.0 2.0 3.0 360.0
5 64990.0 2004.0 97.0 4772.0 1.0 1.0 0.0 6.0 2.0 3.0 360.0
6 67900.0 2004.0 97.0 5000.0 1.0 3.0 0.0 7.0 2.0 3.0 360.0
7 67900.0 2004.0 97.0 4845.0 1.0 2.0 0.0 8.0 2.0 3.0 360.0
8 72775.0 2004.0 97.0 5000.0 1.0 2.0 0.0 1.0 2.0 3.0 360.0
9 77115.0 2004.0 97.0 5000.0 1.0 2.0 0.0 9.0 2.0 3.0 360.0
# 绘制热力图
from mlxtend.plotting import heatmap

cols = ['Original Loan Amount', 'Purchase Year', 'Original Loan To Value',\
       'SONYMA DPAL/CCAL Amount', 'Number of Units', 'Household Size ',\
       'Property Type', 'County', 'Housing Type', 'Bond Series',\
       'Original Term']
cm = np.corrcoef(dfmod[cols].values.T)
"""
下图中的nan出现的原因是,'SONYMA DPAL/CCAL Amount'含有空值null
"""
hm = heatmap(cm, row_names=cols, column_names=cols, figsize=(12, 12))

# 保存图表
plt.savefig('./heatmaps.png', dpi=300)
plt.show()

特征多重共线对随机森林模型预测性能的影响研究_第1张图片

# test for correlations 
corrDF=dfmod.corr()
corrDF

Original Loan Amount Purchase Year Original Loan To Value SONYMA DPAL/CCAL Amount Number of Units Household Size Property Type County Housing Type Bond Series Original Term
Original Loan Amount 1.000000 0.337831 -0.056902 0.662054 0.112723 0.238369 0.101085 0.232890 0.124133 0.325947 0.184459
Purchase Year 0.337831 1.000000 -0.152347 -0.062682 -0.005365 0.073343 0.149763 0.105100 0.113273 0.922574 0.058512
Original Loan To Value -0.056902 -0.152347 1.000000 -0.091755 0.028671 -0.033281 -0.294189 -0.189966 -0.294904 -0.167013 -0.002098
SONYMA DPAL/CCAL Amount 0.662054 -0.062682 -0.091755 1.000000 0.050282 0.202176 0.025931 0.160047 0.152185 -0.078872 0.199689
Number of Units 0.112723 -0.005365 0.028671 0.050282 1.000000 -0.004223 -0.003898 -0.027416 0.013539 -0.006489 -0.002910
Household Size 0.238369 0.073343 -0.033281 0.202176 -0.004223 1.000000 -0.043792 0.107912 0.085088 0.073733 0.076269
Property Type 0.101085 0.149763 -0.294189 0.025931 -0.003898 -0.043792 1.000000 0.197686 0.224163 0.158472 0.030414
County 0.232890 0.105100 -0.189966 0.160047 -0.027416 0.107912 0.197686 1.000000 0.174262 0.115525 0.050321
Housing Type 0.124133 0.113273 -0.294904 0.152185 0.013539 0.085088 0.224163 0.174262 1.000000 0.134284 0.046703
Bond Series 0.325947 0.922574 -0.167013 -0.078872 -0.006489 0.073733 0.158472 0.115525 0.134284 1.000000 0.054699
Original Term 0.184459 0.058512 -0.002098 0.199689 -0.002910 0.076269 0.030414 0.050321 0.046703 0.054699 1.000000

从上表可以看出,特征之间并没有呈现出高度相关性,特征’Original Loan Amount’和特征 'SONYMA DPAL/CCAL Amount’相关性系数达到了0.66.

制造一些相关性很强的假数据

生成新的数据列’Grandmas Loan Agency’ ,它与列 'SONYMA DPAL/CCAL Amount’高度相关,数据展示如下:

randoms = np.linspace(0.9, 1.1, len(dfmod))
dfmod['Grandmas Loan Agency']=dfmod['SONYMA DPAL/CCAL Amount']*randoms
corrDF=dfmod.corr()
corrDF
Original Loan Amount Purchase Year Original Loan To Value SONYMA DPAL/CCAL Amount Number of Units Household Size Property Type County Housing Type Bond Series Original Term Grandmas Loan Agency
Original Loan Amount 1.000000 0.337831 -0.056902 0.662054 0.112723 0.238369 0.101085 0.232890 0.124133 0.325947 0.184459 0.683658
Purchase Year 0.337831 1.000000 -0.152347 -0.062682 -0.005365 0.073343 0.149763 0.105100 0.113273 0.922574 0.058512 0.030616
Original Loan To Value -0.056902 -0.152347 1.000000 -0.091755 0.028671 -0.033281 -0.294189 -0.189966 -0.294904 -0.167013 -0.002098 -0.105809
SONYMA DPAL/CCAL Amount 0.662054 -0.062682 -0.091755 1.000000 0.050282 0.202176 0.025931 0.160047 0.152185 -0.078872 0.199689 0.993475
Number of Units 0.112723 -0.005365 0.028671 0.050282 1.000000 -0.004223 -0.003898 -0.027416 0.013539 -0.006489 -0.002910 0.048050
Household Size 0.238369 0.073343 -0.033281 0.202176 -0.004223 1.000000 -0.043792 0.107912 0.085088 0.073733 0.076269 0.207818
Property Type 0.101085 0.149763 -0.294189 0.025931 -0.003898 -0.043792 1.000000 0.197686 0.224163 0.158472 0.030414 0.038477
County 0.232890 0.105100 -0.189966 0.160047 -0.027416 0.107912 0.197686 1.000000 0.174262 0.115525 0.050321 0.167676
Housing Type 0.124133 0.113273 -0.294904 0.152185 0.013539 0.085088 0.224163 0.174262 1.000000 0.134284 0.046703 0.167975
Bond Series 0.325947 0.922574 -0.167013 -0.078872 -0.006489 0.073733 0.158472 0.115525 0.134284 1.000000 0.054699 0.009896
Original Term 0.184459 0.058512 -0.002098 0.199689 -0.002910 0.076269 0.030414 0.050321 0.046703 0.054699 1.000000 0.207590
Grandmas Loan Agency 0.683658 0.030616 -0.105809 0.993475 0.048050 0.207818 0.038477 0.167676 0.167975 0.009896 0.207590 1.000000

构建一个随机森林模型

from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
from sklearn.model_selection import train_test_split

def ourModel(data, result):  
    # inputs
    # data = pandas data frame (x)
    # results = column of desired result (y)
    
    
    # split the test - train set 
    X_train, X_test, y_train, y_test = train_test_split(
    data , result , test_size=0.25, random_state=1)

    # setup the model
    clf = RandomForestRegressor(n_estimators=100, n_jobs=4, oob_score =True)
    clf.fit(X_train, y_train)
    predictions=clf.predict(X_test)
    print('r2: ' + str(metrics.r2_score(predictions, y_test)))
    print('mse: '+ str(metrics.mean_squared_error(predictions, y_test)))
    
    # feature importance
    importances=clf.feature_importances_
    indices = np.argsort(importances)
    fp=zip(data.columns.values[indices], importances[indices])
    
    return(fp)
    

执行上述函数

dfmod=dfmod.dropna()
result=dfmod['Original Loan Amount']
data_fake=dfmod.drop(['Original Loan Amount'], axis=1)

print('WITH MADE UP DATA')
fp_fake=ourModel(data_fake, result)

data_nofake=dfmod.drop(['Original Loan Amount', 'Grandmas Loan Agency'], axis=1)
print("\nWITHOUT MADE UP DATA")
fp_nofake=ourModel(data_nofake, result)
WITH MADE UP DATA
r2: 0.8805216008148813
mse: 637577470.5899562

WITHOUT MADE UP DATA
r2: 0.8784191345199206
mse: 645873903.6564941

构造一些特征以图进一步改进模型

# let's add some financial data
# our purchase years range from 2004 to 2016. Let's get the average mortage rate in those years
# googled and found at : http://www.freddiemac.com/pmms/pmms30.html
years=np.linspace(2004., 2016., 13)
mort30=np.array([5.84, 5.87, 6.41, 6.34, 6.03, 5.04, 4.69, 4.45, 3.66, 3.98, 4.17, 3.85, 3.65])
dfmod['mort']=[mort30[years==dfmod['Purchase Year'].iloc[item]][0] for item in range(len(dfmod['Purchase Year']))]


# what else can we add? Maybe how many houses were purchased in that year 
# from https://www.statista.com/statistics/219963/number-of-us-house-sales/
housesBought=np.array([1203, 1283, 1051, 776, 485, 375, 323, 306, 368, 429, 437, 501, 560])*1000.
dfmod['housesBought']=[housesBought[years==dfmod['Purchase Year'].iloc[item]][0] for item in range(len(dfmod['Purchase Year']))]

#  and just because we don't want all our new data depending on year, let's do one about
#  expected wealth by family size in NY
#  source: https://www.justice.gov/ust/eo/bapcpa/20130501/bci_data/median_income_table.htm
#  assume > 4 is = 4

# let's make a "wealthy vs Poor" category
dfmod['income']=[0 if dfmod['Household Size '].iloc[x] < 3 or dfmod['Household Size '].iloc[x]\
    > 4 else 1 for x in range(len(dfmod['Household Size '])) ]

#
# run the model again
dfmod=dfmod.dropna()
result=dfmod['Original Loan Amount']
data_fake=dfmod.drop(['Original Loan Amount'], axis=1)

print('WITH MADE UP DATA, Round 2!')
fp_fake=ourModel(data_fake, result)

data_fake=dfmod.drop(['Original Loan Amount', 'Grandmas Loan Agency'], axis=1)
print("\nWITHOUT MADE UP DATA, Round 2!")
fp_nofake=ourModel(data_nofake, result)

# feature ranking for importances
print('\nFeatures in order of importance for fake:')
print(list(fp_fake))

print('\nFeatures in order of importance for NOT fake:')
print(list(fp_nofake))

WITH MADE UP DATA, Round 2!
r2: 0.8805119731776188
mse: 636127397.198097

WITHOUT MADE UP DATA, Round 2!
r2: 0.87777395649925
mse: 650974675.1996832

Features in order of importance for fake:
[('Original Term', 0.001193970255686264), ('income', 0.001962303252954665), ('housesBought', 0.0032933784659805597), ('Number of Units', 0.004629807343641515), ('Housing Type', 0.005638302615442727), ('Household Size ', 0.00818667858229475), ('Purchase Year', 0.010329245640254558), ('mort', 0.01719017704279596), ('Bond Series', 0.01786888405078895), ('Property Type', 0.02191737837312151), ('Original Loan To Value', 0.06067438107071076), ('County', 0.060777486720751416), ('SONYMA DPAL/CCAL Amount', 0.08007650880918302), ('Grandmas Loan Agency', 0.7062614977763934)]

Features in order of importance for NOT fake:
[('Original Term', 0.001331898105323796), ('Number of Units', 0.004787708555639128), ('Housing Type', 0.005980303169527446), ('Household Size ', 0.010415186933312198), ('Property Type', 0.022150192124642493), ('Bond Series', 0.02455088805708598), ('Purchase Year', 0.039911871359296906), ('Original Loan To Value', 0.062332516902338694), ('County', 0.064640597446016), ('SONYMA DPAL/CCAL Amount', 0.7638988373468174)]

可以看到,特征’Grandmas Loan Agency’具有较高的重要性指数的时候,对应的特征’SONYMA DPAL/CCAL Amount’重要性指数较低,大约为0.08;

当删除了特征’Grandmas Loan Agency’,对应的特征’SONYMA DPAL/CCAL Amount’重要性指数为 0.7638988373468174。总结解释如下:

随机森林模型的预测能力不受多重共线性的影响

但是数据的解释性会被多重共线影响。随机森林模型可以返回特征的重要性指数,如果存在多重共线,则importance会被影响。一些具有多重共线的特征的重要性会被相互抵消,从而影响我们解释和理解特征。

一种简单的理解:多重共线性的特征不会对决策树、随机森林的预测能力有影响。

多重共线性最极端的情况是有两个完全一样的特征,特征A和特征B。当特征A被使用之后,决策树不会再选择使用特征B,因为特征B并没有增加新的有效信息。同理,如何决策树先选择了使用特征B,那么特征A也不会再被使用。

所以基于树的模型不会受到多重共线性的影响

你可能感兴趣的:(机器学习和数据挖掘,决策树,随机森林模型,特征共线,Colinearity)