本文展示了三种常用的特征选择方法:皮尔逊相关系数法、递归特征消除法、格兰杰因果检验法。
本文使用的数据集在本人上传的资源中,链接为mock_kaggle.csv
import pandas as pd
import numpy as np
import math
from matplotlib import pyplot as plt
from matplotlib.pylab import mpl
import tensorflow as tf
mpl.rcParams['font.sans-serif'] = ['SimHei'] #显示中文
mpl.rcParams['axes.unicode_minus']=False #显示负号
data=pd.read_csv('mock_kaggle.csv',encoding ='gbk',parse_dates=['datetime'])
Date=pd.to_datetime(data.datetime)
data=data.iloc[:,1:]
datanew=data.set_index(Date)
datanew
特价 | 股票 | 价格 | |
---|---|---|---|
datetime | |||
2014-01-01 | 0 | 4972 | 1.29 |
2014-01-02 | 70 | 4902 | 1.29 |
2014-01-03 | 59 | 4843 | 1.29 |
2014-01-04 | 93 | 4750 | 1.29 |
2014-01-05 | 96 | 4654 | 1.29 |
... | ... | ... | ... |
2016-07-27 | 98 | 3179 | 2.39 |
2016-07-28 | 108 | 3071 | 2.39 |
2016-07-29 | 128 | 4095 | 2.39 |
2016-07-30 | 270 | 3825 | 2.39 |
2016-07-31 | 183 | 3642 | 2.39 |
937 rows × 3 columns
皮尔逊相关系数法:用于衡量两个特征的线性相关程度。值大于0,表示两个变量正相关;值小于0,表示两个变量负相关;绝对值越大表示两个变量的线性相关程度越大。值等于0只能说明两个变量不是线性相关,但有可能是其它方式的相关。
datanew.corr(method='pearson')
特价 | 股票 | 价格 | |
---|---|---|---|
特价 | 1.000000 | 0.153659 | 0.094779 |
股票 | 0.153659 | 1.000000 | -0.032604 |
价格 | 0.094779 | -0.032604 | 1.000000 |
这里‘股票’为我们的预测目标,'特价’和‘价格’为相关特征。根据皮尔逊相关系数法的结果,'特价’和’股票’之间的线性相关程度更高,但也不是很高。
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
x=datanew[['特价','价格']]
y=datanew[['股票']]
#参数estimator为基模型,主要使用的有:RandomForestRegressor、AdaBoostRegressor、GradientBoostingRegressor
#参数n_features_to_select为选择的特征个数
estimator = AdaBoostRegressor()
selector = RFE(estimator, n_features_to_select=1,step=1)
selector = selector.fit(x, np.ravel(y))
selector.ranking_
array([2, 1])
根据RFE的结果可知特征排名为:价格>特价。因此选择的前1个特征为“价格”。
由于格兰杰因果检验只适用于平稳数据,所以每次检验前先进行ADF检验,对不平稳的数据先进行差分。
采用n阶格兰杰检验,若p值小于0.05,则该维度数据对预测有帮助。n的值择可根据实验效果选择,这里使用的是n=5。
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import grangercausalitytests
#ADF检验数据平稳性
#1.只要统计值(第一个值)是小于1%水平下的数字就可以极显著的拒绝原假设,认为数据平稳
#2.第二值为p值,表示t统计量对应的概率值。越接近0越好
print(adfuller(datanew['股票']))
(-4.43277525813206, 0.00025960098178191664, 1, 935, {'1%': -3.437363201927513, '5%': -2.864636122077874, '10%': -2.5684185607252137}, 13320.64266001144)
#ADF检验数据平稳性
#1.只要统计值(第一个值)是小于1%水平下的数字就可以极显著的拒绝原假设,认为数据平稳
#2.第二值为p值,表示t统计量对应的概率值。越接近0越好
print(adfuller(datanew['特价']))
(-5.326902281189413, 4.815089201365622e-06, 14, 922, {'1%': -3.437462363899248, '5%': -2.8646798473884134, '10%': -2.568441851017076}, 10080.058794702525)
# 若大部分的p值小于0.05,则该维度数据对预测有帮助
grangercausalitytests(datanew[['股票', '特价']], maxlag=5)
Granger Causality
number of lags (no zero) 1
ssr based F test: F=26.4227 , p=0.0000 , df_denom=933, df_num=1
ssr based chi2 test: chi2=26.5077 , p=0.0000 , df=1
likelihood ratio test: chi2=26.1393 , p=0.0000 , df=1
parameter F test: F=26.4227 , p=0.0000 , df_denom=933, df_num=1
Granger Causality
number of lags (no zero) 2
ssr based F test: F=11.9515 , p=0.0000 , df_denom=930, df_num=2
ssr based chi2 test: chi2=24.0315 , p=0.0000 , df=2
likelihood ratio test: chi2=23.7278 , p=0.0000 , df=2
parameter F test: F=11.9515 , p=0.0000 , df_denom=930, df_num=2
Granger Causality
number of lags (no zero) 3
ssr based F test: F=7.9772 , p=0.0000 , df_denom=927, df_num=3
ssr based chi2 test: chi2=24.1124 , p=0.0000 , df=3
likelihood ratio test: chi2=23.8064 , p=0.0000 , df=3
parameter F test: F=7.9772 , p=0.0000 , df_denom=927, df_num=3
Granger Causality
number of lags (no zero) 4
ssr based F test: F=6.5840 , p=0.0000 , df_denom=924, df_num=4
ssr based chi2 test: chi2=26.5926 , p=0.0000 , df=4
likelihood ratio test: chi2=26.2206 , p=0.0000 , df=4
parameter F test: F=6.5840 , p=0.0000 , df_denom=924, df_num=4
Granger Causality
number of lags (no zero) 5
ssr based F test: F=5.7818 , p=0.0000 , df_denom=921, df_num=5
ssr based chi2 test: chi2=29.2544 , p=0.0000 , df=5
likelihood ratio test: chi2=28.8047 , p=0.0000 , df=5
parameter F test: F=5.7818 , p=0.0000 , df_denom=921, df_num=5
{1: ({'ssr_ftest': (26.422705591269718, 3.341484126488697e-07, 933.0, 1),
'ssr_chi2test': (26.507666059408848, 2.6249436820703384e-07, 1),
'lrtest': (26.139254908737712, 3.176598799055979e-07, 1),
'params_ftest': (26.422705591270745, 3.3414841264869994e-07, 933.0, 1.0)},
[,
,
array([[0., 1., 0.]])]),
2: ({'ssr_ftest': (11.951474970201767, 7.50104672114872e-06, 930.0, 2),
'ssr_chi2test': (24.031460423954094, 6.048318781002187e-06, 2),
'lrtest': (23.727822721199118, 7.0399368317345415e-06, 2),
'params_ftest': (11.951474970201835, 7.501046721148414e-06, 930.0, 2.0)},
[,
,
array([[0., 0., 1., 0., 0.],
[0., 0., 0., 1., 0.]])]),
3: ({'ssr_ftest': (7.977215033141381, 2.9688716140668083e-05, 927.0, 3),
'ssr_chi2test': (24.11235870858916, 2.3666419220788112e-05, 3),
'lrtest': (23.80636877228062, 2.7416430068225566e-05, 3),
'params_ftest': (7.977215033141184, 2.9688716140675465e-05, 927.0, 3.0)},
[,
,
array([[0., 0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0.],
[0., 0., 0., 0., 0., 1., 0.]])]),
4: ({'ssr_ftest': (6.584011012049909, 3.170738838569464e-05, 924.0, 4),
'ssr_chi2test': (26.59256395776002, 2.4028198570111743e-05, 4),
'lrtest': (26.220641056244858, 2.8562552645710914e-05, 4),
'params_ftest': (6.5840110120498485, 3.170738838569734e-05, 924.0, 4.0)},
[,
,
array([[0., 0., 0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 1., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 1., 0.]])]),
5: ({'ssr_ftest': (5.781830837482133, 2.891464969771178e-05, 921.0, 5),
'ssr_chi2test': (29.254431816141953, 2.066905271923411e-05, 5),
'lrtest': (28.80468706706597, 2.5325511485650853e-05, 5),
'params_ftest': (5.7818308374822704, 2.8914649697703868e-05, 921.0, 5.0)},
[,
,
array([[0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.]])])}
#ADF检验数据平稳性
#1.只要统计值(第一个值)是小于1%水平下的数字就可以极显著的拒绝原假设,认为数据平稳
#2.第二值为p值,表示t统计量对应的概率值。越接近0越好
print(adfuller(datanew['价格']))
(-4.585437174184152, 0.0001373928567642462, 2, 934, {'1%': -3.4373707314972766, '5%': -2.8646394422797337, '10%': -2.5684203292233905}, -90.17618431524147)
# 若大部分的p值小于0.05,则该维度数据对预测有帮助
grangercausalitytests(datanew[['股票', '价格']], maxlag=5)
Granger Causality
number of lags (no zero) 1
ssr based F test: F=2.1967 , p=0.1386 , df_denom=933, df_num=1
ssr based chi2 test: chi2=2.2038 , p=0.1377 , df=1
likelihood ratio test: chi2=2.2012 , p=0.1379 , df=1
parameter F test: F=2.1967 , p=0.1386 , df_denom=933, df_num=1
Granger Causality
number of lags (no zero) 2
ssr based F test: F=3.6519 , p=0.0263 , df_denom=930, df_num=2
ssr based chi2 test: chi2=7.3431 , p=0.0254 , df=2
likelihood ratio test: chi2=7.3144 , p=0.0258 , df=2
parameter F test: F=3.6519 , p=0.0263 , df_denom=930, df_num=2
Granger Causality
number of lags (no zero) 3
ssr based F test: F=2.6202 , p=0.0496 , df_denom=927, df_num=3
ssr based chi2 test: chi2=7.9201 , p=0.0477 , df=3
likelihood ratio test: chi2=7.8867 , p=0.0484 , df=3
parameter F test: F=2.6202 , p=0.0496 , df_denom=927, df_num=3
Granger Causality
number of lags (no zero) 4
ssr based F test: F=2.5770 , p=0.0362 , df_denom=924, df_num=4
ssr based chi2 test: chi2=10.4086 , p=0.0341 , df=4
likelihood ratio test: chi2=10.3510 , p=0.0349 , df=4
parameter F test: F=2.5770 , p=0.0362 , df_denom=924, df_num=4
Granger Causality
number of lags (no zero) 5
ssr based F test: F=2.2911 , p=0.0439 , df_denom=921, df_num=5
ssr based chi2 test: chi2=11.5921 , p=0.0408 , df=5
likelihood ratio test: chi2=11.5206 , p=0.0420 , df=5
parameter F test: F=2.2911 , p=0.0439 , df_denom=921, df_num=5
{1: ({'ssr_ftest': (2.196709767862799, 0.13864333057458358, 933.0, 1),
'ssr_chi2test': (2.203773143322165, 0.1376733862086554, 1),
'lrtest': (2.201182862143469, 0.13790487994314793, 1),
'params_ftest': (2.1967097678628362, 0.13864333057458358, 933.0, 1.0)},
[,
,
array([[0., 1., 0.]])]),
2: ({'ssr_ftest': (3.651893951998595, 0.026314676493413697, 930.0, 2),
'ssr_chi2test': (7.3430555809004, 0.025437576956925986, 2),
'lrtest': (7.3143711921911745, 0.025805036418257297, 2),
'params_ftest': (3.6518939519985145, 0.026314676493415512, 930.0, 2.0)},
[,
,
array([[0., 0., 1., 0., 0.],
[0., 0., 0., 1., 0.]])]),
3: ({'ssr_ftest': (2.6202477844841536, 0.049618429261709424, 927.0, 3),
'ssr_chi2test': (7.920101717502264, 0.04769214764581419, 3),
'lrtest': (7.886710047954693, 0.0484120212476112, 3),
'params_ftest': (2.6202477844841394, 0.049618429261709424, 927.0, 3.0)},
[,
,
array([[0., 0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0.],
[0., 0., 0., 0., 0., 1., 0.]])]),
4: ({'ssr_ftest': (2.5770481375709697, 0.036235214476570736, 924.0, 4),
'ssr_chi2test': (10.408597023176254, 0.03407960554596128, 4),
'lrtest': (10.350965823859951, 0.034913006919153514, 4),
'params_ftest': (2.5770481375709857, 0.036235214476570736, 924.0, 4.0)},
[,
,
array([[0., 0., 0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 1., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 1., 0.]])]),
5: ({'ssr_ftest': (2.2910515372902207, 0.04394797232838387, 921.0, 5),
'ssr_chi2test': (11.592074010610672, 0.04082564664442608, 5),
'lrtest': (11.520576028920914, 0.041981492073816254, 5),
'params_ftest': (2.291051537290241, 0.04394797232838301, 921.0, 5.0)},
[,
,
array([[0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.]])])}
根据格兰杰因果检验的结果,'特价’和’价格’都对’股票’的预测有帮助。