常用特征选择方法

  本文展示了三种常用的特征选择方法:皮尔逊相关系数法、递归特征消除法、格兰杰因果检验法。
  本文使用的数据集在本人上传的资源中,链接为mock_kaggle.csv

import pandas as pd
import numpy as np
import math
from matplotlib import pyplot as plt
from matplotlib.pylab import mpl
import tensorflow as tf
mpl.rcParams['font.sans-serif'] = ['SimHei']   #显示中文
mpl.rcParams['axes.unicode_minus']=False       #显示负号

取数据

data=pd.read_csv('mock_kaggle.csv',encoding ='gbk',parse_dates=['datetime'])
Date=pd.to_datetime(data.datetime)
data=data.iloc[:,1:]
datanew=data.set_index(Date)
datanew
特价 股票 价格
datetime
2014-01-01 0 4972 1.29
2014-01-02 70 4902 1.29
2014-01-03 59 4843 1.29
2014-01-04 93 4750 1.29
2014-01-05 96 4654 1.29
... ... ... ...
2016-07-27 98 3179 2.39
2016-07-28 108 3071 2.39
2016-07-29 128 4095 2.39
2016-07-30 270 3825 2.39
2016-07-31 183 3642 2.39

937 rows × 3 columns

方法一:皮尔逊相关系数法

  皮尔逊相关系数法:用于衡量两个特征的线性相关程度。值大于0,表示两个变量正相关;值小于0,表示两个变量负相关;绝对值越大表示两个变量的线性相关程度越大。值等于0只能说明两个变量不是线性相关,但有可能是其它方式的相关。

datanew.corr(method='pearson')
特价 股票 价格
特价 1.000000 0.153659 0.094779
股票 0.153659 1.000000 -0.032604
价格 0.094779 -0.032604 1.000000

  这里‘股票’为我们的预测目标,'特价’和‘价格’为相关特征。根据皮尔逊相关系数法的结果,'特价’和’股票’之间的线性相关程度更高,但也不是很高。

方法二:RFE(递归特征消除法)

from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
x=datanew[['特价','价格']]
y=datanew[['股票']]
#参数estimator为基模型,主要使用的有:RandomForestRegressor、AdaBoostRegressor、GradientBoostingRegressor
#参数n_features_to_select为选择的特征个数
estimator = AdaBoostRegressor()
selector = RFE(estimator, n_features_to_select=1,step=1)
selector = selector.fit(x, np.ravel(y))
selector.ranking_
array([2, 1])

  根据RFE的结果可知特征排名为:价格>特价。因此选择的前1个特征为“价格”。

方法三:格兰杰因果检验

  由于格兰杰因果检验只适用于平稳数据,所以每次检验前先进行ADF检验,对不平稳的数据先进行差分。

  采用n阶格兰杰检验,若p值小于0.05,则该维度数据对预测有帮助。n的值择可根据实验效果选择,这里使用的是n=5。

from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import grangercausalitytests
#ADF检验数据平稳性
#1.只要统计值(第一个值)是小于1%水平下的数字就可以极显著的拒绝原假设,认为数据平稳
#2.第二值为p值,表示t统计量对应的概率值。越接近0越好
print(adfuller(datanew['股票']))
(-4.43277525813206, 0.00025960098178191664, 1, 935, {'1%': -3.437363201927513, '5%': -2.864636122077874, '10%': -2.5684185607252137}, 13320.64266001144)
#ADF检验数据平稳性
#1.只要统计值(第一个值)是小于1%水平下的数字就可以极显著的拒绝原假设,认为数据平稳
#2.第二值为p值,表示t统计量对应的概率值。越接近0越好
print(adfuller(datanew['特价']))
(-5.326902281189413, 4.815089201365622e-06, 14, 922, {'1%': -3.437462363899248, '5%': -2.8646798473884134, '10%': -2.568441851017076}, 10080.058794702525)
# 若大部分的p值小于0.05,则该维度数据对预测有帮助
grangercausalitytests(datanew[['股票', '特价']], maxlag=5)
Granger Causality
number of lags (no zero) 1
ssr based F test:         F=26.4227 , p=0.0000  , df_denom=933, df_num=1
ssr based chi2 test:   chi2=26.5077 , p=0.0000  , df=1
likelihood ratio test: chi2=26.1393 , p=0.0000  , df=1
parameter F test:         F=26.4227 , p=0.0000  , df_denom=933, df_num=1

Granger Causality
number of lags (no zero) 2
ssr based F test:         F=11.9515 , p=0.0000  , df_denom=930, df_num=2
ssr based chi2 test:   chi2=24.0315 , p=0.0000  , df=2
likelihood ratio test: chi2=23.7278 , p=0.0000  , df=2
parameter F test:         F=11.9515 , p=0.0000  , df_denom=930, df_num=2

Granger Causality
number of lags (no zero) 3
ssr based F test:         F=7.9772  , p=0.0000  , df_denom=927, df_num=3
ssr based chi2 test:   chi2=24.1124 , p=0.0000  , df=3
likelihood ratio test: chi2=23.8064 , p=0.0000  , df=3
parameter F test:         F=7.9772  , p=0.0000  , df_denom=927, df_num=3

Granger Causality
number of lags (no zero) 4
ssr based F test:         F=6.5840  , p=0.0000  , df_denom=924, df_num=4
ssr based chi2 test:   chi2=26.5926 , p=0.0000  , df=4
likelihood ratio test: chi2=26.2206 , p=0.0000  , df=4
parameter F test:         F=6.5840  , p=0.0000  , df_denom=924, df_num=4

Granger Causality
number of lags (no zero) 5
ssr based F test:         F=5.7818  , p=0.0000  , df_denom=921, df_num=5
ssr based chi2 test:   chi2=29.2544 , p=0.0000  , df=5
likelihood ratio test: chi2=28.8047 , p=0.0000  , df=5
parameter F test:         F=5.7818  , p=0.0000  , df_denom=921, df_num=5





{1: ({'ssr_ftest': (26.422705591269718, 3.341484126488697e-07, 933.0, 1),
   'ssr_chi2test': (26.507666059408848, 2.6249436820703384e-07, 1),
   'lrtest': (26.139254908737712, 3.176598799055979e-07, 1),
   'params_ftest': (26.422705591270745, 3.3414841264869994e-07, 933.0, 1.0)},
  [,
   ,
   array([[0., 1., 0.]])]),
 2: ({'ssr_ftest': (11.951474970201767, 7.50104672114872e-06, 930.0, 2),
   'ssr_chi2test': (24.031460423954094, 6.048318781002187e-06, 2),
   'lrtest': (23.727822721199118, 7.0399368317345415e-06, 2),
   'params_ftest': (11.951474970201835, 7.501046721148414e-06, 930.0, 2.0)},
  [,
   ,
   array([[0., 0., 1., 0., 0.],
          [0., 0., 0., 1., 0.]])]),
 3: ({'ssr_ftest': (7.977215033141381, 2.9688716140668083e-05, 927.0, 3),
   'ssr_chi2test': (24.11235870858916, 2.3666419220788112e-05, 3),
   'lrtest': (23.80636877228062, 2.7416430068225566e-05, 3),
   'params_ftest': (7.977215033141184, 2.9688716140675465e-05, 927.0, 3.0)},
  [,
   ,
   array([[0., 0., 0., 1., 0., 0., 0.],
          [0., 0., 0., 0., 1., 0., 0.],
          [0., 0., 0., 0., 0., 1., 0.]])]),
 4: ({'ssr_ftest': (6.584011012049909, 3.170738838569464e-05, 924.0, 4),
   'ssr_chi2test': (26.59256395776002, 2.4028198570111743e-05, 4),
   'lrtest': (26.220641056244858, 2.8562552645710914e-05, 4),
   'params_ftest': (6.5840110120498485, 3.170738838569734e-05, 924.0, 4.0)},
  [,
   ,
   array([[0., 0., 0., 0., 1., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0., 1., 0., 0., 0.],
          [0., 0., 0., 0., 0., 0., 1., 0., 0.],
          [0., 0., 0., 0., 0., 0., 0., 1., 0.]])]),
 5: ({'ssr_ftest': (5.781830837482133, 2.891464969771178e-05, 921.0, 5),
   'ssr_chi2test': (29.254431816141953, 2.066905271923411e-05, 5),
   'lrtest': (28.80468706706597, 2.5325511485650853e-05, 5),
   'params_ftest': (5.7818308374822704, 2.8914649697703868e-05, 921.0, 5.0)},
  [,
   ,
   array([[0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
          [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
          [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.]])])}
#ADF检验数据平稳性
#1.只要统计值(第一个值)是小于1%水平下的数字就可以极显著的拒绝原假设,认为数据平稳
#2.第二值为p值,表示t统计量对应的概率值。越接近0越好
print(adfuller(datanew['价格']))
(-4.585437174184152, 0.0001373928567642462, 2, 934, {'1%': -3.4373707314972766, '5%': -2.8646394422797337, '10%': -2.5684203292233905}, -90.17618431524147)
# 若大部分的p值小于0.05,则该维度数据对预测有帮助
grangercausalitytests(datanew[['股票', '价格']], maxlag=5)
Granger Causality
number of lags (no zero) 1
ssr based F test:         F=2.1967  , p=0.1386  , df_denom=933, df_num=1
ssr based chi2 test:   chi2=2.2038  , p=0.1377  , df=1
likelihood ratio test: chi2=2.2012  , p=0.1379  , df=1
parameter F test:         F=2.1967  , p=0.1386  , df_denom=933, df_num=1

Granger Causality
number of lags (no zero) 2
ssr based F test:         F=3.6519  , p=0.0263  , df_denom=930, df_num=2
ssr based chi2 test:   chi2=7.3431  , p=0.0254  , df=2
likelihood ratio test: chi2=7.3144  , p=0.0258  , df=2
parameter F test:         F=3.6519  , p=0.0263  , df_denom=930, df_num=2

Granger Causality
number of lags (no zero) 3
ssr based F test:         F=2.6202  , p=0.0496  , df_denom=927, df_num=3
ssr based chi2 test:   chi2=7.9201  , p=0.0477  , df=3
likelihood ratio test: chi2=7.8867  , p=0.0484  , df=3
parameter F test:         F=2.6202  , p=0.0496  , df_denom=927, df_num=3

Granger Causality
number of lags (no zero) 4
ssr based F test:         F=2.5770  , p=0.0362  , df_denom=924, df_num=4
ssr based chi2 test:   chi2=10.4086 , p=0.0341  , df=4
likelihood ratio test: chi2=10.3510 , p=0.0349  , df=4
parameter F test:         F=2.5770  , p=0.0362  , df_denom=924, df_num=4

Granger Causality
number of lags (no zero) 5
ssr based F test:         F=2.2911  , p=0.0439  , df_denom=921, df_num=5
ssr based chi2 test:   chi2=11.5921 , p=0.0408  , df=5
likelihood ratio test: chi2=11.5206 , p=0.0420  , df=5
parameter F test:         F=2.2911  , p=0.0439  , df_denom=921, df_num=5





{1: ({'ssr_ftest': (2.196709767862799, 0.13864333057458358, 933.0, 1),
   'ssr_chi2test': (2.203773143322165, 0.1376733862086554, 1),
   'lrtest': (2.201182862143469, 0.13790487994314793, 1),
   'params_ftest': (2.1967097678628362, 0.13864333057458358, 933.0, 1.0)},
  [,
   ,
   array([[0., 1., 0.]])]),
 2: ({'ssr_ftest': (3.651893951998595, 0.026314676493413697, 930.0, 2),
   'ssr_chi2test': (7.3430555809004, 0.025437576956925986, 2),
   'lrtest': (7.3143711921911745, 0.025805036418257297, 2),
   'params_ftest': (3.6518939519985145, 0.026314676493415512, 930.0, 2.0)},
  [,
   ,
   array([[0., 0., 1., 0., 0.],
          [0., 0., 0., 1., 0.]])]),
 3: ({'ssr_ftest': (2.6202477844841536, 0.049618429261709424, 927.0, 3),
   'ssr_chi2test': (7.920101717502264, 0.04769214764581419, 3),
   'lrtest': (7.886710047954693, 0.0484120212476112, 3),
   'params_ftest': (2.6202477844841394, 0.049618429261709424, 927.0, 3.0)},
  [,
   ,
   array([[0., 0., 0., 1., 0., 0., 0.],
          [0., 0., 0., 0., 1., 0., 0.],
          [0., 0., 0., 0., 0., 1., 0.]])]),
 4: ({'ssr_ftest': (2.5770481375709697, 0.036235214476570736, 924.0, 4),
   'ssr_chi2test': (10.408597023176254, 0.03407960554596128, 4),
   'lrtest': (10.350965823859951, 0.034913006919153514, 4),
   'params_ftest': (2.5770481375709857, 0.036235214476570736, 924.0, 4.0)},
  [,
   ,
   array([[0., 0., 0., 0., 1., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0., 1., 0., 0., 0.],
          [0., 0., 0., 0., 0., 0., 1., 0., 0.],
          [0., 0., 0., 0., 0., 0., 0., 1., 0.]])]),
 5: ({'ssr_ftest': (2.2910515372902207, 0.04394797232838387, 921.0, 5),
   'ssr_chi2test': (11.592074010610672, 0.04082564664442608, 5),
   'lrtest': (11.520576028920914, 0.041981492073816254, 5),
   'params_ftest': (2.291051537290241, 0.04394797232838301, 921.0, 5.0)},
  [,
   ,
   array([[0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
          [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
          [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.]])])}

  根据格兰杰因果检验的结果,'特价’和’价格’都对’股票’的预测有帮助。

你可能感兴趣的:(人工智能,机器学习,人工智能,数据分析,python)