之前我们说的ARIMA直接套用就可以,因为包里面自动把它化成t-1,t-2。。。的形式了,
但如果我们要变量选择的话,还是先转化成监督学习问题吧(shift...)
转换成监督学习问题参考:https://machinelearningmastery.com/category/time-series/
from pandas import Series
from pandas import DataFrame
# load dataset
series = Series.from_csv('seasonally_adjusted.csv', header=None)
# reframe as supervised learning
dataframe = DataFrame()
for i in range(12,0,-1):
dataframe['t-'+str(i)] = series.shift(i)
dataframe['t'] = series.values
print(dataframe.head(13))
dataframe = dataframe[13:]
# save to new file
dataframe.to_csv('lags_12months_features.csv', index=False)
结果长这样:
t-12 t-11 t-10 t-9 t-8 t-7 t-6 t-5 \
1961-01-01 NaN NaN NaN NaN NaN NaN NaN NaN
1961-02-01 NaN NaN NaN NaN NaN NaN NaN NaN
1961-03-01 NaN NaN NaN NaN NaN NaN NaN NaN
1961-04-01 NaN NaN NaN NaN NaN NaN NaN NaN
1961-05-01 NaN NaN NaN NaN NaN NaN NaN NaN
1961-06-01 NaN NaN NaN NaN NaN NaN NaN 687.0
1961-07-01 NaN NaN NaN NaN NaN NaN 687.0 646.0
1961-08-01 NaN NaN NaN NaN NaN 687.0 646.0 -189.0
1961-09-01 NaN NaN NaN NaN 687.0 646.0 -189.0 -611.0
1961-10-01 NaN NaN NaN 687.0 646.0 -189.0 -611.0 1339.0
1961-11-01 NaN NaN 687.0 646.0 -189.0 -611.0 1339.0 30.0
1961-12-01 NaN 687.0 646.0 -189.0 -611.0 1339.0 30.0 1645.0
1962-01-01 687.0 646.0 -189.0 -611.0 1339.0 30.0 1645.0 -276.0
t-4 t-3 t-2 t-1 t
1961-01-01 NaN NaN NaN NaN 687.0
1961-02-01 NaN NaN NaN 687.0 646.0
1961-03-01 NaN NaN 687.0 646.0 -189.0
1961-04-01 NaN 687.0 646.0 -189.0 -611.0
1961-05-01 687.0 646.0 -189.0 -611.0 1339.0
1961-06-01 646.0 -189.0 -611.0 1339.0 30.0
1961-07-01 -189.0 -611.0 1339.0 30.0 1645.0
1961-08-01 -611.0 1339.0 30.0 1645.0 -276.0
1961-09-01 1339.0 30.0 1645.0 -276.0 561.0
1961-10-01 30.0 1645.0 -276.0 561.0 470.0
1961-11-01 1645.0 -276.0 561.0 470.0 3395.0
1961-12-01 -276.0 561.0 470.0 3395.0 360.0
1962-01-01 561.0 470.0 3395.0 360.0 3440.0
1.随机森林选择重要程度高的变量
这里简单提一下,随机森林回归树是根据啥选择重要变量的呢,每一棵树,我都利用RSS最小建立一棵树的原则,这样每个特征分裂都有对应的RSS减少量,500颗树平均起来,算出每一个特征RSS的减少量,越大代表重要程度越高,详细介绍可以参考:统计学习导论:基于R的应用。
注意选变量的时候,树的数目一定要多~~~~n_estimators=500,顺便设置个随机种子
from pandas import read_csv
from sklearn.ensemble import RandomForestRegressor
from matplotlib import pyplot
# load data
dataframe = read_csv('lags_12months_features.csv', header=0)
array = dataframe.values
# split into input and output
X = array[:,0:-1]
y = array[:,-1]
# fit random forest model
model = RandomForestRegressor(n_estimators=500, random_state=1)
model.fit(X, y)
# show importance scores
print(model.feature_importances_)
# plot importance scores
names = dataframe.columns.values[0:-1]
ticks = [i for i in range(len(names))]
pyplot.bar(ticks, model.feature_importances_)
pyplot.xticks(ticks, names)
pyplot.show()
结果如下:
[ 0.21642244 0.06271259 0.05662302 0.05543768 0.07155573 0.08478599
0.07699371 0.05366735 0.1033234 0.04897883 0.1066669 0.06283236]
2.RFE
最常用的包装法是递归消除特征法(recursive feature elimination,以下简称RFE)。递归消除特征法使用一个机器学习模型来进行多轮训练,每轮训练后,消除若干权值系数的对应的特征,再基于新的特征集进行下一轮训练。在sklearn中,可以使用RFE函数来选择特征。
我们下面以经典的SVM-RFE算法来讨论这个特征选择的思路。这个算法以支持向量机来做RFE的机器学习模型选择特征。它在第一轮训练的时候,会选择所有的特征来训练,得到了分类的超平面wx˙+b=0wx˙+b=0后,如果有n个特征,那么RFE-SVM会选择出ww中分量的平方值w2iwi2最小的那个序号i对应的特征,将其排除,在第二类的时候,特征数就剩下n-1个了,我们继续用这n-1个特征和输出值来训练SVM,同样的,去掉w2iwi2最小的那个序号i对应的特征。以此类推,直到剩下的特征数满足我们的需求为止。
from pandas import read_csv
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor
from matplotlib import pyplot
# load dataset
dataframe = read_csv('lags_12months_features.csv', header=0)
# separate into input and output variables
array = dataframe.values
X = array[:,0:-1]
y = array[:,-1]
# perform feature selection
rfe = RFE(RandomForestRegressor(n_estimators=500, random_state=1), 4)
fit = rfe.fit(X, y)
# report selected features
print('Selected Features:')
names = dataframe.columns.values[0:-1]
for i in range(len(fit.support_)):
if fit.support_[i]:
print(names[i])
# plot feature rank
names = dataframe.columns.values[0:-1]
ticks = [i for i in range(len(names))]
pyplot.bar(ticks, fit.ranking_)
pyplot.xticks(ticks, names)
pyplot.show()
上面使用了RF作为基本分类器去选择,最终只要4个
结果:
Selected Features:
t-12
t-6
t-4
t-2
上面这个图代表变量重要性的排序排序,选出来的四个变量排序都是1
当然刘建平大神博客还介绍了其他的变量选择方法,也是都可以用一下的
https://machinelearningmastery.com/feature-selection-time-series-forecasting-python/