LightGBM是最近最常见的一类算法,在kaggle比赛中经常被用来做预测和回归,由于性能比较好有着“倚天剑”的称号,而XGBoost则被称为屠龙刀。今天,我们就抛砖引玉,做一个简单的教程,如何用这倚天剑和屠龙刀来预测时间序列。参数没有调到最佳的预测效果,根据不同的数据集,同学们可以自己调参。
1.环境搭建
我们运行的环境是下载anaconda,然后在里面安装keras以及lightgbm,打开spyder运行程序即可。其中下载anaconda和安装keras的教程在我们另一个博客“用CNN做电能质量扰动分类(2019-03-28)”中写过了,这里就不赘述了。至于安装lightgbm的教程可以看“https://zhuanlan.zhihu.com/p/38361330”,基本上一句话总结就是“在Anaconda Prompt中输入pip install lightgbm 按Enter即可”
2.数据集下载
下载时间序列数据集和程序。其中,网盘连接是:
https://pan.baidu.com/s/1atjv-Juq9j8dW_x5RKUohg,密码是“ivhz”。
“nihe.csv”是我自己做的一个时间序列的数据集,一共有1000行4列其中,1-3列可以认为是X,第4列认为是Y。我们现在要做的就是训练3个X和Y之间的关系,然后给定X去预测Y。
3.预测
把下载的nihe.csv文件放到spyder 的默认路径下,我的默认路径是“D:\Matlab2018a\42”,新建一个.py文件,把程序放进去,运行即可。运行结束后,forecasttestY变量就是测试集的预测值。
4.程序
1)LightGBM的程序如下:
import lightgbm as lgbm
from sklearn import metrics
from sklearn import model_selection
import numpy as np
#1构建模型
model = lgbm.LGBMRegressor(
objective='regression',
max_depth=5,
num_leaves=25,
learning_rate=0.007,
n_estimators=1000,
min_child_samples=80,
subsample=0.8,
colsample_bytree=1,
reg_alpha=0,
reg_lambda=0,
random_state=np.random.randint(10e6))
import numpy as np
#2导入数据
from pandas import read_csv
dataset = read_csv('nihe.csv')
values = dataset.values
from sklearn.preprocessing import MinMaxScaler
scaler= MinMaxScaler(feature_range=(0, 1))
XY= scaler.fit_transform(values)
Featurenum=3
X= XY[:,0:Featurenum]
Y = XY[:,Featurenum]
n_train_hours1 = 800
n_train_hours2 = 900
trainX = X[:n_train_hours1, :]
trainY =Y[:n_train_hours1]
validX=X[n_train_hours1:n_train_hours2, :]
validY=Y[n_train_hours1:n_train_hours2]
testX = X[n_train_hours2:, :]
testY =Y[n_train_hours2:]
#3构建、拟合、预测
model.fit(
trainX,
trainY,
eval_set=[(trainX, trainY), (validX, validY)],
eval_names=('fit', 'val'),
eval_metric='l2',
early_stopping_rounds=200,
verbose=False)
forecasttestY0 = model.predict(testX)
Hangnum=len(forecasttestY0)
forecasttestY0 = np.reshape(forecasttestY0, (Hangnum, 1))
#4反变换
from pandas import concat
inv_yhat =np.concatenate((testX,forecasttestY0), axis=1)
inv_y = scaler.inverse_transform(inv_yhat)
forecasttestY = inv_y[:,Featurenum]
2)XGBoost的程序如下:
import xgboost as xgb
from xgboost import plot_importance
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
import numpy as np
#1. load dataset
from pandas import read_csv
dataset = read_csv('nihe.csv')
values = dataset.values
#2.tranform data to [0,1] 3个属性,第4个是待预测量
from sklearn.preprocessing import MinMaxScaler
scaler= MinMaxScaler(feature_range=(0, 1))
XY= scaler.fit_transform(values)
Featurenum=3
X= XY[:,0:Featurenum]
Y = XY[:,Featurenum]
#3.split into train and test sets 950个训练集,剩下的都是验证集
n_train_hours1 = 800
n_train_hours2 = 900
trainX = X[:n_train_hours1, :]
trainY =Y[:n_train_hours1]
validX=X[n_train_hours1:n_train_hours2, :]
validY=Y[n_train_hours1:n_train_hours2]
testX = X[n_train_hours2:, :]
testY =Y[n_train_hours2:]
#3构建、拟合、预测
model = xgb.XGBRegressor(max_depth=5, learning_rate=0.1, n_estimators=160, silent=True, objective='reg:gamma')
model.fit(trainX, trainY)
forecasttestY0 = model.predict(testX)
Hangnum=len(forecasttestY0)
forecasttestY0 = np.reshape(forecasttestY0, (Hangnum, 1))
plot_importance(model)
plt.show()
#4反变换
from pandas import concat
inv_yhat =np.concatenate((testX,forecasttestY0), axis=1)
inv_y = scaler.inverse_transform(inv_yhat)
forecasttestY = inv_y[:,Featurenum]