机器学习回归预测中,存在多个模型预测结果,我们可以选择最优的模型结果 ,也可以组合多个不同模型的结果。
Clemen (1989)的研究表明,在许多情况下,只需对不同预测方法的预测结果进行平均就可以显著提高预测精度,并达到甚至超过单个最优模型的效果。
以下图正是说明(combination)组合的结果接近最优的单个模型。
图来源于(Forecasting: Principles and Practice)
所以下文给出回归模型几种简单的(融合)组合方法。
首先,我们导入库并定义几种损失函数
import numpy as np
import pandas as pd
from sklearn.linear_model import Lasso
from sklearn.linear_model import LassoCV
class Loss_func(object):
def __init__(self):
self.y_true = y_true
self.y_pred = y_pred
def mae(y_true, y_pred):
return np.mean(np.abs(y_true - y_pred))
def mape(y_true, y_pred):
return np.mean(np.abs((y_pred - y_true) / y_true)) * 100
def smape(y_true, y_pred):
return 2.0 * np.mean(np.abs(y_pred - y_true) / (np.abs(y_pred) + np.abs(y_true))) * 100
use_loss = Loss_func.mape
loss_name = getattr(use_loss, '__name__')
"""
features: 为不同模型预测结果
默认真实值为y
"""
多个预测结果直接平均
def simple_avg_forecast(forecast_df):
forecast_df['naive_comb_forecast'] = 0
for fea in features:
forecast_df['naive_comb_forecast'] += forecast_df[fea]
forecast_df['naive_comb_forecast'] = forecast_df['naive_comb_forecast'] / len(features)
return forecast_df
依据误差系数取每个模型结果相应的权重,做加权平均,这里的误差函数可以使用上面定义好的几种之一 ,也可以自定义。在时序预测领域 使用的频率最高至于几种损失函数的选择又是另外一篇文章中的总结了。
def weight_avg_forecast(forecast_df):
forecast_df['{}_sum'.format(loss_name)] = 0
forecast_df['{}_max'.format(loss_name)] = 0
for fea in features:
forecast_df["{}_{}".format(fea, loss_name)] = forecast_df.apply(lambda x: use_loss(x['y'], x[fea]), axis=1)
forecast_df["{}_{}".format(fea, loss_name)] = forecast_df["{}_{}".format(fea, loss_name)].apply(
lambda x: 0 if x <= 0 else x)
for fea in features:
forecast_df['{}_max'.format(loss_name)] = forecast_df.apply(
lambda x: max(x['{}_max'.format(loss_name)], x["{}_{}".format(fea, loss_name)]), axis=1)
for fea in features:
forecast_df['{}_sum'.format(loss_name)] += forecast_df['{}_max'.format(loss_name)] - forecast_df[
"{}_{}".format(fea, loss_name)]
for fea in features:
forecast_df["{}_weight_{}".format(fea, loss_name)] = (forecast_df['{}_max'.format(loss_name)] - forecast_df[
"{}_{}".format(fea, loss_name)]) / forecast_df['{}_sum'.format(loss_name)]
forecast_df['weight_avg_forecast'] = 0
for fea in features:
forecast_df['weight_avg_forecast'] += forecast_df["{}_weight_{}".format(fea, loss_name)] * forecast_df[fea]
return forecast_df
计算多个预测结果与真实值的相关系数,归一化相关系数后做加权使用误差度量归一化加权是一种加权方式。
我们也可以使用皮尔逊相关系数,一般情况下 预测值与真实值相关度系数越高 预测越准确。
def corr_comb_forecast(forecast_df):
forecast_df_corr = forecast_df.corr()
df_corr = pd.DataFrame(forecast_df_corr['y'].sort_values(ascending=False)[1:])
print(df_corr)
forecast_df_corr_re = forecast_df_corr.reset_index()
corr_select_fea = forecast_df_corr_re[forecast_df_corr_re['index'] == 'y']
corr_select_fea = corr_select_fea[features]
corr_select_fea[features] = abs(corr_select_fea[features])
corr_select_fea_min = min([corr_select_fea[fea].values[0]] for fea in corr_select_fea[features])[0]
t_sum = 0
for fea in features:
corr_select_fea['corr_norm_{}'.format(fea)] = corr_select_fea[fea] - corr_select_fea_min
print(corr_select_fea['corr_norm_{}'.format(fea)].values[0])
t_sum += corr_select_fea['corr_norm_{}'.format(fea)].values[0]
for fea in features:
corr_select_fea['corr_norm_{}'.format(fea)] = corr_select_fea['corr_norm_{}'.format(fea)] / t_sum
forecast_df['corr_forecast'] = 0
for fea in features:
forecast_df['corr_forecast'] += corr_select_fea['corr_norm_{}'.format(fea)].values[0] * forecast_df[fea]
forecast_df['corr_norm_{}'.format(fea)] = corr_select_fea['corr_norm_{}'.format(fea)].values[0]
return forecast_df
多个模型的误差,按照误差大小排列,分别用指数衰减的方式取权重。
,比如准确率最高的给予更高的权重 惩罚准确率小于平均的预测结果。甚至让准确率小于50的模型结果权重直接等于0,简单的代码示例如下:
mean_ma=max(75,mean_ma)
low_mean = np.mean(map_list)
for i, value in enumerate(map_list):
if value<50:
map_list[i]=0
elif value < low_mean:
map_list[i] = value / 2
else:
map_list[i] = value * 2
使用回归模型拟合简单的模型,得到回归系数,归一化回归系数与各模型预测结果相乘得到融合结果。
使用Lasso是因为正常来说,若干种预测结果存在高度共线性,Lasso回归可以将变量系数通过L1正则使某些回归系数变为0,求解得到包括某些权重系数为0的结果。
Lasso融合的结果往往比加权平均效果更好,实践中值得尝试。
步骤:
def lasso_comb_forecast(forecast_df, target_col='y'):
reg_data = forecast_df[features]
target = [target_col]
reg_target = forecast_df[target]
lassocv = LassoCV()
lassocv.fit(reg_data, reg_target)
alpha = lassocv.alpha_
print('best alpha is : {}'.format(alpha))
lasso = Lasso(alpha=alpha)
lasso.fit(reg_data, reg_target)
num_effect_coef = np.sum(lasso.coef_ != 0)
print('all coef num : {}. not equal coef num : {}'.format(len(lasso.coef_), num_effect_coef))
lasso_coefs = lasso.coef_
lst = zip(lasso_coefs, features)
loss_coef_df = pd.DataFrame.from_dict(lst,columns = ['coef', 'feature'])
t = 'lasso_comb_forecast='
for i in loss_coef_df['feature'].unique():
coef = loss_coef_df[loss_coef_df['feature'] == i]['coef'].values[0]
temp = str(i) + '*' + str(coef) + '+'
t += temp
forecast_df.eval(t[:-1], inplace=True)
for fea in features:
forecast_df['lasso_coef_{}'.format(fea)] = loss_coef_df[loss_coef_df['feature'] == i]['coef'].values[0]
return forecast_df
以上完整的代码见