一、写在前面
之前我们以决策树为例子,展示了各种花里胡哨的时间序列建模。
从这一期开始,我们继续基于python构建各种机器学习和深度学习的时间序列预测模型。
同样,这里使用这个数据:
《PLoS One》2015年一篇题目为《Comparison of Two Hybrid Models for Forecasting the Incidence of Hemorrhagic Fever with Renal Syndrome in Jiangsu Province, China》文章的公开数据做演示。数据为江苏省2004年1月至2012年12月肾综合症出血热月发病率。运用2004年1月至2011年12月的数据预测2012年12个月的发病率数据。
这一期,我们介绍随机森林回归。
二、随机森林回归
(1)代码解读
class sklearn.ensemble.RandomForestRegressor(n_estimators=100, *, criterion='squared_error', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=1.0, max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, ccp_alpha=0.0, max_samples=None)
咋一看,跟RandomForestClassifier(用于分类,上传送门)参数也差不多,因此,我们列举出它们相同和不同的地方,便于对比记忆:
共同的参数:
n_estimators: 树的数量。
criterion: 用于测量分裂质量的函数。
RandomForestRegressor 可选:{'mse', 'mae'},默认为 'mse'。
RandomForestClassifier 可选:{'gini', 'entropy'},默认为 'gini'。
max_depth: 树的最大深度。
min_samples_split: 分裂内部节点所需的最小样本数。
min_samples_leaf: 叶节点所需的最小样本数。
min_weight_fraction_leaf: 叶节点所需的权重的最小加权总和。
max_features: 在寻找最佳分裂时考虑的特征数量。
max_leaf_nodes: 使用 max_depth 之前的最大叶子节点数。
min_impurity_decrease: 如果节点分裂会导致杂质的减少大于或等于该值,则该节点将被分裂。
bootstrap: 是否使用 bootstrap 样本进行建树。
oob_score: 是否使用 out-of-bag 样本来估计泛化准确度。
n_jobs: 并行运行的任务数。
random_state: 用于控制随机性的种子。
verbose: 控制决策树建立过程的详细程度。
warm_start: 设置为 True 时,重用前一个调用的解决方案来适应并为森林添加更多的估计器。
class_weight (仅 RandomForestClassifier): 与类关联的权重。对于不平衡的分类问题很有用。
特定于 RandomForestClassifier 的参数:
ccp_alpha: 用于最小化成本复杂性修剪的复杂性参数。具有最大成本复杂性的树会被修剪。默认为 0.0。
max_samples: 从 X 中抽取的样本数量,用于训练每个基本估计器。
综上可见,大部分参数对于两者都是相同的,只是它们的默认值或者可选值有所不同。最主要的差异是在于 criterion 参数中回归器和分类器所接受的选项。
(2)单步滚动预测
# 读取数据
import pandas as pd
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
data = pd.read_csv('data.csv')
# 将时间列转换为日期格式
data['time'] = pd.to_datetime(data['time'], format='%b-%y')
# 拆分输入和输出
lag_period = 6
# 创建滞后期特征
for i in range(lag_period, 0, -1):
data[f'lag_{i}'] = data['incidence'].shift(lag_period - i + 1)
# 删除包含NaN的行
data = data.dropna().reset_index(drop=True)
# 划分训练集和验证集
train_data = data[(data['time'] >= '2004-01-01') & (data['time'] <= '2011-12-31')]
validation_data = data[(data['time'] >= '2012-01-01') & (data['time'] <= '2012-12-31')]
# 定义特征和目标变量
X_train = train_data[['lag_1', 'lag_2', 'lag_3', 'lag_4', 'lag_5', 'lag_6']]
y_train = train_data['incidence']
X_validation = validation_data[['lag_1', 'lag_2', 'lag_3', 'lag_4', 'lag_5', 'lag_6']]
y_validation = validation_data['incidence']
# 初始化随机森林模型
rf_model = RandomForestRegressor()
# 定义参数网格
param_grid = {
'n_estimators': [50, 100, 150],
'max_depth': [None, 5, 7],
}
# 初始化网格搜索
grid_search = GridSearchCV(rf_model, param_grid, cv=5, scoring='neg_mean_squared_error')
# 进行网格搜索
grid_search.fit(X_train, y_train)
# 获取最佳参数
best_params = grid_search.best_params_
# 使用最佳参数初始化随机森林模型
best_rf_model = RandomForestRegressor(**best_params)
# 在训练集上训练模型
best_rf_model.fit(X_train, y_train)
# 对于验证集,我们需要迭代地预测每一个数据点
y_validation_pred = []
for i in range(len(X_validation)):
if i == 0:
pred = best_rf_model.predict([X_validation.iloc[0]])
else:
new_features = list(X_validation.iloc[i, 1:]) + [pred[0]]
pred = best_rf_model.predict([new_features])
y_validation_pred.append(pred[0])
y_validation_pred = np.array(y_validation_pred)
# 计算验证集上的MAE, MAPE, MSE和RMSE
mae_validation = mean_absolute_error(y_validation, y_validation_pred)
mape_validation = np.mean(np.abs((y_validation - y_validation_pred) / y_validation))
mse_validation = mean_squared_error(y_validation, y_validation_pred)
rmse_validation = np.sqrt(mse_validation)
# 计算训练集上的MAE, MAPE, MSE和RMSE
y_train_pred = best_rf_model.predict(X_train)
mae_train = mean_absolute_error(y_train, y_train_pred)
mape_train = np.mean(np.abs((y_train - y_train_pred) / y_train))
mse_train = mean_squared_error(y_train, y_train_pred)
rmse_train = np.sqrt(mse_train)
print("Train Metrics:", mae_train, mape_train, mse_train, rmse_train)
print("Validation Metrics:", mae_validation, mape_validation, mse_validation, rmse_validation)
看结果:
(3)多步滚动预测-vol. 1
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error
data = pd.read_csv('data.csv')
data['time'] = pd.to_datetime(data['time'], format='%b-%y')
n = 6
m = 2
# 创建滞后期特征
for i in range(n, 0, -1):
data[f'lag_{i}'] = data['incidence'].shift(n - i + 1)
data = data.dropna().reset_index(drop=True)
train_data = data[(data['time'] >= '2004-01-01') & (data['time'] <= '2011-12-31')]
validation_data = data[(data['time'] >= '2012-01-01') & (data['time'] <= '2012-12-31')]
X_train = train_data[[f'lag_{i}' for i in range(1, n+1)]]
# 创建m个目标变量
y_train_list = [train_data['incidence'].shift(-i) for i in range(m)]
y_train = pd.concat(y_train_list, axis=1)
y_train.columns = [f'target_{i+1}' for i in range(m)]
y_train = y_train.dropna()
X_train = X_train.iloc[:-m+1, :]
X_validation = validation_data[[f'lag_{i}' for i in range(1, n+1)]]
y_validation = validation_data['incidence']
rf_model = RandomForestRegressor()
param_grid = {
'n_estimators': [50, 100, 150],
'max_depth': [None, 3, 5, 7, 9],
}
grid_search = GridSearchCV(rf_model, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
best_rf_model = RandomForestRegressor(**best_params)
best_rf_model.fit(X_train, y_train)
# 预测验证集
y_validation_pred = []
for i in range(len(X_validation) - m + 1):
pred = best_rf_model.predict([X_validation.iloc[i]])
y_validation_pred.extend(pred[0])
# 重叠预测值取平均
for i in range(1, m):
for j in range(len(y_validation_pred) - i):
y_validation_pred[j+i] = (y_validation_pred[j+i] + y_validation_pred[j]) / 2
y_validation_pred = np.array(y_validation_pred)[:len(y_validation)]
mae_validation = mean_absolute_error(y_validation, y_validation_pred)
mape_validation = np.mean(np.abs((y_validation - y_validation_pred) / y_validation))
mse_validation = mean_squared_error(y_validation, y_validation_pred)
rmse_validation = np.sqrt(mse_validation)
print(mae_validation, mape_validation, mse_validation, rmse_validation)
# 拟合训练集
y_train_pred = []
for i in range(len(X_train) - m + 1):
pred = best_rf_model.predict([X_train.iloc[i]])
y_train_pred.extend(pred[0])
# 重叠预测值取平均
for i in range(1, m):
for j in range(len(y_train_pred) - i):
y_train_pred[j+i] = (y_train_pred[j+i] + y_train_pred[j]) / 2
y_train_pred = np.array(y_train_pred)[:len(y_train)]
mae_train = mean_absolute_error(y_train.iloc[:, 0], y_train_pred)
mape_train = np.mean(np.abs((y_train.iloc[:, 0] - y_train_pred) / y_train.iloc[:, 0]))
mse_train = mean_squared_error(y_train.iloc[:, 0], y_train_pred)
rmse_train = np.sqrt(mse_train)
print(mae_train, mape_train, mse_train, rmse_train)
结果:
(4)多步滚动预测-vol. 2
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error
data = pd.read_csv('data.csv')
data['time'] = pd.to_datetime(data['time'], format='%b-%y')
n = 6
m = 2
# 创建滞后期特征
for i in range(n, 0, -1):
data[f'lag_{i}'] = data['incidence'].shift(n - i + 1)
data = data.dropna().reset_index(drop=True)
train_data = data[(data['time'] >= '2004-01-01') & (data['time'] <= '2011-12-31')]
validation_data = data[(data['time'] >= '2012-01-01') & (data['time'] <= '2012-12-31')]
# 只对X_train、y_train、X_validation取奇数行
X_train = train_data[[f'lag_{i}' for i in range(1, n+1)]].iloc[::2].reset_index(drop=True)
# 创建m个目标变量
y_train_list = [train_data['incidence'].shift(-i) for i in range(m)]
y_train = pd.concat(y_train_list, axis=1)
y_train.columns = [f'target_{i+1}' for i in range(m)]
y_train = y_train.iloc[::2].reset_index(drop=True).dropna()
X_train = X_train.head(len(y_train))
X_validation = validation_data[[f'lag_{i}' for i in range(1, n+1)]].iloc[::2].reset_index(drop=True)
y_validation = validation_data['incidence']
rf_model = RandomForestRegressor()
param_grid = {
'n_estimators': [50, 100, 150],
'max_depth': [None, 3, 5, 7, 9],
}
grid_search = GridSearchCV(rf_model, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
best_rf_model = RandomForestRegressor(**best_params)
best_rf_model.fit(X_train, y_train)
# 预测验证集
y_validation_pred = []
for i in range(len(X_validation)):
pred = best_rf_model.predict([X_validation.iloc[i]])
y_validation_pred.extend(pred[0])
y_validation_pred = np.array(y_validation_pred)[:len(y_validation)]
mae_validation = mean_absolute_error(y_validation, y_validation_pred)
mape_validation = np.mean(np.abs((y_validation - y_validation_pred) / y_validation))
mse_validation = mean_squared_error(y_validation, y_validation_pred)
rmse_validation = np.sqrt(mse_validation)
print(mae_validation, mape_validation, mse_validation, rmse_validation)
# 预测训练集
y_train_pred = []
for i in range(len(X_train)):
pred = best_rf_model.predict([X_train.iloc[i]])
y_train_pred.extend(pred[0])
y_train_pred = np.array(y_train_pred)[:y_train.shape[0]]
mae_train = mean_absolute_error(y_train.iloc[:, 0], y_train_pred)
mape_train = np.mean(np.abs((y_train.iloc[:, 0] - y_train_pred) / y_train.iloc[:, 0]))
mse_train = mean_squared_error(y_train.iloc[:, 0], y_train_pred)
rmse_train = np.sqrt(mse_train)
print(mae_train, mape_train, mse_train, rmse_train)
结果:
(5)多步滚动预测-vol. 3
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error
# 数据读取和预处理
data = pd.read_csv('data.csv')
data_y = pd.read_csv('data.csv')
data['time'] = pd.to_datetime(data['time'], format='%b-%y')
data_y['time'] = pd.to_datetime(data_y['time'], format='%b-%y')
n = 6
for i in range(n, 0, -1):
data[f'lag_{i}'] = data['incidence'].shift(n - i + 1)
data = data.dropna().reset_index(drop=True)
train_data = data[(data['time'] >= '2004-01-01') & (data['time'] <= '2011-12-31')]
X_train = train_data[[f'lag_{i}' for i in range(1, n+1)]]
m = 3
X_train_list = []
y_train_list = []
for i in range(m):
X_temp = X_train
y_temp = data_y['incidence'].iloc[n + i:len(data_y) - m + 1 + i]
X_train_list.append(X_temp)
y_train_list.append(y_temp)
for i in range(m):
X_train_list[i] = X_train_list[i].iloc[:-(m-1)]
y_train_list[i] = y_train_list[i].iloc[:len(X_train_list[i])]
# 模型训练
rf_model = RandomForestRegressor()
param_grid = {
'n_estimators': [50, 100, 150],
'max_depth': [None, 3, 5, 7, 9],
}
best_rf_models = []
for i in range(m):
grid_search = GridSearchCV(rf_model, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train_list[i], y_train_list[i])
best_rf_model = RandomForestRegressor(**grid_search.best_params_)
best_rf_model.fit(X_train_list[i], y_train_list[i])
best_rf_models.append(best_rf_model)
validation_start_time = train_data['time'].iloc[-1] + pd.DateOffset(months=1)
validation_data = data[data['time'] >= validation_start_time]
X_validation = validation_data[[f'lag_{i}' for i in range(1, n+1)]]
y_validation_pred_list = [model.predict(X_validation) for model in best_rf_models]
y_train_pred_list = [model.predict(X_train_list[i]) for i, model in enumerate(best_rf_models)]
def concatenate_predictions(pred_list):
concatenated = []
for j in range(len(pred_list[0])):
for i in range(m):
concatenated.append(pred_list[i][j])
return concatenated
y_validation_pred = np.array(concatenate_predictions(y_validation_pred_list))[:len(validation_data['incidence'])]
y_train_pred = np.array(concatenate_predictions(y_train_pred_list))[:len(train_data['incidence']) - m + 1]
mae_validation = mean_absolute_error(validation_data['incidence'], y_validation_pred)
mape_validation = np.mean(np.abs((validation_data['incidence'] - y_validation_pred) / validation_data['incidence']))
mse_validation = mean_squared_error(validation_data['incidence'], y_validation_pred)
rmse_validation = np.sqrt(mse_validation)
print("验证集:", mae_validation, mape_validation, mse_validation, rmse_validation)
mae_train = mean_absolute_error(train_data['incidence'][:-(m-1)], y_train_pred)
mape_train = np.mean(np.abs((train_data['incidence'][:-(m-1)] - y_train_pred) / train_data['incidence'][:-(m-1)]))
mse_train = mean_squared_error(train_data['incidence'][:-(m-1)], y_train_pred)
rmse_train = np.sqrt(mse_train)
print("训练集:", mae_train, mape_train, mse_train, rmse_train)
结果:
三、数据
链接:https://pan.baidu.com/s/1EFaWfHoG14h15KCEhn1STg?pwd=q41n
提取码:q41n