博主比较懒,这个项目耗时一天多,自己直接看代码吧,里面都有详细说明。敲了很多啦,去休息一下,哈哈哈
# 气温预测的任务目标就是使用一份天气相关数据来预测某一天的最高温度, 属于回归任务,导入数据
# 数据读取
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
features = pd.read_csv('temps.csv')
print(features.head())
# 需要完成的三项任务
# 1.使用随机森林算法完成基本建模任务:包括数据预处理,特征展示,完成建模并进行可视化展示分析
# 2.分析数据样本量与特征个数对结果的影响: 在保证算法一致的前提下,增加数据样本个数,观察结果变化。重新考虑特征工程,引入新特征后, 观察结果走势
# 3,对随机森林算法进行调参, 找到最合适的参数: 掌握机器学习中两中经典的调参方法, 对当前模型选择最合适的参数
# 特征可视化与预处理
# 看看数据规模
print('数据维度:', features.shape)
# 进一步观察各个指标的统计特性,用describe()展示
print(features.describe())
# 不存在缺失值
# 对于时间数据, 进行格式转换, 原因在于有些工具包在绘图或着计算的过程中, 用标准时间格式更加方便
# 处理时间数据
import datetime
# 分别得到年, 月, 日
years = features['year']
months = features['month']
days = features['day']
# datetime格式
dates = [str(int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years, months, days)]
dates = [datetime.datetime.strptime(date, '%Y-%m-%d') for date in dates]
print(dates[:5])
# 为了直观的观察数据, 进行画图展示
import matplotlib.pyplot as plt
# 指定默认风格
plt.style.use('fivethirtyeight')
# 开始布局, 展示4项指标, 分别为最高气温的标签值, 前天, 昨天, 朋友预测的气温最高值. 既然是4, 不防用2*2规模, 对每个图定好图题和坐标轴即可
# 设置布局
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows=2, ncols=2, figsize=(10, 10))
fig.autofmt_xdate(rotation=45)
# 标签值
ax1.plot(dates, features['actual'])
ax1.set_xlabel('')
ax1.set_ylabel('Temperature')
ax1.set_title('Max Temp')
# 昨天值
ax2.plot(dates, features['temp_1'])
ax2.set_xlabel('')
ax2.set_ylabel('Temperature')
ax2.set_title('Previous Max Temp')
# 前天值
ax3.plot(dates, features['temp_2'])
ax3.set_xlabel('')
ax3.set_ylabel(' Two days perior Temprature')
ax3.set_title('Two Days Prior Temp')
# 朋友预测值
ax4.plot(dates, features['friend'])
ax4.set_xlabel('Data')
ax4.set_title('Friend Estimate')
plt.tight_layout(pad=2)
plt.show()
# 进行独热编码one-hot encoding
# 可用Sklearn工具包中现成的方法完成转换, 也可用Pandas中的函数, 综合对比觉得Pandas中的.get_dummies()函数更容易
# 独热编码
features = pd.get_dummies(features)
print(features.head())
# 特征预处理完成之后, 将数据重新组合一下, 特征是特征, 标签是标签, 分别在原始数据集中提取一下
# 数据与标签
import numpy as np
# 标签
labels = np.array(features['actual'])
# 在特征中去掉标签
features = features.drop('actual', axis=1)
# 名字单独保存, 以备后患
feature_list = list(features.columns)
# 转换成合适的格式
features = np.array(features)
# 训练模型前, 对数据集进行切分
from sklearn.model_selection import train_test_split
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.25,
random_state=42)
print('训练集特征:', train_features.shape)
print('训练集标签:', train_labels.shape)
print('测试集特征:', test_features.shape)
print('测试集标签:', test_labels.shape)
# 随机森林回归模型'
# 准备工作都以做好, 建立随机森林模型, 首先导入工具包, 建立1000棵树的模型,其他参数默认,然后深入调参任务
# 导入算法
from sklearn.ensemble import RandomForestRegressor
# 建模
rf = RandomForestRegressor(n_estimators=1000, random_state=42)
# 训练
rf.fit(train_features, train_labels)
# 预测结果
predictions = rf.predict(test_features)
# 计算误差
errors = abs(predictions - test_labels)
# 用百分比表示并进行输出
mape = 100 * (errors / test_labels)
print('MAPE:', np.mean(mape))
# RMSE, MSE, MAPE
# 树模型可视化方法
# 导入所需要的包
# 要将pydot中的dot改为dot.exe,然后再关闭,就不会报错了,这里设计到了graph的安装
from sklearn.tree import export_graphviz
import pydot
# 拿到其中一棵树
tree = rf.estimators_[5]
# 导出dot文件
export_graphviz(tree, out_file='tree.dot', feature_names=feature_list, rounded=True, precision=1)
# 绘图
(graph,) = pydot.graph_from_dot_file('tree.dot')
# 展示
graph.write_png('tree.png')
# 由于树太大,剪枝策略上场
# 限制一下树的模型
rf_small = RandomForestRegressor(n_estimators=10, max_depth=3, random_state=42)
rf_small.fit(train_features, train_labels)
# 提取一颗树
tree_small = rf_small.estimators_[5]
# 保存
export_graphviz(tree_small, out_file='small_tree.dot', feature_names=feature_list, rounded=True, precision=1)
(graph,) = pydot.graph_from_dot_file('small_tree.dot')
graph.write_png('small_tree.png')
# 特征重要性
# 讲解随机算法的时候,集成算法是很容易得到其特征重要性, 在sklearn工具包中也出现的函数调用起来非常容易
# 得到特征重要性
importances = list(rf.feature_importances_)
# 转换格式
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
# 排序
feature_importances = sorted(feature_importances, key=lambda x: x[1], reverse=True)
# 对应进行打印
[print('Variable: {:20} Importance:{}'.format(*pair)) for pair in feature_importances]
# 将特征重要性绘制成图标分析
# 装换成list格式
x_values = list(range(len(importances)))
# 绘图
plt.figure(figsize=(12, 12))
plt.bar(x_values, importances, orientation='vertical')
# x轴名字
plt.xticks(x_values, feature_list, rotation='vertical')
# 图题
# plt.figure(figsize=(10, 10))
plt.ylabel('Importance')
plt.xlabel('Variable')
plt.title('Variable Importances')
plt.show()
# 选择最重要的两个特征来试
rf_most_important = RandomForestRegressor(n_estimators=1000, random_state=42)
# 拿到这两个特征
important_indices = [feature_list.index('temp_1'), feature_list.index('average')]
train_important = train_features[:, important_indices]
test_important = test_features[:, important_indices]
# 重新训练模型
rf_most_important.fit(train_important, train_labels)
# 预测结果
predictions = rf_most_important.predict(test_important)
errors = abs(predictions - test_labels)
# 评估结果
print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')
mape = np.mean(100 * (errors / test_labels))
print('mape:', mape)
# 损失值有所上升, 其他特征还是有价值的,通过实验进行再判断
# 剔除用出不大的特征加快模型的构建速度, 得到了基本的随机森林, 并进行预测, 看看模型的预测值与真实值之间的差异
# 日期数据
months = features[:, feature_list.index('month')]
days = features[:, feature_list.index('day')]
years = features[:, feature_list.index('year')]
# 装换为日期格式
dates = [str(int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years, months, days)]
dates = [datetime.datetime.strptime(date, '%Y-%m-%d') for date in dates]
# 创建一个表格保存日期和其对应的标签数值
true_data = pd.DataFrame(data={'date': dates, 'actual': labels})
# 同理, 再创建一个表格保存日期和其对应的模型预测值
months = test_features[:, feature_list.index('month')]
days = test_features[:, feature_list.index('day')]
years = test_features[:, feature_list.index('year')]
test_dates = [str(int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in
zip(years, months, days)]
test_dates = [datetime.datetime.strptime(date, '%Y-%m-%d') for date in test_dates]
predictions_data = pd.DataFrame(data={'date': test_dates, 'prediction': predictions})
# 真实值
plt.figure(figsize=(10, 10))
plt.plot(true_data['date'], true_data['actual'], 'b-', label='actual')
# 预测值
plt.plot(predictions_data['date'], predictions_data['prediction'], 'ro', label='prediction')
plt.xticks(rotation='60')
plt.legend()
# 图名
plt.xlabel('Date')
plt.ylabel('Maximum Temperature (F)')
plt.title('Actual and Predicted Values')
plt.show()
# 数据与特征对结果影响分析
#
#######################################
# 若读取更大的数据, 任务保持不变, 分别观察数据量和特征的选择对结果的影响
# 导入工具包
import pandas as pd
import matplotlib.pyplot as plt
# 读取树据
from numpy import ndarray
features = pd.read_csv('temps_extended.csv')
print(features.head())
# 查看数据规模
print('数据规模为', features.shape)
# 增加了3个新的天气特征
# ws_1:前一天的风速
# prop_1: 前一天的降水
# snwd_1: 前一天的积雪深度
# 既然有了新的特征, 可绘图进行可视化展示
# 设置整体布局
import datetime
# 分别得到年, 月, 日
years = features['year']
months = features['month']
days = features['day']
# datetime格式
dates = [str(int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years, months, days)]
dates = [datetime.datetime.strptime(date, '%Y-%m-%d') for date in dates]
print(dates[:5])
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows=2, ncols=2, figsize=(15, 10))
fig.autofmt_xdate(rotation=45)
# 平均最高气温
ax1.plot(dates, features['average'])
ax1.set_xlabel('')
ax1.set_ylabel('Temperature(F)')
ax1.set_title('Historical Avg Max Temp')
# 风速
ax2.plot(dates, features['ws_1'], 'r-')
ax2.set_xlabel('')
ax2.set_ylabel('Wind Speed (mph)')
ax2.set_title('Prior Wind Speed')
# 降水
ax3.plot(dates, features['prcp_1'], 'r-')
ax3.set_xlabel('Date')
ax3.set_ylabel('Precipitation (in)')
ax3.set_title('Prior Precipitation')
# 积雪
ax4.plot(dates, features['snwd_1'], 'ro')
ax4.set_xlabel('Date')
ax4.set_ylabel('Snow Depth (in)')
ax4.set_title('Prior Snow Depth')
plt.tight_layout(pad=2)
plt.show()
# 加入三项新特征后, 可视化的目的是观察特征情况, 还要考虑数据是否干净
# 特征工程
# 反复提取特征后,拿一些有意义的特征去做特征工程,无论是对建模还是分析都有帮助作用
# 创建一个季节变量
import warnings
warnings.filterwarnings("ignore")
seasons = []
for month in features['month']:
if month in [1, 2, 12]:
seasons.append('winter')
elif month in [3, 4, 5]:
seasons.append('spring')
elif month in [6, 7, 8]:
seasons.append('summer')
elif month in [9, 10, 11]:
seasons.append('fall')
# 有了季节就可以分析更多的东西
reduced_features = features[['temp_1', 'prcp_1', 'average', 'actual']]
reduced_features['season'] = seasons
# 有了季节特征后, 观察不同的季节上述各项特征的变化是怎么样的
# 绘图函数pairplot(), 需要先安装seaborn
# 在matplotlib的基础上进行封装, 用起来更加方便
# 导入seaborn工具包
import seaborn as sns
sns.set(style='ticks', color_codes=True)
# 选择自己喜欢的颜色模板
palette = sns.xkcd_palette(['dark blue', 'dark green', 'gold', 'orange'])
# 绘制pairplot
sns.pairplot(reduced_features, hue='season', diag_kind='kde', palette=palette, plot_kws=dict(alpha=0.7),
diag_kws=dict(shade=True))
plt.show()
# 9.2.2 数据量对结果影响分析
# 进行一系列的对比实验, 数据量增多时, 结果会不会改变呢
# 独热编码
features = pd.get_dummies(features)
# 提取特征和标签
labels = features['actual']
features = features.drop('actual', axis=1)
feature_list = list(features.columns)
# 转换成需要的格式
import numpy as np
features = np.array(features)
labels = np.array(labels)
# 数据集切分
from sklearn.model_selection import train_test_split
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.25,
random_state=0)
print('训练集特征:', train_features.shape)
print('训练集标签:', train_labels.shape)
print('测试集特征:', test_features.shape)
print('测试集标签:', test_labels.shape)
# 新的数据集由1643个训练样本和548个测试样本组成。 为了进行对比实验, 还需要相同的测试集来对比结果, 由于打开了新样本所以还需要对样本较少的数据集再次执行相同的预处理
# 工具包导入
import pandas as pd
# 为了剔除个数对结果的影响, 统一为只有老数据集的特征
original_feature_indices = [feature_list.index(feature) for feature in feature_list if
feature not in ['ws_1', 'prcp_1', 'snwd_1']]
# 读取老数据
original_features = pd.read_csv('temps.csv')
original_features = pd.get_dummies(original_features)
import numpy as np
# 数据和标签转换
original_labels = np.array(original_features['actual'])
original_features = original_features.drop('actual', axis=1)
original_feature_list = list(original_features.columns)
original_features = np.array(original_features)
# 数据集切分
from sklearn.model_selection import train_test_split
original_train_features, original_test_features, original_train_labels, original_test_labels = train_test_split(
original_features
, original_labels, test_size=0.25, random_state=42)
# 同样的树模型进行建模
from sklearn.ensemble import RandomForestRegressor
# 同样的参数与随机种子
rf = RandomForestRegressor(n_estimators=100, random_state=0)
# 这里的训练集使用的是老数据集
rf.fit(original_train_features, original_train_labels)
# 为了公平起见, 统一使用一致的测试集, 这里选择切分过的新数据集的测试集
predictions = rf.predict(test_features[:, original_feature_indices])
# 计算温度平均误差
errors = abs(predictions - test_labels)
print('平均温度误差:', round(np.mean(errors), 2), 'degrees')
# MAPE
mape = 100 * (errors / test_labels)
# 这里的Accuary是为了方便观察, 直接用100减去误差, 目标自然希望这个值能够越大越好
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')
# 上述输出结果显示平均温度误差为4.67, 这是样本数量较少的结果, 再看看样本数量增多时效果会提升吗
from sklearn.ensemble import RandomForestRegressor
# 剔除掉新的特征, 保证数据特征是一致的
original_train_features = train_features[:, original_feature_indices]
original_test_features = test_features[:, original_feature_indices]
rf = RandomForestRegressor(n_estimators=100, random_state=0)
rf.fit(original_train_features, train_labels)
# 预测
baseline_predictions = rf.predict(original_test_features)
# 结果
baseline_errors = abs(baseline_predictions - test_labels)
print('平均温度误差:', round(np.mean(baseline_errors), 2), 'degrees.')
# (MAPE)
baseline_mape = 100 * np.mean(baseline_errors / test_labels)
# accuracy
baseline_accuracy = 100 - baseline_mape
print('Accuracy:', round(baseline_accuracy, 2), '%.')
# 当数据量增大后, 平均温度误差为4.2, 在机器学习任务中, 数据量能够越大越好, 一方面机器学习更充分, 另一方面会降低过拟合的风险
# 9.2.3 特征数量对结果的影响分析
# 准备加入新的特征
from sklearn.ensemble import RandomForestRegressor
rf_exp = RandomForestRegressor(n_estimators=100, random_state=0)
rf_exp.fit(train_features, train_labels)
# 同样的测试集
predictions = rf_exp.predict(test_features)
# 评估
errors = abs(predictions - test_labels)
print('平均温度误差:', round(np.mean(errors), 2), 'degrees.')
# (MAPE)
mape = np.mean(100 * (errors / test_labels))
# 看一下提升了多少
improvement_baseline = 100 * abs(mape - baseline_mape) / baseline_mape
print('特征增多后模型效果提升:', round(improvement_baseline, 2), '%.')
# accuracy
accuracy = 100 - mape
print('Accuracy:', round(accuracy, 2), '%.')
# 特征名字
importances = list(rf_exp.feature_importances_)
# 名字,数值组合在一起
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
# 排序
feature_importances = sorted(feature_importances, key=lambda x: x[1], reverse=True)
# 打印结果
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances]
# 用图表展示出来
# 指定风格
plt.style.use('fivethirtyeight')
# 指定位置
x_values = list(range(len(importances)))
# 绘图
plt.bar(x_values, importances, orientation='vertical', color='r', edgecolor='k', linewidth=1.2)
# x轴名字得竖着写
plt.xticks(x_values, feature_list, rotation='vertical')
# 图题
plt.ylabel('Importance')
plt.xlabel('Variable')
plt.title('Variable Importances')
plt.show()
# 通过cumsum()函数把特征按照其重要性进行排序, 再算累计值,cumsum设置一个阈值, 通常取95%,然后再去筛选他们得特征
# 对特征进行排序
sorted_importances = [importance[1] for importance in feature_importances]
sorted_features = [importance[0] for importance in feature_importances]
# 累计重要性
cumulative_importances = np.cumsum(sorted_importances)
# 绘制折线图
plt.plot(x_values, cumulative_importances, 'g-')
# 画一条y=0.95的红色虚线
plt.hlines(y=0.95, xmin=0, xmax=len(sorted_importances), color='r', linestyles='dashed')
# x轴
plt.xticks(x_values, sorted_features, rotation='vertical')
# y轴和图题
plt.xlabel('Variable')
plt.ylabel('Cumulative Importance')
plt.title('Cumulative Importances')
plt.show()
# 当第5个出现时, 总体的累加值达到了95%, 此时进行对比实验, 用5个特征建模
# 选择这些特征
important_feature_names = [feature[0] for feature in feature_importances[0:5]]
# 找到它们的名字
important_indices = [feature_list.index(feature) for feature in important_feature_names]
# 重新创建训练集
important_train_features = train_features[:, important_indices]
important_test_features = test_features[:, important_indices]
# 数据维度
print('Important train features shape:', important_train_features.shape)
print('Important test features shape:', important_test_features.shape)
# 再训练模型
rf_exp.fit(important_train_features, train_labels)
# 同样的测试集
predictions = rf_exp.predict(important_test_features)
# 评估结果
errors = abs(predictions - test_labels)
print('平均温度误差:', round(np.mean(errors), 2), 'degrees.')
mape = 100 * (errors / test_labels)
# accuracy
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')
# 效果有一点点下降, 可能剩下的5%的特征确实有一定的作用,模型效果没有提升, 再看看时间效率层面有没有进步
# 计算时间
import time
# 这次是所用的所有特征
all_features_time = []
# 算一次可能不太准, 来10次取平均
for _ in range(10):
start_time = time.time()
rf_exp.fit(train_features, train_labels)
all_features_predictions = rf_exp.predict(test_features)
end_time = time.time()
all_features_time.append(end_time - start_time)
all_features_time = np.mean(all_features_time)
print('使用所有特征时建模与测试的平均时间消耗:', round(all_features_time, 2), '秒.')
# 笔记本的运行时间要稍微长一点,再来看看只选择高特征重要性数据的结果
# 这次是用部分重要的特征
reduced_features_time = []
# 算一次可能不太准, 来10次取平均值
for _ in range(10):
start_time = time.time()
rf_exp.fit(important_train_features, train_labels)
reduced_features_predictions = rf_exp.predict(important_test_features)
end_time = time.time()
reduced_features_time.append(end_time - start_time)
reduced_features_time = np.mean(reduced_features_time)
print('使用部分特征时建模与测试的平均耗时:', round(reduced_features_time, 2), '秒.')
# 很明显, 时间有缩短, 决策树遍历的特征少了许多,对比起来, 方便观察
# 分别用预测值来计算评估结果
all_accuracy = 100 * (1 - np.mean(abs(all_features_predictions - test_labels) / test_labels))
reduced_accuracy = 100 * (1 - np.mean(abs(reduced_features_predictions - test_labels) / test_labels))
# 创建一个df来保存结果
comparison = pd.DataFrame(
{'features': ['all(17)', 'reduced(5)'], 'run_time': [round(all_features_time, 2), round(reduced_features_time, 2)],
'accuracy': [round(all_accuracy, 2), round(reduced_accuracy, 2)]})
comparison[['features', 'accuracy', 'run_time']]
print(comparison)
# 这里的准确率是为了观察方便自己定义的,用于对比分析, 当考虑到实际业务具体分析时, 时间效率可能会比准确率更优先考虑
# 通过具体的数值来看一下各自的提升效果
relative_accuracy_decrease = 100 * (all_accuracy - reduced_accuracy) / all_accuracy
print('相对accuracy提升:', round(relative_accuracy_decrease, 3), '%.')
relative_runtime_decrease = 100 * (all_features_time - reduced_features_time) / all_features_time
print('相对时间效率提升:', round(relative_runtime_decrease, 3), '%.')
# 设置总体布局, 还是一整行看起来好一些
import pandas as pd
# 读数据并展示
original_features = pd.read_csv('temps.csv')
original_features = pd.get_dummies(original_features)
# 使用numpy库
import numpy as np
# 我们想预测标签的值
original_labels = np.array(original_features['actual'])
# 移除特征的标签
original_features = original_features.drop('actual', axis=1)
# 保存特征标签已备用
original_features_list = list(original_features.columns)
# 进行numpy array
original_features = np.array(original_features)
# 使用数据切分集
from sklearn.model_selection import train_test_split
# 切分数据
original_train_features, original_test_features, original_train_labels, original_test_labels = train_test_split(
original_features, original_labels, test_size=0.25, random_state=42)
# 发现原始特征
original_features_indices = [feature_list.index(feature) for feature in
feature_list if feature not in
['ws_1', 'prcp_1', 'snwd_1']]
# 创造一个测试原始数据特征集
original_test_features = test_features[:, original_features_indices]
# 原始时间的的测试
original_features_time = []
# 做十次循环
for _ in range(10):
start_time = time.time()
rf.fit(original_train_features, original_train_labels)
original_features_predictions = rf.predict(original_test_features)
end_time = time.time()
original_features_time.append(end_time - start_time)
original_features_time = np.mean(original_features_time)
# 做精确分析三者不同'
original_mae = np.mean(abs(original_features_predictions - test_labels))
exp_all_mae = np.mean(abs(all_features_predictions - test_labels))
exp_reduced_mae = np.mean(abs(reduced_features_predictions - test_labels))
# 精确模型数据训练
original_accuracy = 100 * (1 - np.mean(abs(original_features_predictions - test_labels) / test_labels))
# 创造一个model_comparison模型
model_comparison = pd.DataFrame({'model': ['original', 'exp_all', 'exp_reduced'],
'error (degrees)': [original_mae, exp_all_mae, exp_reduced_mae],
'accuracy': [original_accuracy, all_accuracy, reduced_accuracy],
'run_time (s)': [original_features_time, all_features_time, reduced_features_time]})
model_comparison = model_comparison[['model', 'error (degrees)', 'accuracy', 'run_time (s)']]
fig, (ax1, ax2, ax3) = plt.subplots(nrows=1, ncols=3, figsize=(16, 5), sharex=True)
# x轴
x_values = [0, 1, 2]
labels = list(model_comparison['model'])
plt.xticks(x_values, labels)
# 字体大小
fontdict = {'fontsize': 18}
fontdict_yaxis = {'fontsize': 14}
# 预测温度和真实温度差异对比
ax1.bar(x_values, model_comparison['error (degrees)'], color=['b', 'r', 'g'], edgecolor='k', linewidth=1.5)
ax1.set_ylim(bottom=3.5, top=4.5)
ax1.set_ylabel('Error (degrees) (F)', fontdict=fontdict_yaxis)
ax1.set_title('Model Error Comparison', fontdict=fontdict)
# Accuracy 对比
ax2.bar(x_values, model_comparison['accuracy'], color=['b', 'r', 'g'], edgecolor='k', linewidth=1.5)
ax2.set_ylim(bottom=92, top=94)
ax2.set_ylabel('Accuracy(%)', fontdict=fontdict_yaxis)
ax2.set_title('Model Accuracy Comparison', fontdict=fontdict)
# 时间效率对比
ax3.bar(x_values, model_comparison['run_time (s)'], color=['b', 'r', 'g'], edgecolor='k', linewidth=1.5)
ax3.set_ylim(bottom=0, top=1)
ax3.set_ylabel('Run Time (sec)', fontdict=fontdict_yaxis)
ax3.set_title('Model Run-Time Comparison', fontdict=fontdict)
plt.show()
# 最终的模型决策需要通过实际业务应用来判断,分析工作一定要做到位
# 模型调参
# 对比和分析主要是数据和特征层面, 还有一部分非常重要的工作等着大家去做, 模型调参问题, 看一下对于树模型来说,应当进行参数调节
# 打印可以调的参数
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(random_state=42)
from pprint import pprint
# 打印所有参数
pprint(rf.get_params())
# 随机参数选择
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
# 建立树的个数
import numpy as np
n_estimators = [int(x) for x in np.linspace(start=200, stop=2000, num=100)]
# 最大特征的选择方式
max_features = ['auto', 'sqrt']
# 树的最大深度
max_depth = [int(x) for x in np.linspace(10, 20, num=2)]
max_depth.append(None)
# 节点最小分裂所需样本个数
min_samples_split = [2, 5, 10]
# 叶子节点最小样本数, 任何分裂不能让其子节点样本少于此值
min_samples_leaf = [1, 2, 4]
# 样本采样方法
bootstrap = [True, False]
# 随机参数空间
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
# 随机选择最适合的参数组合
rf = RandomForestRegressor()
rf_random = RandomizedSearchCV(estimator=rf, param_distributions=random_grid,
n_iter=100, scoring='neg_mean_absolute_error',
cv=3, verbose=2, random_state=42, n_jobs=-1)
# 执行寻找操作
rf_random.fit(train_features, train_labels)
print(rf_random)
print(rf_random.best_params_)
# 既然要进行对比分析, 先给出评估标准,与之前的实验一致
def evaluate(model, test_features, test_labels):
predictions = model.predict(test_features)
errors = abs(predictions - test_labels)
mape = 100 * np.mean(errors/test_labels)
accuracy = 100 - mape
print('平均气温误差.', np.mean(errors))
print('Accuracy = {:0.2f}%'.format(accuracy))
# 默认参数结果
base_model = RandomForestRegressor(random_state=42)
base_model.fit(train_features, train_labels)
evaluate(base_model, test_features, test_labels)
# 网格参数搜索
from sklearn.model_selection import GridSearchCV
# 网络搜索的候选参数空间
param_grid = {
'bootstrap': [True],
'max_depth': [8, 10, 12],
'max_features': ['auto'],
'min_samples_leaf': [2, 3, 4, 5, 6],
'min_samples_split': [3, 5, 7],
'n_estimators': [800, 900, 1000, 1200]
}
# 选择基本算法模型
rf = RandomForestRegressor()
# 网格搜索
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid,
scoring='neg_mean_absolute_error',
cv=3, n_jobs=-1, verbose=2)
# 执行搜索
grid_search.fit(train_features, train_labels)
best_grid = grid_search.best_estimator_
evaluate(best_grid, test_features, test_labels)
# 另一组网络搜索参数
param_grid={
'bootstrap': [True],
'max_depth': [12, 15, None],
'max_features': [3, 4, 'auto'],
'min_samples_leaf': [5, 6, 7],
'min_samples_split': [7, 10, 13],
'n_estimators': [900, 1000, 1200]
}
# 选择算法模型
rf = RandomForestRegressor()
# 继续寻找
grid_search_ad = GridSearchCV(estimator=rf, param_grid=param_grid, scoring='neg_mean_absolute_error',
cv=3, n_jobs=-1, verbose=2)
grid_search_ad.fit(train_features, train_labels)
best_grid_ad = grid_search_ad.best_estimator_
print(best_grid_ad, test_features, test_labels)
print('最终模型参数:\n')
pprint(best_grid_ad.get_params())