译文:Complete Guide to Parameter Tuning in Gradient Boosting (GBM) in Python
这边文章帮助大家去看看GBM的是如何真正工作的。原文受启发于 NYC Data Science Academy 与bagging 在模型中只控制高方差不同的是,Boosting 算法善于处理偏差(bias)方差(variance)问题(trade-off),且比bagging更有效。
这篇文章将会揭露GBM背后的科学原理,且教会你如何进行参数调优从而获得可靠的结果。
Boosting 是一个以融合原则工作的序列技术,将一系列的弱学习器进行组合从而提升预测的准确率。模型第t次迭代的结果,是基于前t-1次的结果进行赋权。如果样本被正确预测了将会被富裕较低的权重,反之,预测错误的样本将会被赋予较高的权重。
如下例所示:
图二:输出第二个弱分类器
与图二类似的变化将会在图三中看到,这种情况将会迭代多次,直到模型给样本的权重取决于模型的准确率,最终生成一个合并的结果
以下几篇文章可以参考:
Learn Gradient Boosting Algorithm for better predictions (with codes in R)
Quick Introduction to Boosting Algorithms in Machine Learning
Getting smart with Machine Learning – AdaBoost and Gradient Boost
GBM的所有参数可以划分成以下3种类型:
1. Tree-Specific Parameters:这类参数影响模型中的每一颗树;
2. Boosting Parameters:这类参数影响模型boosting的操作;
3. Miscellaneous Parameters:其他影响整体功能的参数
用于描述树的参数如下:
1. Initialize the outcome
2. Iterate from 1 to total number of trees
2.1 Update the weights for targets based on previous run (higher for the ones mis-classified)
2.2 Fit the model on selected subsample of data
2.3 Make predictions on the full set of observations
2.4 Update the output with current results taking into account the learning rate
3. Return the final output.
2.Boosting Parameters
Boosting Parameters 主要影响以上伪代码的2.2建模部分,各参数如下:
+ learning_rate:
3.Miscellaneous Parameters
作者使用了 Data Hackathon 3.x AV hackathon的比赛实例,比赛的题目[ competition page](competition page),可以在这里下载数据,作者的数据预处理过程如下:
1.City variable dropped because of too many categories
2.DOB converted to Age | DOB dropped
3.EMI_Loan_Submitted_Missing created which is 1 if EMI_Loan_Submitted was missing else 0 | Original variable EMI_Loan_Submitted dropped
4.EmployerName dropped because of too many categories
5.Existing_EMI imputed with 0 (median) since only 111 values were missing
6.Interest_Rate_Missing created which is 1 if Interest_Rate was missing else 0 | Original variable Interest_Rate dropped
7.Lead_Creation_Date dropped because made little intuitive impact on outcome
8.Loan_Amount_Applied, Loan_Tenure_Applied imputed with median values
9.Loan_Amount_Submitted_Missing created which is 1 if Loan_Amount_Submitted was missing else 0 | Original variable Loan_Amount_Submitted dropped
10.Loan_Tenure_Submitted_Missing created which is 1 if Loan_Tenure_Submitted was missing else 0 | Original variable Loan_Tenure_Submitted dropped
11.LoggedIn, Salary_Account dropped
12.Processing_Fee_Missing created which is 1 if Processing_Fee was missing else 0 | Original variable Processing_Fee dropped
13.Source – top 2 kept as is and all others combined into different category
14.Numerical and One-Hot-Coding performed
通常情况下使用默认参数作为调参的基准
from sklearn.ensemble import GradientBoostingClassifier #GBM algorithm
#Choose all predictors except target & IDcols
predictors = [x for x in train.columns if x not in [target, IDcol]]
gbm0 = GradientBoostingClassifier(random_state=10)
modelfit(gbm0, train, predictors)#作者自己编写的用于输出想要结果的函数
通过使用默认参数,获得了一个baseline,调参的预期效果期望好于baseline,通常调参就是以上提及的两种参数类型,其中learning rate的策略是取小值(如果我们有足够数目的树训练)。
当树的数目增加的时候,GBM对过拟合是鲁棒的,但是一个大的学习率会导致过拟合问题,我们可以通过减小learning rate增加数的数目来加以改善,导致的结果是计算量大需要花很多的时间。
据上,我们可以采取以下的策略:
1. 选择一个相对大的learning rate,通常设置为0.1,有些时候[0.05,0.2]在有些时候也可以考虑
2. 根据选定的learning rate决定最优的树的数目。通常的范围为[40-70]。应该选一个能够让你的系统工作较快的范围
3. 调整tree-specific parameters :对于确定的学习率和树的个数,进行tree-specific parameters 的调整,作者给了一个例子,见下一个标题
4. 在以上的基础上降低学习率增加树的数量会获得更好鲁棒的模型
为了获得最优的参数,我们先设定一些初始值,如下:
+ min_samples_split = 500:这个值得设定通常在样本量的0.5-1%之间,但如果是不平衡样本,值可以在设置的低一些
+ min_samples_leaf = 50 :可根据直觉设置,这个参数的目的是用于防止过拟合,但是遇到不平衡数据时,可以考虑设置较低一些的值
+ max_depth = 8 : 可以基于样本量和特征量设置在(5-8)左右
+ max_features = ‘sqrt’ :随意吧
+ subsample = 0.8 : 这个是通常初始化的值
以上均是初始化的值,默认learning_rate=0.1,据此可以grid search搜索[20,80]步长为10的最优的树的数目,下面将进行调参:
#Choose all predictors except target & IDcols
predictors = [x for x in train.columns if x not in [target, IDcol]]
param_test1 = {'n_estimators':range(20,81,10)}
gsearch1 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, min_samples_split=500,min_samples_leaf=50,max_depth=8,max_features='sqrt',subsample=0.8,random_state=10),
param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch1.fit(train[predictors],train[target])
调参的结果可以像下面这样查看:
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_
以上结果可以看出,最优的树的数目是60,但是如果我们获得的最优的树的数目过大或者过小怎么办?
1.如果结果值在20左右,需要适当降低学习率(如改成0.05),重新run网格搜索程序
2.如果结果值太大到了100左右,最快的方式是调大学习率
调参的步骤:
1. max_depth and num_samples_split
2. min_samples_leaf
3. max_features
调参的顺序需要被谨慎的考虑,应该先调整对结果影响较大的参数。
作者的举例如下:
param_test2 = {'max_depth':range(5,16,2), 'min_samples_split':range(200,1001,200)}
gsearch2 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, n_estimators=60, max_features='sqrt', subsample=0.8, random_state=10),
param_grid = param_test2, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch2.fit(train[predictors],train[target])
gsearch2.grid_scores_, gsearch2.best_params_, gsearch2.best_score_
作者调优后的结果 max_depth=9,min_samples_split=1000,此时min_samples_split为设置的阈值上界,所有有必要进一步扩大上界进行二次调优(可选择,当max_depth=9,score差异不大时可以不调)。
param_test3 = {'min_samples_split':range(1000,2100,200), 'min_samples_leaf':range(30,71,10)}
gsearch3 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, n_estimators=60,max_depth=9,max_features='sqrt', subsample=0.8, random_state=10),
param_grid = param_test3, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch3.fit(train[predictors],train[target])
gsearch3.grid_scores_, gsearch3.best_params_, gsearch3.best_score_
调完参数后,可以看一下特征的重要性,并将现在的特征重要性与之前的baseline模型的特征重要性进行对比,可以发现,我们能从更多的特征中获取更多的信息
接下来,调整最后一个树参数,max_features,范围[7,19],步长为2
param_test4 = {'max_features':range(7,20,2)}
gsearch4 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, n_estimators=60,max_depth=9, min_samples_split=1200, min_samples_leaf=60, subsample=0.8, random_state=10),
param_grid = param_test4, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch4.fit(train[predictors],train[target])
gsearch4.grid_scores_, gsearch4.best_params_, gsearch4.best_score_
以上步骤完成之后,就获得了所有调参的结果
为了进一步调整模型,下一步就是对模型的subsample 进行进一步调整,范围[0.6,0.9],步长0.05
param_test5 = {'subsample':[0.6,0.7,0.75,0.8,0.85,0.9]}
gsearch5 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, n_estimators=60,max_depth=9,min_samples_split=1200, min_samples_leaf=60, subsample=0.8, random_state=10,max_features=7),
param_grid = param_test5, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch5.fit(train[predictors],train[target])
gsearch5.grid_scores_, gsearch5.best_params_, gsearch5.best_score_
以上步骤完成之后,作者获得了最优的模型结果0.85,并获得了所有的最优的参数值。现在,需要降低学习率并增加树的数量。
随着树的数目的增加,通过CV进行计算的时间将会快速增长,作者将学习率设置成0.05,并将树的数目double了一下(然后不断的将学习率除以2,树的数目double)
predictors = [x for x in train.columns if x not in [target, IDcol]]
gbm_tuned_1 = GradientBoostingClassifier(learning_rate=0.05, n_estimators=120,max_depth=9, min_samples_split=1200,min_samples_leaf=60, subsample=0.85, random_state=10, max_features=7)
modelfit(gbm_tuned_1, train, predictors)
另外一个调参的黑马是warm_start。可以用来慢慢增加数的数目而不需要总是重新开始训练。作者未对这个参数做进一步的描述