译文:Complete Guide to Parameter Tuning in XGBoost
当模型没有达到预期效果的时候,XGBoost就是数据科学家的最终武器。XGboost是一个高度复杂的算法,有足够的能力去学习数据的各种各样的不规则特征。
用XGBoost建模很简单,但是提升XGBoost的模型效果却需要很多的努力。因为这个算法使用了多维的参数。为了提升模型效果,调参就不可避免,但是想要知道参数怎么调,什么样的参数能够得出较优的模型输出就很困难了。
这篇文章适用于XGBoost新手,教会新手学习使用XGBoost的一些有用信息来帮忙调整参数。
XGBoost(extreme Gradient Boosting) 是一个高级的梯度增强算法(gradient boosting algorithm),推荐看一下我前一篇翻译的自该作者的文章
加深对XGBoost的理解的文章:
1.XGBoost Guide – Introduction to Boosted Trees
2.Words from the Author of XGBoost
XGBoost的变量类型有三类:
booster [default=gbtree]:
silent [default=0]:
虽然XGBoost有两种boosters,作者在参数这一块只讨论了tree booster,原因是tree booster的表现总是好于 linear booster
此类变量用于定义优化目标每一次计算的需要用到的变量
有些变量在Python的sklearn的接口中对应命名如下:
1. eta -> learning rate
2. lambda ->reg_lambda
3. alpha -> reg_alpha
可能感到困惑的是这里并没有像GBM中一样提及n_estimators,这个参数实际存在于XGBClassifier中,但实际是通过num_boosting_rounds在我们调用fit函数事来体现的。
作者推荐以下链接,进一步加深对XGBOOST的了解:
1.XGBoost Parameters (official guide)
2.XGBoost Demo Codes (xgboost GitHub repository)
3.Python API Reference (official guide)
#导入需要的数据和库
#Import libraries:
import pandas as pd
import numpy as np
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn import cross_validation, metrics #Additional scklearn functions
from sklearn.grid_search import GridSearchCV #Perforing grid search
import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 12, 4
train = pd.read_csv('train_modified.csv')
target = 'Disbursed'
IDcol = 'ID'
此处作者调用了两种类型的XGBoost:
1.xgb:xgboost直接的库,可以调用cv函数
2.XGBClassifier: sklearn对XGBoost的包装,可以允许使用sklearn的网格搜索功能进行并行计算
#定义一个函数帮助产生xgboost模型及其效果
def modelfit(alg, dtrain, predictors,useTrainCV=True, cv_folds=5, early_stopping_rounds=50):
if useTrainCV:
xgb_param = alg.get_xgb_params()
xgtrain = xgb.DMatrix(dtrain[predictors].values, label=dtrain[target].values)
cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
metrics='auc', early_stopping_rounds=early_stopping_rounds, show_progress=False)
alg.set_params(n_estimators=cvresult.shape[0])
#Fit the algorithm on the data
alg.fit(dtrain[predictors], dtrain['Disbursed'],eval_metric='auc')
#Predict training set:
dtrain_predictions = alg.predict(dtrain[predictors])
dtrain_predprob = alg.predict_proba(dtrain[predictors])[:,1]
#Print model report:
print "\nModel Report"
print "Accuracy : %.4g" % metrics.accuracy_score(dtrain['Disbursed'].values, dtrain_predictions)
print "AUC Score (Train): %f" % metrics.roc_auc_score(dtrain['Disbursed'], dtrain_predprob)
feat_imp = pd.Series(alg.booster().get_fscore()).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='Feature Importances')
plt.ylabel('Feature Importance Score')
#xgboost’s sklearn没有feature_importances,但是#get_fscore() 有相同的功能
通常的做法如下:
1.选择一个相对高一点的学习率(learning rate):通常0.1是有用的,但是根据问题的不同,可以选择范围在[0.05,0.3]之间,根据选好的学习率选择最优的树的数目,xgboost有一个非常有用的cv函数可以用于交叉验证并能返回最终的最优树的数目
2.调tree-specific parameters(max_depth, min_child_weight, gamma, subsample, colsample_bytree)
3.调regularization parameters(lambda, alpha)
4.调低学习率并决定优化的参数
1.设置参数的初始值:
#通过固定的学习率0.1和cv选择合适的树的数量
#Choose all predictors except target & IDcols
predictors = [x for x in train.columns if x not in [target, IDcol]]
xgb1 = XGBClassifier(
learning_rate =0.1,
n_estimators=1000,
max_depth=5,
min_child_weight=1,
gamma=0,
subsample=0.8,
colsample_bytree=0.8,
objective= 'binary:logistic',
nthread=4,
scale_pos_weight=1,
seed=27)
modelfit(xgb1, train, predictors)
#作者调整后得到的树的值为140,如果这个值对于当前的系统而言太大了,可以调高学习率重新训练
先调这两个参数的原因是因为这两个参数对模型的影响做大
param_test1 = {
'max_depth':range(3,10,2),
'min_child_weight':range(1,6,2)
}
gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=5,
min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27),
param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch1.fit(train[predictors],train[target])
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_
最优的 max_depth=5,min_child_weight=5
因为之前的步长是2,在最优参数的基础上,在上调下调各一步,看是否能得到更好的参数
param_test2 = {
'max_depth':[4,5,6],
'min_child_weight':[4,5,6]
}
gsearch2 = GridSearchCV(estimator = XGBClassifier( learning_rate=0.1, n_estimators=140, max_depth=5,
min_child_weight=2, gamma=0, subsample=0.8, colsample_bytree=0.8,
objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),
param_grid = param_test2, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch2.fit(train[predictors],train[target])
gsearch2.grid_scores_, gsearch2.best_params_, gsearch2.best_score_
以上结果跑出来的最优参数为:max_depth=4,min_child_weight=6,另外从作者跑出来的cv结果看,再提升结果比较困难,可以进一步对min_child_weight试着调整看一下效果:
param_test2b = {
'min_child_weight':[6,8,10,12]
}
gsearch2b = GridSearchCV(estimator = XGBClassifier( learning_rate=0.1, n_estimators=140, max_depth=4,
min_child_weight=2, gamma=0, subsample=0.8, colsample_bytree=0.8,
objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),
param_grid = param_test2b, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch2b.fit(train[predictors],train[target])
modelfit(gsearch3.best_estimator_, train, predictors)
gsearch2b.grid_scores_, gsearch2b.best_params_, gsearch2b.best_score_
param_test3 = {
'gamma':[i/10.0 for i in range(0,5)]
}
gsearch3 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=4,
min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8,
objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),
param_grid = param_test3, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch3.fit(train[predictors],train[target])
gsearch3.grid_scores_, gsearch3.best_params_, gsearch3.best_score_
基于以上调好参数的前提下,可以看一下模型的特征的表现:
xgb2 = XGBClassifier(
learning_rate =0.1,
n_estimators=1000,
max_depth=4,
min_child_weight=6,
gamma=0,
subsample=0.8,
colsample_bytree=0.8,
objective= 'binary:logistic',
nthread=4,
scale_pos_weight=1,
seed=27)
modelfit(xgb2, train, predictors)
param_test4 = {
'subsample':[i/10.0 for i in range(6,10)],
'colsample_bytree':[i/10.0 for i in range(6,10)]
}
gsearch4 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=4,
min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8,
objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),
param_grid = param_test4, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch4.fit(train[predictors],train[target])
gsearch4.grid_scores_, gsearch4.best_params_, gsearch4.best_score_
#上一步发现最优值均为0.8,这一步做的事情是在附近以0.05的步长做调整
param_test5 = {
'subsample':[i/100.0 for i in range(75,90,5)],
'colsample_bytree':[i/100.0 for i in range(75,90,5)]
}
gsearch5 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=4,
min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8,
objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),
param_grid = param_test5, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch5.fit(train[predictors],train[target])
这一步的作用是通过使用过regularization 来降低过拟合问题,大部分的人选择忽略这个参数,因为gamma 有提供类似的功能
param_test6 = {
'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100]
}
gsearch6 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=4,
min_child_weight=6, gamma=0.1, subsample=0.8, colsample_bytree=0.8,
objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),
param_grid = param_test6, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch6.fit(train[predictors],train[target])
gsearch6.grid_scores_, gsearch6.best_params_, gsearch6.best_score_
这一步调参之后结果可能会变差,方法是在获得的最优的参数0.01附近进行微调,看能否获得更好的结果
param_test7 = {
'reg_alpha':[0, 0.001, 0.005, 0.01, 0.05]
}
gsearch7 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=4,
min_child_weight=6, gamma=0.1, subsample=0.8, colsample_bytree=0.8,
objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),
param_grid = param_test7, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch7.fit(train[predictors],train[target])
gsearch7.grid_scores_, gsearch7.best_params_, gsearch7.best_score_
然后基于获得的更好的值,我们再看一下模型的整体表现
xgb3 = XGBClassifier(
learning_rate =0.1,
n_estimators=1000,
max_depth=4,
min_child_weight=6,
gamma=0,
subsample=0.8,
colsample_bytree=0.8,
reg_alpha=0.005,
objective= 'binary:logistic',
nthread=4,
scale_pos_weight=1,
seed=27)
modelfit(xgb3, train, predictors)
最后一步就是 降低学习率并增加更多的树
xgb4 = XGBClassifier(
learning_rate =0.01,
n_estimators=5000,
max_depth=4,
min_child_weight=6,
gamma=0,
subsample=0.8,
colsample_bytree=0.8,
reg_alpha=0.005,
objective= 'binary:logistic',
nthread=4,
scale_pos_weight=1,
seed=27)
modelfit(xgb4, train, predictors)
最后作者分享了两条经验:
1.仅仅通过调参来提升模型的效果是很难的
2.想要提升模型的效果,还可以通过特征工程、模型融合以及stacking方法