参数 | 含义 | 是否需要调参 |
---|---|---|
booster[默认gbtree] | 迭代模型 gbtree或gbliner | 否 |
silent[默认0] | 1时不输出信息 | 否 |
nthread[默认最大可能线程数] | 否 | |
eta[默认0.3] | 学习率 | 是0.01-0.2 |
min_child_weight[默认1] | 最小叶子节点样本权重和;回归问题里min_child_weight代表的意思是,儅一個節點下的樣本數小於給定的閾值時,則停止分裂 | 是 |
max_depth[默认6] | 树的最大深度 | 是3-10 |
max_leaf_nodes | 树上最大叶子数量 | |
gamma[默认0] | 节点分类所需最小损失函数下降值 | 是 |
max_delta_step[默认0] | 每棵树权重改变的最大步长,一般不需要 | 否 |
subsample | 对于每棵树随机采样的比例, | 0.5-1 |
colsample_bytree[默认1] | 控制每棵随机采样的列数的占比 | 0.5-1 |
colsample_bylevel[默认1] | 控制树的每一级的每一次分裂,对列数的采样的占比。 | |
lambda[默认1] | L2正则化项的权重 | |
alpha | 权重的L1正则化项 | |
scale_pos_weight | 调节正负样本不均衡问题;用于加快收敛 | |
objective[默认reg:linear] | linear/binary/multi/softprob | |
eval_metric | 度量方式回归默认rmse分类默认error | |
seed |
调参的步骤是
定义一个函数方便后续的交叉验证
import pandas as pd
import numpy as np
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn import cross_validation, metrics
from sklearn.grid_search import GridSearchCV #网格搜索
import matplotlib.pylab as plt
%matplotlib inline
def modelfit(alg, dtrain, predictors,useTrainCV=True, cv_folds=5, early_stopping_rounds=50):
if useTrainCV:
xgb_param = alg.get_xgb_params()
xgtrain = xgb.DMatrix(dtrain[predictors].values, label=dtrain[target].values)
xgtest = xgb.DMatrix(dtest[predictors].values)
cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
metrics='auc', early_stopping_rounds=early_stopping_rounds, show_progress=False)
alg.set_params(n_estimators=cvresult.shape[0])
#Fit the algorithm on the data
alg.fit(dtrain[predictors], dtrain['Disbursed'],eval_metric='auc')
#Predict training set:
dtrain_predictions = alg.predict(dtrain[predictors])
dtrain_predprob = alg.predict_proba(dtrain[predictors])[:,1]
#Print model report:
print "\nModel Report"
print "Accuracy : %.4g" % metrics.accuracy_score(dtrain['Disbursed'].values, dtrain_predictions)
print "AUC Score (Train): %f" % metrics.roc_auc_score(dtrain['Disbursed'], dtrain_predprob)
# Predict on testing data:
dtest['predprob'] = alg.predict_proba(dtest[predictors])[:,1]
results = test_results.merge(dtest[['ID','predprob']], on='ID')
print 'AUC Score (Test): %f' % metrics.roc_auc_score(results['Disbursed'], results['predprob'])
feat_imp = pd.Series(alg.booster().get_fscore()).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='Feature Importances')
plt.ylabel('Feature Importance Score')
第一步 确定学习率和树个数
给其他参数一个初始值
在学习率为0.1时找出理想的决策树数目
xgb1 = XGBClassifier(
learning_rate =0.1,
n_estimators=1000,
max_depth=5,#3-10之间
min_child_weight=1,
gamma=0,#节点分类所需最小损失函数下降值
subsample=0.8,#典型值0.5-0.9
colsample_bytree=0.8,#典型值0.5-0.9
objective= 'binary:logistic',
nthread=4,
scale_pos_weight=1,
seed=27)
modelfit(xgb1, train, predictors)
第二步 max_depth 和min_weight参数调优
先粗调再细调
param_test1 = {
'max_depth':range(3,10,2),
'min_child_weight':range(1,6,2)
}
gsearch1 = GridSearchCV( estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=5,
min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27),
param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch1.fit(train[predictors],train[target])
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_
在找出两个值的粗略最优值后再其附近进行一次细调
param_test2 = {
'max_depth':[4,5,6],
'min_child_weight':[4,5,6]
}
gsearch2 = GridSearchCV(estimator = XGBClassifier( learning_rate=0.1, n_estimators=140, max_depth=5,
min_child_weight=2, gamma=0, subsample=0.8, colsample_bytree=0.8,
objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),
param_grid = param_test2, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch2.fit(train[predictors],train[target])
gsearch2.grid_scores_, gsearch2.best_params_, gsearch2.best_score_
第三步 gamma参数调优
param_test3 = {
'gamma':[i/10.0 for i in range(0,5)]
}
gsearch3 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=4,
min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8,
objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),
param_grid = param_test3, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch3.fit(train[predictors],train[target])
gsearch3.grid_scores_, gsearch3.best_params_, gsearch3.best_score_
第四部 调整subsample和colsample_bytree参数
param_test4 = {
'subsample':[i/10.0 for i in range(6,10)],
'colsample_bytree':[i/10.0 for i in range(6,10)]
}
gsearch4 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=3,
min_child_weight=4, gamma=0.1, subsample=0.8, colsample_bytree=0.8,
objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),
param_grid = param_test4, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch4.fit(train[predictors],train[target])
gsearch4.grid_scores_, gsearch4.best_params_, gsearch4.best_score_
第五步 正则化参数调优
用来减低过拟合,与gamma函数起着类似的作用
param_test6 = {
'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100]
}
gsearch6 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=4,
min_child_weight=6, gamma=0.1, subsample=0.8, colsample_bytree=0.8,
objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),
param_grid = param_test6, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch6.fit(train[predictors],train[target])
gsearch6.grid_scores_, gsearch6.best_params_, gsearch6.best_score_
第 6步 降低学习速率
xgb4 = XGBClassifier(
learning_rate =0.01,
n_estimators=5000,
max_depth=4,
min_child_weight=6,
gamma=0,
subsample=0.8,
colsample_bytree=0.8,
reg_alpha=0.005,
objective= 'binary:logistic',
nthread=4,
scale_pos_weight=1,
seed=27)
modelfit(xgb4, train, predictors)
xgb.cv函数
def cv(params, dtrain, num_boost_round=10, nfold=3, stratified=False, folds=None,
metrics=(), obj=None, feval=None, maximize=False, early_stopping_rounds=None,
fpreproc=None, as_pandas=True, verbose_eval=None, show_stdv=True,
seed=0, callbacks=None, shuffle=True)
xgb_param 可以用xgb.XGBClassifier().get_xgb_params()获得
dtrain则是用xgb.DMatrix(x_train,y_train)获得。
num_boost_round是最大迭代次数,
early_stopping_rounds,测试集50 round没有提升迭代停止,输出最好的轮数,
verbose_eval=10意思是每10轮打印一次评价指标,
show_stdv=Flase表示不打印交叉验证的标准差。
nfold表示几折
folds可以接受一个KFold或者StratifiedKFold对象
metrics是一个字符串或者列表,表示评价指标,一般都用‘auc’
另外xgb.cv返回的是一个dataframe