1.直接调用xgboost内嵌的cv寻找最佳的参数n_estimators(弱分类器数目)
Otto商品分类数据
导入必要模型# python 3.6 python 3.6 python 3.6
from xgboost import XGBClassifier #sklearn中调用XGBoost的接口类,XGBClassifier就是对xgboost的封装,内核是同一个内核
import xgboost as xgb #直接调用XGBoost
import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import log_loss
from matplotlib import pyplot
import seaborn as sns
%matplotlib inline#读取数据
dpath = './logistic/'
train = pd.read_csv(dpath + "Otto_train_test.csv")
train.head()
Variable Identification
选择该数据集是因为该数据特征单一,我们可以在特征工程方面少做些工作,集中精力放在参数调优上;
Target分布,看看各类样本分布是否均衡sns.countplot(train.target)
pyplot.xlabel('target')
pyplot.ylabel('Number of occurrences')
每类样本分布不是很均匀
特征编码# 将类别字符串变成数字
y_train = train['target'] #形式为Class_x
y_train = y_train.map(lambda s: s[6:])
y_train = y_train.map(lambda s: int(s) - 1)#将类别的形式由Class_x变为0-8之间的整数
train = train.drop(["id" , "target"] , axis = 1)
X_train = np.array(train)#prepare cross validation-----设置交叉验证
# 当各类样本数量不均衡时,交叉验证对分类任务要采用StratifiedKFold,即在每折采样时根据各类样本按比例采样,
# 交叉验证的代码中缺省的就是StratifiedKFold
kfold = StratifiedKFold(n_splits = 5 , shuffle = True , random_state = 3)
默认参数,此时学习率为0.1,比较大,观察弱分类数目的大致范围(采用默认参数配置,看看模型是过拟合还是欠拟合)#直接调用xgboost内嵌的交叉验证(cv),可对连续的n_estimators参数进行快速交叉验证
#而GridSearchCV因速度太慢,只能对有限个参数进行交叉验证
def modelfit(alg , X_train , y_train , cv_folds = None , early_stopping_rounds = 10):
xgb_param = alg.get_xgb_params()
xgb_param['num_class'] = 9 #该问题为9类分类问题
#直接调用xgboost,而非sklearn的wrapper类
xgtrain = xgb.DMatrix(X_train , label = y_train)
#评价指标mlogloss的值是越小越好
cvresult = xgb.cv(xgb_param , xgtrain , num_boost_round = alg.get_params()['n_estimators'] , folds = cv_folds ,metrics='mlogloss' , early_stopping_rounds = early_stopping_rounds )
cvresult.to_csv('l_nestimators.csv' , index_label = 'n_estimators')
#最佳参数n_estimators
n_estimators = cvresult.shape[0]
print("n_estimators :")
print(n_estimators)
#采用交叉验证得到的最佳参数n_estimators,训练模型
alg.set_params(n_estimators = n_estimators)
alg.fit(X_train , y_train , eval_metric = 'mlogloss')
#Predict training set:
train_predprob = alg.predict_proba(X_train)
logloss = log_loss(y_train , train_predprob)
#Print model report:
print("logloss of train :")
print (logloss)xgb1 = XGBClassifier(
learning_rate = 0.1,
n_estimators = 1000, #已经设置了early_stopping_rounds,弱分类器数目的值大没关系,cv会自动返回合适的n_estimators
max_depth = 5 ,
min_child_weight = 1,
gamma = 0,
subsample = 0.3,
colsample_bytree = 0.8,
colsample_bylevel = 0.7,
objective = 'multi:softprob',#该分类问题为多类分类问题,这里设置为输出概率
seed = 3)
modelfit(xgb1 , X_train , y_train , cv_folds = kfold)[object Object]
注:此处的结果是用交叉验证得到的最佳参数,对训练集进行的预测,该结果不能代表在实际问题中的表现;cvresult = pd.DataFrame.from_csv('l_nestimators.csv')
#plot
test_means = cvresult['test-mlogloss-mean']#测试误差均值
test_stds = cvresult['test-mlogloss-std']#标准差
train_means = cvresult['train-mlogloss-mean']#训练误差均值
train_stds = cvresult['train-mlogloss-std']#标准差
x_axis = range(0 , cvresult.shape[0])
pyplot.errorbar(x_axis , test_means , yerr = test_stds , label = 'Test')
pyplot.errorbar(x_axis , train_means , yerr = train_stds , label = 'Train')
pyplot.title("XGBoost n_estimators vs Log Loss")
pyplot.xlabel('n_estimators')
pyplot.ylabel('Log Loss')
pyplot.legend()
pyplot.savefig('n_estimators4_1.png')
pyplot.show()
注:上图中,纵轴坐标LogLoss2.00-1.75那段表示当前模型处于欠拟合的状态,模型需要变得更复杂,以得到更好的性能;横轴坐标n_estimators从40开始,训练误差和测试误差的距离越来越大,说明模型已经过拟合了,即在测试集的性能没多少改变,而测试集上的性能则变得越来越好,此进模型需要更简单一点;#重新划出20后面的图形
cvresult = pd.DataFrame.from_csv('l_nestimators.csv')
cvresult = cvresult.iloc[20:]
#plot
test_means = cvresult['test-mlogloss-mean']
test_stds = cvresult['test-mlogloss-std']
train_means = cvresult['train-mlogloss-mean']
train_stds = cvresult['train-mlogloss-std']
x_axis = range(20 , cvresult.shape[0] + 20)
fig = pyplot.figure(figsize=(10 , 10) , dpi = 60)
pyplot.errorbar(x_axis , test_means , yerr = test_stds , label = 'Test')
pyplot.errorbar(x_axis , train_means , yerr = train_stds , label = 'Train')
pyplot.title("XGBoost n_estimators vs Log Loss")
pyplot.xlabel('n_estimators')
pyplot.ylabel('Log Loss')
pyplot.legend()
pyplot.savefig('n_estimators4_1.png')
pyplot.show()
注:上图中,曲线上横向的点表示误差均值,纵向的线段表示标准差,标准差线段的两头分别表示,当弱学习器数目为n时,样本误差的上界和下界;
2.调整树的参数:max_depth(树的最大深度) & min_child_weight(叶子节点所需要的最小权利和)
(2.1:粗调,参数的步长为2;2.2:在最佳参数周围,将步长设为1或更小,进行精细调整)#max_depth 建议3-10,min_child_weight=1/sqrt(ratio_rare_event) = 5.5
max_depth = range(3 , 10 , 1)#取值范围为3-10,步长为1
min_child_weight = range(1 , 6 , 2)
param_test2_1 = dict(max_depth = max_depth , min_child_weight = min_child_weight)
param_test2_1
xgb2_1 = XGBClassifier(
learning_rate = 0.1,
n_estimators = 152, #上面已经得到的最优值
max_depth = 5 ,
min_child_weight = 1,
gamma = 0,
subsample = 0.3,
colsample_bytree = 0.8,
colsample_bylevel = 0.7,
objective = 'multi:softprob',#该分类问题为多类分类问题,这里设置为输出概率
seed = 3)
#GridSearchCV参数说明:(学习器 ,参数范围 ,评价指标 , cpu核心的使用数(-1为并行,使用全部的核) , 交叉验证一共多少折)
gsearch2_1 = GridSearchCV(xgb2_1 , param_grid = param_test2_1 , scoring='neg_log_loss' , n_jobs = -1 , cv = kfold)
gsearch2_1.fit(X_train , y_train)
gsearch2_1.grid_scores_ , gsearch2_1.best_params_ , gsearch2_1.best_score_
gsearch2_1.cv_results_{'mean_fit_time': array([1.45454082, 1.97168965, 1.9824079 , 2.2223093 , 2.07778544,
1.78918533, 2.42190661, 2.36306362, 2.0501524 , 2.49945779,
2.14302115, 1.77447672, 2.92233768, 2.49135246, 1.96860585,
2.6433012 , 2.1420651 , 1.88084736, 3.19571919, 2.14432192,
1.70169439]),
'std_fit_time': array([0.24886047, 0.24884278, 0.0613107 , 0.1016457 , 0.09550288,
0.05508209, 0.02851551, 0.23187988, 0.18956871, 0.06988497,
0.08399057, 0.08938734, 0.13197995, 0.13758643, 0.06937447,
0.06387703, 0.03662388, 0.05743063, 0.13994387, 0.02977088,
0.10212321]),
'mean_score_time': array([0.01265473, 0.01481042, 0.0132092 , 0.02611961, 0.02081451,
0.01380944, 0.01901302, 0.02251544, 0.01881318, 0.02141471,
0.01691146, 0.0155108 , 0.02621841, 0.01551075, 0.0211143 ,
0.01811252, 0.01471033, 0.02031398, 0.03312268, 0.01941338,
0.01303215]),
'std_score_time': array([0.00645511, 0.0031419 , 0.00172168, 0.01261709, 0.00424094,
0.0010775 , 0.00195032, 0.01254631, 0.00741078, 0.00274748,
0.00174502, 0.00216924, 0.00658116, 0.00070793, 0.00833838,
0.00177336, 0.00067866, 0.00797758, 0.00906928, 0.01108613,
0.00142161]),
'param_max_depth': masked_array(data=[3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7, 7, 8, 8, 8,
9, 9, 9],
mask=[False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False],
fill_value='?',
dtype=object),
'param_min_child_weight': masked_array(data=[1, 3, 5, 1, 3, 5, 1, 3, 5, 1, 3, 5, 1, 3, 5, 1, 3, 5,
1, 3, 5],
mask=[False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False],
fill_value='?',
dtype=object),
'params': [{'max_depth': 3, 'min_child_weight': 1},
{'max_depth': 3, 'min_child_weight': 3},
{'max_depth': 3, 'min_child_weight': 5},
{'max_depth': 4, 'min_child_weight': 1},
{'max_depth': 4, 'min_child_weight': 3},
{'max_depth': 4, 'min_child_weight': 5},
{'max_depth': 5, 'min_child_weight': 1},
{'max_depth': 5, 'min_child_weight': 3},
{'max_depth': 5, 'min_child_weight': 5},
{'max_depth': 6, 'min_child_weight': 1},
{'max_depth': 6, 'min_child_weight': 3},
{'max_depth': 6, 'min_child_weight': 5},
{'max_depth': 7, 'min_child_weight': 1},
{'max_depth': 7, 'min_child_weight': 3},
{'max_depth': 7, 'min_child_weight': 5},
{'max_depth': 8, 'min_child_weight': 1},
{'max_depth': 8, 'min_child_weight': 3},
{'max_depth': 8, 'min_child_weight': 5},
{'max_depth': 9, 'min_child_weight': 1},
{'max_depth': 9, 'min_child_weight': 3},
{'max_depth': 9, 'min_child_weight': 5}],
'split0_test_score': array([-1.08718937, -1.09579943, -1.13354104, -1.08336137, -1.12539686,
-1.14638456, -1.09459129, -1.09724201, -1.13811383, -1.08060457,
-1.14676753, -1.17301695, -1.1145029 , -1.09377679, -1.17301695,
-1.0965587 , -1.12111028, -1.17301695, -1.12523253, -1.12111028,
-1.17301695]),
'split1_test_score': array([-0.90847575, -0.93228667, -0.98981291, -0.88871794, -0.9324894 ,
-0.97138889, -0.92866746, -0.93313133, -1.01312489, -0.8821443 ,
-0.92579839, -0.98544821, -0.88852343, -0.92211337, -0.98544821,
-0.92303341, -0.90950118, -0.98544821, -0.89260702, -0.90950118,
-0.98544821]),
'split2_test_score': array([-0.89419077, -0.94683318, -1.01680019, -0.90744574, -0.97871248,
-1.00126056, -0.93873131, -0.96280162, -1.0109172 , -0.86081739,
-0.96997059, -0.99547674, -0.91518516, -0.96057845, -0.99547674,
-0.9371314 , -0.97328839, -0.99547674, -0.93638561, -0.97328839,
-0.99547674]),
'split3_test_score': array([-0.92546119, -0.96272871, -1.00218196, -0.93792652, -0.9368132 ,
-0.99449181, -0.91112218, -0.93719629, -0.98389445, -0.94349375,
-0.96993939, -0.96746802, -0.89214913, -0.92681013, -0.94762662,
-0.8959783 , -0.93057469, -0.94762662, -0.93227858, -0.9367468 ,
-0.94762662]),
'split4_test_score': array([-0.80183262, -0.85884744, -0.96049093, -0.84301533, -0.9113355 ,
-0.97472223, -0.8592988 , -0.89658449, -0.94263649, -0.831899 ,
-0.87642984, -0.94945823, -0.83155818, -0.9191476 , -0.94945823,
-0.83659937, -0.91063024, -0.94945823, -0.85089715, -0.91063024,
-0.94945823]),
'mean_test_score': array([-0.92422275, -0.95993659, -1.02101526, -0.93271723, -0.9774968 ,
-1.01807338, -0.9471281 , -0.96591762, -1.01828851, -0.92044767,
-0.97847828, -1.01477048, -0.92913416, -0.96492305, -1.01080786,
-0.93858077, -0.96953656, -1.01080786, -0.94818459, -0.97076922,
-1.01080786]),
'std_test_score': array([0.09251479, 0.07707461, 0.05963455, 0.0818743 , 0.07766002,
0.06565731, 0.07915205, 0.06944591, 0.06548008, 0.0886047 ,
0.09151114, 0.08128094, 0.09735839, 0.06662051, 0.0839309 ,
0.08668567, 0.07979949, 0.0839309 , 0.09440646, 0.0792338 ,
0.0839309 ]),
'rank_test_score': array([ 2, 8, 21, 4, 13, 19, 6, 10, 20, 1, 14, 18, 3, 9, 15, 5, 11,
15, 7, 12, 15]),
'split0_train_score': array([-0.23506696, -0.42238518, -0.6134396 , -0.18879343, -0.40748233,
-0.62428014, -0.17596322, -0.40007491, -0.61200072, -0.16981949,
-0.4095462 , -0.61907156, -0.16201702, -0.40087629, -0.61907156,
-0.16691824, -0.40227751, -0.61907156, -0.165709 , -0.40227751,
-0.61907156]),
'split1_train_score': array([-0.25821533, -0.43929102, -0.62994285, -0.20182639, -0.42128714,
-0.63887457, -0.18338174, -0.4187483 , -0.64548383, -0.18261545,
-0.42154358, -0.62833961, -0.1719561 , -0.41369627, -0.62833961,
-0.17269233, -0.41680083, -0.62833961, -0.17563985, -0.41680083,
-0.62833961]),
'split2_train_score': array([-0.25948842, -0.44902698, -0.64283288, -0.20920136, -0.43339103,
-0.63528185, -0.18549172, -0.42097063, -0.63765519, -0.17888064,
-0.41736498, -0.64216106, -0.17466933, -0.421817 , -0.64216106,
-0.17045003, -0.42071521, -0.64216106, -0.1753699 , -0.42071521,
-0.64216106]),
'split3_train_score': array([-0.25218286, -0.43842555, -0.64139283, -0.19695715, -0.42982769,
-0.63135471, -0.18244458, -0.40960157, -0.63857148, -0.17498448,
-0.42012243, -0.6361547 , -0.17098162, -0.41027713, -0.63039844,
-0.17674043, -0.41380281, -0.63039844, -0.17157678, -0.41820101,
-0.63039844]),
'split4_train_score': array([-0.26175732, -0.43941897, -0.64548966, -0.20640497, -0.42951335,
-0.64860559, -0.18507553, -0.4265518 , -0.64860942, -0.18137506,
-0.41946124, -0.64231342, -0.17556121, -0.42167697, -0.64231342,
-0.17274134, -0.41864994, -0.64231342, -0.1752531 , -0.41864994,
-0.64231342]),
'mean_train_score': array([-0.25334218, -0.43770954, -0.63461956, -0.20063666, -0.42430031,
-0.63567937, -0.18247136, -0.41518944, -0.63646413, -0.17753502,
-0.41760769, -0.63360807, -0.17103705, -0.41366873, -0.63245682,
-0.17190847, -0.41444926, -0.63245682, -0.17270972, -0.4153289 ,
-0.63245682]),
'std_train_score': array([0.00967126, 0.00858902, 0.01184869, 0.00723663, 0.00929831,
0.00807588, 0.00343772, 0.00932592, 0.01290876, 0.00465622,
0.00424964, 0.00888244, 0.00481394, 0.00781891, 0.00885074,
0.00321513, 0.00649626, 0.00885074, 0.00380591, 0.00664524,
0.00885074])}#用交叉验证得到的最佳max_depth和min_child_weight进行训练及预测
xgb2 = XGBClassifier(
learning_rate = 0.1,
n_estimators = 152, #第一次交叉验证找到的最佳参数
max_depth = 6 , #第二次交叉验证找到的最佳参数
min_child_weight = 1,#第二次交叉验证找到的最佳参数
gamma = 0,
subsample = 0.3,
colsample_bytree = 0.8,
colsample_bylevel = 0.7,
objective = 'multi:softprob',#该分类问题为多类分类问题,这里设置为输出概率
seed = 3)
xgb2.fit(X_train , y_train , eval_metric = 'mlogloss')
#Predict training set:
train_predprob = xgb2.predict_proba(X_train)
logloss = log_loss(y_train , train_predprob)
#Print model report:
print("logloss of train :")
print (logloss)
注:结果比第一次的0.17892266070462243稍微好一点;
4.调整树的参数:subsample(样本采样比例)和colsample_bytree(构造每棵树所用的特征比例)
参数调整的步骤类似第二步,这里粗调参数的步长为0.1;下一步可以将步长降为0.05,进行精细调整subsample = [i/10.0 for i in range(3 , 9)]
colsample_bytree = [i/10.0 for i in range(6 , 10)]
param_test4_1 = dict(subsample = subsample , colsample_bytree = colsample_bytree)
param_test4_1
xgb4_1 = XGBClassifier(
learning_rate = 0.1,
n_estimators = 152, #上面已经得到的最优值
max_depth = 6 ,
min_child_weight = 1,
gamma = 0,
subsample = 0.3,
colsample_bytree = 0.8,
colsample_bylevel = 0.7,
objective = 'multi:softprob',#该分类问题为多类分类问题,这里设置为输出概率
seed = 3)
#GridSearchCV参数说明:(学习器 ,参数范围 ,评价指标 , cpu核心的使用数(-1为并行,使用全部的核) , 交叉验证一共多少折)
gsearch4_1 = GridSearchCV(xgb4_1 , param_grid = param_test4_1 , scoring='neg_log_loss' , n_jobs = -1 , cv = kfold)
gsearch4_1.fit(X_train , y_train)
gsearch4_1.grid_scores_ , gsearch4_1.best_params_ , gsearch4_1.best_score_
注:上面交叉验证得到的最佳结果已经是现在在用的值,所以这两个值不需要再做调整;