写在前面:已经很久很久很久没有发博客了,有点愧疚还有点难过,不写博客的实践都干嘛了,哎!!!
XGBoost有两种接口:
xgboost.train
,xgboost.cv
xgboost.XGBClassifier
,xgboost.XGBRegressor
两种接口有些许不同,比如原生接口的学习率参数是eta
,sklearn接口的是learning_rate
,原生接口要在train
或cv
函数中传入num_round
作为基学习器个数,而sklearn接口在定义模型时使用参数n_estimators
。sklearn接口的形式与sklearn中的模型保持统一,方便sklearn用户学习。
如果要对XGBoost模型进行交叉验证,可以使用原生接口的交叉验证函数xgboost.cv
;对于sklearn接口,可以使用sklearn.model_selection
中的cross_val_score
,cross_validate
,validation_curve
三个函数。
sklearn.model_selection
中的三个函数区别:
cross_val_score
最简单,返回模型给定参数的验证得分,不能返回训练得分cross_validate
复杂一些,返回模型给定参数的训练得分、验证得分、训练时间和验证时间等,甚至还可以指定多个评价指标validation_curve
返回模型指定一个参数的一系列候选值的训练得分和验证得分,可以通过判断拟合情况来调整该参数,也可以用来画validation_curve
下面分别以分类任务和回归任务展示一下四个函数的用法和输出情况。经过对比,在参数相同的条件下,四个函数的输出结果一致。发现了一个问题,validation_curve
和xgboost.cv
的输出结果大部分相同,但是前者的耗时却比后者多了好几倍。(暂时还找到原因,网上也没找到相同的问题,打算到stackoverflow上问一下,如果有答案的话再回来补充)
20200402补充:初步怀疑是热启动的问题,在使用xgboost.cv
进行交叉验证时,可以通过热启动的方式训练模型,此时只需要训练 N N N棵树;而把XGBRegressor
传入validation_curve
进行交叉验证,此时XGBRegressor
不能设置热启动(而sklearn的GBDT和随机森林都可以设置热启动),那就需要训练 1 + 2 + . . . + N = N ∗ ( N − 1 ) 2 1+2+...+N = \frac{N*(N-1)}{2} 1+2+...+N=2N∗(N−1)棵树,自然速度就慢了。
P.S. 下面代码是用Jupyter Notebook写的,懒得合并了。
import numpy as np
import xgboost as xgb
from sklearn.datasets import make_regression
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate
from sklearn.model_selection import KFold
from sklearn.model_selection import validation_curve
# 回归问题
X, y = make_regression(n_samples=10000, n_features=10)
# sklearn接口
n_estimators = 50
params = {'n_estimators':n_estimators, 'booster':'gbtree', 'max_depth':5, 'learning_rate':0.05,
'objective':'reg:squarederror', 'subsample':1, 'colsample_bytree':1}
clf = xgb.XGBRegressor(**params)
cv = KFold(n_splits=5, shuffle=True, random_state=100)
print('test_score:', cross_val_score(clf, X, y, cv=cv, scoring='neg_mean_absolute_error'))
test_score: array([-57.48422753, -59.69255262, -58.91771172, -58.44347715,
-59.8880623 ])
cross_validate(clf, X, y, cv=cv, scoring='neg_mean_absolute_error',
return_train_score=True)
{'fit_time': array([0.37278223, 0.36898613, 0.36637878, 0.36504936, 0.37162185]),
'score_time': array([0.00398517, 0.00403547, 0.00398993, 0.00398636, 0.00404048]),
'test_score': array([-57.48422753, -59.69255262, -58.91771172, -58.44347715,
-59.8880623 ]),
'train_score': array([-50.70099151, -50.43187094, -50.75229625, -50.66844022,
-50.82982251])}
%%time # 计算一个cell的执行实践
estimator_range = range(1, n_estimators+1)
train_score, test_score = validation_curve(
clf, X, y, param_name='n_estimators', param_range=estimator_range,
cv=cv, scoring='neg_mean_absolute_error'
)
print('train_score:',train_score[-1])
print('test_score:', test_score[-1])
train_score: [-50.70099151 -50.43187094 -50.75229625 -50.66844022 -50.82982251]
test_score: [-57.48422753 -59.69255262 -58.91771172 -58.44347715 -59.8880623 ]
Wall time: 57 s
print('train_mae_mean:\n', np.abs(train_score).mean(axis=1))
print('test_mae_mean:\n', np.abs(test_score).mean(axis=1))
train_mae_mean:
array([127.5682212 , 124.17645861, 120.96190697, 117.93807824,
115.06161926, 112.28746068, 109.60911311, 107.07263957,
104.63554663, 102.2788341 , 99.97895509, 97.82509892,
95.73958223, 93.71896245, 91.79974093, 89.94817809,
88.13495265, 86.37829884, 84.67090725, 83.05548799,
81.46903821, 79.94032864, 78.44072613, 77.00488358,
75.62191062, 74.24138916, 72.92121361, 71.66007955,
70.41908351, 69.23718699, 68.06376522, 66.92736292,
65.81928473, 64.73408044, 63.67274508, 62.65390845,
61.66069004, 60.69802867, 59.72997524, 58.7870726 ,
57.88241178, 57.01307807, 56.14014094, 55.30247271,
54.48416963, 53.69873843, 52.91791742, 52.16280788,
51.42670887, 50.67668428]),
test_mae_mean:
array([127.83738044, 124.65000719, 121.72020148, 118.90983369,
116.24294452, 113.69376675, 111.29388701, 108.94996321,
106.7553701 , 104.62401193, 102.5608943 , 100.68648486,
98.76550219, 96.97546939, 95.20893969, 93.55259092,
91.9299438 , 90.3413075 , 88.76142948, 87.34316226,
85.96043718, 84.62054143, 83.30115705, 82.07698107,
80.89857637, 79.67939585, 78.52190061, 77.37787457,
76.28248431, 75.24121599, 74.2093299 , 73.21873113,
72.19303325, 71.23265487, 70.33854865, 69.42902278,
68.57191177, 67.73459769, 66.88130101, 66.05978781,
65.26603807, 64.46357751, 63.70019472, 62.95398889,
62.25243534, 61.56164243, 60.88819753, 60.20476192,
59.55280602, 58.88520627])
%%time
params_xgb = params.copy() # 修改参数
num_round = params_xgb['n_estimators']
params_xgb['eta'] = params['learning_rate']
del params_xgb['n_estimators']
del params_xgb['learning_rate']
# xgboost原生接口 进行交叉验证
res = xgb.cv(params_xgb, xgb.DMatrix(X, y), num_round, folds=cv, metrics='mae')
print(res)
train-mae-mean train-mae-std test-mae-mean test-mae-std
0 127.568312 0.315528 127.837350 1.243183
1 124.176437 0.300477 124.649957 1.236916
2 120.962018 0.301030 121.720238 1.206761
3 117.938005 0.278763 118.909902 1.231662
4 115.061696 0.269224 116.242946 1.190097
5 112.287560 0.240412 113.693771 1.159047
6 109.609152 0.262167 111.293890 1.099815
7 107.072640 0.242916 108.949971 1.067070
8 104.635579 0.209314 106.755350 1.080068
9 102.278841 0.195815 104.624013 1.054731
10 99.978919 0.201804 102.560906 1.055403
11 97.825169 0.213528 100.686517 1.033271
12 95.739612 0.202356 98.765524 1.029646
13 93.719107 0.187538 96.975470 1.005893
14 91.799744 0.175199 95.208905 1.046983
15 89.948177 0.160738 93.552597 1.067333
16 88.134976 0.144838 91.929965 1.052541
17 86.378351 0.163211 90.341278 1.037858
18 84.670908 0.187184 88.761414 0.995875
19 83.055446 0.171080 87.343141 0.981363
20 81.469022 0.164968 85.960420 0.993623
21 79.940317 0.167554 84.620523 0.963820
22 78.440726 0.154343 83.301137 1.004986
23 77.004854 0.141827 82.076961 0.986129
24 75.621930 0.150028 80.898605 0.964261
25 74.241496 0.154140 79.679413 0.949695
26 72.921170 0.140105 78.521875 0.946750
27 71.660085 0.130937 77.377856 0.924869
28 70.419052 0.109023 76.282506 0.928389
29 69.237167 0.107013 75.241214 0.900845
30 68.063844 0.097079 74.209323 0.900476
31 66.927363 0.091163 73.218730 0.942131
32 65.819266 0.091109 72.193025 0.930880
33 64.734090 0.092792 71.232658 0.908819
34 63.672701 0.086543 70.338522 0.932795
35 62.653945 0.088487 69.429022 0.927500
36 61.660666 0.082703 68.571904 0.915664
37 60.697992 0.119144 67.734601 0.882644
38 59.729960 0.126423 66.881299 0.886910
39 58.787107 0.117820 66.059784 0.897685
40 57.882377 0.125402 65.266035 0.877481
41 57.013075 0.109192 64.463574 0.901940
42 56.140131 0.140454 63.700203 0.888990
43 55.302481 0.148805 62.953973 0.834368
44 54.484136 0.145519 62.252445 0.829440
45 53.698661 0.132748 61.561636 0.854725
46 52.917877 0.124366 60.888204 0.875071
47 52.162859 0.133974 60.204764 0.878531
48 51.426765 0.140143 59.552805 0.892451
49 50.676675 0.133987 58.885213 0.873657
Wall time: 2.25 s
validation_curve用了57s,而xgboost.cv只用了2.25s,差距巨大!
# 分类数据集
X, y = make_classification(n_samples=10000, n_features=10, n_classes=2)
n_estimators = 50
params = {'n_estimators':n_estimators, 'booster':'gbtree', 'max_depth':5, 'learning_rate':0.05,
'objective':'binary:logistic', 'subsample':1, 'colsample_bytree':1}
clf = xgb.XGBClassifier(**params)
cv = KFold(n_splits=5, shuffle=True, random_state=100)
print('test_score:', cross_val_score(clf, X, y, cv=cv, scoring='accuracy'))
test_score: array([0.913 , 0.9235, 0.8955, 0.9075, 0.918 ])
cross_validate(clf, X, y, cv=cv, scoring='accuracy',
return_train_score=True)
{'fit_time': array([0.43403697, 0.43297029, 0.41813326, 0.42408895, 0.42200208]),
'score_time': array([0.00299048, 0.00203776, 0.00500631, 0.0019989 , 0.00299263]),
'test_score': array([0.913 , 0.9235, 0.8955, 0.9075, 0.918 ]),
'train_score': array([0.92425 , 0.921125, 0.9285 , 0.92325 , 0.922125])}
%%time
estimator_range = range(1, n_estimators+1)
train_score, test_score = validation_curve(
clf, X, y, param_name='n_estimators', param_range=estimator_range,
cv=cv, scoring='accuracy'
)
print('train_score:',train_score[-1])
print('test_score:', test_score[-1])
train_score: [0.92425 0.921125 0.9285 0.92325 0.922125]
test_score: [0.913 0.9235 0.8955 0.9075 0.918 ]
Wall time: 58.7 s
print('train_mae_mean:\n', np.abs(train_score).mean(axis=1))
print('test_mae_mean:\n', np.abs(test_score).mean(axis=1))
train_score.mean(axis=1), test_score.mean(axis=1)
train_mae_mean:
array([0.912775, 0.916075, 0.91585 , 0.91695 , 0.917125, 0.917225,
0.91725 , 0.9175 , 0.91745 , 0.917925, 0.91755 , 0.918025,
0.917975, 0.91835 , 0.918225, 0.918625, 0.919 , 0.91905 ,
0.918975, 0.9191 , 0.91955 , 0.919525, 0.9198 , 0.9199 ,
0.919975, 0.920025, 0.9201 , 0.92005 , 0.920125, 0.9208 ,
0.921425, 0.9218 , 0.921875, 0.922025, 0.922125, 0.9221 ,
0.92225 , 0.922275, 0.922275, 0.92235 , 0.9226 , 0.9229 ,
0.923 , 0.9233 , 0.923375, 0.923275, 0.923325, 0.9234 ,
0.923675, 0.92385 ]),
test_mae_mean:
array([0.9049, 0.9072, 0.9082, 0.9085, 0.9087, 0.9084, 0.9082, 0.9091,
0.9087, 0.9089, 0.9091, 0.9092, 0.9089, 0.9101, 0.9102, 0.9108,
0.9102, 0.9107, 0.9105, 0.9109, 0.9104, 0.9102, 0.9109, 0.9109,
0.9103, 0.9105, 0.9105, 0.9103, 0.9106, 0.9111, 0.9121, 0.9124,
0.9124, 0.9122, 0.9119, 0.912 , 0.912 , 0.9117, 0.9114, 0.911 ,
0.911 , 0.9113, 0.9111, 0.9107, 0.9108, 0.911 , 0.9109, 0.9113,
0.9114, 0.9115])
%%time
params_xgb = params.copy()
num_round = params_xgb['n_estimators']
params_xgb['eta'] = params['learning_rate']
del params_xgb['n_estimators']
del params_xgb['learning_rate']
res = xgb.cv(params_xgb, xgb.DMatrix(X, y), num_round, folds=cv, metrics='error')
Wall time: 2.37 s
res['train-error-mean'] = 1 - res['train-error-mean']
res['test-error-mean'] = 1 - res['test-error-mean']
print(res)
train-error-mean train-error-std test-error-mean test-error-std
0 0.912775 0.002296 0.9049 0.007493
1 0.916075 0.003749 0.9072 0.007679
2 0.915850 0.003048 0.9082 0.006615
3 0.916950 0.002090 0.9085 0.008503
4 0.917125 0.002028 0.9087 0.008606
5 0.917225 0.002191 0.9084 0.009356
6 0.917250 0.002219 0.9082 0.009114
7 0.917500 0.002318 0.9091 0.009672
8 0.917450 0.002308 0.9087 0.009405
9 0.917925 0.002467 0.9089 0.009410
10 0.917550 0.002248 0.9091 0.009313
11 0.918025 0.002384 0.9092 0.009389
12 0.917975 0.002583 0.9089 0.009124
13 0.918350 0.002095 0.9101 0.008840
14 0.918225 0.002223 0.9102 0.008658
15 0.918625 0.002204 0.9108 0.008388
16 0.919000 0.002904 0.9102 0.009495
17 0.919050 0.002639 0.9107 0.008376
18 0.918975 0.002451 0.9105 0.008562
19 0.919100 0.002613 0.9109 0.008645
20 0.919550 0.003244 0.9104 0.008570
21 0.919525 0.003234 0.9102 0.008761
22 0.919800 0.003307 0.9109 0.008505
23 0.919900 0.003537 0.9109 0.008505
24 0.919975 0.003535 0.9103 0.008376
25 0.920025 0.003365 0.9105 0.008087
26 0.920100 0.003451 0.9105 0.008390
27 0.920050 0.003514 0.9103 0.008412
28 0.920125 0.003521 0.9106 0.007908
29 0.920800 0.003303 0.9111 0.008351
30 0.921425 0.002912 0.9121 0.009330
31 0.921800 0.002910 0.9124 0.009330
32 0.921875 0.002739 0.9124 0.009330
33 0.922025 0.002837 0.9122 0.009405
34 0.922125 0.002860 0.9119 0.009957
35 0.922100 0.002807 0.9120 0.009497
36 0.922250 0.002777 0.9120 0.009370
37 0.922275 0.002636 0.9117 0.009569
38 0.922275 0.002540 0.9114 0.009609
39 0.922350 0.002477 0.9110 0.009680
40 0.922600 0.002607 0.9110 0.009633
41 0.922900 0.002838 0.9113 0.010033
42 0.923000 0.002787 0.9111 0.009805
43 0.923300 0.002612 0.9107 0.009516
44 0.923375 0.002614 0.9108 0.009667
45 0.923275 0.002897 0.9110 0.009772
46 0.923325 0.002718 0.9109 0.009728
47 0.923400 0.002685 0.9113 0.009811
48 0.923675 0.002820 0.9114 0.009764
49 0.923850 0.002551 0.9115 0.009597
validation_curve用了58.7s,而xgboost.cv只用了2.37s,差距巨大!