GBDT算法在实际应用中深受大众的喜爱,不同于Adaboost(利用前一轮弱学习器的误差率来更新训练集的误差权重),GBDT采用的是前向分步算法,但是弱学习器则限定了CART决策树模型。
GBDT的通俗理解是:假如一所房屋的价格是100万,我们先用80万去拟合,发现损失20万,这时我们用15万去拟合发现损失5万,接着我们使用4万去拟合返现损失1万,这样一直迭代下去,使得损失误差一直减少到设定的阈值。
如何解决损失函数拟合方法的问题,大牛提出了使用损失函数的负梯度来拟合本轮损失的近似值,进而拟合一个CART回归树。由于公式编辑很费时间,这里给大家推荐一个博客,具体细节可参考。
GBDT的优点有很多,总结如下:
1、由于采用了CART作为弱学习器,可以处理各种类型的数据,包括连续值和离散值
2、预测准确度较高
3、可以采用一些正则化的损失函数,对异常值的鲁棒性非常强
下面对GBDT的实际应用做个小小的demo。
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
%matplotlib inline
data = load_iris()
X = data.data
y = data.target
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
fig = plt.figure()
plt.scatter(x_train[y_train==0][:, 1], x_train[y_train==0][:, 2])
plt.scatter(x_train[y_train==1][:, 1], x_train[y_train==1][:, 2])
plt.scatter(x_train[y_train==2][:, 1], x_train[y_train==2][:, 2])
plt.legend(data.target_names)
plt.show()
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from IPython.display import Image
import pydotplus
from sklearn import metrics
dtc = DecisionTreeClassifier()
dtc.fit(x_train, y_train)
print("Accuracy (train): %.4g" % metrics.accuracy_score(y_train, dtc.predict(x_train)))
print("Accuracy (test): %.4g" % metrics.accuracy_score(y_test, dtc.predict(x_test)))
print("混淆矩阵:\n", metrics.classification_report(y_test, dtc.predict(x_test), target_names=data.target_names))
dot_data = export_graphviz(decision_tree=dtc,
out_file=None,
feature_names=data.feature_names,
class_names=data.target_names,
filled=True,
rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())
Accuracy (train): 1
Accuracy (test): 0.9474
混淆矩阵:
precision recall f1-score support
setosa 1.00 1.00 1.00 11
versicolor 0.80 1.00 0.89 8
virginica 1.00 0.89 0.94 19
avg / total 0.96 0.95 0.95 38
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier()
gbc.fit(x_train, y_train)
print("Accuracy(train) : %.4g" % gbc.score(x_train, y_train))
print("Accuracy(test) : %.4g" % gbc.score(x_test, y_test))
print("混淆矩阵:\n", metrics.classification_report(y_test, gbc.predict(x_test), target_names=data.target_names))
Accuracy(train) : 1
Accuracy(test) : 0.9474
混淆矩阵:
precision recall f1-score support
setosa 1.00 1.00 1.00 11
versicolor 0.80 1.00 0.89 8
virginica 1.00 0.89 0.94 19
avg / total 0.96 0.95 0.95 38
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
params = {"max_depth": list(range(1,11))}
gbc = GradientBoostingClassifier()
gs = GridSearchCV(gbc, param_grid=params, cv=10)
gs.fit(x_train, y_train)
print("Accuracy(train) : %.4g" % gs.score(x_train, y_train))
print("Accuracy(test) : %.4g" % gs.score(x_test, y_test))
print("混淆矩阵:\n", metrics.classification_report(y_test, gs.predict(x_test), target_names=data.target_names))
Accuracy(train) : 1
Accuracy(test) : 0.9474
混淆矩阵:
precision recall f1-score support
setosa 1.00 1.00 1.00 11
versicolor 0.80 1.00 0.89 8
virginica 1.00 0.89 0.94 19
avg / total 0.96 0.95 0.95 38
gs.best_estimator_, gs.best_score_, gs.best_params_, gs.grid_scores_
D:\anaconda\setup\lib\site-packages\sklearn\model_selection\_search.py:761: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20
DeprecationWarning)
(GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.1, loss='deviance', max_depth=4,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
presort='auto', random_state=None, subsample=1.0, verbose=0,
warm_start=False),
0.9642857142857143,
{'max_depth': 4},
[mean: 0.94643, std: 0.06325, params: {'max_depth': 1},
mean: 0.94643, std: 0.07307, params: {'max_depth': 2},
mean: 0.95536, std: 0.06385, params: {'max_depth': 3},
mean: 0.96429, std: 0.04335, params: {'max_depth': 4},
mean: 0.95536, std: 0.06385, params: {'max_depth': 5},
mean: 0.95536, std: 0.06385, params: {'max_depth': 6},
mean: 0.95536, std: 0.06385, params: {'max_depth': 7},
mean: 0.95536, std: 0.06385, params: {'max_depth': 8},
mean: 0.95536, std: 0.06385, params: {'max_depth': 9},
mean: 0.95536, std: 0.06385, params: {'max_depth': 10}])
from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier()
ada.fit(x_train, y_train)
ada.score(x_test, y_test)
print("Accuracy(train) : %.4g" % ada.score(x_train, y_train))
print("Accuracy(test) : %.4g" % ada.score(x_test, y_test))
print("混淆矩阵:\n", metrics.classification_report(y_test, ada.predict(x_test), target_names=data.target_names))
Accuracy(train) : 0.9821
Accuracy(test) : 0.9474
混淆矩阵:
precision recall f1-score support
setosa 1.00 1.00 1.00 11
versicolor 0.80 1.00 0.89 8
virginica 1.00 0.89 0.94 19
avg / total 0.96 0.95 0.95 38
import xgboost
params = {
'booster': 'gbtree',
'objective': 'multi:softmax', # 多分类的问题
'num_class': 3, # 类别数,与 multisoftmax 并用
'gamma': 0.1, # 用于控制是否后剪枝的参数,越大越保守,一般0.1、0.2这样子。
'max_depth': 3, # 构建树的深度,越大越容易过拟合
'lambda': 1, # 控制模型复杂度的权重值的L2正则化项参数,参数越大,模型越不容易过拟合。
'subsample': 0.8, # 随机采样训练样本
'colsample_bytree': 0.7, # 生成树时进行的列采样
'min_child_weight': 3,
'silent': 1, # 设置成1则没有运行信息输出,最好是设置为0.
'eta': 0.01, # 如同学习率
'seed': 1000,
'nthread': 4, # cpu 线程数
}
dtrain = xgboost.DMatrix(x_train, y_train)
num_rounds = 500
model = xgboost.train(params=params,dtrain=dtrain, num_boost_round=num_rounds)
dtest = xgboost.DMatrix(x_test)
ans = model.predict(dtest)
cnt1 = 0
cnt2 = 0
for i in range(len(y_test)):
if ans[i] == y_test[i]:
cnt1 += 1
else:
cnt2 += 1
print("Accuracy(test): \n", cnt1/(cnt1 + cnt2))
Accuracy(test):
0.9473684210526315
import pandas as pd
train = pd.read_csv("./train_modified.csv")
# train.head(10)
train.describe()
# train[train.isnull().values==True]
Disbursed | Existing_EMI | Loan_Amount_Applied | Loan_Tenure_Applied | Monthly_Income | Var4 | Var5 | Age | EMI_Loan_Submitted_Missing | Interest_Rate_Missing | ... | Var2_2 | Var2_3 | Var2_4 | Var2_5 | Var2_6 | Mobile_Verified_0 | Mobile_Verified_1 | Source_0 | Source_1 | Source_2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 20000.000000 | 20000.00000 | 2.000000e+04 | 20000.000000 | 2.000000e+04 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | ... | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 |
mean | 0.016000 | 3890.29622 | 2.424191e+05 | 2.254450 | 4.575767e+04 | 2.966350 | 5.851350 | 30.878950 | 0.644000 | 0.644000 | ... | 0.239300 | 0.009300 | 0.064950 | 0.026450 | 0.000900 | 0.354100 | 0.645900 | 0.115500 | 0.645450 | 0.239050 |
std | 0.125478 | 10534.21647 | 3.582973e+05 | 1.988467 | 4.575422e+05 | 1.575989 | 5.835997 | 6.829651 | 0.478827 | 0.478827 | ... | 0.426667 | 0.095989 | 0.246444 | 0.160473 | 0.029987 | 0.478252 | 0.478252 | 0.319632 | 0.478389 | 0.426514 |
min | 0.000000 | 0.00000 | 0.000000e+00 | 0.000000 | 1.000000e+01 | 1.000000 | 0.000000 | 18.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 0.00000 | 0.000000e+00 | 0.000000 | 1.700000e+04 | 1.000000 | 0.000000 | 26.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 0.000000 | 0.00000 | 1.000000e+05 | 2.000000 | 2.500000e+04 | 3.000000 | 3.000000 | 29.000000 | 1.000000 | 1.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 |
75% | 0.000000 | 4000.00000 | 3.000000e+05 | 4.000000 | 4.000000e+04 | 5.000000 | 11.000000 | 34.000000 | 1.000000 | 1.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 |
max | 1.000000 | 420000.00000 | 9.000000e+06 | 10.000000 | 5.495454e+07 | 7.000000 | 17.000000 | 65.000000 | 1.000000 | 1.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
8 rows × 50 columns
target = "Disbursed"
IDcol = "ID"
train["Disbursed"].value_counts()
0 19680
1 320
Name: Disbursed, dtype: int64
x_columns = [x for x in train.columns if x not in [target, IDcol]]
X = train[x_columns]
y = train["Disbursed"]
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
dtc = DecisionTreeClassifier()
dtc.fit(x_train, y_train)
print("Accuracy (train): %.4g" % metrics.accuracy_score(y_train, dtc.predict(x_train)))
print("Accuracy (test): %.4g" % metrics.accuracy_score(y_test, dtc.predict(x_test)))
print("混淆矩阵:\n", metrics.classification_report(y_test, dtc.predict(x_test)))
Accuracy (train): 0.9997
Accuracy (test): 0.9682
混淆矩阵:
precision recall f1-score support
0 0.99 0.98 0.98 4931
1 0.04 0.06 0.05 69
avg / total 0.97 0.97 0.97 5000
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import metrics
gbm0 = GradientBoostingClassifier(random_state=10)
gbm0.fit(x_train, y_train)
y_pred = gbm0.predict(x_test)
y_predprob = gbm0.predict_proba(x_test)[:,1]
print("Accuracy (train): %.4g" % metrics.accuracy_score(y_train, gbm0.predict(x_train)))
print("Accuracy (test): %.4g" % metrics.accuracy_score(y_test, gbm0.predict(x_test)))
print("混淆矩阵:\n", metrics.classification_report(y_test, gbm0.predict(x_test)))
Accuracy (train): 0.9844
Accuracy (test): 0.9854
混淆矩阵:
precision recall f1-score support
0 0.99 1.00 0.99 4931
1 0.00 0.00 0.00 69
avg / total 0.97 0.99 0.98 5000
from sklearn.model_selection import GridSearchCV
params = {"n_estimators": range(20, 81, 10)}
gsearch1 = GridSearchCV(estimator=GradientBoostingClassifier(learning_rate=0.1,
min_samples_split=300, # 限制子树继续划分的条件,如果某节点的样本数小于这个值,则不会继续尝试选择最优特征进行划分
min_samples_leaf=20, # 限制叶子节点最少的样本数,如果少于这个值则会和兄弟节点一起被剪枝
max_depth=8, # 决策树的最大深度
max_features="sqrt", # 划分时考虑的最大特征数,“sqrt”或者“auto”意味着划分时最多考虑根号(N)个特征
subsample=0.8, # 子采样,不放回采样,0.8表示只使用了80%的数据,可以防止过拟合
random_state=10),
param_grid=params,
scoring="roc_auc",
iid=False,
cv=5)
gsearch1.fit(x_train, y_train)
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_
D:\anaconda\setup\lib\site-packages\sklearn\model_selection\_search.py:761: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20
DeprecationWarning)
([mean: 0.82240, std: 0.03087, params: {'n_estimators': 20},
mean: 0.82469, std: 0.03055, params: {'n_estimators': 30},
mean: 0.82479, std: 0.03178, params: {'n_estimators': 40},
mean: 0.82445, std: 0.02968, params: {'n_estimators': 50},
mean: 0.82230, std: 0.02993, params: {'n_estimators': 60},
mean: 0.82074, std: 0.02881, params: {'n_estimators': 70},
mean: 0.81918, std: 0.02904, params: {'n_estimators': 80}],
{'n_estimators': 40},
0.8247923927911079)
params2 = {"max_depth": range(3, 14, 2), "min_samples_split": range(100, 801, 200)}
gsearch2 = GridSearchCV(estimator=GradientBoostingClassifier(learning_rate=0.1,
n_estimators=60,
min_samples_leaf=20,
max_features="sqrt",
subsample=0.8,
random_state=10),
param_grid=params2,
scoring="roc_auc",
iid=False,
cv=5)
gsearch2.fit(x_train, y_train)
gsearch2.grid_scores_, gsearch2.best_params_, gsearch2.best_score_
D:\anaconda\setup\lib\site-packages\sklearn\model_selection\_search.py:761: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20
DeprecationWarning)
([mean: 0.81718, std: 0.03177, params: {'max_depth': 3, 'min_samples_split': 100},
mean: 0.81821, std: 0.02824, params: {'max_depth': 3, 'min_samples_split': 300},
mean: 0.81938, std: 0.02993, params: {'max_depth': 3, 'min_samples_split': 500},
mean: 0.81850, std: 0.02894, params: {'max_depth': 3, 'min_samples_split': 700},
mean: 0.82919, std: 0.02452, params: {'max_depth': 5, 'min_samples_split': 100},
mean: 0.82704, std: 0.02582, params: {'max_depth': 5, 'min_samples_split': 300},
mean: 0.82595, std: 0.02603, params: {'max_depth': 5, 'min_samples_split': 500},
mean: 0.82930, std: 0.02581, params: {'max_depth': 5, 'min_samples_split': 700},
mean: 0.82742, std: 0.02200, params: {'max_depth': 7, 'min_samples_split': 100},
mean: 0.81882, std: 0.02066, params: {'max_depth': 7, 'min_samples_split': 300},
mean: 0.82529, std: 0.02404, params: {'max_depth': 7, 'min_samples_split': 500},
mean: 0.82395, std: 0.02940, params: {'max_depth': 7, 'min_samples_split': 700},
mean: 0.82908, std: 0.02157, params: {'max_depth': 9, 'min_samples_split': 100},
mean: 0.81857, std: 0.03291, params: {'max_depth': 9, 'min_samples_split': 300},
mean: 0.82545, std: 0.02825, params: {'max_depth': 9, 'min_samples_split': 500},
mean: 0.82815, std: 0.02859, params: {'max_depth': 9, 'min_samples_split': 700},
mean: 0.81604, std: 0.02591, params: {'max_depth': 11, 'min_samples_split': 100},
mean: 0.82513, std: 0.02261, params: {'max_depth': 11, 'min_samples_split': 300},
mean: 0.82908, std: 0.03235, params: {'max_depth': 11, 'min_samples_split': 500},
mean: 0.82534, std: 0.02583, params: {'max_depth': 11, 'min_samples_split': 700},
mean: 0.81899, std: 0.02132, params: {'max_depth': 13, 'min_samples_split': 100},
mean: 0.82667, std: 0.02806, params: {'max_depth': 13, 'min_samples_split': 300},
mean: 0.82685, std: 0.03581, params: {'max_depth': 13, 'min_samples_split': 500},
mean: 0.82662, std: 0.02611, params: {'max_depth': 13, 'min_samples_split': 700}],
{'max_depth': 5, 'min_samples_split': 700},
0.8292976819017346)
param_test3 = {'min_samples_split':range(800,1900,200), 'min_samples_leaf':range(60,101,10)}
gsearch3 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1,
n_estimators=60,
max_depth=9,
max_features='sqrt',
subsample=0.8,
random_state=10),
param_grid = param_test3,
scoring='roc_auc',
iid=False,
cv=5)
gsearch3.fit(x_train, y_train)
gsearch3.grid_scores_, gsearch3.best_params_, gsearch3.best_score_
D:\anaconda\setup\lib\site-packages\sklearn\model_selection\_search.py:761: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20
DeprecationWarning)
([mean: 0.82938, std: 0.02746, params: {'min_samples_leaf': 60, 'min_samples_split': 800},
mean: 0.82748, std: 0.03127, params: {'min_samples_leaf': 60, 'min_samples_split': 1000},
mean: 0.82002, std: 0.03099, params: {'min_samples_leaf': 60, 'min_samples_split': 1200},
mean: 0.82265, std: 0.03321, params: {'min_samples_leaf': 60, 'min_samples_split': 1400},
mean: 0.82615, std: 0.02846, params: {'min_samples_leaf': 60, 'min_samples_split': 1600},
mean: 0.82273, std: 0.02671, params: {'min_samples_leaf': 60, 'min_samples_split': 1800},
mean: 0.82471, std: 0.03209, params: {'min_samples_leaf': 70, 'min_samples_split': 800},
mean: 0.82705, std: 0.03119, params: {'min_samples_leaf': 70, 'min_samples_split': 1000},
mean: 0.82525, std: 0.02723, params: {'min_samples_leaf': 70, 'min_samples_split': 1200},
mean: 0.82698, std: 0.02734, params: {'min_samples_leaf': 70, 'min_samples_split': 1400},
mean: 0.82374, std: 0.02662, params: {'min_samples_leaf': 70, 'min_samples_split': 1600},
mean: 0.82543, std: 0.02728, params: {'min_samples_leaf': 70, 'min_samples_split': 1800},
mean: 0.82468, std: 0.02681, params: {'min_samples_leaf': 80, 'min_samples_split': 800},
mean: 0.82688, std: 0.02378, params: {'min_samples_leaf': 80, 'min_samples_split': 1000},
mean: 0.82400, std: 0.02718, params: {'min_samples_leaf': 80, 'min_samples_split': 1200},
mean: 0.82635, std: 0.03008, params: {'min_samples_leaf': 80, 'min_samples_split': 1400},
mean: 0.82478, std: 0.02849, params: {'min_samples_leaf': 80, 'min_samples_split': 1600},
mean: 0.82215, std: 0.02679, params: {'min_samples_leaf': 80, 'min_samples_split': 1800},
mean: 0.82416, std: 0.02264, params: {'min_samples_leaf': 90, 'min_samples_split': 800},
mean: 0.82559, std: 0.02115, params: {'min_samples_leaf': 90, 'min_samples_split': 1000},
mean: 0.82556, std: 0.02317, params: {'min_samples_leaf': 90, 'min_samples_split': 1200},
mean: 0.82452, std: 0.02702, params: {'min_samples_leaf': 90, 'min_samples_split': 1400},
mean: 0.82319, std: 0.02409, params: {'min_samples_leaf': 90, 'min_samples_split': 1600},
mean: 0.82400, std: 0.02738, params: {'min_samples_leaf': 90, 'min_samples_split': 1800},
mean: 0.83031, std: 0.02758, params: {'min_samples_leaf': 100, 'min_samples_split': 800},
mean: 0.82296, std: 0.02450, params: {'min_samples_leaf': 100, 'min_samples_split': 1000},
mean: 0.82464, std: 0.02562, params: {'min_samples_leaf': 100, 'min_samples_split': 1200},
mean: 0.82332, std: 0.02972, params: {'min_samples_leaf': 100, 'min_samples_split': 1400},
mean: 0.82227, std: 0.02910, params: {'min_samples_leaf': 100, 'min_samples_split': 1600},
mean: 0.82231, std: 0.02642, params: {'min_samples_leaf': 100, 'min_samples_split': 1800}],
{'min_samples_leaf': 100, 'min_samples_split': 800},
0.8303082748093461)
gbm = GradientBoostingClassifier(learning_rate=0.1,
n_estimators=40,
min_samples_split=800,
min_samples_leaf=100,
max_depth=5,
max_features="sqrt",
subsample=0.8,
random_state=10)
gbm.fit(x_train, y_train)
print("Accuracy (train): %.4g" % metrics.accuracy_score(y_train, gbm.predict(x_train)))
print("Accuracy (test): %.4g" % metrics.accuracy_score(y_test, gbm.predict(x_test)))
print("混淆矩阵:\n", metrics.classification_report(y_test, gbm.predict(x_test)))
Accuracy (train): 0.9833
Accuracy (test): 0.9862
混淆矩阵:
precision recall f1-score support
0 0.99 1.00 0.99 4931
1 0.00 0.00 0.00 69
avg / total 0.97 0.99 0.98 5000
D:\anaconda\setup\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)