导入本次任务所用到的包:
import pandas as pd
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
import numpy as np
import warnings
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, precision_score, recall_score,\
confusion_matrix, f1_score, roc_curve, roc_auc_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
warnings.filterwarnings(module='sklearn*', action='ignore', category=DeprecationWarning)
%matplotlib inline
plt.rc('font', family='SimHei', size=14)
plt.rcParams['axes.unicode_minus']=False
%config InlineBackend.figure_format = 'retina'
原始数据集下载地址: https://pan.baidu.com/s/1wO9qJRjnrm8uhaSP67K0lw
说明:这份数据集是金融数据(非原始数据,已经处理过了),我们要做的是预测贷款用户是否会逾期。表格中 “status” 是结果标签:0 表示未逾期,1 表示逾期。
本次导入的是前文(【一周算法实践进阶】任务 2 特征工程)已经完成特征工程过的数据集:
data_del = pd.read_csv('data_del.csv')
data_del.head()
abs | apply_score | consfin_avg_limit | historical_trans_amount | history_fail_fee | latest_one_month_fail | loans_overdue_count | loans_score | max_cumulative_consume_later_1_month | repayment_capability | trans_amount_3_month | trans_fail_top_count_enum_last_1_month | status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.200665 | 0.124820 | -1.201348 | -0.255030 | -0.427773 | -0.337569 | -0.098210 | 0.144596 | -0.067183 | 0.020868 | -0.049208 | -0.346369 | 1.0 |
1 | -0.090524 | 1.497024 | 0.238640 | 0.215237 | -0.547614 | -0.080162 | -0.733973 | 1.509325 | -0.073494 | -0.034355 | -0.274805 | -0.868380 | 0.0 |
2 | -0.312623 | 1.516627 | -0.671941 | -0.675385 | -0.627508 | -0.080162 | -0.733973 | 1.476440 | -0.262821 | -0.171658 | -0.321773 | 0.697653 | 1.0 |
3 | 1.359842 | 0.360055 | 0.736282 | 0.790524 | 0.331218 | -0.337569 | 0.537552 | -0.019829 | 0.471049 | -0.237850 | 0.505738 | -0.346369 | 0.0 |
4 | -0.315531 | -0.698503 | 0.042759 | -0.522714 | 0.291271 | -0.337569 | 1.173315 | -1.055708 | -0.172665 | -0.144424 | -0.282697 | 0.697653 | 1.0 |
调用sklearn包将数据集按比例7:3划分为训练集和数据集,随机种子2018:
X_train, X_test, y_train, y_test = train_test_split(data_del.drop(['status'], axis=1).values,
data_del['status'].values, test_size=0.3,
random_state=2018)
查看划分的数据集和训练集大小:
[X_train.shape, y_train.shape, X_test.shape, y_test.shape]
[(3133, 12), (3133,), (1343, 12), (1343,)]
对训练集进行K折交叉验证。以图中五折为例,将训练集分为5个部分,每次取一个部分作为验证数据,其余作为训练数据。用训练数据训练模型后,再对验证数据进行预测,预测的结果即图中橙色的Predict。最后将五次交叉验证得到的Predict整合为橙色的Predictions,作为下一个模型的训练集。
在每次交叉验证用训练数据训练模型后,不仅要对验证集进行预测,还要对测试集进行预测,每次预测的结果就是图中绿色部分Predict。最后将五次Predict的结果取平均值,就得到了下一个模型的测试集(绿色Predictions)。
def get_stacking_data(models, X_train, y_train, X_test, y_test, k=5):
'''获得下一模型的训练集,测试集
models: 当前模型
X_train: 当前训练数据
y_train: 当前训练标签
X_test: 当前测试数据
y_test: 当前测试标签
k: K折交叉验证
return: new_train: 下一个模型的训练集
new_test: 下一个模型的测试集
'''
kfold = KFold(n_splits=k, random_state=2018, shuffle=True)
next_train = np.zeros((X_train.shape[0], len(models)))
next_test = np.zeros((X_test.shape[0], len(models)))
for j, model in enumerate(models):
next_test_temp = np.zeros((X_test.shape[0], k))
ksplit = kfold.split(X_train)
for i, (train_index, val_index) in enumerate(ksplit):
X_train_fold, y_train_fold = X_train[train_index], y_train[train_index]
X_val = X_train[val_index]
model.fit(X_train_fold, y_train_fold)
next_train[val_index, j] = model.predict(X_val)
next_test_temp[:, i] = model.predict(X_test)
next_test[:, j] = np.mean(next_test_temp, axis=1)
return next_train, next_test
查看默认参数下七个模型的评估结果,代码见上篇文章:
AUC | Accuracy | F1-score | Precision | Recall | |
---|---|---|---|---|---|
随机森林 | 训练集:99.92%;测试集:75.32% | 训练集:98.44%;测试集:77.29% | 训练集:96.78%;测试集:42.99% | 训练集:94.36%;测试集:33.24% | 训练集:99.33%;测试集:60.85% |
GBDT | 训练集:88.12%;测试集:79.54% | 训练集:84.14%;测试集:79.08% | 训练集:59.03%;测试集:49.19% | 训练集:45.90%;测试集:39.31% | 训练集:82.68%;测试集:65.70% |
XGBoost | 训练集:87.15%;测试集:79.72% | 训练集:82.73%;测试集:79.37% | 训练集:54.88%;测试集:48.42% | 训练集:42.18%;测试集:37.57% | 训练集:78.52%;测试集:68.06% |
LightGBM | 训练集:99.61%;测试集:78.80% | 训练集:96.17%;测试集:78.26% | 训练集:91.77%;测试集:47.67% | 训练集:85.77%;测试集:38.44% | 训练集:98.67%;测试集:62.74% |
逻辑回归 | 训练集:76.45%;测试集:78.48% | 训练集:78.90%;测试集:78.18% | 训练集:39.19%;测试集:39.59% | 训练集:27.31%;测试集:27.75% | 训练集:69.38%;测试集:69.06% |
SVM | 训练集:79.32%;测试集:74.56% | 训练集:79.76%;测试集:78.03% | 训练集:38.57%;测试集:34.00% | 训练集:25.51%;测试集:21.97% | 训练集:78.97%;测试集:75.25% |
决策树 | 训练集:100.00%;测试集:62.97% | 训练集:100.00%;测试集:70.66% | 训练集:100.00%;测试集:45.28% | 训练集:100.00%;测试集:47.11% | 训练集:100.00%;测试集:43.58% |
四个集成模型(随机森林、GBDT、XGBoost、LightGBM)和逻辑回归明显效果较好,将随机森林、GBDT、逻辑回归、LightGBM作为基础模型,XGBoost作为第二层模型。训练参数暂时采用默认参数。
rnd_clf = RandomForestClassifier(random_state=2018)
gbdt = GradientBoostingClassifier(random_state=2018)
xgb = XGBClassifier(random_state=2018)
lgbm = LGBMClassifier(random_state=2018)
log = LogisticRegression(random_state=2018, max_iter=1000)
svc = SVC(random_state=2018, probability=True)
tree = DecisionTreeClassifier(random_state=2018)
base_models = [rnd_clf, gbdt, lgbm, log]
next_train, next_test = get_stacking_data(base_models, X_train, y_train, X_test, y_test, k=10)
stacking_model= XGBClassifier(random_state=2018)
stacking_model.fit(next_test, y_test)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
n_jobs=1, nthread=None, objective='binary:logistic',
random_state=2018, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
seed=None, silent=True, subsample=1)
AUC | Accuracy | F1-score | Precision | Recall | |
---|---|---|---|---|---|
随机森林 | 训练集:99.92%;测试集:75.32% | 训练集:98.44%;测试集:77.29% | 训练集:96.78%;测试集:42.99% | 训练集:94.36%;测试集:33.24% | 训练集:99.33%;测试集:60.85% |
GBDT | 训练集:88.12%;测试集:79.54% | 训练集:84.14%;测试集:79.08% | 训练集:59.03%;测试集:49.19% | 训练集:45.90%;测试集:39.31% | 训练集:82.68%;测试集:65.70% |
XGBoost | 训练集:87.15%;测试集:79.72% | 训练集:82.73%;测试集:79.37% | 训练集:54.88%;测试集:48.42% | 训练集:42.18%;测试集:37.57% | 训练集:78.52%;测试集:68.06% |
LightGBM | 训练集:99.61%;测试集:78.80% | 训练集:96.17%;测试集:78.26% | 训练集:91.77%;测试集:47.67% | 训练集:85.77%;测试集:38.44% | 训练集:98.67%;测试集:62.74% |
逻辑回归 | 训练集:76.45%;测试集:78.48% | 训练集:78.90%;测试集:78.18% | 训练集:39.19%;测试集:39.59% | 训练集:27.31%;测试集:27.75% | 训练集:69.38%;测试集:69.06% |
SVM | 训练集:79.32%;测试集:74.56% | 训练集:79.76%;测试集:78.03% | 训练集:38.57%;测试集:34.00% | 训练集:25.51%;测试集:21.97% | 训练集:78.97%;测试集:75.25% |
决策树 | 训练集:100.00%;测试集:62.97% | 训练集:100.00%;测试集:70.66% | 训练集:100.00%;测试集:45.28% | 训练集:100.00%;测试集:47.11% | 训练集:100.00%;测试集:43.58% |
融合(Stacking)模型 | 训练集:64.89%;测试集:79.04% | 训练集:78.58%;测试集:83.54% | 训练集:39.06%;测试集:59.15% | 训练集:27.56%;测试集:46.24% | 训练集:66.98%;测试集:82.05% |
训练集 | 测试集 |
---|---|
相比于其他模型,融合模型除了AUC值外的其他指标值都有所上升,而且是在使用默认参数的情况下。相信经过调参后,效果会有进一步的提升。
[1] https://blog.csdn.net/wstcjf/article/details/77989963, 详解 stacking 过程