【一周算法实践进阶】任务3 模型融合(Stacking)

导入本次任务所用到的包:

import pandas as pd
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
import numpy as np
import warnings
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, precision_score, recall_score,\
                            confusion_matrix, f1_score, roc_curve, roc_auc_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

warnings.filterwarnings(module='sklearn*', action='ignore', category=DeprecationWarning)
%matplotlib inline
plt.rc('font', family='SimHei', size=14)
plt.rcParams['axes.unicode_minus']=False
%config InlineBackend.figure_format = 'retina'

准备数据

导入数据

原始数据集下载地址: https://pan.baidu.com/s/1wO9qJRjnrm8uhaSP67K0lw

说明:这份数据集是金融数据(非原始数据,已经处理过了),我们要做的是预测贷款用户是否会逾期。表格中 “status” 是结果标签:0 表示未逾期,1 表示逾期。

本次导入的是前文(【一周算法实践进阶】任务 2 特征工程)已经完成特征工程过的数据集:

data_del = pd.read_csv('data_del.csv')
data_del.head()
abs apply_score consfin_avg_limit historical_trans_amount history_fail_fee latest_one_month_fail loans_overdue_count loans_score max_cumulative_consume_later_1_month repayment_capability trans_amount_3_month trans_fail_top_count_enum_last_1_month status
0 -0.200665 0.124820 -1.201348 -0.255030 -0.427773 -0.337569 -0.098210 0.144596 -0.067183 0.020868 -0.049208 -0.346369 1.0
1 -0.090524 1.497024 0.238640 0.215237 -0.547614 -0.080162 -0.733973 1.509325 -0.073494 -0.034355 -0.274805 -0.868380 0.0
2 -0.312623 1.516627 -0.671941 -0.675385 -0.627508 -0.080162 -0.733973 1.476440 -0.262821 -0.171658 -0.321773 0.697653 1.0
3 1.359842 0.360055 0.736282 0.790524 0.331218 -0.337569 0.537552 -0.019829 0.471049 -0.237850 0.505738 -0.346369 0.0
4 -0.315531 -0.698503 0.042759 -0.522714 0.291271 -0.337569 1.173315 -1.055708 -0.172665 -0.144424 -0.282697 0.697653 1.0

划分数据

调用sklearn包将数据集按比例7:3划分为训练集和数据集,随机种子2018:

X_train, X_test, y_train, y_test = train_test_split(data_del.drop(['status'], axis=1).values, 
                                                    data_del['status'].values, test_size=0.3, 
                                                    random_state=2018)

查看划分的数据集和训练集大小:

[X_train.shape, y_train.shape, X_test.shape, y_test.shape]
[(3133, 12), (3133,), (1343, 12), (1343,)]

模型融合(Stacking)

【一周算法实践进阶】任务3 模型融合(Stacking)_第1张图片
将数据分为训练集、测试集两部分。

  • 训练集(图中Training Data)

对训练集进行K折交叉验证。以图中五折为例,将训练集分为5个部分,每次取一个部分作为验证数据,其余作为训练数据。用训练数据训练模型后,再对验证数据进行预测,预测的结果即图中橙色的Predict。最后将五次交叉验证得到的Predict整合为橙色的Predictions,作为下一个模型的训练集。

  • 测试集(图中Testing Data)

在每次交叉验证用训练数据训练模型后,不仅要对验证集进行预测,还要对测试集进行预测,每次预测的结果就是图中绿色部分Predict。最后将五次Predict的结果取平均值,就得到了下一个模型的测试集(绿色Predictions)。

def get_stacking_data(models, X_train, y_train, X_test, y_test, k=5):
    '''获得下一模型的训练集,测试集
    models: 当前模型
    X_train: 当前训练数据
    y_train: 当前训练标签
    X_test: 当前测试数据
    y_test: 当前测试标签
    k: K折交叉验证
    return: new_train: 下一个模型的训练集
            new_test: 下一个模型的测试集
    '''
    kfold = KFold(n_splits=k, random_state=2018, shuffle=True)
    next_train = np.zeros((X_train.shape[0], len(models)))
    next_test = np.zeros((X_test.shape[0], len(models)))
    
    for j, model in enumerate(models):
        next_test_temp = np.zeros((X_test.shape[0], k))
        ksplit = kfold.split(X_train)
        for i, (train_index, val_index) in enumerate(ksplit):
            X_train_fold, y_train_fold = X_train[train_index], y_train[train_index]
            X_val = X_train[val_index]
            model.fit(X_train_fold, y_train_fold)
            next_train[val_index, j] = model.predict(X_val)
            next_test_temp[:, i] = model.predict(X_test)
        next_test[:, j] = np.mean(next_test_temp, axis=1)
    
    return next_train, next_test

融合模型选择

查看默认参数下七个模型的评估结果,代码见上篇文章:

AUC Accuracy F1-score Precision Recall
随机森林 训练集:99.92%;测试集:75.32% 训练集:98.44%;测试集:77.29% 训练集:96.78%;测试集:42.99% 训练集:94.36%;测试集:33.24% 训练集:99.33%;测试集:60.85%
GBDT 训练集:88.12%;测试集:79.54% 训练集:84.14%;测试集:79.08% 训练集:59.03%;测试集:49.19% 训练集:45.90%;测试集:39.31% 训练集:82.68%;测试集:65.70%
XGBoost 训练集:87.15%;测试集:79.72% 训练集:82.73%;测试集:79.37% 训练集:54.88%;测试集:48.42% 训练集:42.18%;测试集:37.57% 训练集:78.52%;测试集:68.06%
LightGBM 训练集:99.61%;测试集:78.80% 训练集:96.17%;测试集:78.26% 训练集:91.77%;测试集:47.67% 训练集:85.77%;测试集:38.44% 训练集:98.67%;测试集:62.74%
逻辑回归 训练集:76.45%;测试集:78.48% 训练集:78.90%;测试集:78.18% 训练集:39.19%;测试集:39.59% 训练集:27.31%;测试集:27.75% 训练集:69.38%;测试集:69.06%
SVM 训练集:79.32%;测试集:74.56% 训练集:79.76%;测试集:78.03% 训练集:38.57%;测试集:34.00% 训练集:25.51%;测试集:21.97% 训练集:78.97%;测试集:75.25%
决策树 训练集:100.00%;测试集:62.97% 训练集:100.00%;测试集:70.66% 训练集:100.00%;测试集:45.28% 训练集:100.00%;测试集:47.11% 训练集:100.00%;测试集:43.58%

四个集成模型(随机森林、GBDT、XGBoost、LightGBM)和逻辑回归明显效果较好,将随机森林、GBDT、逻辑回归、LightGBM作为基础模型,XGBoost作为第二层模型。训练参数暂时采用默认参数。

rnd_clf = RandomForestClassifier(random_state=2018)
gbdt = GradientBoostingClassifier(random_state=2018)
xgb = XGBClassifier(random_state=2018)
lgbm = LGBMClassifier(random_state=2018)
log = LogisticRegression(random_state=2018, max_iter=1000)
svc = SVC(random_state=2018, probability=True)
tree = DecisionTreeClassifier(random_state=2018)
base_models = [rnd_clf, gbdt, lgbm, log]
next_train, next_test = get_stacking_data(base_models, X_train, y_train, X_test, y_test, k=10)

融合模型训练及评估

stacking_model= XGBClassifier(random_state=2018)
stacking_model.fit(next_test, y_test)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic',
       random_state=2018, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=None, silent=True, subsample=1)
AUC Accuracy F1-score Precision Recall
随机森林 训练集:99.92%;测试集:75.32% 训练集:98.44%;测试集:77.29% 训练集:96.78%;测试集:42.99% 训练集:94.36%;测试集:33.24% 训练集:99.33%;测试集:60.85%
GBDT 训练集:88.12%;测试集:79.54% 训练集:84.14%;测试集:79.08% 训练集:59.03%;测试集:49.19% 训练集:45.90%;测试集:39.31% 训练集:82.68%;测试集:65.70%
XGBoost 训练集:87.15%;测试集:79.72% 训练集:82.73%;测试集:79.37% 训练集:54.88%;测试集:48.42% 训练集:42.18%;测试集:37.57% 训练集:78.52%;测试集:68.06%
LightGBM 训练集:99.61%;测试集:78.80% 训练集:96.17%;测试集:78.26% 训练集:91.77%;测试集:47.67% 训练集:85.77%;测试集:38.44% 训练集:98.67%;测试集:62.74%
逻辑回归 训练集:76.45%;测试集:78.48% 训练集:78.90%;测试集:78.18% 训练集:39.19%;测试集:39.59% 训练集:27.31%;测试集:27.75% 训练集:69.38%;测试集:69.06%
SVM 训练集:79.32%;测试集:74.56% 训练集:79.76%;测试集:78.03% 训练集:38.57%;测试集:34.00% 训练集:25.51%;测试集:21.97% 训练集:78.97%;测试集:75.25%
决策树 训练集:100.00%;测试集:62.97% 训练集:100.00%;测试集:70.66% 训练集:100.00%;测试集:45.28% 训练集:100.00%;测试集:47.11% 训练集:100.00%;测试集:43.58%
融合(Stacking)模型 训练集:64.89%;测试集:79.04% 训练集:78.58%;测试集:83.54% 训练集:39.06%;测试集:59.15% 训练集:27.56%;测试集:46.24% 训练集:66.98%;测试集:82.05%

ROC曲线:
【一周算法实践进阶】任务3 模型融合(Stacking)_第2张图片
综合比较ROC曲线:

训练集 测试集
【一周算法实践进阶】任务3 模型融合(Stacking)_第3张图片 【一周算法实践进阶】任务3 模型融合(Stacking)_第4张图片

总结

相比于其他模型,融合模型除了AUC值外的其他指标值都有所上升,而且是在使用默认参数的情况下。相信经过调参后,效果会有进一步的提升。

参考资料

[1] https://blog.csdn.net/wstcjf/article/details/77989963, 详解 stacking 过程

你可能感兴趣的:(数据分析)