Home Credit Default Risk 违约风险预测,kaggle比赛,进阶篇,LB 0.792

Home Credit Default Risk 违约风险预测


只用application_train数据集,AUC的分数可以达到0.76381。加上另外两组数据集,AUC的分数可以达到0.8。说明另外两组数据集能够带来约0.04的AUC提升。这篇就对另外两组数据进行分析和使用。

Bureau 征信数据主要包括征信状态,贷款时间,用途,总量等等。

Home Credit(捷信)的产品逻辑是这样的。

  • 首先提供POS,入门级产品,类似于消费贷。只能用于消费,金额有限,风险最小,客户提供的信息也最少。
  • 然后是credit card,在本次竞赛中又称为revolving loan。循环授信,主要用于消费。
  • 最后才是cash loan,用户能得到现金,风险最大。

Application train和test中只有credit card和cash的数据。也许因为POS门槛太低,风险太小,而且能用的信息最少,不在预测范围内。在历史数据中,POS,Credit Card, Cash Loans都有,但是Credit Card数量最少。

以上数据集的分布基本一致,整个数据中,训练集在85%左右,测试集15% ,而且POS和credit互相不包含


    length of train is 307511 the percent is 86.32 %
    length of test is 48744 the percent is 13.68 %
    
    length of bureau is 305811
    intersection with train is 263491 the percent is 86.16 %
    intersection with test is 42320 the percent is 13.84 %
    
    length of previous is 338857
    intersection with train is 291057 the percent is 85.89 %
    intersection with test is 47800 the percent is 14.11 %
    
    length of POS is 337252
    intersection with train is 289444 the percent is 85.82 %
    intersection with test is 47808 the percent is 14.18 %
    
    length of credit is 103558
    intersection with train is 86905 the percent is 83.92 %
    intersection with test is 16653 the percent is 16.08 %
    
    length of installment is 339587
    intersection with train is 291643 the percent is 85.88 %
    intersection with test is 47944 the percent is 14.12 %
    intersection between credit and POS is 0

你可能感兴趣的:(python,kaggle,数据竞赛,data,science)