本文采用lending club官网公开数据中2017年Q2部分,数据内容为贷款申请人信息包括申请人的年龄、性别、婚姻状况、学历、贷款金额、申请人财产情况等(自变量)和贷款履行情况(因变量)。(使用2017年数据是为了方便与其他人的结果对比)
本文基于对象过去行为和属性预测其未来是否逾期,流程主要包括处理缺失值、将原始变量进行WOE编码,通过IV值、相关系数、显著性依次筛选变量,使用SMOTE解决类别不平衡问题,通过逻辑回归算法解决二元分类问题(判定贷款申请人是否违约),再计算出每个样本的评分(为了方便业务使用,类似于芝麻信用分)。
最终结果,auc=0.953, ks=0.802, accuracy_score=0.938 。
完整代码
1.数据下载与读取
百度网盘:网盘地址 密码:let1
先看一下大概数据
In[1]: df.head()
Out[1]:
loan_amnt funded_amnt ... total_bc_limit total_il_high_credit_limit
0 7500.0 7500.0 ... 35000.0 92511.0
1 20000.0 20000.0 ... 22900.0 42517.0
2 12000.0 12000.0 ... 9200.0 30780.0
3 6025.0 6025.0 ... 17600.0 0.0
4 4000.0 4000.0 ... 5000.0 15523.0
查看df的信息
[In]: df.info()
[Out]:
RangeIndex: 105453 entries, 0 to 105452
Columns: 145 entries, id to settlement_term
dtypes: float64(107), object(38)
memory usage: 116.7+ MB
可以看到这个数据共有105453行,145列,在列中有107列是float64类型也就是数值变量,38列是object类型
2.数据预处理
2.1 因变量映射
在这份数据中,因变量列是loan_status,但是观察这一列的值发现并不是二元的
In[46]: df.loan_status.value_counts() # 7个类别
Out[46]:
Current 77347
Fully Paid 19652
Charged Off 4519
Late (31-120 days) 2089
In Grace Period 1083
Late (16-30 days) 598
Default 163
Name: loan_status, dtype: int64
查找业务解释
Fully Paid:已结清; Current:当前已还款;
Charged Off:坏账; Late (31-120 days):预期31-120天; Late (16-30 days):预期16-30天
In Grace Period :已逾期但在宽限期类; Default:逾期超过90天
上面只有Fully Paid和Current是未逾期,将7个值映射为{0, 1}
d = {'Current':0,
'Fully Paid':0,
'Charged Off':1,
'Late (31-120 days)':1,
'Late (16-30 days)':1,
'In Grace Period':1,
'Default':1}
df.loan_status = df.loan_status.map(d)
df = df[df['loan_status'].notnull()]
此时再次查看loan_status列
In: df['loan_status'].value_counts(normalize=True)
Out:
0 0.919849
1 0.080151
Name: loan_status, dtype: float64
可以看到已经映射完毕,同时发现未逾期(0)和逾期(1)存在类别不平衡,在后续建模时可以使用2种方法:1.损失函数中使用类别权重 2.使用SMOTE对原始数据类别少的过采样或者对于原始数据类别多的多次下采样构建多个判别器bagging
2.1 缺失值处理
我们关注的因变量是loan_status查看这一列发现有2个null,将这两行去除
df.loan_status.isnull().sum() # out = 2
df = df[df.loan_status.notnull()]
缺失率超过50%的列往往不能带来什么信息,直接删除
miss_large_col = \
[k for k,v in dict(df.isnull().sum()/df.shape[0]).items() if v>=0.5]
df = df.drop(miss_large_col,axis=1)
miss_large_col一共有42列,删除后数据的规模是(105451, 103),查看当前缺失值情况
In[53]: (df.isnull().sum() / df.shape[0]).sort_values(ascending=False)
Out[53]:
mths_since_last_delinq 0.484765
next_pymnt_d 0.229215
il_util 0.126884
mths_since_recent_inq 0.113313
emp_title 0.064314
emp_length 0.063508
num_tl_120dpd_2m 0.050052
mo_sin_old_il_acct 0.025149
mths_since_rcnt_il 0.025149
bc_util 0.011294
percent_bc_gt_75 0.010868
bc_open_to_buy 0.010839
mths_since_recent_bc 0.010270
last_pymnt_d 0.001157
revol_util 0.000711
dti 0.000711
all_util 0.000123
avg_cur_bal 0.000019
out_prncp 0.000000
total_acc 0.000000
initial_list_status 0.000000
其中13列存在缺失值,mths_since_last_delinq这一列缺失率虽然没有超过0.5但也有0.48,删掉
df = df.drop(['mths_since_last_delinq'], axis=1)
接下来遍历每一列,查看那些列中有一种值占比超过0.95
In[58]:
tmp_list = []
for x in df.drop(['loan_status'],axis=1).columns:
if df[x].value_counts(normalize=True).iloc[0] >=0.95:
tmp_list.append((x, df[x].value_counts(normalize=True).iloc[0]))
tmp_list
Out[58]:
[('pymnt_plan', 0.9995637784373785),
('total_rec_late_fee', 0.9741111985661587),
('recoveries', 0.9834235806203829),
('collection_recovery_fee', 0.9834425467752795),
('collections_12_mths_ex_med', 0.9786156603540981),
('policy_code', 1.0),
('acc_now_delinq', 0.9948127566357834),
('chargeoff_within_12_mths', 0.9915979933808119),
('delinq_amnt', 0.9955714028316469),
('num_tl_120dpd_2m', 0.9990516406616553),
('num_tl_30dpd', 0.9965860921186144),
('tax_liens', 0.9542346682345355),
('hardship_flag', 0.9993835999658609),
('disbursement_method', 0.9998198215284825),
('debt_settlement_flag', 0.9971740429204086)]
如果某一属性所有的样本都是同一个值,那么这个属性肯定对是否逾期没有影响,所以将占比最多的值超过0.95的列删除。
not_col=[]
for x in df.drop(['loan_status'],axis=1).columns:
if df[x].value_counts(normalize=True).iloc[0] >=0.95:
not_col.append(x)
df = df.drop(not_col,axis=1)
print(df.shape[1]) # out = 88
此时还剩下88列,查看剩下的这些列
df.dtypes.sort_values()
这62列中,loan_status是int64类型,'sub_grade', 'grade', 'initial_list_status', 'int_rate', 'term', 'emp_title', 'application_type', 'emp_length', 'issue_d', 'last_credit_pull_d', 'verification_status', 'purpose', 'title', 'zip_code', 'addr_state', 'next_pymnt_d', 'last_pymnt_d', 'revol_util' 'home_ownership', 'earliest_cr_line'
共20列是object类型,其余67列是float64类型。
查看这20列,对于unique值超过100的emp_title, zip_code这两个类别型变量直接删除,int_rate是百分比数字因为%被识别为文本需要转化为数字,sub_grade是grade信用评级的细分,和grade信息有交叠这里先删掉(也许用sub_grade比grade更好,需要实验);emp_length的unique值也可以映射为年份的数值变量,add_state与偿还能力无关,删掉;earliest_cr_line, last_pymnt_d, next_pymnt_d,last_credit_pull_d都是时间类型,可以与当前时间做差值计算时间间隔;revol_util可以转化为数值变量。
In[68]:
object_col = list(df.select_dtypes(include=['O']).columns)
df.loc[:,object_col].describe().T
Out[68]:
count unique top freq
term 105451 2 36 months 77105
int_rate 105451 65 16.02% 4956
grade 105451 7 C 36880
sub_grade 105451 35 C1 8088
emp_title 98669 38551 Teacher 1999
emp_length 98754 11 10+ years 35438
home_ownership 105451 5 MORTGAGE 52502
verification_status 105451 3 Source Verified 42033
issue_d 105451 3 Jun-2017 38087
purpose 105451 13 debt_consolidation 58557
title 105451 12 Debt consolidation 58564
zip_code 105451 851 112xx 1100
addr_state 105451 49 CA 13751
earliest_cr_line 105451 627 Sep-2004 892
revol_util 105376 1076 0% 468
initial_list_status 105451 2 w 79488
last_pymnt_d 105329 16 Jun-2018 54794
next_pymnt_d 81280 2 Jul-2018 56176
last_credit_pull_d 105451 17 Jun-2018 84157
application_type 105451 2 Individual 98638
对于unique值超过100的emp_title, zip_code这两个类别型变量直接删除,int_rate是百分比数字因为%被识别为文本需要转化为数字,sub_grade是grade信用评级的细分,和grade信息有交叠这里先删掉(也许用sub_grade比grade更好,需要实验);emp_length的unique值也可以映射为年份的数值变量,add_state与偿还能力无关,删掉;earliest_cr_line, last_pymnt_d, next_pymnt_d,last_credit_pull_d都是时间类型,可以与当前时间做差值计算时间间隔;revol_util可以转化为数值变量。操作如下
df = df.drop(['emp_title', 'zip_code', 'sub_grade', 'addr_state'], axis=1)
df['revol_util'] = df['revol_util']\
.map(lambda x: float(x.split('%')[0])/100 if not pd.isnull(x) else x)
df['int_rate'] = df['int_rate']\
.map(lambda x: float(x.split('%')[0])/100 if not pd.isnull(x) else x)
df['emp_length'].unique()
d = {'10+ years':10, '< 1 year':0, '7 years':7,'2 years':2, '1 year':1,
'3 years':3, '9 years':9, '8 years':8, '5 years':5, '6 years':6, '4 years':4}
df['emp_length'] = df['emp_length'].map(d)
再次查看经过处理后的object列
object_col = list(df.select_dtypes(include=['O']).columns)
object_col
df.loc[:,object_col].describe().T
# 依次对剩余列检查
for ob in object_col:
print(ob, dict(df[ob].value_counts(normalize=True)))
依次查看每一列关于loan_status的分组条形图
发现home_ownership中{'MORTGAGE': 0.50, 'RENT': 0.39, 'ANY': 4.7415387241467604e-05, 'OWN': 0.11, 'NONE': 1.896615489658704e-05}
‘ANY’和NONE占比太少,用最多的MORTGAGE替换
df.loc[df.home_ownership.isin(['ANY', 'NONE']), 'home_ownership'] = 'MORTGAGE'
for i in object_col:
pvt=pd.pivot_table(df[['loan_status',i]],index=i,columns="loan_status",aggfunc=len)
pvt.plot(kind="bar")
这里只展示几个列,下面的term关于loan_status的条形图可以看出,多数选择36个月,而且选择36个月的客户违约率更低。
grade中各个评级对于违约率的影响我们不能直接看出来,那么怎么衡量这个变量对于loan_status有没有影响,这就需要用到信用卡评分模型中常用的WOE编码(详细点击这篇文章[待完成])。
2.2 缺失值填充
接下来填充缺失值,策略是对于数值类型变量,如果缺失值超过0.05,用-999代替作为一个特征值,没有超过0.05用中位数填充;对于类别型变量,本项目中没有缺失值,如果有可以用新的值或者占比最多的值填充。
rate = dict(df.isnull().sum()/df.shape[0])
rate
# 1.对于数值型数据,缺失率超过 0.05, 用 -999代替nan
# 2.对于类别数据,
cate_col = list(df.select_dtypes(include=['O']).columns)#4
num_col = [x for x in df.columns if x not in cate_col and x!='loan_status']#57
d1 = [k for k,v in rate.items() if k in num_col and v>=0.05]
for i in d1:
df[i] = df[i].fillna(-999)
d2 = [x for x in num_col if x not in d1]
for i in d2:
df[i] = df[i].fillna(df[i].median())
df.loc[:,cate_col].isnull().sum()# 类别类型无缺失
3.WOE编码
3.1 对类别变量进行WOE编码
def binning_cate(df,col,target):
total = df[target].count()
bad = df[target].sum()
good = total-bad
group = df.groupby([col],as_index=True)
bin_df = pd.DataFrame()
bin_df['total'] = group[target].count()
bin_df['totalrate'] = bin_df['total']/total
bin_df['bad'] = group[target].sum()
bin_df['badrate'] = bin_df['bad']/bin_df['total']
bin_df['good'] = bin_df['total'] - bin_df['bad']
bin_df['goodrate'] = bin_df['good']/bin_df['total']
bin_df['badattr'] = bin_df['bad']/bad
bin_df['goodattr'] = (bin_df['total']-bin_df['bad'])/good
bin_df['woe'] = np.log(bin_df['badattr']/bin_df['goodattr'])
bin_df['bin_iv'] = (bin_df['badattr']-bin_df['goodattr'])*bin_df['woe']
bin_df['iv'] = bin_df['bin_iv'].sum()
return bin_df
cate_bin_df_list = []
for col in cate_col:
bin_df = binning_cate(df, col, 'loan_status')
cate_bin_df_list.append(bin_df)
# 存类别变量名、IV值
cate_iv_df = pd.DataFrame({'col':cate_col, 'iv':[x['iv'].iloc[0] for x in cate_bin_df_list]}).sort_values('iv',ascending=False).reset_index(drop=True)
cate_iv_df
结果是
Out[168]:
col iv
0 purpose inf
1 grade 0.476388
2 verification_status 0.083826
3 initial_list_status 0.022144
4 title 0.018638
5 home_ownership 0.017939
6 term 0.016072
7 issue_d 0.005004
8 application_type 0.000880
purpose的iv值居然是正无穷,显然不符合常理,出现这种情况是因为purpose变量中某一类的数量太少,我们用查看这一列值的分布,显然wedding只有一个样本,将这个样本删掉df = df.loc[df.purpose != 'wedding']
。
df['purpose'].value_counts()
Out[169]:
debt_consolidation 58557
credit_card 21261
home_improvement 9222
other 7140
major_purchase 2616
medical 1648
car 1334
vacation 1170
small_business 1034
moving 945
house 453
renewable_energy 70
wedding 1
Name: purpose, dtype: int64
3.2 对数值变量进行WOE编码
对数值变量进行WOE编码的方法是对于某一个自变量比如说last_pymnt_d
,使用这个变量和loan_status构建单变量决策树模型,决策树节点的分裂区间来对数值变量进行分箱。
In[181]: # 对数值变量分箱, 使用单变量决策树方法
def tree_split(df,col,target,max_bin,min_binpct,nan_value):
missing_rate = df[df[col]==nan_value].shape[0]/df.shape[0]
if missing_rate < 0.05:
x = np.array(df[col]).reshape(-1,1)
y = np.array(df[target])
tree = DecisionTreeClassifier(max_leaf_nodes=max_bin,min_samples_leaf=min_binpct)
tree.fit(x,y)
threshold = tree.tree_.threshold
threshold = threshold[threshold!=_tree.TREE_UNDEFINED]
split_list = sorted(threshold.tolist())
else:
x = np.array(df[df[col]!=nan_value][col]).reshape(-1,1)
y = np.array(df[df[col]!=nan_value][target])
tree = DecisionTreeClassifier(max_leaf_nodes=max_bin-1,min_samples_leaf=min_binpct)
tree.fit(x,y)
threshold = tree.tree_.threshold
threshold = threshold[threshold!=_tree.TREE_UNDEFINED]
split_list = sorted(threshold.tolist())
split_list.insert(0,nan_value)
return split_list
# 数值型特征的分箱,计算woe,IV
def binning_num(df,col,target,cut):
total = df[target].count()
bad = df[target].sum()
good = total-bad
bucket = pd.cut(df[col],cut)
group = df.groupby(bucket)
bin_df = pd.DataFrame()
bin_df['total'] = group[target].count()
bin_df['totalrate'] = bin_df['total']/total
bin_df['bad'] = group[target].sum()
bin_df['badrate'] = bin_df['bad']/bin_df['total']
bin_df['good'] = bin_df['total'] - bin_df['bad']
bin_df['goodrate'] = bin_df['good']/bin_df['total']
bin_df['badattr'] = bin_df['bad']/bad
bin_df['goodattr'] = (bin_df['total']-bin_df['bad'])/good
bin_df['woe'] = np.log(bin_df['badattr']/bin_df['goodattr'])
bin_df['bin_iv'] = (bin_df['badattr']-bin_df['goodattr'])*bin_df['woe']
bin_df['iv'] = bin_df['bin_iv'].sum()
return bin_df
num_dict={}
for col in num_col:
split_list = tree_split(df,col,'loan_status',5,0.05,-999)
split_list.insert(0,float('-inf'))
split_list.append(float('inf'))
bin_df = binning_num(df,col,'loan_status',split_list)
num_dict.setdefault(col,{})
num_dict[col]['bin_df']=bin_df
num_dict[col]['cut'] = split_list
num_iv_df = pd.DataFrame({'col':num_col,'iv':[num_dict[x]['bin_df']['iv'].iloc[0] for x in num_col]})\
.sort_values('iv',ascending=False).reset_index(drop=True)
num_iv_df.head()
Out[181]:
col iv
0 last_pymnt_d 2.059883
1 total_rec_prncp 1.171917
2 last_pymnt_amnt 0.687479
3 out_prncp 0.567522
4 out_prncp_inv 0.567459
4.变量筛选
4.1 根据IV值筛选
根据业务经验,将阈值设定为0.03,将大于0.03的变量筛选出来,最后得到32个数值变量、2个类别变量
#根据业务经验将阈值定在0.03,大于0.03筛选得23个数值型字段,1个类别型字段。
iv_select_num_col = list(num_iv_df[num_iv_df.iv>0.03]['col'])
select_num_dict = {k:v for k,v in num_dict.items() if k in iv_select_num_col}
len(iv_select_num_col)
iv_select_cate_col = list(cate_iv_df[cate_iv_df.iv>0.03]['col'])
len(iv_select_cate_col)
iv_select_df = pd.concat([num_iv_df[num_iv_df.iv>0.03],cate_iv_df[cate_iv_df.iv>0.03]],axis=0).\
sort_values('iv',ascending=False).reset_index(drop=True)
df2 = df.loc[:,iv_select_num_col+iv_select_cate_col+['loan_status']]
df2.shape
4.2 将原始变量转化为WOE变量
将原始变量转化为 WOE 变量
woe_list = list(select_num_dict[col]['bin_df']['woe'])
cut = select_num_dict[col]['cut']
df2[col+'_woe'] = pd.cut(df2[col], bins=cut, labels=woe_list)
for col in iv_select_cate_col:
woe_dict = dict([x for x in cate_bin_df_list if x.index.name==col][0]['woe'])
df2[col+'_woe'] = df2[col].map(woe_dict)
df2.head()
df2_woe = df2.loc[:, [x for x in df2.columns if x.find('woe')>0]+['loan_status']]
df2_woe.head()
for col in df2_woe.columns:
df2_woe[col] = df2_woe[col].astype('float64')
此时共有35列,34个自变量,1个因变量
4.3 使用前向逐步回归发根据相关系数筛选变量
首先选定一个变量,每次加入一个变量,将当前相关系数大于0.7的变量去除
# 根据相关系数去除多重共线性
def forward_corr_delete(data,col_list):
corr_list=[]
corr_list.append(col_list[0])
delete_col=[]
for col in col_list[1:]:
corr_list.append(col)
corr = data.loc[:,corr_list].corr()
corr_tup = [(k,v) for k,v in zip(corr[col].index,corr[col].values)]
corr_value = [v for k,v in corr_tup if k!=col]
if len([x for x in corr_value if abs(x)>=0.65])>0:
delete_col.append(col)
select_corr_col=[x for x in col_list if x not in delete_col]
return select_corr_col
corr_col = [x+'_woe' for x in iv_select_df.col]
select_corr_col = forward_corr_delete(df2_woe,corr_col)
len(select_corr_col)
df2_woe2 = df2_woe.loc[:,select_corr_col+['loan_status']]
df2_woe2.head()
经过筛选,得到了17个变量
4.4 根据方差膨胀因子(VIF)去除多重共线性
在这一步,没有发现多重共线性
# 根据方差膨胀因子去除共线性
def vif_delete(df,list_corr):
col_list = list_corr.copy()
vifs_matrix = np.matrix(df[col_list])
vifs_list = [variance_inflation_factor(vifs_matrix,i)for i in range(vifs_matrix.shape[1])]
vif_high = [x for x,y in zip(col_list,vifs_list) if y>10]
if len(vif_high)>0:
for col in reversed(vif_high):
col_list.remove(col)
vif_matrix=np.matrix(df[col_list])
vifs = [variance_inflation_factor(vif_matrix,i)for i in range(vif_matrix.shape[1])]
if len([x for x in vifs if x>10])==0:
break
return col_list
vif_select_col = vif_delete(df2_woe2,select_corr_col)
len(vif_select_col)
4.5 根据显著性筛选变量
使用statistic模块根据p值做显著性检验,删除inq_fi_woe
变量
# 显著性筛选 根据p值
def forward_pvalue_delete(x,y):
col_list = x.columns.tolist()
pvalues_col=[]
for col in col_list:
pvalues_col.append(col)
x_const = sm.add_constant(x.loc[:,pvalues_col])
sm_lr = sm.Logit(y,x_const)
sm_lr = sm_lr.fit()
pvalue = sm_lr.pvalues[col]
if pvalue>=0.5:
pvalues_col.remove(col)
return pvalues_col
# 将数据集分为特征集X和标签集Y
x = df2_woe2.drop(['loan_status'],axis=1)
y = df2_woe2['loan_status']
# 做显著性筛选
pvalues_col = forward_pvalue_delete(x,y)
df2_woe3 = df2_woe2.loc[:, pvalues_col+['loan_status']]
5. 建模
使用 sklearn中的逻辑回归模型作为分类器
5.1 简单建模,超参数使用默认
x2 = df2_woe3.drop(['loan_status'],axis=1)
y2 = df2_woe3['loan_status']
x_train,x_test,y_train,y_test = train_test_split(x2,y2,test_size=0.2,random_state=2020)
lr_model = LogisticRegression().fit(x_train,y_train)
对使用默认的参数训练的模型衡量指标,包括auc, ks, 敏感性,特异性,精准性
#绘制roc曲线
def plot_roc(y_label,y_pred):
tpr,fpr,threshold = metrics.roc_curve(y_label,y_pred)
AUC = metrics.roc_auc_score(y_label,y_pred)
fig = plt.figure(figsize=(6,4))
ax = fig.add_subplot(1,1,1)
ax.plot(tpr,fpr,color='blue',label='AUC=%.3f'%AUC)
ax.plot([0,1],[0,1],'r--')
ax.set_xlim(0,1)
ax.set_ylim(0,1)
ax.set_title('ROC')
ax.legend(loc='best')
return plt.show(ax)
#绘制KS曲线
def plot_model_ks(y_label,y_pred):
pred_list = list(y_pred)
label_list = list(y_label)
total_bad = sum(label_list)
total_good = len(label_list)-total_bad
items = sorted(zip(pred_list,label_list),key=lambda x :x[0])
step = (max(pred_list)-min(pred_list))/200
pred_bin = []
good_rate = []
bad_rate = []
ks_list = []
for i in range(1,201):
idx = min(pred_list)+i*step
pred_bin .append(idx)
label_bin = [x[1] for x in items if x[0]
此时的auc=0.950, ks=0.798, roc和ks曲线如下
5.2 使用网格搜索交叉验证选择最优参数
In[157]:
#利用交叉验证和网格搜索
from sklearn.model_selection import GridSearchCV #网格搜索
from sklearn.linear_model import LogisticRegression # 逻辑回归
from sklearn.model_selection import train_test_split # 测试集与训练集划分
#构建网格参数组合
param_test1={"C":[0.01,0.1,1.0,10.0,20.0,30.0,100.0,200.0,300.0,1000.0], #正则化系数
"penalty":["l1","l2"], #正则化参数
"max_iter":[100,200,300,400,500]} #算法收敛的最大迭代次数
gsearch1=GridSearchCV(LogisticRegression(),param_grid=param_test1,cv=10)
gsearch1.fit(x_train,y_train) #训练模型
gsearch1.best_params_, gsearch1.best_score_ #查看评分最高的参数组合与最佳评分
Out[157]:
({'C': 10.0, 'max_iter': 100, 'penalty': 'l2'}, 0.9728544333807492)
最优的参数是C=10.0, max_iter=100, peanalty=l2(正则化项)
使用最优参数构建分类器,训练得到的auc和ks并没有较大提升,说明在本项目里选定了逻辑回归,改变一个超参数对结果影响不大。
5.3 使用SMOTE解决类别不平衡问题
在当前数据中,逾期的类只占了0.08,有一些不平衡,使用SMOTE算法对少数类进行过采样生成均衡的数据集,检验指标是否有提升。注意:使用SMOTE算法只能对训练集进行过采样。
In[237]: y.value_counts(normalize=True)
Out[237]:
0.0 0.919848
1.0 0.080152
# 使用SMOTE算法解决类别不平衡
from imblearn.over_sampling import SMOTE # 导入SMOTE算法模块
# 处理不平衡数据
smo = SMOTE(random_state=42) # 处理过采样的方法
x_train2, y_train2 = smo.fit_sample(x_train, y_train)
print('通过SMOTE方法平衡正负样本后')
n_sample = y_train2.shape[0]
n_pos_sample = y_train2[y_train2 == 0].shape[0]
n_neg_sample = y_train2[y_train2 == 1].shape[0]
print('样本个数:{}; 正样本占{:.2%}; 负样本占{:.2%}'.format(n_sample,
n_pos_sample / n_sample,
n_neg_sample / n_sample))
lr_model_smo = LogisticRegression().fit(x_train2,y_train2)
y_pred_smo = lr_model_smo.predict_proba(x_test)[:,1]
plot_roc(y_test,y_pred_smo)
plot_model_ks(y_test, y_pred_smo)
使用SMOTE算法,得到的auc=0.953, ks=0.802, 说明本项目中将训练集构造成均衡的数据集有效果。
6.计算每个样本的评分
在当前的数据集上,每个样本都是特征+预测得到的逾期概率,为了在业务上有更好的解释性,需要将概率转化为信用评分(类似于芝麻信用分)
# 计算基础分
def cal_scale(score,odds,PDO,model):
B = PDO/np.log(2)
A = score+B*np.log(odds)
base_score = A-B*model.intercept_[0]
return A,B,base_score
A,B,base_score = cal_scale(400,999/1,20,lr_model)
x_test_score = x_test.copy()
for col in x_test_score.columns:
col_coe = coe_dict[col]
x_test_score[col.replace('woe','score')]=x_test_score[col].map(lambda x:round(x*-B*col_coe))
x_test_score['score'] = round(base_score)
for col in [x for x in x_test_score.columns if x.find('_score')>=0]:
x_test_score['score']+=x_test_score[col]
x_test_score['label']=list(y_test)
sns.kdeplot(x_test_score[x_test_score['label']==1].score,shade=True,label='bad')
sns.kdeplot(x_test_score[x_test_score['label']==0].score,shade=True,label='good')
在上图中可以看出,正负样本的区分度还是很高的,但是正样本与负样本都不是标准的正态分布,说明模型还是有局限性。
7.其他模型、模型融合(待完成)
在业务中因为逻辑回归模型并行化、训练速度快、可解释性强等优点被广泛使用,但是预测是否逾期是一个很典型的机器学习问题,当然要使用
7.1 LightGBM
7.2 DNN
7.3 模型融合
参考文章:https://zhuanlan.zhihu.com/p/152128764这篇文章的auc只有0.67左右