基于python行为评分卡模型

什么是行为评分卡

 

  • 基本定义:根据贷款人放贷后的表现,来预测其未来一段时间内发生逾期或违约风险概率的模型
  • 使用场景:在放贷之后、到期之前,即贷中环节
  • 使用目的:贷款人在贷款结束之前的逾期/违约风险

下面是案例

对本文有些概念不懂得可以看 信用评分卡模型的建立

数据+代码下载

关于数据

 

  • Loan_Amount:总的额度
  • OS:未还金融
  • Payment:还款金融
  • Spend:使用金额
  • Delq:逾期情况

第一步,特征处理

由于数据时已经过初步清洗工作,本次特征工程主要做了变量的衍生工作。

比观察期12个月的数据做了(1,3,6,12)个月的切片,定义以下特征:

 

  • 逾期类型特征

    逾期类型的特征在行为评分卡(预测违约行为中),一般是非常显著的变量。

    这里取:最大逾期,M0逾期次数,M1逾期次数和M2逾期次数。

    定义逾期变量函数,并生成特征。

 
  1. def DelqFeatures(event, window, type):

  2. '''

  3. event:事件

  4. window:时间窗口,本次模型统一[1,3,6,12],

  5. type:特征类型(最大逾期,M0逾期次数,M1逾期次数M2,逾期次数)

  6. '''

  7. current = 12

  8. start = 12 - window + 1

  9. delq1 = [event[a] for a in ['Delq1_' + str(t) for t in range(current, start-1, -1)]]

  10. delq2 = [event[a] for a in ['Delq2_' + str(t) for t in range(current, start-1, -1)]]

  11. delq3 = [event[a] for a in ['Delq3_' + str(t) for t in range(current, start-1, -1)]]

  12. if type == 'max delq':

  13. if max(delq3) == 1:

  14. return 3

  15. elif max(delq2) == 1:

  16. return 2

  17. elif max(delq1) == 1:

  18. return 1

  19. else:

  20. return 0

  21. if type in ['M0 times', 'M1 times', 'M2 times']:

  22. if type.find('M0')>-1:

  23. return sum(delq1)

  24. elif type.find('M1')>-1:

  25. return sum(delq2)

  26. else:

  27. return sum(delq3)

  28.  
  29. for t in [1,3,6,12]:

  30. # (1)过去t时间窗口内的最大逾期状态

  31. allFeature.append('maxDelqL'+str(t)+'M')

  32. trainData['maxDelqL' + str(t) + 'M'] = trainData.apply(lambda x: DelqFeatures(x, t, 'max delq'), axis=1)

  33.  
  34. #(2)过去t时间窗口内的,M0,M1,M2的次数

  35. allFeature.append('M0FreqL'+str(t)+'M')

  36. trainData['M0FreqL' + str(t)+ 'M'] = trainData.apply(lambda x: DelqFeatures(x, t,'M0 times'),1)

  37.  
  38. allFeature.append('M1FreqL' + str(t) + 'M')

  39. trainData['M1FreqL' + str(t) + 'M'] = trainData.apply(lambda x: DelqFeatures(x, t, 'M1 times'), 1)

  40.  
  41. allFeature.append('M2FreqL' + str(t) + 'M')

  42. trainData['M2FreqL' + str(t) + 'M'] = trainData.apply(lambda x: DelqFeatures(x, t, 'M2 times'), 1)

 

  • 额度使用类型在行为评分卡模型中,通常是与违约高度相关的。

    这里取:平均使用率、最大使用率、使用率增加的月份

    定义额度使用函数,并生成特征

 
  1. def UrateFeatures(event, window, type): # 额度使用率特征,平均使用率、最大使用率、使用率增加的月份

  2.  
  3. current = 12

  4. start = 12 - window + 1

  5. monthlySpend = [event[a] for a in ['Spend_' + str(t) for t in range(current, start-1, -1)]]

  6. limit = event['Loan_Amount']

  7. monthlyUrate = [x / limit for x in monthlySpend]

  8. if type == 'mean utilization rate':

  9. return np.mean(monthlyUrate)

  10. if type == 'max utilization rate':

  11. return max(monthlyUrate)

  12. if type == 'increase utilization rate':

  13. currentUrate = monthlyUrate[0:-1]

  14. previousUrate = monthlyUrate[1:]

  15. compareUrate = [int(x[0]>x[1]) for x in zip(currentUrate, previousUrate)]

  16. return sum(compareUrate)

  17.  
  18. '''

  19. 额度使用率类型特征在行为评分卡模型中,通常是与违约率高度相关的

  20. '''

  21. for t in [1,3,6,12]:

  22. # (1)过去t时间窗口内的最大月额度使用率

  23. allFeature.append('maxUrateL' + str(t) + 'M')

  24. trainData['maxUrateL' + str(t) + 'M'] = trainData.apply(lambda x: UrateFeatures(x, t, 'max utilization rate'), 1)

  25. # (2)过去t时间窗口内的平均月额度使用率

  26. allFeature.append('avgUrateL' + str(t) + 'M')

  27. trainData['avgUrateL' + str(t) + 'M'] = trainData.apply(lambda x: UrateFeatures(x, t, 'mean utilization rate'), 1)

  28.  
  29. # (3)过去t时间窗口内,月额度使用率增加的月份,t>1

  30. if t > 1:

  31. allFeature.append('increaseUrateL' + str(t) + 'M')

  32. trainData['increaseUrateL' + str(t) + 'M'] = trainData.apply(lambda x: UrateFeatures(x, t, 'increase utilization rate'),1)

  • 还款类型特征也是行为评分卡模型中常用的特征

    这里取(最小还款率、最大还款率、平均还款率)

    定义还款类型并生成特征

 
  1. def PaymentFeature(event, window, type): # 还款情况特征(最小还款率、最大还款率、平均还款率)

  2. current = 12

  3. start = 12 - window + 1

  4. currentPayment = [event[a] for a in ['Payment_' + str(t) for t in range(current, start-1, -1)]]

  5. previousOS = [event[a] for a in ['OS_' + str(t) for t in range(current-1, start-2, -1)]]

  6. monthlyPatRatio = []

  7. for Pay_OS in zip(currentPayment, previousOS):

  8. if Pay_OS[1] > 0:

  9. payRatio = Pay_OS[0]*1.0 / Pay_OS[1]

  10. monthlyPatRatio.append(payRatio)

  11. else:

  12. monthlyPatRatio.append(1)

  13. if type == 'min payment ratio':

  14. return min(monthlyPatRatio)

  15. if type == 'max payment ratio':

  16. return max(monthlyPatRatio)

  17. if type == 'mean payment ratio':

  18. totoal_payment = sum(currentPayment)

  19. total_OS = sum(previousOS)

  20. if total_OS > 0:

  21. return totoal_payment / total_OS

  22. else:

  23. return 1

  24.  
  25. '''

  26. 还款类型特征也是评分卡模型中常用的特征

  27. '''

  28.  
  29. for t in [1,3,6,12]:

  30. allFeature.append('maxPayL' + str(t) + 'M')

  31. trainData['maxPayL' + str(t) + 'M'] = trainData.apply(lambda x: PaymentFeature(x,t,'max payment ratio'),1)

  32. allFeature.append('minPayL' + str(t) + 'M')

  33. trainData['minPayL' + str(t) + 'M'] = trainData.apply(lambda x: PaymentFeature(x, t, 'min payment ratio'), 1)

  34.  
  35. allFeature.append('avgPayL' + str(t) + 'M')

  36. trainData['avgPayL' + str(t) + 'M'] = trainData.apply(lambda x: PaymentFeature(x, t, 'mean payment ratio'), 1)

衍生特征的都在这里了:

['maxDelqL1M', 'M0FreqL1M', 'M1FreqL1M', 'M2FreqL1M', 'maxDelqL3M', 'M0FreqL3M', 'M1FreqL3M', 'M2FreqL3M', 'maxDelqL6M', 'M0FreqL6M', 'M1FreqL6M', 'M2FreqL6M', 'maxDelqL12M', 'M0FreqL12M', 'M1FreqL12M', 'M2FreqL12M', 'maxUrateL1M', 'avgUrateL1M', 'maxUrateL3M', 'avgUrateL3M', 'increaseUrateL3M', 'maxUrateL6M', 'avgUrateL6M', 'increaseUrateL6M', 'maxUrateL12M', 'avgUrateL12M', 'increaseUrateL12M', 'maxPayL1M', 'minPayL1M', 'avgPayL1M', 'maxPayL3M', 'minPayL3M', 'avgPayL3M', 'maxPayL6M', 'minPayL6M', 'avgPayL6M', 'maxPayL12M', 'minPayL12M', 'avgPayL12M']

 

 

第二步进行分箱和WOE编码

分箱和WOE编码都是有监督的处理方式,要求每一箱的同时存在好、坏样本,一般分箱数量不超过5

scorecard_function是自定义的函数,详情见源码

类别型变量:过去t时间内最大的逾期状态

需要检查与坏样本的相关度

 
  1. print(trainData.groupby(['maxDelqL1M'])['label'].mean())

  2. print(trainData.groupby(['maxDelqL3M'])['label'].mean())

  3. print(trainData.groupby(['maxDelqL6M'])['label'].mean())

  4. print(trainData.groupby(['maxDelqL12M'])['label'].mean())

所有的结果都是单调递增或递减的,检查通过。

maxDelqL1M
0    0.102087
1    0.109065
2    0.514403
3    0.956710
Name: label, dtype: float64
maxDelqL3M
0    0.047477
1    0.050318
2    0.434509
3    0.958009
Name: label, dtype: float64
maxDelqL6M
0    0.047886
1    0.050380
2    0.265044
3    0.549407
Name: label, dtype: float64
maxDelqL12M
0    0.070175
1    0.044748
2    0.182365
3    0.380687
Name: label, dtype: float64

 

 
  1. ###################################

  2. #   2,  分箱,计算WOE并编码

  3. ###################################

  4.  
  5. '''

  6. 对类别型变量的分箱和WOE计算

  7. 可以通过计算取值个数的方式判断是否是类别变量

  8. '''

  9.  
  10. categoricalFeatures = []

  11. numericalFeatures = []

  12. WOE_IV_dict = {}

  13. for var in allFeature:

  14.     if len(set(trainData[var])) > 5:

  15.         numericalFeatures.append(var)

  16.     else:

  17.         categoricalFeatures.append(var)

  18. not_monotone = []

  19. for var in categoricalFeatures:

  20.     #检查bad rate在箱中的单调性

  21.     if not scorecard_function.BadRateMonotone(trainData, var, 'label'):

  22.         not_monotone.append(var)

  23. print(not_monotone.append(var))

不单调的变量有:['M1FreqL3M', 'M2FreqL3M', 'maxDelqL12M'],下面对它们进行处理

 

trainData.groupby(['M2FreqL3M'])['label'].count()   
M2FreqL3M
0    27456
1      585
2       55
3        3
Name: label, dtype: int64

将 M2FreqL3M>=1的合并为一组,检查M1FreqL3M单调性

 
  1. # 将 M2FreqL3M>=1的合并为一组,计算WOE和IV

  2. trainData['M2FreqL3M_Bin'] = trainData['M2FreqL3M'].apply(lambda x: int(x>=1))

  3. trainData.groupby(['M2FreqL3M_Bin'])['label'].mean()

  4. WOE_IV_dict['M2FreqL3M_Bin'] = scorecard_function.CalcWOE(trainData, 'M2FreqL3M_Bin', 'label')

  5.  
  6. trainData.groupby(['M1FreqL3M'])['label'].mean() #检查单调性

 
M1FreqL3M
0    0.049511
1    0.409583
2    0.930825
3    0.927083
Name: label, dtype: float64
M1FreqL3M
0    22379
1     4800
2      824
3       96
Name: label, dtype: int64
除了M1FreqL3M=3外, 其他组别的bad rate单调。
# 此外,M1FreqL3M=0 占比很大,因此将M1FreqL3M>=1的分为一组
对其他单调的类别型变量,检查是否有一箱的占比低于5%。 如果有,将该变量进行合并
 
  1. small_bin_var = []

  2. large_bin_var = []

  3. N = trainData.shape[0]

  4. for var in categoricalFeatures:

  5. if var not in not_monotone:

  6. total = trainData.groupby([var])[var].count()

  7. pcnt = total * 1.0 / N

  8. if min(pcnt)<0.05:

  9. small_bin_var.append({var:pcnt.to_dict()})

  10. else:

  11. large_bin_var.append(var)

  12. for i in small_bin_var:

  13. print (i)

  14. '''

  15. {'maxDelqL1M': {0: 0.60379372931421049, 1: 0.31880138083205806, 2: 0.069183956724438597, 3: 0.0082209331292928574}}

  16. {'M2FreqL1M': {0: 0.99177906687070716, 1: 0.0082209331292928574}}

  17. {'maxDelqL3M': {0: 0.22637816292394747, 1: 0.57005587387451506, 2: 0.18068258656891703, 3: 0.022883376632620377}}

  18. {'maxDelqL6M': {0: 0.057226235809103528, 1: 0.58489625965336844, 2: 0.31285810882949572, 3: 0.045019395708032317}}

  19. {'M2FreqL6M': {0: 0.95498060429196774, 1: 0.04003701199330937, 2: 0.0045909107085661408, 3: 0.00032029609594647497, 4: 7.1176910210327775e-05}}

  20. {'M2FreqL12M': {0: 0.92334246770347694, 1: 0.066514822591551295, 2: 0.0092174098722374465, 3: 0.00081853446741876937, 4: 0.00010676536531549166}}

  21. '''

  22. #对于M2FreqL1M、M2FreqL6M和M2FreqL12M,由于有部分箱占了很大比例,故删除

  23. allFeatures.remove('M2FreqL1M')

  24. allFeatures.remove('M2FreqL6M')

  25. allFeatures.remove('M2FreqL12M')

  26. #对于small_bin_var中的其他变量,将最小的箱和相邻的箱进行合并并计算WOE

  27. trainData['maxDelqL1M_Bin'] = trainData['maxDelqL1M'].apply(lambda x: scorecard_function.MergeByCondition(x, ['==0', '==1', '>=2']))

  28. trainData['maxDelqL3M_Bin'] = trainData['maxDelqL3M'].apply(lambda x: scorecard_function.MergeByCondition(x, ['==0', '==1', '>=2']))

  29. trainData['maxDelqL6M_Bin'] = trainData['maxDelqL6M'].apply(lambda x: scorecard_function.MergeByCondition(x, ['==0', '==1', '>=2']))

  30. for var in ['maxDelqL1M_Bin','maxDelqL3M_Bin','maxDelqL6M_Bin']:

  31. WOE_IV_dict[var] = scorecard_function.CalcWOE(trainData, var, 'label')

  32.  
  33.  
  34. '''

  35. 对于不需要合并、原始箱的bad rate单调的特征,直接计算WOE和IV

  36. '''

  37. for var in large_bin_var:

  38. WOE_IV_dict[var] = scorecard_function.CalcWOE(trainData, var, 'label')

  39.  
  40.  
  41. '''

  42. 对于数值型变量,需要先分箱,再计算WOE、IV

  43. 分箱的结果需要满足:

  44. 1,箱数不超过5

  45. 2,bad rate单调

  46. 3,每箱占比不低于5%

  47. '''

  48. bin_dict = []

  49. for var in numericalFeatures:

  50. binNum = 5

  51. newBin = var + '_Bin'

  52. bin = scorecard_function.ChiMerge(trainData, var, 'label', max_interval=binNum, minBinPcnt = 0.05)

  53. trainData[newBin] = trainData[var].apply(lambda x: scorecard_function.AssignBin(x, bin))

  54. # 如果不满足单调性,就降低分箱个数

  55. while not scorecard_function.BadRateMonotone(trainData, newBin, 'label'):

  56. binNum -= 1

  57. bin = scorecard_function.ChiMerge(trainData, var, 'label', max_interval=binNum, minBinPcnt=0.05)

  58. trainData[newBin] = trainData[var].apply(lambda x: scorecard_function.AssignBin(x, bin))

  59. WOE_IV_dict[newBin] = scorecard_function.CalcWOE(trainData, newBin, 'label')

  60. bin_dict.append({var:bin})

第三步,单变量和多变量分析

逻辑回归对样本特征的线性相关比较敏感,因此在建立模型之前要对变量进行剔除。

单变量分析,本次样本特征不多,IV选择放宽到>0.02

多变量分析,计算每个样本特征共线性,相关系数>0.7,剔除IV较低的一个。

 
  1. ##############################

  2. # 3, 单变量分析和多变量分析 #

  3. ##############################

  4. # 选取IV高于0.02的变量

  5. high_IV = [(k,v['IV']) for k,v in WOE_IV_dict.items() if v['IV'] >= 0.02]

  6. high_IV_sorted = sorted(high_IV, key=lambda k: k[1],reverse=True)

  7. for (var,iv) in high_IV:

  8. newVar = var+"_WOE"

  9. trainData[newVar] = trainData[var].map(lambda x: WOE_IV_dict[var]['WOE'][x])

  10. '''

  11. 比较两两线性相关性。如果相关系数的绝对值高于阈值,剔除IV较低的一个

  12. '''

  13. deleted_index = []

  14. cnt_vars = len(high_IV_sorted)

  15. for i in range(cnt_vars):

  16. if i in deleted_index:

  17. continue

  18. x1 = high_IV_sorted[i][0]+"_WOE"

  19. for j in range(cnt_vars):

  20. if i == j or j in deleted_index:

  21. continue

  22. y1 = high_IV_sorted[j][0]+"_WOE"

  23. roh = np.corrcoef(trainData[x1],trainData[y1])[0,1]

  24. if abs(roh)>0.7:

  25. x1_IV = high_IV_sorted[i][1]

  26. y1_IV = high_IV_sorted[j][1]

  27. if x1_IV > y1_IV:

  28. deleted_index.append(j)

  29. else:

  30. deleted_index.append(i)

  31.  
  32. single_analysis_vars = [high_IV_sorted[i][0]+"_WOE" for i in range(cnt_vars) if i not in deleted_index]

  33.  
  34.  
  35. X = trainData[single_analysis_vars]

  36. f, ax = plt.subplots(figsize=(10, 8))

  37. corr = X.corr()

  38. sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True),square=True, ax=ax)

 

剩余变量的相关系数热力图

基于python行为评分卡模型_第1张图片

多变量分析,计算VIF。最大的VIF是 3.429,小于10,因此这一步认为没有多重共线性

 
  1. '''

  2. 多变量分析:VIF

  3. '''

  4. X = np.matrix(trainData[single_analysis_vars])

  5. VIF_list = [variance_inflation_factor(X, i) for i in range(X.shape[1])]

  6. print (max(VIF_list))

  7. # 最大的VIF是 3.429,小于10,因此这一步认为没有多重共线性

  8. multi_analysis = single_analysis_vars

第四步,建立逻辑回归模型预测违约

 
  1. ################################

  2. # 4, 建立逻辑回归模型预测违约 #

  3. ################################

  4. X = trainData[multi_analysis]

  5. X['intercept'] = [1] * X.shape[0]

  6. y = trainData['label']

  7. logit = sm.Logit(y, X)

  8. logit_result = logit.fit()

  9. pvalues = logit_result.pvalues

  10. params = logit_result.params

  11. fit_result = pd.concat([params,pvalues],axis=1)

  12. fit_result.columns = ['coef','p-value']

  13. fit_result = fit_result.sort_values(by = 'coef')

  14. print(fit_result)

以下变量系数为正,需要单独和label做逻辑回归验证
increaseUrateL3M_WOE
minPayL6M_Bin_WOE  
avgUrateL12M_Bin_WOE
minPayL1M_Bin_WOE 
M0FreqL6M_Bin_WOE  
minPayL3M_Bin_WOE 
 
  1. sm.Logit(y, trainData['increaseUrateL3M_WOE']).fit().params # -0.995312

  2. sm.Logit(y, trainData['minPayL6M_Bin_WOE']).fit().params # -0.807779

  3. sm.Logit(y, trainData['avgUrateL12M_Bin_WOE']).fit().params # -1.0179

  4. sm.Logit(y, trainData['minPayL1M_Bin_WOE']).fit().params # -0.969236

  5. sm.Logit(y, trainData['M0FreqL6M_Bin_WOE']).fit().params # -1.032842

  6. sm.Logit(y, trainData['minPayL3M_Bin_WOE']).fit().params # -0.829298

单独建立回归模型,系数为负,与预期相符,说明仍然存在多重共线性
下一步,用GBDT跑出变量重要性,挑选出合适的变量
 
  1. clf = ensemble.GradientBoostingClassifier()

  2. gbdt_model = clf.fit(X, y)

  3. importace = gbdt_model.feature_importances_.tolist()

  4. featureImportance = zip(multi_analysis,importace)

  5. featureImportanceSorted = sorted(featureImportance, key=lambda k: k[1],reverse=True)

  6. pd.DataFrame(featureImportanceSorted).plot(kind='bar')

基于python行为评分卡模型_第2张图片

上图中可以看出前4个变量重要度较高,先用前四个变量生成模型,再按重要程度每次加一个变量,剔除结果参数为负或p值超过0.1的变量

 
  1. # 先假定模型可以容纳4个特征,再逐步增加特征个数,直到有特征的系数为正,或者p值超过0.1

  2. n = 4

  3. featureSelected = [i[0] for i in featureImportanceSorted[:n]]

  4. X_train = X[featureSelected+['intercept']]

  5. logit = sm.Logit(y, X_train)

  6. logit_result = logit.fit()

  7. pvalues = logit_result.pvalues

  8. params = logit_result.params

  9. fit_result = pd.concat([params,pvalues],axis=1)

  10. fit_result.columns = ['coef','p-value']

  11. '''

  12. coef p-value

  13. maxDelqL3M_Bin_WOE -0.895654 0.000000e+00

  14. increaseUrateL6M_Bin_WOE -1.084713 1.623441e-84

  15. M0FreqL3M_WOE -0.436273 1.556517e-74

  16. avgUrateL1M_Bin_WOE -0.629355 7.146665e-16

  17. avgUrateL3M_Bin_WOE -0.570670 8.207241e-12

  18. intercept -1.831752 0.000000e+00

  19. '''

  20. while(n

  21. nextVar = featureImportanceSorted[n][0]

  22. featureSelected = featureSelected + [nextVar]

  23. X_train = X[featureSelected+['intercept']]

  24. logit = sm.Logit(y, X_train)

  25. logit_result = logit.fit()

  26. params = logit_result.params

  27. print ("current var is ",nextVar,' ', params[nextVar])

  28. if max(params) < 0:

  29. n += 1

  30. else:

  31. featureSelected.remove(nextVar)

  32. n += 1

  33.  
  34. X_train = X[featureSelected+['intercept']]

  35. logit = sm.Logit(y, X_train)

  36. logit_result = logit.fit()

  37. pvalues = logit_result.pvalues

  38. params = logit_result.params

  39. fit_result = pd.concat([params,pvalues],axis=1)

  40. fit_result.columns = ['coef','p-value']

  41. fit_result = fit_result.sort_values(by = 'p-value')

p值大于0.1d的单独检验显著性
 
  1. largePValueVars = pvalues[pvalues>0.1].index

  2. for var in largePValueVars:

  3. X_temp = X[[var, 'intercept']]

  4. logit = sm.Logit(y, X_temp)

  5. logit_result = logit.fit()

  6. pvalues = logit_result.pvalues

  7. print ("The p-value of {0} is {1} ".format(var, str(pvalues[var])))

显然,单个变量的p值是显著地。说明任然存在着共线性。

可用L1约束,直到所有变量显著
 
 
  1. X2 = X[featureSelected+['intercept']]

  2. for alpha in range(100,0,-1):

  3. l1_logit = sm.Logit.fit_regularized(sm.Logit(y, X2), start_params=None, method='l1', alpha=alpha)

  4. pvalues = l1_logit.pvalues

  5. params = l1_logit.params

  6. if max(pvalues)>=0.1 or max(params)>0:

  7. break

  8.  
  9. bestAlpha = alpha + 1

  10. l1_logit = sm.Logit.fit_regularized(sm.Logit(y, X2), start_params=None, method='l1', alpha=bestAlpha)

  11. params = l1_logit.params

  12. params2 = params.to_dict()

  13. featuresInModel = [k for k, v in params2.items() if k!='intercept' and v < -0.0000001]

  14. print(featuresInModel)

  15.  
  16. X_train = X[featuresInModel + ['intercept']]

  17. logit = sm.Logit(y, X_train)

  18. logit_result = logit.fit()

  19. trainData['pred'] = logit_result.predict(X_train)

  20.  
  21. ks = scorecard_function.KS(trainData, 'pred', 'label')

  22. auc = roc_auc_score(trainData['label'],trainData['pred'])

第五步,在测试集上测试模型

 
 
  1. ##################################

  2. # 5,在测试集上测试逻辑回归的结果 #

  3. ###################################

  4. # 准备WOE编码后的变量

  5. modelFeatures = [i.replace('_Bin','').replace('_WOE','') for i in featuresInModel]

  6.  
  7. numFeatures = [i for i in modelFeatures if i in numericalFeatures]

  8. charFeatures = [i for i in modelFeatures if i in categoricalFeatures]

  9.  
  10. testData['maxDelqL1M'] = testData.apply(lambda x: DelqFeatures(x,1,'max delq'),axis=1)

  11. testData['maxDelqL3M'] = testData.apply(lambda x: DelqFeatures(x,3,'max delq'),axis=1)

  12. testData['M0FreqL3M'] = testData.apply(lambda x: DelqFeatures(x,3,'M0 times'),axis=1)

  13. testData['M1FreqL6M'] = testData.apply(lambda x: DelqFeatures(x, 6, 'M1 times'), axis=1)

  14. testData['M2FreqL3M'] = testData.apply(lambda x: DelqFeatures(x, 3, 'M2 times'), axis=1)

  15. testData['avgUrateL1M'] = testData.apply(lambda x: UrateFeatures(x,1, 'mean utilization rate'),axis=1)

  16. testData['avgUrateL3M'] = testData.apply(lambda x: UrateFeatures(x,3, 'mean utilization rate'),axis=1)

  17. testData['increaseUrateL6M'] = testData.apply(lambda x: UrateFeatures(x, 6, 'increase utilization rate'),axis=1)

  18. testData['M2FreqL3M_Bin'] = testData['M2FreqL3M'].apply(lambda x: int(x>=1))

  19. testData['maxDelqL1M_Bin'] = testData['maxDelqL1M'].apply(lambda x: scorecard_function.MergeByCondition(x, ['==0', '==1', '>=2']))

  20. testData['maxDelqL3M_Bin'] = testData['maxDelqL3M'].apply(lambda x: scorecard_function.MergeByCondition(x, ['==0', '==1', '>=2']))

  21. for var in numFeatures:

  22. newBin = var+"_Bin"

  23. bin = list([i.values() for i in bin_dict if var in i][0])[0]

  24. testData[newBin] = testData[var].apply(lambda x: scorecard_function.AssignBin(x, bin))

  25.  
  26. finalFeatures = [i+'_Bin' for i in numFeatures] + ['M2FreqL3M_Bin','maxDelqL1M_Bin','maxDelqL3M_Bin','M0FreqL3M']

  27. for var in finalFeatures:

  28. var2 = var+"_WOE"

  29. testData[var2] = testData[var].apply(lambda x: WOE_IV_dict[var]['WOE'][x])

  30.  
  31. X_test = testData[featuresInModel]

  32. X_test['intercept'] = [1]*X_test.shape[0]

  33. testData['pred'] = logit_result.predict(X_test)

  34.  
  35.  
  36. ks = scorecard_function.KS(testData, 'pred', 'label')

  37. auc = roc_auc_score(testData['label'],testData['pred'])

  38. print(ks, auc)

 

第六步,计算测试集的评分卡得分

 
 
  1. ##########################

  2. # 6,在测试集上计算分数 #

  3. ##########################

  4. BasePoint, PDO = 500,50

  5. testData['score'] = testData['pred'].apply(lambda x: scorecard_function.Prob2Score(x, BasePoint, PDO))

  6. plt.hist(testData['score'],bins=100)

  7. plt.show()

测试集用户的得分分布

 

基于python行为评分卡模型_第3张图片

你可能感兴趣的:(基于python行为评分卡模型)