---
title: "信用评分模型"
author:"junjun"
date: "2016年10月3日"
output:html_document
---
# 一、数据的获取与整合
数据来源:数据来自Kaggle,cs-training.csv是有15万条的样本数据,下图可以看到这份数据的大致情况。下载地址为:https://www.kaggle.com/c/GiveMeSomeCredit/data
• 数据描述:数据属于个人消费类贷款,只考虑评分卡最终实施时能够使用到的数据应从如下一些方面获取数据:
– 基本属性:包括了借款人当时的年龄。
– 偿债能力:包括了借款人的月收入、负债比率。
– 信用往来:两年内35-59天逾期次数、两年内60-89天逾期次数、两年内90天或高于90天逾期的次数。
– 财产状况:包括了开放式信贷和贷款数量、不动产贷款或额度数量。
– 贷款属性:暂无。
– 其他因素:包括了借款人的家属数量(不包括本人在内)。
• 原始变量:
变量名 变量类型 变量描述
SeriousDlqin2yrs Y/N 超过90天或更糟的逾期拖欠
RevolvingUtilizationOf percentage 无担保放款的循环利用:除了不动产和像车贷那样除以信用额度总和的无分期付款债务的信用卡和个人信用额度总额
UnsecuredLines
age integer 借款人当时的年龄
NumberOfTime30-59DaysPastDueNotWorse integer 35-59天逾期但不糟糕次数
DebtRatio percentage 负债比率
MonthlyIncome real 月收入
NumberOf integer 开放式信贷和贷款数量,开放式贷款(分期付款如汽车贷款或抵押贷款)和信贷(如信用卡)的数量
OpenCreditLinesAndLoans
NumberOfTimes90DaysLate integer 90天逾期次数:借款者有90天或更高逾期的次数
NumberRealEstateLoans integer 不动产贷款或额度数量:抵押贷款和不动产放款包括房屋净值信贷额度
OrLines
NumberOfTime60-89DaysPastDueNotWorse integer 60-89天逾期但不糟糕次数:借款人在在过去两年内有60-89天逾期还款但不糟糕的次数
NumberOfDependents integer 家属数量:不包括本人在内的家属数量
• 时间窗口:自变量的观察窗口为过去两年,因变量表现窗口为未来两年。
# 二、数据处理
首先去掉原数据中的顺序变量,即第一列的id变量。由于要预测的是SeriousDlqin2yrs变量,因此将其设为响应变量y,其他分别设为x1~x10变量。
1、缺失值分析及处理
在得到数据集后,我们需要观察数据的分布情况,因为很多的模型对缺失值敏感,因此观察是否有缺失值是其中很重要的一个步骤。在正式分析前,我们先通过图形进行对观测字段的缺失情况有一个直观的感受。
```{r warning=FALSE}
#1、读取数据集
data <- read.csv(file="F:\\R\\数据集\\P2P\\信用评分模型\\cs-training.csv", row.names=F)
#去掉id
data1 <- data[,-1]
head(data1)
#对列进行重命名
names(data1) <-c("y", paste("x", 1:10, sep = ""))
str(data1)
#2、查看数据集的缺失值分布
library(mice)
#matrixplot(data1)
md.pattern(data1)
#可以看到x5变量和x10变量,即MonthlyIncome变量和NumberOfDependents两个变量存在缺失值;monthlyincome列共有缺失值29731个,numberofdependents有3924个
#3、对于缺失值的处理方法非常多,例如基于聚类的方法,基于回归的方法,基于均值的方法,其中最简单的方法是直接移除,但是在本文中因为缺失值所占比例较高,直接移除会损失大量观测,因此并不是最合适的方法。在这里,我们使用KNN方法对缺失值进行填补。
library(DMwR)
traindata <-knnImputation(data1,k=10,meth = "weighAvg")
#write.csv(traindata, "F:\\R\\数据集\\P2P\\信用评分模型\\cs-training-na.csv")
str(traindata)
#4、异常值分析及处理
#获取月收入的异常值
out <-boxplot.stats(traindata$x5)
boxplot(traindata$x5)
# which(traindata$x5%in% out)
# traindata1 <-traindata[-which(traindata$x5 %in% out)]
#boxplot(traindata1$x5)
#首先对于x2变量,即客户的年龄,我们可以定量分析,发现有以下值
unique(traindata$x2)
#可以看到年龄中存在0值,显然是异常值,予以剔除。
traindata <-traindata[-which(traindata$x2==0), ]
#对于x3,x7,x9三个变量,由下面的箱线图可以看出,均存在异常值,且由unique函数可以得知均存在96、98两个异常值,因此予以剔除。同时会发现剔除其中一个变量的96、98值,其他变量的96、98两个值也会相应被剔除
unique(traindata$x3)
traindata <-traindata[-which(traindata$x3 %in% c(96, 98)), ]
unique(traindata$x7)
traindata <-traindata[-which(traindata$x7 %in% c(96, 98)), ]
#当把x3和x7中的异常值删除后,x9中的异常值也被删除了
unique(traindata$x9)
#5、变量分析
#1)单变量分析
#简单地看下部分变量的分布,比如对于age变量,如下图:
ggplot(traindata,aes(x = x2, y = ..density..)) + geom_histogram(fill = "blue", colour= "grey60", size = 0.2, alpha = 0.2) + geom_density()
#可以看到年龄变量大致呈正态分布,符合统计分析的假设。再比如月收入变量,也可以做图观察观察,如下:
ggplot(traindata,aes(x = x5, y = ..density..)) + geom_histogram(fill = "blue", colour= "grey60", size = 0.2, alpha = 0.2) + geom_density() + xlim(1,20000)
#月收入也大致呈正态分布,符合统计分析的需要。
#2)变量之间的相关性:建模之前首先得检验变量之间的相关性,如果变量之间相关性显著,会影响模型的预测效果。下面通过corrplot函数,画出各变量之间,包括响应变量与自变量的相关性。
cor1 <-cor(traindata[, 1:11])
library(corrplot)
corrplot(cor1)
corrplot(cor1,method="number")
#由上图可以看出,各变量之间的相关性是非常小的。其实Logistic回归同样需要检验多重共线性问题,不过此处由于各变量之间的相关性较小,可以初步判断不存在多重共线性问题,当然我们在建模后还可以用VIF(方差膨胀因子)来检验多重共线性问题。如果存在多重共线性,即有可能存在两个变量高度相关,需要降维或剔除处理。
```
# 三、切分数据集
```{r warning=FALSE}
#1、查看因变量的分布
table(traindata$y)
prop.table(table(traindata$y))
#由上表看出,对于响应变量SeriousDlqin2yrs,存在明显的类失衡问题,SeriousDlqin2yrs等于1的观测为9879,仅为所有观测值的6.6%。因此我们需要对非平衡数据进行处理,在这里可以采用SMOTE算法,用R对稀有事件进行超级采样。
#2、利用caret包中的createDataPartition(数据分割功能)函数将数据随机分成相同的两份
library(caret)
index <-createDataPartition(traindata$y, time=1, p=0.5, list=F)
train <-traindata[index, ]
test <-traindata[-index, ]
#对于分割后的训练集和测试集均有74865个数据,分类结果的平衡性如下
prop.table(table(train$y))
prop.table(table(test$y))
#两者的分类结果是平衡的,仍然有6.6%左右的代表,我们仍然处于良好的水平。因此可以采用这份切割的数据进行建模及预测。
```
# 四、建模:Logistic回归在信用评分卡开发中起到核心作用。由于其特点,以及对自变量进行了证据权重转换(WOE),Logistic回归的结果可以直接转换为一个汇总表,即所谓的标准评分卡格式。
```{r warning=FALSE}
#1、首先利用glm函数对所有变量进行Logistic回归建模,模型如下
fit <- glm(y~.,train, family = "binomial")
summary(fit)
#可以看出,利用全变量进行回归,模型拟合效果并不是很好,其中x1,x4,x6三个变量的p值未能通过检验,在此直接剔除这三个变量,利用剩余的变量对y进行回归。
#2、对以上模型进行优化
fit2 <-glm(y~x2+x3+x5+x7+x8+x9+x10, train, family = "binomial")
summary(fit2)
#第二个回归模型所有变量都通过了检验,甚至AIC值(赤池信息准则)更小,所有模型的拟合效果更好些。
#3、模型评估:
#通常一个二值分类器可以通过ROC(ReceiverOperating Characteristic)曲线和AUC值来评价优劣。很多二元分类器会产生一个概率预测值,而非仅仅是0-1预测值。我们可以使用某个临界点(例如0.5),以划分哪些预测为1,哪些预测为0。得到二元预测值后,可以构建一个混淆矩阵来评价二元分类器的预测效果。所有的训练数据都会落入这个矩阵中,而对角线上的数字代表了预测正确的数目,即true positive + true nagetive。同时可以相应算出TPR(真正率或称为灵敏度)和TNR(真负率或称为特异度)。我们主观上希望这两个指标越大越好,但可惜二者是一个此消彼涨的关系。除了分类器的训练参数,临界点的选择,也会大大的影响TPR和TNR。有时可以根据具体问题和需要,来选择具体的临界点。
#如果我们选择一系列的临界点,就会得到一系列的TPR和TNR,将这些值对应的点连接起来,就构成了ROC曲线。ROC曲线可以帮助我们清楚的了解到这个分类器的性能表现,还能方便比较不同分类器的性能。在绘制ROC曲线的时候,习惯上是使用1-TNR作为横坐标即FPR(false positive rate),TPR作为纵坐标。这是就形成了ROC曲线。而AUC(Area Under Curve)被定义为ROC曲线下的面积,显然这个面积的数值不会大于1。又由于ROC曲线一般都处于y=x这条直线的上方,所以AUC的取值范围在0.5和1之间。使用AUC值作为评价标准是因为很多时候ROC曲线并不能清晰的说明哪个分类器的效果更好,而作为一个数值,对应AUC更大的分类器效果更好。
#首先利用模型对test数据进行预测,生成概率预测值
pre <-predict(fit2, test)
#在R中,可以利用pROC包,它能方便比较两个分类器,还能自动标注出最优的临界点,图看起来也比较漂亮。在下图中最优点FPR=1-TNR=0.845,TPR=0.638,AUC值为0.8102,说明该模型的预测效果还是不错的,正确较高。
modelroc <-roc(test$y, pre)
plot(modelroc,print.auc=TRUE, auc.polygon=TRUE, grid=c(0.1, 0.2),
grid.col=c("green","red"), max.auc.polygon=TRUE,
auc.polygon.col="skyblue",print.thres=TRUE)
```
# 五、WOE转换
证据权重(Weight ofEvidence,WOE)转换可以将Logistic回归模型转变为标准评分卡格式。引入WOE转换的目的并不是为了提高模型质量,只是一些变量不应该被纳入模型,这或者是因为它们不能增加模型值,或者是因为与其模型相关系数有关的误差较大,其实建立标准信用评分卡也可以不采用WOE转换。这种情况下,Logistic回归模型需要处理更大数量的自变量。尽管这样会增加建模程序的复杂性,但最终得到的评分卡都是一样的。
用WOE(x)替换变量x。WOE()=ln[(违约/总违约)/(正常/总正常)]。
通过上述的Logistic回归,剔除x1,x4,x6三个变量,对剩下的变量进行WOE转换。
```{r warning=FALSE}
#1、进行分箱
#1)age变量(x2):
cutx2 = c(-Inf, 30,35, 40, 45, 50, 55, 60, 65, 70, 75, Inf)
plot(cut(train$x2,cutx2))
#2)NumberOfTime30-59DaysPastDueNotWorse变量(x3):
cutx3 = c(-Inf, 0,1, 3, 5, Inf)
plot(cut(train$x3,cutx3))
#3)MonthlyIncome变量(x5):
cutx5 <- c(-Inf,1000, 2000, 3000, 4000, 5000, 6000, 7500, 9500, 12000, Inf)
plot(cut(train$x5,cutx5))
#4)NumberOfTimes90DaysLate变量(x7):
cutx7 <- c(-Inf,0, 1, 3, 5, 10,Inf)
plot(cut(train$x7,cutx7))
#5)NumberRealEstateLoansOrLines变量(x8):
cutx8 <- c(-Inf,0, 1, 3, 5, Inf)
plot(cut(train$x8,cutx8))
#6)NumberOfTime60-89DaysPastDueNotWorse变量(x9):
cutx9 <- c(-Inf,0, 1, 3, 5, Inf)
plot(cut(train$x9,cutx9))
#7)NumberOfDependents变量(x10):
cutx10 <- c(-Inf,0, 1, 2, 3, 5, Inf)
plot(cut(train$x10,cutx10))
#2、计算WOE值
totalgood <-as.numeric(table(train$y))[1]
totalbad <-as.numeric(table(train$y))[2]
#计算WOE的函数
getWOE <-function(a, p, q){
Good <- as.numeric(table(train$y[a>p& a<=q]))[1]
Bad <- as.numeric(table(train$y[a>p& a<=q]))[2]
WOE <-log((Bad/totalbad)/(Good/totalgood), base=exp(1))
return(WOE)
}
#比如age变量(x2)
Agelessthan30.WOE<- getWOE(train$x2, -Inf, 30)
Age30to35.WOE <-getWOE(train$x2, 30, 35)
Age35to40.WOE=getWOE(train$x2,35,40)
Age40to45.WOE=getWOE(train$x2,40,45)
Age45to50.WOE=getWOE(train$x2,45,50)
Age50to55.WOE=getWOE(train$x2,50,55)
Age55to60.WOE=getWOE(train$x2,55,60)
Age60to65.WOE=getWOE(train$x2,60,65)
Age65to75.WOE=getWOE(train$x2,65,75)
Agemorethan.WOE=getWOE(train$x2,75,Inf)
(age.WOE=c(Agelessthan30.WOE,Age30to35.WOE,Age35to40.WOE,Age40to45.WOE,Age45to50.WOE,Age50to55.WOE,Age55to60.WOE,Age60to65.WOE,Age65to75.WOE,Agemorethan.WOE))
# NumberOfTime30-59DaysPastDueNotWorse变量(x3)
# ## [1]-0.5324915 0.9106018 1.7645290 2.4432903 2.5682332
NumOfTimeless0.WOE <- getWOE(train$x3, -Inf, 0)
NumOfTime0to1.WOE<- getWOE(train$x3, 0, 1)
NumOfTime1to3.WOE<- getWOE(train$x3, 1, 3)
NumOfTime3to5.WOE<- getWOE(train$x3, 3, 5)
NumOfTimethan5.WOE<- getWOE(train$x3, 5, Inf)
(NumOfTime.WOE <-c(NumOfTimelesso.WOE, NumOfTime0to1.WOE, NumOfTime1to3.WOE, NumOfTime3to5.WOE,NumOfTimethan5.WOE))
# MonthlyIncome变量(x5)
# ## [1] -1.128862326 0.448960482 0.312423080 0.350846777 0.247782295
# ## [6] 0.114417168 -0.001808106 -0.237224039 -0.389158800 -0.462438653
MonIncomeless1000.WOE<- getWOE(train$x5, -Inf, 1000)
MonIncome1000to2000.WOE<- getWOE(train$x5, 1000, 2000)
MonIncome2000to3000.WOE<- getWOE(train$x5, 2000, 3000)
MonIncome3000to4000.WOE<- getWOE(train$x5, 3000, 4000)
MonIncome4000to5000.WOE<- getWOE(train$x5, 4000, 5000)
MonIncome5000to6000.WOE<- getWOE(train$x5, 5000, 6000)
MonIncome6000to7500.WOE<- getWOE(train$x5, 6000, 7500)
MonIncome7500to9500.WOE<- getWOE(train$x5, 7500, 9500)
MonIncome9500to12000.WOE<- getWOE(train$x5, 9500, 12000)
MonIncomethan12000.WOE<- getWOE(train$x5, 12000, Inf)
(MonIncome.WOE <-c(MonIncomeless1000.WOE, MonIncome1000to2000.WOE, MonIncome2000to3000.WOE,MonIncome3000to4000.WOE, MonIncome4000to5000.WOE, MonIncome5000to6000.WOE,MonIncome6000to7500.WOE, MonIncome7500to9500.WOE, MonIncome9500to12000.WOE,MonIncomethan12000.WOE))
# NumberOfTimes90DaysLate变量(x7)
# ## [1]-0.3694044 1.9400973 2.7294448 3.3090003 3.3852925 2.3483738
NumOfTime90Dayless0.WOE<- getWOE(train$x7, -Inf, 0)
NumOfTime90Day0to1.WOE<- getWOE(train$x7, 0, 1)
NumOfTime90Day1to3.WOE<- getWOE(train$x7, 1, 3)
NumOfTime90Day3to5.WOE<- getWOE(train$x7, 3, 5)
NumOfTime90Day5to10.WOE<- getWOE(train$x7, 5, 10)
NumOfTime90Daythan10.WOE<- getWOE(train$x7, 10, Inf)
(NumOfTimeDay.WOE<- c(NumOfTime90Dayless0.WOE, NumOfTime90Day0to1.WOE,NumOfTime90Day1to3.WOE, NumOfTime90Day3to5.WOE, NumOfTime90Day5to10.WOE,NumOfTime90Daythan10.WOE))
# NumberRealEstateLoansOrLines变量(x8)
# ## [1] 0.21490691 -0.24386987 -0.15568385 0.02906876 0.41685234 1.12192809
NumRealless0.WOE<- getWOE(train$x8, -Inf, 0)
NumReal0to1.WOE<- getWOE(train$x8, 0, 1)
NumReal1to3.WOE<- getWOE(train$x8, 1, 3)
NumReal3to5.WOE<- getWOE(train$x8, 3, 5)
NumRealthan5.WOE<- getWOE(train$x8, 5, Inf)
(NumReal.WOE <-c(NumRealless0.WOE, NumReal0to1.WOE, NumReal1to3.WOE, NumReal3to5.WOE,NumRealthan5.WOE))
# NumberOfTime60-89DaysPastDueNotWorse变量(x9)
# ## [1]-0.2784605 1.8329078 2.7775343 3.5805174 3.4469860
NumOfTime6089Dayless0.WOE<- getWOE(train$x9, -Inf, 0)
NumOfTime6089Day0to1.WOE<- getWOE(train$x9, 0, 1)
NumOfTime6089Day1to3.WOE<- getWOE(train$x9, 1, 3)
NumOfTime6089Day3to5.WOE<- getWOE(train$x9, 3, 5)
NumOfTime6089Daythan5.WOE<- getWOE(train$x9, 5, Inf)
(NumOfTime6089.WOE<- c(NumOfTime6089Dayless0.WOE, NumOfTime6089Day0to1.WOE,NumOfTime6089Day1to3.WOE, NumOfTime6089Day3to5.WOE, NumOfTime6089Daythan5.WOE))
# NumberOfDependents变量(x10)
# ## [1]-0.15525081 0.08669961 0.19618098 0.33162486 0.40469824 0.76425365
NumOfDepless0.WOE<- getWOE(train$x10, -Inf, 0)
NumOfDep0to1.WOE<- getWOE(train$x10, 0, 1)
NumOfDep1to3.WOE<- getWOE(train$x10, 1, 3)
NumOfDep3to5.WOE<- getWOE(train$x10, 3, 5)
NumOfDepthan5.WOE<- getWOE(train$x10, 5, Inf)
(NumOfDep.WOE <-c(NumOfDepless0.WOE, NumOfDep0to1.WOE, NumOfDep1to3.WOE, NumOfDep3to5.WOE,NumOfDepthan5.WOE))
#3、对变量进行WOE变换
#比如age变量(x2)
tmp.age <- 0
for(i in1:nrow(train)) {
if(train$x2[i] <= 30)
tmp.age[i] <- Agelessthan30.WOE
else if(train$x2[i] <= 35)
tmp.age[i] <- Age30to35.WOE
else if(train$x2[i] <= 40)
tmp.age[i] <- Age35to40.WOE
else if(train$x2[i] <= 45)
tmp.age[i] <- Age40to45.WOE
else if(train$x2[i] <= 50)
tmp.age[i] <- Age45to50.WOE
else if(train$x2[i] <= 55)
tmp.age[i] <- Age50to55.WOE
else if(train$x2[i] <= 60)
tmp.age[i] <- Age55to60.WOE
else if(train$x2[i] <= 65)
tmp.age[i] <- Age60to65.WOE
else if(train$x2[i] <= 75)
tmp.age[i] <- Age65to75.WOE
else
tmp.age[i] <- Agemorethan.WOE
}
table(tmp.age)
tmp.age[1:10]
train$x2[1:10]
# NumberOfTime30-59DaysPastDueNotWorse变量(x3)
# ##tmp.NumberOfTime30.59DaysPastDueNotWorse
# ##-0.53249146131578 0.910601840444591 1.76452904024992 2.44329031065646
# ## 62948 8077 3160 562
# ## 2.56823323027274
# ## 118
# ## [1] 0.9106018 -0.5324915 -0.5324915 -0.5324915 -0.5324915 -0.5324915
# ## [7] -0.5324915 -0.5324915 -0.5324915-0.5324915
# ## [1] 1 0 0 0 0 0 0 0 0 0
tmp.NumberOfTime30.59DaysPastDueNotWorse<- 0
for(i in1:nrow(train)) {
if(train$x3[i] <= 0)
tmp.NumberOfTime30.59DaysPastDueNotWorse[i] <- NumOfTimeless0.WOE
else if(train$x3[i] <= 1)
tmp.NumberOfTime30.59DaysPastDueNotWorse[i] <- NumOfTime0to1.WOE
else if(train$x3[i] <= 3)
tmp.NumberOfTime30.59DaysPastDueNotWorse[i] <- NumOfTime1to3.WOE
else if(train$x3[i] <= 5)
tmp.NumberOfTime30.59DaysPastDueNotWorse[i] <- NumOfTime3to5.WOE
else
tmp.NumberOfTime30.59DaysPastDueNotWorse[i] <- NumOfTimethan5.WOE
}
table(tmp.NumberOfTime30.59DaysPastDueNotWorse)
# MonthIncome变量(x5)
# ##tmp.MonthlyIncome
# ## -1.12886232582259 -0.462438653207328 -0.389158799506996
# ## 10201 5490 5486
# ## -0.237224038650003 -0.00180810632297072 0.114417167554772
# ## 7048 8076 7249
# ## 0.247782294610166 0.312423079500641 0.350846777249291
# ## 9147 8118 9680
# ## 0.448960482499888
# ## 4370
# ## [1] 0.350846777 0.350846777 0.350846777 0.312423080 -0.001808106
# ## [6] -0.462438653 -0.237224039 0.350846777 0.312423080 -0.237224039
# ## [1] 3042 3300 3500 2500 6501 12454 8800 3280 2500 7916
tmp.MonthlyIncome<- 0
for(i in1:nrow(train)) {
if(train$x5[i] <= 1000)
tmp.MonthlyIncome[i] <-MonIncomeless1000.WOE
else if(train$x5[i] <= 2000)
tmp.MonthlyIncome[i] <-MonIncome1000to2000.WOE
else if(train$x5[i] <= 3000)
tmp.MonthlyIncome[i] <-MonIncome2000to3000.WOE
else if(train$x5[i] <= 4000)
tmp.MonthlyIncome[i] <-MonIncome3000to4000.WOE
else if(train$x5[i] <= 5000)
tmp.MonthlyIncome[i] <-MonIncome4000to5000.WOE
else if(train$x5[i] <= 6000)
tmp.MonthlyIncome[i] <-MonIncome5000to6000.WOE
else if(train$x5[i] <= 7500)
tmp.MonthlyIncome[i] <-MonIncome6000to7500.WOE
else if(train$x5[i] <= 9500)
tmp.MonthlyIncome[i] <-MonIncome7500to9500.WOE
else if(train$x5[i] <= 12000)
tmp.MonthlyIncome[i] <-MonIncome9500to12000.WOE
else
tmp.MonthlyIncome[i] <-MonIncomethan12000.WOE
}
table(tmp.MonthlyIncome)
# NumberOfTime90DaysPastDueNotWorse变量(x7)
# ##tmp.NumberOfTimes90DaysLate
# ##-0.369404425455224 1.94009728631401 2.34837375415972
# ## 70793 2669 7
# ## 2.72944477623793 3.30900029985393 3.38529247382249
# ## 1093 222 81
# ## [1] 1.9400973 -0.3694044 -0.3694044 -0.3694044 -0.3694044 -0.3694044
# ## [7] -0.3694044 -0.3694044 -0.3694044-0.3694044
# ## [1] 1 0 0 0 0 0 0 0 0 0
tmp.NumberOfTimes90DaysLate<- 0
for(i in1:nrow(train)) {
if(train$x7[i] <= 0)
tmp.NumberOfTimes90DaysLate[i] <-NumOfTime90Dayless0.WOE
else if(train$x7[i] <= 1)
tmp.NumberOfTimes90DaysLate[i] <-NumOfTime90Day0to1.WOE
else if(train$x7[i] <= 3)
tmp.NumberOfTimes90DaysLate[i] <-NumOfTime90Day1to3.WOE
else if(train$x7[i] <= 5)
tmp.NumberOfTimes90DaysLate[i] <-NumOfTime90Day3to5.WOE
else if(train$x7[i] <= 10)
tmp.NumberOfTimes90DaysLate[i] <-NumOfTime90Day5to10.WOE
else
tmp.NumberOfTimes90DaysLate[i] <-NumOfTime90Daythan10.WOE
}
table(tmp.NumberOfTimes90DaysLate)
# NumberRealEstateLoansOrLines变量(x8)
# ##tmp.NumberRealEstateLoansOrLines
# ##-0.243869874062293 -0.155683851792327 0.0290687559545721
# ## 26150 15890 3130
# ## 0.214906905417014 1.12192809398173
# ## 27901 1794
# ## [1] 0.2149069 0.2149069 0.2149069 0.2149069 -0.1556839 -0.1556839
# ## [7] 0.2149069 -0.2438699 0.2149069 0.2149069
# ## [1] 0 0 0 0 2 2 0 1 0 0
tmp.NumberRealEstateLoansOrLines<- 0
for(i in1:nrow(train)) {
if(train$x8[i] <= 0)
tmp.NumberRealEstateLoansOrLines[i]<- NumRealless0.WOE
else if(train$x8[i] <= 1)
tmp.NumberRealEstateLoansOrLines[i]<- NumReal0to1.WOE
else if(train$x8[i] <= 3)
tmp.NumberRealEstateLoansOrLines[i]<- NumReal1to3.WOE
else if(train$x8[i] <= 5)
tmp.NumberRealEstateLoansOrLines[i]<- NumReal3to5.WOE
else
tmp.NumberRealEstateLoansOrLines[i]<- NumRealthan5.WOE
}
table(tmp.NumberRealEstateLoansOrLines)
# NumberOfTime60.89DaysPastDueNotWorse变量(x9)
# ##tmp.NumberOfTime60.89DaysPastDueNotWorse
# ##-0.278460464730538 1.83290775083723 2.77753428092856
# ## 71150 2919 708
# ## 3.44698604282783 3.58051743545235
# ## 13 75
# ## [1] -0.2784605 -0.2784605 -0.2784605-0.2784605 -0.2784605 -0.2784605
# ## [7] -0.2784605 -0.2784605 -0.2784605-0.2784605
# ## [1] 0 0 0 0 0 0 0 0 0 0
tmp.NumberOfTime60.89DaysPastDueNotWorse<- 0
for(i in1:nrow(train)) {
if(train$x9[i] <= 0)
tmp.NumberOfTime60.89DaysPastDueNotWorse[i] <-NumOfTime6089Dayless0.WOE
else if(train$x9[i] <= 1)
tmp.NumberOfTime60.89DaysPastDueNotWorse[i] <-NumOfTime6089Day0to1.WOE
else if(train$x9[i] <= 3)
tmp.NumberOfTime60.89DaysPastDueNotWorse[i] <-NumOfTime6089Day1to3.WOE
else if(train$x9[i] <= 5)
tmp.NumberOfTime60.89DaysPastDueNotWorse[i] <-NumOfTime6089Day3to5.WOE
else
tmp.NumberOfTime60.89DaysPastDueNotWorse[i] <-NumOfTime6089Daythan5.WOE
}
table(tmp.NumberOfTime60.89DaysPastDueNotWorse)
# NumberOfDependents变量(x10)
# ##tmp.NumberOfDependents
# ##-0.155250809857344 0.0866996065110081 0.196180980387687
# ## 43498 14544 10102
# ## 0.331624863227172 0.404698242905824 0.76425364970991
# ## 4771 1815 135
# ## [1] -0.1552508 -0.1552508 -0.1552508-0.1552508 0.1961810 0.1961810
# ## [7] -0.1552508 0.1961810 -0.1552508 -0.1552508
# ## [1] 0 0 0 0 2 2 0 2 0 0
tmp.NumberOfDependents<- 0
for(i in1:nrow(train)) {
if(train$x10[i] <= 0)
tmp.NumberOfDependents[i] <-NumOfDepless0.WOE
else if(train$x10[i] <= 1)
tmp.NumberOfDependents[i] <-NumOfDep0to1.WOE
else if(train$x10[i] <= 3)
tmp.NumberOfDependents[i] <-NumOfDep1to3.WOE
else if(train$x10[i] <= 5)
tmp.NumberOfDependents[i] <-NumOfDep3to5.WOE
else
tmp.NumberOfDependents[i] <-NumOfDepthan5.WOE
}
table(tmp.NumberOfDependents)
#4、WOE Dataframe构建:
trainWOE=cbind.data.frame(tmp.age,tmp.NumberOfTime30.59DaysPastDueNotWorse,tmp.MonthlyIncome,tmp.NumberOfTime60.89DaysPastDueNotWorse,tmp.NumberOfTimes90DaysLate,tmp.NumberRealEstateLoansOrLines,tmp.NumberOfDependents)
summary(trainWOE)
```
# 六、评分卡的创建和实施
通俗来说就是,评分需要自己预设一个阀值,比如:
这个人预测出来“不发生违约”的几率为0.8,设定这个人为500分;
另一个人预测出来“不发生违约”的几率为0.9,设定这个人为600分。
阀值的设定需根据行业经验不断跟踪调整,下面的分数设定仅代表个人经验。
下面开始设立评分,假设按好坏比15为600分,每高20分好坏比翻一倍算出P,Q。如果后期结果不明显,可以高30-50分好坏比才翻一倍。
Score = q - p *log(odds)
即有方程:
620 = q - p *log(15)
600 = q - p *log(15/2)
```{r warning=FALSE}
#1、逻辑回归模型:求出p和q的值
#因为数据中“1”代表的是违约,直接建模预测,求的是“发生违约的概率”,log(odds)即为“坏好比”。为了符合常规理解,分数越高,信用越好,所有就调换“0”和“1”,使建模预测结果为“不发生违约的概率”,最后log(odds)即表示为“好坏比”。
trainWOE$y <- 1 -train$y
glm.fit <-glm(y~., data=trainWOE, family = binomial(link=logit))
summary(glm.fit)
coe =(glm.fit$coefficients)
p <- -20/log(2)
q <- 600 -20*log(15)/log(2)
Score=q +p*{as.numeric(coe[1])+as.numeric(coe[2])*tmp.age+as.numeric(coe[3])*tmp.NumberOfTime30.59DaysPastDueNotWorse+p*as.numeric(coe[4])*tmp.MonthlyIncome+p*as.numeric(coe[5])*tmp.NumberOfTime60.89DaysPastDueNotWorse+p*as.numeric(coe[6])*tmp.NumberOfTimes90DaysLate+p*as.numeric(coe[7])*tmp.NumberRealEstateLoansOrLines+p*as.numeric(coe[8])*tmp.NumberOfDependents}
#个人总评分=基础分+各部分得分
base <- q +p*as.numeric(coe[1])
#2、对各变量进行打分
#1)age变量(x2)
Agelessthan30.SCORE= p*as.numeric(coe[2])*Agelessthan30.WOE
Age30to35.SCORE =p*as.numeric(coe[2])*Age30to35.WOE
Age35to40.SCORE =p*as.numeric(coe[2])*Age35to40.WOE
Age40to45.SCORE =p*as.numeric(coe[2])*Age40to45.WOE
Age45to50.SCORE =p*as.numeric(coe[2])*Age45to50.WOE
Age50to55.SCORE =p*as.numeric(coe[2])*Age50to55.WOE
Age55to60.SCORE =p*as.numeric(coe[2])*Age55to60.WOE
Age60to65.SCORE =p*as.numeric(coe[2])*Age60to65.WOE
Age65to75.SCORE =p*as.numeric(coe[2])*Age65to75.WOE
Agemorethan.SCORE=p*as.numeric(coe[2])*Agemorethan.WOE
(Age.SCORE=c(Age30to35.SCORE,Age35to40.SCORE,Age40to45.SCORE,Age45to50.SCORE,Age50to55.SCORE,Age55to60.SCORE,Age60to65.SCORE,Age65to75.SCORE,Agemorethan.SCORE))
#2)构造计算分值函数
getscore <-function(i, x){
score <- round(p*as.numeric(coe[i])*x, 0)
return(score)
}
#3)计算各变量分箱得分
#age变量(x2)
Agelessthan30.SCORE= getscore(2,Agelessthan30.WOE)
Age30to35.SCORE =getscore(2,Age30to35.WOE)
Age35to40.SCORE =getscore(2,Age35to40.WOE)
Age40to45.SCORE =getscore(2,Age40to45.WOE)
Age45to50.SCORE =getscore(2,Age45to50.WOE)
Age50to55.SCORE =getscore(2,Age50to55.WOE)
Age55to60.SCORE =getscore(2,Age55to60.WOE)
Age60to65.SCORE =getscore(2,Age60to65.WOE)
Age65to75.SCORE =getscore(2,Age65to75.WOE)
Agemorethan.SCORE =getscore(2,Agemorethan.WOE)
(Age.SCORE =c(Agelessthan30.SCORE,Age30to35.SCORE,Age35to40.SCORE,Age40to45.SCORE,Age45to50.SCORE,Age50to55.SCORE,Age55to60.SCORE,Age60to65.SCORE,Age65to75.SCORE,Agemorethan.SCORE))
#NumberOfTime30-59DaysPastDueNotWorse变量(x3)
## [1] -10 18 34 47 50
NumOfTimeless0.SCORE<- getscore(3,NumOfTimeless0.WOE)
NumOfTime0to1.SCORE<- getscore(3,NumOfTime0to1.WOE)
NumOfTime1to3.SCORE<- getscore(3,NumOfTime1to3.WOE)
NumOfTime3to5.SCORE<- getscore(3,NumOfTime3to5.WOE)
NumOfTimethan5.SCORE<- getscore(3,NumOfTimethan5.WOE)
(NumOfTime.SCORE<- c(NumOfTimeless0.SCORE, NumOfTime0to1.SCORE, NumOfTime1to3.SCORE,NumOfTime3to5.SCORE, NumOfTimethan5.SCORE))
#MonthlyIncome变量(x5)
## [1] -25 10 7 8 5 3 0 0 -9 -10
MonIncomeless1000.SCORE= getscore(4,MonIncomeless1000.WOE)
MonIncome1000to2000.SCORE= getscore(4,MonIncome1000to2000.WOE)
MonIncome2000to3000.SCORE= getscore(4,MonIncome2000to3000.WOE)
MonIncome3000to4000.SCORE= getscore(4,MonIncome3000to4000.WOE)
MonIncome4000to5000.SCORE= getscore(4,MonIncome4000to5000.WOE)
MonIncome5000to6000.SCORE= getscore(4,MonIncome5000to6000.WOE)
MonIncome6000to7500.SCORE= getscore(4,MonIncome6000to7500.WOE)
MonIncome7500to9500.SCORE= getscore(4,MonIncome7500to9500.WOE)
MonIncome9500to12000.SCORE= getscore(4,MonIncome9500to12000.WOE)
MonIncomethan12000.SCORE= getscore(4,MonIncomethan12000.WOE)
(MonIncome.SCORE<- c(MonIncomeless1000.SCORE, MonIncome1000to2000.SCORE,MonIncome2000to3000.SCORE, MonIncome3000to4000.SCORE,MonIncome4000to5000.SCORE, MonIncome5000to6000.SCORE,MonIncome6000to7500.SCORE, MonIncome7500to9500.SCORE, MonIncome9500to12000.SCORE,MonIncomethan12000.SCORE))
#NumberOfTimes90DaysLate变量(x7)
## [1] -5 27 38 4748 33
NumOfTime90Dayless0.SCORE= getscore(5,NumOfTime90Dayless0.WOE)
NumOfTime90Day0to1.SCORE= getscore(5,NumOfTime90Day0to1.WOE)
NumOfTime90Day1to3.SCORE= getscore(5,NumOfTime90Day1to3.WOE)
NumOfTime90Day3to5.SCORE= getscore(5,NumOfTime90Day3to5.WOE)
NumOfTime90Day5to10.SCORE= getscore(5,NumOfTime90Day5to10.WOE)
NumOfTime90Daythan10.SCORE= getscore(5,NumOfTime90Daythan10.WOE)
(NumOfTime90Day.SCORE<- c(NumOfTime90Dayless0.SCORE, NumOfTime90Day0to1.SCORE,NumOfTime90Day1to3.SCORE, NumOfTime90Day3to5.SCORE, NumOfTime90Day5to10.SCORE,NumOfTime90Daythan10.SCORE))
#NumberRealEstateLoansOrLine变量(x8)
## [1] 4 -5 -3 1 8 21
NumRealless0.SCORE =getscore(6,NumRealless0.WOE)
NumReal0to1.SCORE =getscore(6,NumReal0to1.WOE)
NumReal1to3.SCORE =getscore(6,NumReal1to3.WOE)
NumReal3to5.SCORE =getscore(6,NumReal3to5.WOE)
NumRealthan5.SCORE =getscore(6,NumRealthan5.WOE)
(NumReal.SCORE <-c(NumRealless0.SCORE, NumReal0to1.SCORE, NumReal1to3.SCORE, NumReal3to5.SCORE,NumRealthan5.SCORE))
#NumberOfTime60-89DaysPastDueNotWorse变量(x9)
## [1] -5 32 48 6260
NumOfTime6089Dayless0.SCORE= getscore(7,NumOfTime6089Dayless0.WOE)
NumOfTime6089Day0to1.SCORE= getscore(7,NumOfTime6089Day0to1.WOE)
NumOfTime6089Day1to3.SCORE= getscore(7,NumOfTime6089Day1to3.WOE)
NumOfTime6089Day3to5.SCORE= getscore(7,NumOfTime6089Day3to5.WOE)
NumOfTime6089Daythan5.SCORE= getscore(7,NumOfTime6089Daythan5.WOE)
(NumOfTime6089Day.SCORE<- c(NumOfTime6089Dayless0.SCORE, NumOfTime6089Day0to1.SCORE,NumOfTime6089Day1to3.SCORE, NumOfTime6089Day3to5.SCORE,NumOfTime6089Daythan5.SCORE))
#NumberOfDependents变量(x10)
## [1] -2 1 2 3 4 8
NumOfDepless0.SCORE= getscore(8,NumOfDepless0.WOE)
NumOfDep0to1.SCORE =getscore(8,NumOfDep0to1.WOE)
NumOfDep1to3.SCORE =getscore(8,NumOfDep1to3.WOE)
NumOfDep3to5.SCORE =getscore(8,NumOfDep3to5.WOE)
NumOfDepthan5.SCORE= getscore(8,NumOfDepthan5.WOE)
(NumOfDept.SCORE<- c(NumOfDepless0.SCORE, NumOfDep0to1.SCORE, NumOfDep1to3.SCORE,NumOfDep3to5.SCORE, NumOfDepthan5.SCORE))
```