背景介绍:
Lending Club 创立于2006年,主营业务是为市场提供P2P贷款的平台中介服务,公司总部位于旧金山。公司在运营初期仅提供个人贷款服务,贷款人向Lending Club平台申请贷款时,Lending Club通过线上或线下让客户填写贷款申请表,收集客户的基本信息,同时会借助第三方平台的征信机构的信息。
通过这些信息属性来做逻辑回归生成预测模型,Lending Club可以通过预测判断贷款人是否会违约,从而决定是否向申请人发放贷款。
数据集来源:LendingClub官网 07年—11年 的数据:
https://www.lendingclub.com/statistics/additional-statistics?
引入包和数据集
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
import seaborn as sns
%matplotlib inline
warnings.filterwarnings('ignore')
plt.style.use('ggplot')
Loandata = pd.read_csv('C:/Users/Jason/Desktop/DAdata/LoanStats3a_securev2.csv',skiprows=1)
一、查看数据集基本情况
Loandata.shape
(39786, 150)
每一行是一条数据,150个字段,字段信息如下:
Loandata.iloc[0]
查看第一条字段的信息
二、数据可视化分析前的数据预处理
1、删除特征中只有一种属性的列
orig_columns = Loandata.columns
drop_columns = []
for col in orig_columns:
col_series = Loandata[col].dropna().unique() #去重唯一的属性
if len(col_series) == 1: #如果该特征的属性只有一个属性,就给过滤掉该特征
drop_columns.append(col)
Loandata = Loandata.drop(drop_columns, axis=1)
print(drop_columns)
['pymnt_plan', 'out_prncp', 'next_pymnt_d', 'collections_12_mths_ex_med', 'mths_since_last_major_derog', 'application_type', 'verification_status_joint', 'acc_now_delinq', 'bc_util', 'chargeoff_within_12_mths', 'delinq_amnt', 'percent_bc_gt_75', 'tax_liens', 'sec_app_mths_since_last_major_derog', 'hardship_flag', 'hardship_last_payment_amount']
2、删除缺失值超过二分之一的字段
half_count = len(Loandata)/2
Loandata = Loandata.dropna(thresh=half_count,axis=1)
Loandata.shape
(39786, 50)
还剩下50个字段
Loandata.isnull().sum()
查看有空值的字段
id 0
loan_amnt 0
funded_amnt 0
funded_amnt_inv 0
term 0
int_rate 0
installment 0
grade 0
sub_grade 0
emp_title 2467
emp_length 1078
home_ownership 0
annual_inc 0
verification_status 0
issue_d 0
loan_status 0
url 0
desc 12967
purpose 0
title 11
zip_code 0
addr_state 0
dti 0
delinq_2yrs 0
earliest_cr_line 0
fico_range_low 0
fico_range_high 0
inq_last_6mths 0
open_acc 0
pub_rec 0
revol_bal 0
revol_util 50
total_acc 0
initial_list_status 0
out_prncp_inv 0
total_pymnt 0
total_pymnt_inv 0
total_rec_prncp 0
total_rec_int 0
total_rec_late_fee 0
recoveries 0
collection_recovery_fee 0
last_pymnt_d 71
last_pymnt_amnt 1
last_credit_pull_d 2
last_fico_range_high 0
last_fico_range_low 0
policy_code 0
pub_rec_bankruptcies 697
debt_settlement_flag 1
dtype: int64
空值比较多的列,如:desc,emp_title等对于分析和建模都没有帮助,所以将其删除,id,url,zip_code等也一并删除
Loandata = Loandata.drop(['id','url','desc','title','emp_title','zip_code'],axis=1)
Loandata.isnull().sum()
loan_amnt 0
funded_amnt 0
funded_amnt_inv 0
term 0
int_rate 0
installment 0
grade 0
sub_grade 0
emp_length 1078
home_ownership 0
annual_inc 0
verification_status 0
issue_d 0
loan_status 0
purpose 0
addr_state 0
dti 0
delinq_2yrs 0
earliest_cr_line 0
fico_range_low 0
fico_range_high 0
inq_last_6mths 0
open_acc 0
pub_rec 0
revol_bal 0
revol_util 50
total_acc 0
initial_list_status 0
out_prncp_inv 0
total_pymnt 0
total_pymnt_inv 0
total_rec_prncp 0
total_rec_int 0
total_rec_late_fee 0
recoveries 0
collection_recovery_fee 0
last_pymnt_d 71
last_pymnt_amnt 1
last_credit_pull_d 2
last_fico_range_high 0
last_fico_range_low 0
policy_code 0
pub_rec_bankruptcies 697
debt_settlement_flag 1
dtype: int64
# 采用labelencoder处理 emp_length
label_dict = {
"emp_length": {
"10+ years": 10,
"9 years": 9,
"8 years": 8,
"7 years": 7,
"6 years": 6,
"5 years": 5,
"4 years": 4,
"3 years": 3,
"2 years": 2,
"1 year": 1,
"< 1 year": 0,
None: 0
}
}
Loandata = Loandata.replace(label_dict)
3、将issue_d这一列从字符串转换为时间格式,并查看是否转换后有空值,然后按时间先后排序
Loandata['issue_d'] = pd.to_datetime(Loandata['issue_d'])
Loandata['issue_d'].isnull().any()
-->> False
# 按时间排序
Loandata = Loandata.sort_values(by=['issue_d'],ascending=True)
Loandata = Loandata.reset_index(drop=True)
把费率这个字段做一个处理
Loandata["int_rate"] = Loandata["int_rate"].str.rstrip("%").astype("float")
三、我们先来做一个初步数据分析
1、查看贷款人数最多的州:
Loandata.addr_state.value_counts()[:20].plot(kind='bar', figsize=(8, 4),title='StateLoan Count')
因为Lending Club总部在加州,对本地业务开拓比较深,所以加州的笔数远远高于其他州,其次是纽约州、佛罗里达州和德克萨斯州
2、查看坏账率
Loandata['loan_status'].value_counts()
-->>Fully Paid 34116
Charged Off 5670
Name: loan_status, dtype: int64
# 对还款情况做一个编码
badloan = ['Charged Off']
Loandata['loan_condition'] = np.nan
def loan_condition(status):
if status in badloan:
return 0
else:
return 1
Loandata['loan_condition'] = Loandata['loan_status'].apply(loan_condition)
print('goodload 1: badloan 0')
print(Loandata['loan_condition'].value_counts())
-->>goodload 1: badloan 0
1 34116
0 5670
Name: loan_condition, dtype: int64
3、每年放款交易额
Loandata['year'] =Loandata['issue_d'].dt.year
sns.countplot('year',data=Loandata)
plt.title('Loan Amount by Year',fontsize=10)
每年的贷款笔数和贷款金额在逐年上升
4、客户贷款金额和期数的选择
plt.hist(Loandata.loan_amnt,bins=10,edgecolor='white',color='dodgerblue')
Loandata['term'].value_counts()
-->> 36 months 29096
60 months 10690
Name: term, dtype: int64
4000-12000 的贷款人数是最多的,大部分人选择36期还款
5、利率的范围
print(Loandata.int_rate.describe())
sns.distplot(Loandata.int_rate)
-->>count 39786.000000
mean 12.027873
std 3.727466
min 5.420000
25% 9.250000
50% 11.860000
75% 14.590000
max 24.590000
Name: int_rate, dtype: float64
利率平均值是12%,总体范围在5.4%~24.59%
四、初步分析完毕,开始建模部分,但是在此之间还要对数据进行处理,删除对于建模帮助不大的字段,减少模型计算量,而且由于sk-learn不接受字符串类型的数据,还需做缺失值字符串、标点符号、%号、字符值等的处理
Loandata.columns
-->>Index(['loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'term', 'int_rate',
'installment', 'grade', 'sub_grade', 'emp_length', 'home_ownership',
'annual_inc', 'verification_status', 'issue_d', 'loan_status',
'purpose', 'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line',
'fico_range_low', 'fico_range_high', 'inq_last_6mths', 'open_acc',
'pub_rec', 'revol_bal', 'revol_util', 'total_acc',
'initial_list_status', 'out_prncp_inv', 'total_pymnt',
'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int',
'total_rec_late_fee', 'recoveries', 'collection_recovery_fee',
'last_pymnt_d', 'last_pymnt_amnt', 'last_credit_pull_d',
'last_fico_range_high', 'last_fico_range_low', 'policy_code',
'pub_rec_bankruptcies', 'debt_settlement_flag', 'loan_condition',
'year'],
dtype='object')
目前还有比较多的字段,可能在实际工作中,模型字段的保留与删除与否,将会是一个重要的工程,在这里我就删除一些对建模无用的字段,比如:to迄今收到的本金,期望贷款金额,邮编等
Loandata = Loandata.drop(["funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "issue_d"], axis=1)
Loandata = Loandata.drop(["out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp"], axis=1)
Loandata = Loandata.drop(["total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"], axis=1)
Loandata.head(1)
-->>
loan_amnt term int_rate installment emp_length home_ownership annual_inc verification_status loan_status purpose ... total_acc initial_list_status last_credit_pull_d last_fico_range_high last_fico_range_low policy_code pub_rec_bankruptcies debt_settlement_flag loan_condition year
0 7500 36 months 13.75 255.43 0 OWN 22000.0 Not Verified Fully Paid debt_consolidation ... 8 f 20-Jan 719 715 1 NaN N 1 2007
还剩下31个字段
null_counts = Loandata.isnull().sum()
null_counts
-->>
loan_amnt 0
term 0
int_rate 0
installment 0
emp_length 0
home_ownership 0
annual_inc 0
verification_status 0
loan_status 0
purpose 0
addr_state 0
dti 0
delinq_2yrs 0
earliest_cr_line 0
fico_range_low 0
fico_range_high 0
inq_last_6mths 0
open_acc 0
pub_rec 0
revol_bal 0
revol_util 50
total_acc 0
initial_list_status 0
last_credit_pull_d 2
last_fico_range_high 0
last_fico_range_low 0
policy_code 0
pub_rec_bankruptcies 697
debt_settlement_flag 1
loan_condition 0
year 0
dtype: int64
revol_util 去掉%并转成float
Loandata["revol_util"] = Loandata["revol_util"].str.rstrip("%").astype("float")
缺失值并不多,丢弃也无妨,当然也可以最大值、最小值、平均值等填充
Loandata = Loandata.drop("pub_rec_bankruptcies", axis=1)
Loandata = Loandata.dropna(axis=0)
Loandata = Loandata.drop(['debt_settlement_flag', 'policy_code','initial_list_status','earliest_cr_line','addr_state','loan_status'],axis=1)
把剩下的几个字符串类型字段做一个标签编码
import sklearn.preprocessing as sp
lbe = sp.LabelEncoder()
Loandata['home_ownership'] = lbe.fit_transform(Loandata['home_ownership'])
lbe = sp.LabelEncoder()
Loandata['verification_status'] = lbe.fit_transform(Loandata['verification_status'])
lbe = sp.LabelEncoder()
Loandata['purpose'] = lbe.fit_transform(Loandata['purpose'])
lbe = sp.LabelEncoder()
Loandata['term'] = lbe.fit_transform(Loandata['term'])
把剩下数值型的字段转成int型
Loandata['total_acc'] = Loandata['total_acc'].astype('int64')
Loandata['revol_bal'] = Loandata['revol_bal'].astype('int64')
Loandata['delinq_2yrs'] = Loandata['delinq_2yrs'].astype('int64')
Loandata.head()
-->>
loan_amnt term int_rate installment emp_length home_ownership annual_inc verification_status purpose dti ... fico_range_high inq_last_6mths open_acc pub_rec revol_bal revol_util total_acc last_fico_range_high last_fico_range_low loan_condition
0 7500 0 13.75 255.43 0 3 22000.0 0 2 14.29 ... 664 0 7 0 4175 51.5 8 719 715 1
1 3500 0 10.28 113.39 0 4 20000.0 0 8 1.50 ... 684 0 17 0 1882 32.4 18 829 825 1
2 5750 0 7.43 178.69 10 0 125000.0 0 2 0.27 ... 794 0 10 0 2817 10.2 16 799 795 1
3 5000 0 7.43 155.38 6 4 40000.0 0 0 2.55 ... 774 2 4 0 2562 14.0 7 729 725 1
4 1200 0 11.54 39.60 0 4 20000.0 0 1 2.04 ... 664 2 3 0 1153 75.8 4 704 700 1
5 rows × 22 columns
数据清洗完毕,剩下22个字段用作模型训练,将干净的数据重新保存并读取
Loandata.to_csv("C:/Users/Jason/Desktop/CleanLoanData.csv", index=False)
Loandata=pd.read_csv("C:/Users/Jason/Desktop/CleanLoanData.csv")
五、利用逻辑回归实现客户逾期预测
5.1
import sklearn.linear_model as lm
model = lm.LogisticRegression()
cols = Loandata.columns
train_cols = cols.drop('loan_condition')
x = Loandata[train_cols]
y = Loandata['loan_condition']
model.fit(x,y)
predict = model.predict(x)
predict[:10]
-->>array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)
0 代表没还,1代表还了,这么高的还款率,似乎有点不对。让我们看看model的模型概率
model.predict_proba(x)
-->>
array([[0.03725216, 0.96274784],
[0.00711186, 0.99288814],
[0.02119685, 0.97880315],
...,
[0.18928953, 0.81071047],
[0.04177887, 0.95822113],
[0.06569009, 0.93430991]])
5.2 等等,让我们想一想,拿什么衡量我们模型的好坏呢,我们结合实际,我们借钱出去给有能力还款的人,每笔赚取10%的利润,十个人中假设一个人没还款,损失100%,但是需要预测对十个人才能弥补预测错一个人的收益,显然精度是不合适此模型,为了实现利润最大化,所以需要模型预测更高的recall率,故采用两个指标:TPR(True Poositive Rate)更高,FPR(False Positive Rate)更低
实际值 预测值 盈亏
0 1 -1000 FP
1 1 100 TP
1 0 0 FN
0 0 0 TN
fp_ = (predict ==1) & (Loandata['loan_condition']==0)
fp = len(predict[fp_])
print(fp)
tp_ = (predict ==1) & (Loandata['loan_condition']==1)
tp = len(predict[tp_])
print(tp)
fn_ = (predict == 0) & (Loandata["loan_condition"] == 1)
fn = len(predict[fn_])
print(fn)
tn_ = (predict ==0) & (Loandata['loan_condition']==0)
tn = len(predict[tn_])
print(tn)
-->>
4414
33118
962
1239
5.3 建立一个混淆矩阵
import sklearn.model_selection as sm
model = lm.LogisticRegression()
predict = sm.cross_val_predict(model,x,y,cv=10)
predict = pd.Series(predict)
predict[:100]
-->>
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
10 1
11 1
12 1
13 1
14 1
15 1
16 1
17 1
18 1
19 1
20 1
21 1
22 1
23 1
24 1
25 1
26 1
27 1
28 1
29 1
..
70 1
71 1
72 1
73 1
74 1
75 1
76 1
77 1
78 1
79 1
80 1
81 1
82 0
83 0
84 1
85 1
86 1
87 1
88 1
89 1
90 1
91 1
92 1
93 1
94 1
95 0
96 1
97 1
98 1
99 1
Length: 100, dtype: int64
fp_ = (predict ==1) & (Loandata['loan_condition']==0)
fp = len(predict[fp_])
print(fp)
tp_ = (predict ==1) & (Loandata['loan_condition']==1)
tp = len(predict[tp_])
print(tp)
fn_ = (predict == 0) & (Loandata["loan_condition"] == 1)
fn = len(predict[fn_])
print(fn)
tn_ = (predict ==0) & (Loandata['loan_condition']==0)
tn = len(predict[tn_])
print(tn)
--->>
4420
33127
953
1233
tpr = tp/float((tp+fn))
fpr = fp/float((fp+tn))
print(tpr)
print(fpr)
-->>
0.9720363849765258
0.781885724394127
5.4 TPR和FPR的值都很高,显然不是我们想要的,考虑到数据集样本权重差异较大,下一步我们调整权重再训练一次(默认权重)
model = lm.LogisticRegression(class_weight='balanced')
predict = sm.cross_val_predict(model,x,y,cv=10)
predict = pd.Series(predict)
fp_ = (predict ==1) & (Loandata['loan_condition']==0)
fp = len(predict[fp_])
print(fp)
tp_ = (predict ==1) & (Loandata['loan_condition']==1)
tp = len(predict[tp_])
print(tp)
tn_ = (predict ==0) & (Loandata['loan_condition']==0)
tn = len(predict[tn_])
print(tn)
fn_ = (predict == 0) & (Loandata["loan_condition"] == 1)
fn = len(predict[fn_])
print(fn)
tpr = tp/float((tp+fn))
fpr = fp/float((fp+tn))
print(tpr)
print(fpr)
-->>
1517
26393
4136
7687
0.7744424882629108
0.26835308685653636
5.5 自定义权重
penalty = {
0:6,
1:1
}
model = lm.LogisticRegression(class_weight=penalty)
predict = sm.cross_val_predict(model,x,y,cv=10)
predict = pd.Series(predict)
fp_ = (predict ==1) & (Loandata['loan_condition']==0)
fp = len(predict[fp_])
print(fp)
tp_ = (predict ==1) & (Loandata['loan_condition']==1)
tp = len(predict[tp_])
print(tp)
tn_ = (predict ==0) & (Loandata['loan_condition']==0)
tn = len(predict[tn_])
print(tn)
fn_ = (predict == 0) & (Loandata["loan_condition"] == 1)
fn = len(predict[fn_])
print(fn)
tpr = tp/float((tp+fn))
fpr = fp/float((fp+tn))
print(tpr)
print(fpr)
1521
26382
4132
7698
0.7741197183098592
0.2690606757473908