数据挖掘项目——金融反欺诈

数据挖掘项目——金融反欺诈

    • 前言
    • 一、数据集获取
    • 二、特征工程
      • 1、读数据
      • 2、去除特殊字符
      • 3、删除属性
      • 4、提取标签
    • 三、构建模型

前言

该项目来自北风网,模型搭建很简单,该篇记录过程总结套路。

一、数据集获取

https://www.lendingclub.com/info/demand-and-credit-profile.action

二、特征工程

首先声明,该项目使用到的特征处理手段十分简单,但结果却能达到商业要求。

1、读数据

df = pd.read_csv("./data/LoanStats_2016Q2.csv",skiprows = 1,low_memory = False)#不要第一行,low_memory降内存
print(df.info(verbose = True,null_counts = True))

开一看到该数据集有97856个样本,每个样本有145个属性,由于该数据集是开源的,数据非常“脏”,存在大量的空值属性,相关属性。

2、去除特殊字符

这里的特殊字符特指出现在该数据集里的如下属性:

print(df["int_rate"])
>>0         14.49%
1         13.49%
2         13.99%
3         19.99%
4          8.59%
5          9.49%
6          9.49%
7         15.59%
8         24.49%
9         15.59%
10        10.99%
……
print(df["revol_util"])
>>0        26.4%
1        74.6%
2          61%
3        20.6%
4        43.9%
5        58.9%
6        34.2%
7        56.7%
8        36.9%
9         9.3%
10       28.2%
print(df["emp_length"])
>>0          8 years
1          2 years
2        10+ years
3          9 years
4          3 years
5        10+ years
6          9 years
7          7 years
8          2 years
9        10+ years
10         4 years
#使用正则将上述含有特殊字符的属性转成数值型
df["term"].replace(to_replace= "[^0-9]+",value = " ",inplace=True,regex = True )
df["int_rate"].replace("%",value=" ",inplace = True,regex= True)
df["revol_util"].replace("%",value=" ",inplace = True,regex= True)

3、删除属性

这里需要删除的属性应属于以下几种情况:

  1. 空值过多
  2. 方差较小
  3. 相关性较高
  4. 没有实际意义

1、使用如下挑选出空值较多的属性:

##删掉全部为空值的列
df.dropna(axis = 1,inplace = True,how = "all")#how有两个属性:any是有一个空值就把该列删除,all是该列都为空值就删除
print(df.info(verbose=True,null_counts=True))#使用该API来查看每一列的空值个数,如果空值个数过多,则可以删除掉该列
##删掉全部为空值的行
df.dropna(axis = 0,inplace = True,how = "all")#how有两个属性:any是有一个空值就把该列删除,all是该列都为空值就删除

这里需要选择一个阈值,空缺值大于该阈值的属性删除,否则保留。

2、删除方差小的属性

#使用
print(df["emp_title"].value_counts())#发现没有区分度,所以删除
df.drop("emp_title",axis = True,inplace = True)

当然也可以使用sklearn.feature_selection中的VarianceThreshold来计算每个数值属性的方差。

也可以使用“启发式”方式来drop掉变量个数较单一的属性:

for col in df.select_dtypes(include = ['float']).columns:
    print("col {} has {}".format(col,len(df[col].unique())))
#经过上面的结果,可以发现:一下几列是分类比较单一的
#inq_last_6mths 6
#pub_rec 19
#collections_12_mths_ex_med 8
#policy_code 1
#acc_now_delinq 5
#open_acc_6m 17
#open_act_il 40
#open_il_12m 17
#open_il_24m 26
#open_rv_12m 21
#open_rv_24m 34
#inq_fi 21
#total_cu_tl 42
#inq_last_12m 31
#acc_open_past_24mths 42
#chargeoff_within_12_mths 8
#mort_acc 23
#mths_since_recent_inq 27
#num_accts_ever_120_pd 31
#num_actv_bc_tl 31
#num_actv_rev_tl has 44
#num_bc_sats has 42
#num_bc_tl has 53
#num_il_tl has 84
#num_op_rev_tl has 57
#num_rev_accts has 77
#num_rev_tl_bal_gt_0 has 40
#num_sats has 64
#num_tl_120dpd_2m has 5
#num_tl_30dpd has 5
#num_tl_90g_dpd_24m has 19
#num_tl_op_past_12m has 25
#pub_rec_bankruptcies has 9
#tax_liens has 19
#settlement_term has 28
#选择性的删除上面的列(<10的)
#删除float中,重复较少的列
df.drop([
         "inq_last_6mths","collections_12_mths_ex_med","policy_code","acc_now_delinq",
         "chargeoff_within_12_mths","num_tl_120dpd_2m","num_tl_30dpd","pub_rec_bankruptcies",
         "delinq_2yrs","mths_since_last_delinq","inq_last_6mths","mths_since_last_record","open_acc",
         "pub_rec","total_acc","out_prncp","out_prncp_inv","collections_12_mths_ex_med","policy_code",
         "acc_now_delinq","chargeoff_within_12_mths",'pub_rec_bankruptcies',"tax_liens"
         ],axis = 1 ,inplace = True)

使用上面的方法,找那些object类型的属性

##找完float,现在找object
for col in df.select_dtypes(include = ['object']).columns:
    print("col {} has {}".format(col,len(df[col].unique())))

#col term has 2
#col int_rate has 74
#col grade has 7
#col emp_length has 11
#col home_ownership has 3
#col verification_status has 3
#col issue_d has 3
#col loan_status has 7
#col pymnt_plan has 2
#col desc has 7
#col purpose has 12
#col title has 13
#col zip_code has 875
#col addr_state has 50
#col earliest_cr_line has 612
#col revol_util has 1085
#col initial_list_status has 2
#col last_pymnt_d has 29
#col next_pymnt_d has 3
#col last_credit_pull_d has 30
#col application_type has 2
#col verification_status_joint has 2
#col hardship_flag has 2
#col disbursement_method has 2
#col debt_settlement_flag has 2
#col debt_settlement_flag_date has 21
#col settlement_status has 4
#col settlement_date has 23

df.drop(["term","grade","home_ownership","verification_status",
         "issue_d","pymnt_plan","purpose","desc","initial_list_status","next_pymnt_d","application_type","application_type","verification_status_joint","hardship_flag",
         "disbursement_method","debt_settlement_flag"],axis = 1 ,inplace = True)

3、删除相关性较高的列

##相关性
cor= df.corr()
cor.iloc[:,:] = np.tril(cor,k = -1)
cor = cor.stack()
print(cor[(cor > 0.95) | (cor < -0.95)])#保守一些,删掉0.95以上的

4、没有实际意义的属性

df.drop(["id","member_id"],axis = 1 ,inplace= True)
#删除掉grade或者sub_grade中的一列,这两列线性相关
df.drop("sub_grade",axis = 1,inplace= True)
df.drop(["title"],axis = 1 ,inplace= True)

4、提取标签

##标签二值化
#print(df.info())
df.loan_status.replace("Fully Paid",value = 1,inplace = True)
df.loan_status.replace("Charged Off",value = 0,inplace = True)
df.loan_status.replace("Current",value = np.nan,inplace = True)
df.loan_status.replace("Late (31-120 days)",value = np.nan,inplace = True)
df.loan_status.replace("In Grace Period",value = np.nan,inplace = True)
df.loan_status.replace("Late (16-30 days)",value = np.nan,inplace = True)
df.loan_status.replace("Default",value = np.nan,inplace = True)

##删除标签为空值的实例
df.dropna(subset = ["loan_status"],inplace = True)
Y = df.loan_status

三、构建模型

X = df.drop("loan_status",axis = 1,inplace = False)
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size = 0.3,random_state=2019)
x_train = StandardScaler().fit_transform(x_train)#可选
x_test = StandardScaler().fit_transform(x_test)#可选
lr = LogisticRegression()
start = time.time()

lr.fit(x_train,y_train)
train_predict = lr.predict(x_train)
train_f1 = metrics.f1_score(train_predict,y_train)
train_acc = metrics.accuracy_score(train_predict,y_train)
train_rec = metrics.recall_score(train_predict,y_train)
print("逻辑回归的效果如下")
print("在训练集上f1_score的值为:{}".format(train_f1))
print("在训练集上精确率的值为:{}".format(train_acc))
print("在训练集上查全率的值为:{}".format(train_rec))

test_predict = lr.predict(x_test)
test_f1 = metrics.f1_score(test_predict,y_test)
test_acc = metrics.accuracy_score(test_predict,y_test)
test_rec = metrics.recall_score(test_predict,y_test)

print("在测试集上f1_score的值为:{}".format(test_f1))
print("在测试集上精确率的值为:{}".format(test_acc))
print("在测试集上查全率的值为:{}".format(test_rec))

end = time.time()
print(end - start)

rf = RandomForestClassifier()
start = time.time()
rf.fit(x_train,y_train)
train_predict = rf.predict(x_train)
print("=" * 100)
print("随机森林的效果如下")
train_f1 = metrics.f1_score(train_predict,y_train)
train_acc = metrics.accuracy_score(train_predict,y_train)
train_rec = metrics.recall_score(train_predict,y_train)

test_predict = rf.predict(x_test)
test_f1 = metrics.f1_score(test_predict,y_test)
test_acc = metrics.accuracy_score(test_predict,y_test)
test_rec = metrics.recall_score(test_predict,y_test)

print("在训练集上f1_score的值为:{}".format(train_f1))
print("在训练集上精确率的值为:{}".format(train_acc))
print("在训练集上查全率的值为:{}".format(train_rec))
print("在测试集上f1_score的值为:{}".format(test_f1))
print("在测试集上精确率的值为:{}".format(test_acc))
print("在测试集上查全率的值为:{}".format(test_rec))

feature_importance = rf.feature_importances_
index = np.argsort(feature_importance)[-10:]
plt.barh(np.arange(10),feature_importance[index],color = "dodgerblue")
plt.yticks(np.arange(10+0.25),np.array(X.columns)[index])
plt.xlabel("relative importance")
plt.title("Top 10 Importance Variable")
plt.show()

结果:

逻辑回归的效果如下
在训练集上f1_score的值为:0.9991685321394307
在训练集上精确率的值为:0.9987865588052272
在训练集上查全率的值为:0.998338445807771
在测试集上f1_score的值为:0.9992096423631693
在测试集上精确率的值为:0.9988384754990925
在测试集上查全率的值为:0.9984205330700888
2.0413224697113037
====================================================================================================
随机森林的效果如下
在训练集上f1_score的值为:0.9998292932741549
在训练集上精确率的值为:0.9997510889856877
在训练集上查全率的值为:0.9998719644914856
在测试集上f1_score的值为:0.9928557253423298
在测试集上精确率的值为:0.989546279491833
在测试集上查全率的值为:0.9964150567616012

你可能感兴趣的:(数据挖掘项目,数据挖掘金融)