该项目来自北风网,模型搭建很简单,该篇记录过程总结套路。
https://www.lendingclub.com/info/demand-and-credit-profile.action
首先声明,该项目使用到的特征处理手段十分简单,但结果却能达到商业要求。
df = pd.read_csv("./data/LoanStats_2016Q2.csv",skiprows = 1,low_memory = False)#不要第一行,low_memory降内存
print(df.info(verbose = True,null_counts = True))
开一看到该数据集有97856个样本,每个样本有145个属性,由于该数据集是开源的,数据非常“脏”,存在大量的空值属性,相关属性。
这里的特殊字符特指出现在该数据集里的如下属性:
print(df["int_rate"])
>>0 14.49%
1 13.49%
2 13.99%
3 19.99%
4 8.59%
5 9.49%
6 9.49%
7 15.59%
8 24.49%
9 15.59%
10 10.99%
……
print(df["revol_util"])
>>0 26.4%
1 74.6%
2 61%
3 20.6%
4 43.9%
5 58.9%
6 34.2%
7 56.7%
8 36.9%
9 9.3%
10 28.2%
print(df["emp_length"])
>>0 8 years
1 2 years
2 10+ years
3 9 years
4 3 years
5 10+ years
6 9 years
7 7 years
8 2 years
9 10+ years
10 4 years
#使用正则将上述含有特殊字符的属性转成数值型
df["term"].replace(to_replace= "[^0-9]+",value = " ",inplace=True,regex = True )
df["int_rate"].replace("%",value=" ",inplace = True,regex= True)
df["revol_util"].replace("%",value=" ",inplace = True,regex= True)
这里需要删除的属性应属于以下几种情况:
1、使用如下挑选出空值较多的属性:
##删掉全部为空值的列
df.dropna(axis = 1,inplace = True,how = "all")#how有两个属性:any是有一个空值就把该列删除,all是该列都为空值就删除
print(df.info(verbose=True,null_counts=True))#使用该API来查看每一列的空值个数,如果空值个数过多,则可以删除掉该列
##删掉全部为空值的行
df.dropna(axis = 0,inplace = True,how = "all")#how有两个属性:any是有一个空值就把该列删除,all是该列都为空值就删除
这里需要选择一个阈值,空缺值大于该阈值的属性删除,否则保留。
2、删除方差小的属性
#使用
print(df["emp_title"].value_counts())#发现没有区分度,所以删除
df.drop("emp_title",axis = True,inplace = True)
当然也可以使用sklearn.feature_selection中的VarianceThreshold来计算每个数值属性的方差。
也可以使用“启发式”方式来drop掉变量个数较单一的属性:
for col in df.select_dtypes(include = ['float']).columns:
print("col {} has {}".format(col,len(df[col].unique())))
#经过上面的结果,可以发现:一下几列是分类比较单一的
#inq_last_6mths 6
#pub_rec 19
#collections_12_mths_ex_med 8
#policy_code 1
#acc_now_delinq 5
#open_acc_6m 17
#open_act_il 40
#open_il_12m 17
#open_il_24m 26
#open_rv_12m 21
#open_rv_24m 34
#inq_fi 21
#total_cu_tl 42
#inq_last_12m 31
#acc_open_past_24mths 42
#chargeoff_within_12_mths 8
#mort_acc 23
#mths_since_recent_inq 27
#num_accts_ever_120_pd 31
#num_actv_bc_tl 31
#num_actv_rev_tl has 44
#num_bc_sats has 42
#num_bc_tl has 53
#num_il_tl has 84
#num_op_rev_tl has 57
#num_rev_accts has 77
#num_rev_tl_bal_gt_0 has 40
#num_sats has 64
#num_tl_120dpd_2m has 5
#num_tl_30dpd has 5
#num_tl_90g_dpd_24m has 19
#num_tl_op_past_12m has 25
#pub_rec_bankruptcies has 9
#tax_liens has 19
#settlement_term has 28
#选择性的删除上面的列(<10的)
#删除float中,重复较少的列
df.drop([
"inq_last_6mths","collections_12_mths_ex_med","policy_code","acc_now_delinq",
"chargeoff_within_12_mths","num_tl_120dpd_2m","num_tl_30dpd","pub_rec_bankruptcies",
"delinq_2yrs","mths_since_last_delinq","inq_last_6mths","mths_since_last_record","open_acc",
"pub_rec","total_acc","out_prncp","out_prncp_inv","collections_12_mths_ex_med","policy_code",
"acc_now_delinq","chargeoff_within_12_mths",'pub_rec_bankruptcies',"tax_liens"
],axis = 1 ,inplace = True)
使用上面的方法,找那些object类型的属性
##找完float,现在找object
for col in df.select_dtypes(include = ['object']).columns:
print("col {} has {}".format(col,len(df[col].unique())))
#col term has 2
#col int_rate has 74
#col grade has 7
#col emp_length has 11
#col home_ownership has 3
#col verification_status has 3
#col issue_d has 3
#col loan_status has 7
#col pymnt_plan has 2
#col desc has 7
#col purpose has 12
#col title has 13
#col zip_code has 875
#col addr_state has 50
#col earliest_cr_line has 612
#col revol_util has 1085
#col initial_list_status has 2
#col last_pymnt_d has 29
#col next_pymnt_d has 3
#col last_credit_pull_d has 30
#col application_type has 2
#col verification_status_joint has 2
#col hardship_flag has 2
#col disbursement_method has 2
#col debt_settlement_flag has 2
#col debt_settlement_flag_date has 21
#col settlement_status has 4
#col settlement_date has 23
df.drop(["term","grade","home_ownership","verification_status",
"issue_d","pymnt_plan","purpose","desc","initial_list_status","next_pymnt_d","application_type","application_type","verification_status_joint","hardship_flag",
"disbursement_method","debt_settlement_flag"],axis = 1 ,inplace = True)
3、删除相关性较高的列
##相关性
cor= df.corr()
cor.iloc[:,:] = np.tril(cor,k = -1)
cor = cor.stack()
print(cor[(cor > 0.95) | (cor < -0.95)])#保守一些,删掉0.95以上的
4、没有实际意义的属性
df.drop(["id","member_id"],axis = 1 ,inplace= True)
#删除掉grade或者sub_grade中的一列,这两列线性相关
df.drop("sub_grade",axis = 1,inplace= True)
df.drop(["title"],axis = 1 ,inplace= True)
##标签二值化
#print(df.info())
df.loan_status.replace("Fully Paid",value = 1,inplace = True)
df.loan_status.replace("Charged Off",value = 0,inplace = True)
df.loan_status.replace("Current",value = np.nan,inplace = True)
df.loan_status.replace("Late (31-120 days)",value = np.nan,inplace = True)
df.loan_status.replace("In Grace Period",value = np.nan,inplace = True)
df.loan_status.replace("Late (16-30 days)",value = np.nan,inplace = True)
df.loan_status.replace("Default",value = np.nan,inplace = True)
##删除标签为空值的实例
df.dropna(subset = ["loan_status"],inplace = True)
Y = df.loan_status
X = df.drop("loan_status",axis = 1,inplace = False)
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size = 0.3,random_state=2019)
x_train = StandardScaler().fit_transform(x_train)#可选
x_test = StandardScaler().fit_transform(x_test)#可选
lr = LogisticRegression()
start = time.time()
lr.fit(x_train,y_train)
train_predict = lr.predict(x_train)
train_f1 = metrics.f1_score(train_predict,y_train)
train_acc = metrics.accuracy_score(train_predict,y_train)
train_rec = metrics.recall_score(train_predict,y_train)
print("逻辑回归的效果如下")
print("在训练集上f1_score的值为:{}".format(train_f1))
print("在训练集上精确率的值为:{}".format(train_acc))
print("在训练集上查全率的值为:{}".format(train_rec))
test_predict = lr.predict(x_test)
test_f1 = metrics.f1_score(test_predict,y_test)
test_acc = metrics.accuracy_score(test_predict,y_test)
test_rec = metrics.recall_score(test_predict,y_test)
print("在测试集上f1_score的值为:{}".format(test_f1))
print("在测试集上精确率的值为:{}".format(test_acc))
print("在测试集上查全率的值为:{}".format(test_rec))
end = time.time()
print(end - start)
rf = RandomForestClassifier()
start = time.time()
rf.fit(x_train,y_train)
train_predict = rf.predict(x_train)
print("=" * 100)
print("随机森林的效果如下")
train_f1 = metrics.f1_score(train_predict,y_train)
train_acc = metrics.accuracy_score(train_predict,y_train)
train_rec = metrics.recall_score(train_predict,y_train)
test_predict = rf.predict(x_test)
test_f1 = metrics.f1_score(test_predict,y_test)
test_acc = metrics.accuracy_score(test_predict,y_test)
test_rec = metrics.recall_score(test_predict,y_test)
print("在训练集上f1_score的值为:{}".format(train_f1))
print("在训练集上精确率的值为:{}".format(train_acc))
print("在训练集上查全率的值为:{}".format(train_rec))
print("在测试集上f1_score的值为:{}".format(test_f1))
print("在测试集上精确率的值为:{}".format(test_acc))
print("在测试集上查全率的值为:{}".format(test_rec))
feature_importance = rf.feature_importances_
index = np.argsort(feature_importance)[-10:]
plt.barh(np.arange(10),feature_importance[index],color = "dodgerblue")
plt.yticks(np.arange(10+0.25),np.array(X.columns)[index])
plt.xlabel("relative importance")
plt.title("Top 10 Importance Variable")
plt.show()
结果:
逻辑回归的效果如下
在训练集上f1_score的值为:0.9991685321394307
在训练集上精确率的值为:0.9987865588052272
在训练集上查全率的值为:0.998338445807771
在测试集上f1_score的值为:0.9992096423631693
在测试集上精确率的值为:0.9988384754990925
在测试集上查全率的值为:0.9984205330700888
2.0413224697113037
====================================================================================================
随机森林的效果如下
在训练集上f1_score的值为:0.9998292932741549
在训练集上精确率的值为:0.9997510889856877
在训练集上查全率的值为:0.9998719644914856
在测试集上f1_score的值为:0.9928557253423298
在测试集上精确率的值为:0.989546279491833
在测试集上查全率的值为:0.9964150567616012