目录
赛题背景
一、 数据预处理
二、特征工程
(1)tag表构造特征
(2)trd表构造特征
(3)beh表特征构造
(4)连接数据
三、建立模型预测
四、总结
主办方提供了两个数据集(训练数据集和评分数据集),包含用户标签数据、过去60天的交易行为数据、过去30天的APP行为数据。希望参赛选手基于训练数据集,通过有效的特征提取,构建信用违约预测模型,并将模型应用在评分数据集上,输出评分数据集中每个用户的违约概率。
(1)观察数据分布。这里要将tag表训练数据和测试数据连接方便做特征工程,其他表也做相似处理。
#加载tag表数据
train_tag = pd.read_csv('train_data/train_tag.csv')
test_tag = pd.read_csv('test_data/test_tag.csv')
#将训练数据和测试数据连接方便做特征
df_tag=pd.concat([train_tag,test_tag],axis=0,ignore_index=True)
#观察数据大致分布
for feature in list(df_tag):
print(feature + "的特征分布如下:")
print("{}特征有个{}不同的值".format(feature,df_tag[feature].nunique()))
print(df_tag[feature].value_counts())
(2)查找空值。发现教育背景相关的两个特征和信用卡还贷方式出现大量缺失,所以删去。
df_tag.isnull().sum()#观察缺失数据
df_tag = df_tag.drop(['edu_deg_cd','deg_cd','atdd_type'],axis=1)#三个特征缺失值较多,删掉
(3)填充无效字符'\N'。对于部分特征'\N'都为500个左右,可能有特殊含义,选择用-1填充,其余都用0填充;将字符型特征编码。
#填充数值特征,根据数值分布筛选出部分特征用0填充'\N',其余使用'-1'填充
col_0 = ['job_year','frs_agn_dt_cnt','fin_rsk_ases_grd_cd','confirm_rsk_ases_lvl_typ_cd',
'tot_ast_lvl_cd','pot_ast_lvl_cd','hld_crd_card_grd_cd','l1y_crd_card_csm_amt_dlm_cd',
'l12mon_buy_fin_mng_whl_tms','l12_mon_fnd_buy_whl_tms','l12_mon_insu_buy_whl_tms','l12_mon_gld_buy_whl_tms']
for col in col_0:
df_tag[col].replace('\\N','0',inplace=True)
df_tag.replace('\\N',-1,inplace=True)
#将类别特征转化编码
str_col = ['gdr_cd','mrg_situ_cd','acdm_deg_cd']
for col in str_col:
df_tag[col] = pd.factorize(df_tag[col])[0]
(4)对特征做分桶处理,并将'age','job_year'做均值编码。
#对分布极不均衡的特征做截断处理
df_tag['l12mon_buy_fin_mng_whl_tms'] = df_tag['l12mon_buy_fin_mng_whl_tms'].astype('int').map(lambda x:x if x<5 else 5)
df_tag['l12_mon_fnd_buy_whl_tms'] = df_tag['l12_mon_fnd_buy_whl_tms'].astype('int').map(lambda x:x if x<6 else 6)
df_tag['l12_mon_insu_buy_whl_tms'] = df_tag['l12_mon_insu_buy_whl_tms'].astype('int').map(lambda x: x if x==0 else 1)
df_tag['l12_mon_gld_buy_whl_tms'] = df_tag['l12_mon_gld_buy_whl_tms'].astype('int').map(lambda x:x if x==0 else 1)
df_tag['age'] = df_tag['age'].astype('int').map(lambda x:x if x<60 else 60)
df_tag['fin_rsk_ases_grd_cd'] = df_tag['fin_rsk_ases_grd_cd'].astype('int').map(lambda x:x if x<=5 else 7)
df_tag['confirm_rsk_ases_lvl_typ_cd'] = df_tag['confirm_rsk_ases_lvl_typ_cd'].astype('int').map(lambda x:x if x<=5 else 6)
df_tag['ovd_30d_loan_tot_cnt'] = df_tag['ovd_30d_loan_tot_cnt'].astype('int').map(lambda x:x if x==0 else 1)
df_tag['his_lng_ovd_day'] = df_tag['his_lng_ovd_day'].astype('int').map(lambda x: x if x<2 else 2)
df_tag['cur_debit_cnt'] = df_tag['cur_debit_cnt'].astype('int').map(lambda x:x if x<10 else 10)
df_tag['cur_credit_cnt'] = df_tag['cur_credit_cnt'].astype('int').map(lambda x: x if x<6 else 6)
#'job_year'分桶处理
def group_bin(x):
if x<20:
return x
elif 20<=x<25:
return 23
elif 25<=x<30:
return 27
else:
return 30
df_tag['job_year'] = df_tag['job_year'].astype('int').map(group_bin)
#均值编码
df_tag['job_year_mean_code'] = df_tag.groupby('job_year')['flag'].transform('mean')
df_tag['age_mean_code'] = df_tag.groupby('age')['flag'].transform('mean')
对信用卡天数、借记卡天数、信用卡等级等一系列特征进行相乘,目的使用户之间的差距增大。
#信用卡:持卡天数*等级
df_tag['credit_level1']=df_tag['cur_credit_min_opn_dt_cnt']*pd.to_numeric(df_tag['l1y_crd_card_csm_amt_dlm_cd'])
df_tag['credit_level2']=df_tag['cur_credit_min_opn_dt_cnt']*pd.to_numeric(df_tag['perm_crd_lmt_cd'])
df_tag['credit_level3']=df_tag['cur_credit_min_opn_dt_cnt']*pd.to_numeric(df_tag['hld_crd_card_grd_cd'])
df_tag['level_level']=pd.to_numeric(df_tag['l1y_crd_card_csm_amt_dlm_cd'])*pd.to_numeric(df_tag['perm_crd_lmt_cd']) #等级*等级
df_tag['credit_amount']=df_tag['cur_credit_min_opn_dt_cnt']*df_tag['cur_credit_cnt'] #持卡天数*持卡数量
#信用卡:持卡数量*等级
df_tag['amount_level1']=df_tag['cur_credit_cnt']*pd.to_numeric(df_tag['perm_crd_lmt_cd'])
df_tag['amount_level2']=df_tag['cur_credit_cnt']*pd.to_numeric(df_tag['l1y_crd_card_csm_amt_dlm_cd'])
df_tag['amount_level3']=df_tag['cur_credit_cnt']*pd.to_numeric(df_tag['hld_crd_card_grd_cd'])
df_tag.shape
这是本次比赛最重要的一个表了,它提供了交易类别,交易金额和交易时间的特征。我们可以做大量的统计。其中利用RFM模型作为指导,统计最近交易,交易频率,交易金额可以有效提升模型效果。
from datetime import datetime
df_trd['trx_tm'] = pd.to_datetime(df_trd['trx_tm'])
df_trd['trx_month'] = df_trd['trx_tm'].dt.month#交易月份
df_trd['trx_day'] = df_trd['trx_tm'].apply(lambda x:x.strftime('%Y-%m-%d'))#交易日期
df_trd['week_no'] = df_trd['trx_tm'].map(lambda x:x.isocalendar()[1])#交易周
df_trd['last_trx_tm'] = df_trd.groupby('id')['trx_tm'].transform('max')#提取最后一次交易时间
df_trd['last_day'] = df_trd.groupby('id')['trx_day'].transform('max')#提取最后一次交易日期
df_trd['last_week'] = df_trd.groupby('id')['week_no'].transform('max')#最后一周
print('df_trd is done')
#节假日划分,节假日记作0,工作日记作1
cal = Calendar(workdays=[MO,TU,WE,TH,FR], holidays=['2019-06-07', '2019-06-08','2019-06-09',
'2019-05-01','2019-05-02','2019-05-03','2019-05-04'])
df_trd['is_weekday'] = df_trd['trx_day'].map(lambda x:1 if cal.isbusday(x) else 0)
#t1统计交易余额,交易次数,最小交易额,最大交易额,交易额标准差,交易平均值
t0 = df_trd[['id','flag']].drop_duplicates().reset_index(drop=True)
t1 = df_trd.groupby('id',as_index=False)['cny_trx_amt'].agg(
{'trx_sum':'sum','trx_count':'count','trx_min':'min','trx_max':'max','trx_std':'std','trx_mean':'mean'})
t2 = df_trd.groupby('id',as_index=False)['trx_day'].agg({'trx_days':'nunique'})#交易天数
#t3计算最后一次交易余额
t3 = df_trd[df_trd['trx_tm']==df_trd['last_trx_tm']].groupby('id',as_index=False)['cny_trx_amt'].agg({'last_amt':'sum'})
#t4统计最后一天交易余额,交易次数,最小交易额,最大交易额,交易额标准差,交易平均值
t4 = df_trd[df_trd['trx_day']==df_trd['last_day']].groupby('id',as_index=False)['cny_trx_amt'].agg(
{'last_day_sum':'sum','last_day_count':'count','last_day_min':'min',
'last_day_max':'max','last_day_mean':'mean','last_day_std':'std'})
#t5统计最后一周交易余额,交易次数,最小交易额,最大交易额,交易额标准差,交易平均值
t5 = df_trd[df_trd['week_no']==df_trd['last_week']].groupby('id',as_index=False)['cny_trx_amt'].agg(
{'last_week_sum':'sum','last_week_count':'count','last_week_min':'min',
'last_week_max':'max','last_week_mean':'mean','last_week_std':'std'})
merge1 = pd.merge(t0,t1,on='id',how='left')
merge2 = pd.merge(merge1,t2,on='id',how='left')
merge3 = pd.merge(merge2,t3,on='id',how='left')
merge4 = pd.merge(merge3,t4,on='id',how='left')
merge5 = pd.merge(merge4,t5,on='id',how='left')
t = merge5
t.shape
#交易方向做细致的分类统计
df_trd[['is_weekday','last_week','Trx_Cod1_Cd','Trx_Cod2_Cd','trx_month']] = df_trd[['is_weekday','last_week',
'Trx_Cod1_Cd','Trx_Cod2_Cd','trx_month']].astype('str')
cols = ['Dat_Flg1_Cd']
for col in cols:
for fea in list(df_trd[col].unique()):
t1 = df_trd[df_trd[col]==fea].groupby('id',as_index=False)['cny_trx_amt'].agg(
{col+'_'+fea+'_sum':'sum',col+'_'+fea+'_count':'count',col+'_'+fea+'_min':'min',
col+'_'+fea+'_max':'max',col+'_'+fea+'_std':'std',col+'_'+fea+'_mean':'mean'})
t2 = df_trd[df_trd[col]==fea].groupby('id',as_index=False)['trx_day'].agg(
{col+'_'+fea+'_days':'nunique'})
merge1 = pd.merge(t,t1,on='id',how='left')
merge2 = pd.merge(merge1,t2,on='id',how='left')
t = merge2
print(col+' is done')
#交易月份,工作日,支付方式,收支一级分类代码
cols = ['trx_month','is_weekday','Dat_Flg3_Cd','Trx_Cod1_Cd']
for col in cols:
for fea in list(df_trd[col].unique()):
#交易金额统计
t1 = df_trd[(df_trd[col]==fea)].groupby('id',as_index=False)['cny_trx_amt'].agg(
{col+'_'+fea+'_sum':'sum',col+'_'+fea+'_count':'count',col+'_'+fea+'_min':'min',
col+'_'+fea+'_max':'max',col+'_'+fea+'_std':'std',col+'_'+fea+'_mean':'mean'})
#交易天数
t2 = df_trd[df_trd[col]==fea].groupby('id',as_index=False)['trx_day'].agg(
{col+'_'+fea+'_days':'nunique'})
#收入金额统计
t3 = df_trd[(df_trd[col]==fea)&(df_trd['cny_trx_amt']>0)].groupby('id',as_index=False)['cny_trx_amt'].agg(
{col+'_'+fea+'_income_sum':'sum',col+'_'+fea+'_income_count':'count',col+'_'+fea+'_income_min':'min',
col+'_'+fea+'_income_max':'max',col+'_'+fea+'_income_std':'std',col+'_'+fea+'_income_mean':'mean'})
#支出金额统计
t4 = df_trd[(df_trd[col]==fea)&(df_trd['cny_trx_amt']<=0)].groupby('id',as_index=False)['cny_trx_amt'].agg(
{col+'_'+fea+'_pay_sum':'sum',col+'_'+fea+'_pay_count':'count',col+'_'+fea+'_pay_min':'min',
col+'_'+fea+'_pay_max':'max',col+'_'+fea+'_pay_std':'std',col+'_'+fea+'_pay_mean':'mean'})
merge1 = pd.merge(t,t1,on='id',how='left')
merge2 = pd.merge(merge1,t2,on='id',how='left')
merge3 = pd.merge(merge2,t3,on='id',how='left')
merge4 = pd.merge(merge3,t4,on='id',how='left')
t = merge4
print(col+' is done')
#收支二级分类代码特征取值太多,只计算分类交易次数
col = 'Trx_Cod2_Cd'
for fea in list(df_trd[col].unique()):
t1 = df_trd[df_trd[col]==fea].groupby('id',as_index=False)['cny_trx_amt'].agg({col+'_'+fea+'_count':'count'})
t = pd.merge(t,t1,on='id',how='left')
#计算收支二级分类代码分类平均交易次数
t0 = df_trd.groupby(['id','Trx_Cod2_Cd'],as_index=False)['cny_trx_amt'].agg({'Cod2_count':'count'})
t1 = t0.groupby(['id'],as_index=False)['Cod2_count'].agg({'mean_Cod2_count':'mean'})
t = pd.merge(t,t1,on='id',how='left')
print(col+' is done')
t.shape
t['day_trx_amt'] = t['trx_sum']/t['trx_days']#每日交易金额
t['one_trx_amt'] = t['trx_count']/t['trx_days']#每日交易次数
t['total_trx_amt'] = t['Dat_Flg1_Cd_C_sum'] - t['Dat_Flg1_Cd_B_sum']#交易额总和
t['income_ratio'] = t['Dat_Flg1_Cd_C_sum']/t['total_trx_amt']#收入占交易总和比重
t['expend_ratio'] = -t['Dat_Flg1_Cd_B_sum']/t['total_trx_amt']#支出占交易总和比重
t['day_income_amt'] = t['Dat_Flg1_Cd_C_sum']/t['Dat_Flg1_Cd_C_days']#单日收入金额
t['day_expend_amt'] = -t['Dat_Flg1_Cd_B_sum']/t['Dat_Flg1_Cd_B_days']#单日支出金额
这张表记录的用户数量太少,无论怎么做最后连接tag表都会产生大量缺失,所以只做部分统计。
df_beh['visit_tm'] = pd.to_datetime(df_beh['Unnamed: 3'])
df_beh['visit_day'] = df_beh['visit_tm'].dt.day#提取访问日期
df_beh['last_visit_day'] = df_beh.groupby('id')['visit_day'].transform('max')#最后访问日期
df_beh['last_visit_tm'] = df_beh.groupby('id')['visit_tm'].transform('max')#最后访问时间
df_beh.head()
t0 = df_beh[['id','flag']].drop_duplicates().reset_index(drop=True)
t1 = df_beh.groupby('id',as_index=False)['page_no'].agg({'page_count':'count','page_unique':'nunique'})#统计访问页面次数,访问页面数量
t2 = df_beh[df_beh['visit_day']==df_beh['last_visit_day']].groupby('id',as_index=False)['page_no'].agg(
{'last_page_cnt':'count','last_page_unique':'nunique'})#统计最后一天访问页面次数,访问页面数量
t3 = df_beh.groupby('id',as_index=False)['visit_day'].agg({'day_unique':'nunique'})#访问天数
merge1 = pd.merge(t0,t1,on='id',how='left')
merge2 = pd.merge(merge1,t2,on='id',how='left')
merge3 = pd.merge(merge2,t3,on='id',how='left')
t = merge3
#各访问页面的访问次数分类统计
col = 'page_no'
for fea in list(df_beh[col].unique()):
t1 = df_beh[df_beh[col]==fea].groupby('id',as_index=False)['page_no'].agg({col+'_'+fea+'_count':'count'})
t = pd.merge(t,t1,on='id',how='left')
print(col+' is done')
连接tag,trd,beh三表数据,划分训练集和测试集。由于连接之后会产生大量空值,所以用0填充。
#将三表连接,并填充缺失值
temp = pd.merge(df_tag1,df_trd1,on=['id','flag'],how='left')
df = pd.merge(temp,df_beh1,on=['id','flag'],how='left')
df = df.fillna(0)
#将训练数据和测试数据分开
df_train = df[df['flag'].notnull()]
df_test = df[df['flag'].isnull()].drop(['flag'],axis=1)
特征工程生成373个特征,使用xgboost筛选重要程度前200的特征。
from xgboost import XGBClassifier
model = XGBClassifier(colsample_bytree=0.6,max_depth=6,reg_alpha=0.05,subsample=0.7,\
objective='binary:logistic',n_job=-1,booster='gbtree',\
n_estimators=1000,learning_rate=0.02,gamma=0.7)
folds = KFold(n_splits=5, shuffle=True, random_state=6)
predictions = 0
for fold_, (trn_idx, val_idx) in enumerate(folds.split(dftrain_X, dftrain_Y)):
print('Fold:', fold_+1)
tr_x, tr_y = dftrain_X.iloc[trn_idx, :], dftrain_Y[trn_idx]
vl_x, vl_y = dftrain_X.iloc[val_idx, :], dftrain_Y[val_idx]
model.fit(tr_x, tr_y,
eval_set=[(vl_x, vl_y)],
eval_metric='auc',
early_stopping_rounds=100,
verbose=50)
y_prob = model.predict_proba(vl_x)[:,1]
print(roc_auc_score(vl_y, y_prob))
predictions += roc_auc_score(vl_y, y_prob)/5
print("最终模型auc:", predictions)