BI & Data Mining Case 保险反欺诈预测 Python

Dependencies

import pandas as pd
import numpy as np

数据读取

train = pd.read_csv('insurance/train.csv')
test = pd.read_csv('insurance/test.csv')
train = pd.concat([train,test]).reset_index(drop = True)

数据摸底、数据清洗

数据摸底

数据摸底:依据业务经验剔除明显无关变量以及明显冗余变量。

  1. 保险编号policy_id:自增类型,与是否欺诈无关,设为索引Index;
  2. 被保人邮编insured_zip:与上保险所在地区信息重复,冗余变量,剔除;
  3. 保险绑定日期policy_bind_date,出险日期incident_date,两个日期相减创建衍生变量已保天数insured_days,原变量剔除;
  4. policy_csl形如’num1/num2’,用’/'拆分为两个数值型字段。
train['policy_bind_date'] = pd.to_datetime(train.policy_bind_date,format = '%Y-%m-%d')
train['incident_date'] = pd.to_datetime(train.incident_date,format = '%Y-%m-%d')
train['insured_days'] = (train['incident_date'] - train['policy_bind_date']).dt.days
train['policy_csl_0'] = train.policy_csl.str.split('/',expand = True)[0].astype('int64')
train['policy_csl_1'] = train.policy_csl.str.split('/',expand = True)[1].astype('int64')
selected_var = train.columns.difference(['policy_bind_date','incident_date','policy_csl','insured_zip'])
train = train[selected_var].set_index('policy_id')

数据清洗

数据清洗:缺失值、异常值处理。

对各字段数据进行value_counts()分析;

碰撞类型collation_type,是否有财产损失property_damage,是否有警察记录的报告police_report_available几个字段缺失值较多;

此处采取将缺失值视作类别变量中的一单独类别,不做特殊处理。

#train.collision_type.value_counts()
#train.property_damage.value_counts()
#train.police_report_available.value_counts()

特征工程

类别型变量编码

对有序类别进行编码

有序类别:

  1. 被保人学历insured_education_level,
  2. 事故严重程度incident_severity
train.insured_education_level = train.insured_education_level.astype('category')
train.incident_severity = train.incident_severity.astype('category')
#有序类别用category映射方法编码
train['insured_education_level_val'] = train.insured_education_level.cat.rename_categories({'High School':1,'Associate':2,'College':3,'Masters':4,'JD':5,'PhD':6,'MD':7})
train['incident_severity_val'] = train.incident_severity.cat.rename_categories({'Trivial Damage':1,'Minor Damage':2,'Major Damage':3,'Total Loss':4})
train['insured_education_level_val'] = train['insured_education_level_val'].astype('int64')
train['incident_severity_val'] = train['incident_severity_val'].astype('int64')

对无序类别使用sklearn数据预处理模块的LabelEncoder方法进行编码

无序类别:剩余dtype为object类别

from sklearn import preprocessing as pp
le = pp.LabelEncoder()
obj_list = train.dtypes[train.dtypes == 'object'].index
for col in obj_list:
    train[col+'_val'] = le.fit_transform(train[col])

提取特征变量

feats = train.dtypes[(train.dtypes != 'object')&(train.dtypes != 'category')].index
feats = feats[feats!='fraud']

划分训练集、测试集

train_x = train.loc[train.fraud.notna()][feats]
test_x = train.loc[train.fraud.isna()][feats]
train_y = train.loc[train.fraud.notna()]['fraud']
test_y = train.loc[train.fraud.isna()]['fraud']

模型构建

模型选择

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.model_selection import cross_val_score
import time

clf1 = LogisticRegression()
clf2 = DecisionTreeClassifier()
clf3 = RandomForestClassifier()
clf4 = AdaBoostClassifier()
clf5 = GradientBoostingClassifier()
clf6 = XGBClassifier()
clf7 = LGBMClassifier()
for clf,name in zip([clf1,clf2,clf3,clf4,clf5,clf6,clf7],['Logistic','DecisionTree','RandomForest','AdaBoost','GradientBoosting','XGB','LGBM']):
    start = time.time()
    score = cross_val_score(clf,train_x,train_y,scoring = 'roc_auc',cv = 5)
    end = time.time()
    time_cost = end - start
    print("AUC:%0.8f(+/- %0.2f),耗时:%0.2f秒,模型名称[%s]"%(score.mean(),score.std(),time_cost,name))

(注:以下输出需安装xgboost,lightgbm依赖包)
AUC:0.55280733(+/- 0.02),耗时:0.25秒,模型名称[Logistic]
AUC:0.69942906(+/- 0.03),耗时:0.09秒,模型名称[DecisionTree]
AUC:0.78689206(+/- 0.01),耗时:1.42秒,模型名称[RandomForest]
AUC:0.77235847(+/- 0.05),耗时:0.79秒,模型名称[AdaBoost]
AUC:0.83080416(+/- 0.02),耗时:1.48秒,模型名称[GradientBoosting]
AUC:0.83164124(+/- 0.02),耗时:0.85秒,模型名称[XGB]
AUC:0.81864285(+/- 0.02),耗时:0.60秒,模型名称[LGBM]
在以上各模型默认参数设置下,XGB模型AUC得分最高,耗时较短,故选择对XGB模型调参进行最终预测

模型调参

  1. 调参方法:使用sklearn网格搜索交叉验证GridSearchCV方法调参,输出最优参数与最优得分
  2. 调参思路:max_depth→min_child_weight→gamma→colsample_bytree→alpha→lambda→learning_rate
from sklearn.model_selection import GridSearchCV
param_1 = {'max_depth':[3, 5, 6, 7, 9, 12, 15, 17, 25]}
gsearch1 = GridSearchCV(estimator = XGBClassifier(random_state = 2022),param_grid = param_1,scoring = 'roc_auc',cv = 5)
gsearch1.fit(train_x,train_y)
print(gsearch1.best_params_,gsearch1.best_score_)
param_2 = {'min_child_weight':range(1,60,1)}
gsearch2 = GridSearchCV(estimator = XGBClassifier(max_depth=6,random_state = 2022),param_grid = param_2,scoring = 'roc_auc',cv = 5)
gsearch2.fit(train_x,train_y)
print(gsearch2.best_params_,gsearch2.best_score_)
param_3 = {'gamma':[i*0.01 for i in range(0,100,1)]}
gsearch3 = GridSearchCV(estimator = XGBClassifier(random_state = 2022,max_depth = 6,min_child_weight = 17),param_grid = param_3,scoring = 'roc_auc',cv = 5)
gsearch3.fit(train_x,train_y)
print(gsearch3.best_params_,gsearch3.best_score_)
param_4 = {'colsample_bytree':[0.6,0.7,0.8,0.9,1]}
gsearch4 = GridSearchCV(estimator = XGBClassifier(random_state = 2022,max_depth = 6,min_child_weight = 17,gamma = 0.78),param_grid = param_4,scoring = 'roc_auc',cv = 5)
gsearch4.fit(train_x,train_y)
print(gsearch4.best_params_,gsearch4.best_score_)
param_5 = {'alpha':[0,0.01,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.09,0.1,1]}
gsearch5 = GridSearchCV(estimator = XGBClassifier(random_state = 2022,max_depth = 6,min_child_weight = 17,gamma = 0.78),param_grid = param_5,scoring = 'roc_auc',cv = 5)
gsearch5.fit(train_x,train_y)
print(gsearch5.best_params_,gsearch5.best_score_)
param_6 = {'lambda':[0.1*i for i in range(11)]}
gsearch6 = GridSearchCV(estimator = XGBClassifier(random_state = 2022,max_depth = 6,min_child_weight = 17,gamma = 0.78),param_grid = param_6,scoring = 'roc_auc',cv = 5)
gsearch6.fit(train_x,train_y)
print(gsearch6.best_params_,gsearch6.best_score_)
param_7 = {'learning_rate':[0.01, 0.015, 0.025, 0.05, 0.1],'lambda':[0.4]}
gsearch7 = GridSearchCV(estimator = XGBClassifier(random_state = 2022,max_depth = 6,min_child_weight = 17,gamma = 0.78),param_grid = param_7,scoring = 'roc_auc',cv = 5)
gsearch7.fit(train_x,train_y)
print(gsearch7.best_params_,gsearch7.best_score_)

从调参结果(AUC:0.83080416→0.83982359)来看,参数调整带来的score得分提升并不显著,而一个合适的模型选择起决定性作用

模型拟合及预测

clf_bg = XGBClassifier(random_state = 2022,max_depth = 6,min_child_weight = 17,gamma = 0.78)
clf_bg.fit(train_x,train_y)
predict_y = clf_bg.predict(test_x)
predict_y = pd.DataFrame(predict_y,columns = ['fraud'],index = test_x.index)
res = predict_y.reset_index()
res.to_csv('insurance\submission.csv')

你可能感兴趣的:(Python,Machine,Learning,Deep,Learning,python,数据挖掘,人工智能,机器学习,深度学习)