信用卡防欺诈的模型构建

一、背景介绍

近年来，随着信用卡的普及，对于银行业来说同时也增加了信用卡盗刷等金融欺诈案件的风险。随着技术的发展，商业银行以及一些提供借贷服务的金融公司也越来越倾向用大数据的能力，机器学习的方法来防止这种金融风险，保障客户的安全。
今天我们将对28W+信用卡交易数据，进行探索分析，找出差异特征，利用随机森林进行特征建模，从而对信用卡交易进行分类，判断交易为正常还是盗刷。
数据集来源：https://www.kaggle.com/mlg-ulb/creditcardfraud

二、导入相应的模块和数据集

import pandas as pd 
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import matplotlib.gridspec as gridspec
import warnings
warnings.filterwarnings('ignore')
plt.style.use('ggplot')

df_credit = pd.read_csv('C:/Users/Jason/Desktop/creditcard.csv')

查看前五列数据

df_credit.head()
-->>
    Time  V1        V2            V3              V4         V5        V6             V7         V8           V9    ... V21           V22           V23           V24       V25         V26         V27          V28     Amount Class
0   0.0 -1.359807   -0.072781   2.536347    1.378155    -0.338321   0.462388    0.239599    0.098698    0.363787    ... -0.018307   0.277838    -0.110474   0.066928    0.128539    -0.189115   0.133558    -0.021053   149.62  0
1   0.0 1.191857    0.266151    0.166480    0.448154    0.060018    -0.082361   -0.078803   0.085102    -0.255425   ... -0.225775   -0.638672   0.101288    -0.339846   0.167170    0.125895    -0.008983   0.014724    2.69    0
2   1.0 -1.358354   -1.340163   1.773209    0.379780    -0.503198   1.800499    0.791461    0.247676    -1.514654   ... 0.247998    0.771679    0.909412    -0.689281   -0.327642   -0.139097   -0.055353   -0.059752   378.66  0
3   1.0 -0.966272   -0.185226   1.792993    -0.863291   -0.010309   1.247203    0.237609    0.377436    -1.387024   ... -0.108300   0.005274    -0.190321   -1.175575   0.647376    -0.221929   0.062723    0.061458    123.50  0
4   2.0 -1.158233   0.877737    1.548718    0.403034    -0.407193   0.095921    0.592941    -0.270533   0.817739    ... -0.009431   0.798278    -0.137458   0.141267    -0.206010   0.502292    0.219422    0.215153    69.99   0

接下来看看是否有空值

df_credit.info()
-->>

RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
Time      284807 non-null float64
V1        284807 non-null float64
V2        284807 non-null float64
V3        284807 non-null float64
V4        284807 non-null float64
V5        284807 non-null float64
V6        284807 non-null float64
V7        284807 non-null float64
V8        284807 non-null float64
V9        284807 non-null float64
V10       284807 non-null float64
V11       284807 non-null float64
V12       284807 non-null float64
V13       284807 non-null float64
V14       284807 non-null float64
V15       284807 non-null float64
V16       284807 non-null float64
V17       284807 non-null float64
V18       284807 non-null float64
V19       284807 non-null float64
V20       284807 non-null float64
V21       284807 non-null float64
V22       284807 non-null float64
V23       284807 non-null float64
V24       284807 non-null float64
V25       284807 non-null float64
V26       284807 non-null float64
V27       284807 non-null float64
V28       284807 non-null float64
Amount    284807 non-null float64
Class     284807 non-null int64
dtypes: float64(30), int64(1)
memory usage: 67.4 MB

看来数据集比较干净，不需要我们做太多的数据清洗操作

三、接下来我们做一些探索性的分析

3.1 查看正常刷卡和盗刷的统计分布

print('正常刷卡（0）：盗刷（1）')
print(df_credit.Class.value_counts())
plt.figure(figsize=(8,4))
sns.countplot(df_credit.Class)
plt.title('Distribution',fontsize=16)
plt.ylabel('Count',fontsize=14)
-->>
正常刷卡（0）：盗刷（1）
0    284315
1       492
Name: Class, dtype: int64

count.png

由数据显示盗刷和正常消费的分布非常不平衡，而这种不平衡往往会给我们数据分析和建模造成很大的困扰，所以后面我们会采用一些统计方法，让数据集变得分布一致，放大少数事件对于分析带来的影响，使得模型根据变量进行更好的判断。

3.2 从交易金额上来分析

df_fake = df_credit[df_credit['Class'] ==1]
df_normal = df_credit[df_credit['Class'] ==0]
print('盗刷统计：',df_fake.Amount.describe())
print('正常交易统计',df_normal.Amount.describe())
-->>
盗刷统计： count     492.000000
mean      122.211321
std       256.683288
min         0.000000
25%         1.000000
50%         9.250000
75%       105.890000
max      2125.870000
Name: Amount, dtype: float64
正常交易统计 count    284315.000000
mean         88.291022
std         250.105092
min           0.000000
25%           5.650000
50%          22.000000
75%          77.050000
max       25691.160000
Name: Amount, dtype: float64

盗刷的卡再25%-75%分为点的分布比较宽，盗刷金额比例相对偏大，而正常的交易有很长的长尾效应

3.3 从刷卡时间来分析

先把Time这一列转为时间戳

timedata = pd.to_timedelta(df_credit['Time'],unit='s')
df_credit['min'] = (timedata.dt.components.minutes).astype(int)
df_credit['hour'] = (timedata.dt.components.hours).astype(int)
plt.figure(figsize=(12,5))
sns.distplot(df_credit[df_credit['Class'] == 0]['hour']),
sns.distplot(df_credit[df_credit['Class'] == 1]['hour']),
plt.xlim([-1,25])

hour.png

正常消费发生的时间在9点到晚上10点居多，符合一般人的活动时间，盗刷在早上9点到11点较高，下午3点到晚上9点较高，凌晨0点到3点有所回升，基本和正常刷卡相接近。

3.4 同样用描述分布的方法来分析其他变量

columns = df_credit.iloc[:,1:29].columns
fake = df_credit.Class == 1
normal = df_credit.Class == 0
grid = gridspec.GridSpec(14,2)
plt.figure(figsize=(15,80))
for n, col in enumerate(df_credit[columns]):
    ax = plt.subplot(grid[n])
    sns.distplot(df_credit[col][fake], bins = 50, color='g')
    sns.distplot(df_credit[col][normal], bins = 50, color='r')

v1-v28.png

可以看到V4,V9,V11,V16等变量有巨大差异，可以帮助我们在建模中找到盗刷的事件

四、使用随机森林建立模型来预测金融欺诈

4.1 对数据预处理

导入相应的库

from imblearn.pipeline import make_pipeline as make_pipeline_imb   
from imblearn.over_sampling import SMOTE                            
from sklearn.pipeline import make_pipeline                          
from imblearn.metrics import classification_report_imbalanced      

from sklearn.model_selection import train_test_split               
from collections import Counter                                    

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier                 
from sklearn.metrics import precision_score, recall_score, fbeta_score, confusion_matrix, precision_recall_curve, accuracy_score
from sklearn.model_selection import KFold, cross_val_score
from sklearn.model_selection import GridSearchCV
import sklearn.preprocessing as sp

Amount：消费金额这个字段数值差异比较大，不利于机器学习算法进行样本处理，我们用sklearn中的StandardScaler进行数据标准化处理，并选择特征字段进行分析

from sklearn.preprocessing import StandardScaler
df_credit['Amount'] = StandardScaler().fit_transform(df_credit['Amount'].values.reshape(-1,1))
df_credit = df_credit[["hour","min","V2","V3","V4","V9","V10","V11","V12","V14","V16","V17","V18","V19","V27","Amount","Class"]]
df_credit.head()
-->>
  hour  min       V2          V3        V4        V9            V10       V11         V12           V14         V16         V17       V18         V19         V27         Amount Class
0   0   0   -0.072781   2.536347    1.378155    0.363787    0.090794    -0.551600   -0.617801   -0.311169   -0.470401   0.207971    0.025791    0.403993    0.133558    0.244964    0
1   0   0   0.266151    0.166480    0.448154    -0.255425   -0.166974   1.612727    1.065235    -0.143772   0.463917    -0.114805   -0.183361   -0.145783   -0.008983   -0.342475   0
2   0   0   -1.340163   1.773209    0.379780    -1.514654   0.207643    0.624501    0.066084    -0.165946   -2.890083   1.109969    -0.121359   -2.261857   -0.055353   1.160686    0
3   0   0   -0.185226   1.792993    -0.863291   -1.387024   -0.054952   -0.226487   0.178228    -0.287924   -1.059647   -0.684093   1.965775    -1.232622   0.062723    0.140534    0
4   0   0   0.877737    1.548718    0.403034    0.817739    0.753074    -0.822843   0.538196    -1.119670   -0.451449   -0.237033   -0.038195   0.803487    0.219422    -0.073403   0

4.2 由于数据样本分布不均衡，我们将采用欠采样（undersampling）和过采样（oversampling）两种方法，比较最终结果

先来看看欠采样的表现

feature_col = df_credit.iloc[:,:-1].columns
x = df_credit[feature_col]
y = df_credit['Class']
fake_number = len(df_credit[df_credit.Class==1])
fake_index = np.array(df_credit[df_credit.Class==1].index)
normal_index = df_credit[df_credit.Class == 0].index
random_normal_index = np.random.choice(normal_index,fake_number,replace = False)
random_normal_index = np.array(random_normal_index)
under_sample_index = np.concatenate([fake_index,random_normal_index])
under_sample_data = df_credit.iloc[under_sample_index,:]
x_undersample = under_sample_data.iloc[:,under_sample_data.columns !='Class']
y_undersample = under_sample_data.iloc[:,under_sample_data.columns =='Class']

划分数据集

x_train_under,x_test_under,y_train_under,y_test_under = train_test_split(x_undersample,y_undersample,test_size=0.2,random_state=2)

采用随机森林建模，并实现自动调参

param_grid = {
    'max_depth':[3,5,None],
    'n_estimators':[5,10,100],
    'max_features':[5,6,7]
}
model = RandomForestClassifier(max_depth=3, n_estimators=10, random_state=7,max_features=3)
grid_search = GridSearchCV(model,param_grid=param_grid,cv=5,scoring='recall_weighted')
grid_search.fit(x_train_under,y_train_under)
y_under_predict = grid_search.predict(x_test_under)

print(grid_search.best_score_)
print(grid_search.best_params_)
-->>
0.9301136821736676
{'max_depth': 5, 'max_features': 5, 'n_estimators': 100}

print(confusion_matrix(y_test_under,y_under_predict))
-->>
 [[98  4]
 [10 85]]

打印得分

print("accuracy: {}".format(accuracy_score(y_test_under, y_under_predict)))
print("precision: {}".format(precision_score(y_test_under, y_under_predict)))
print("recall: {}".format(recall_score(y_test_under, y_under_predict)))
print("f2: {}".format(fbeta_score(y_test_under, y_under_predict, beta=2)))
-->>
accuracy: 0.9289340101522843
precision: 0.9550561797752809
recall: 0.8947368421052632
f2: 0.9061833688699361

我们再来看看过采样的表现

over_sample = SMOTE(random_state=4)
os_features,os_labels = over_sample.fit_sample(x,y)

查看采样之后少数样本的结果：

len(os_labels[os_labels==1])
-->>
284315

x_train_over,x_test_over,y_train_over,y_test_over = train_test_split(os_features,os_labels,test_size=0.2,random_state=2)

param_grid = {
    'max_depth':[3,5,None],
    'n_estimators':[5,10,100],
    'max_features':[5,6,7]
}
model = RandomForestClassifier(max_depth=3, n_estimators=10, random_state=7,max_features=3)
grid_search = GridSearchCV(model,param_grid=param_grid,cv=5,scoring='recall_weighted')
grid_search.fit(x_train_over,y_train_over)
y_under_predict = grid_search.predict(x_test_over)

print(grid_search.best_score_)
print(grid_search.best_params_)
-->>
0.9997911648500677
{'max_depth': 5, 'max_features': 5, 'n_estimators': 100}

好像比欠采样稍微高一点点，再来看看recall和其它得分

print("accuracy: {}".format(accuracy_score(y_test_over, y_under_predict)))
print("precision: {}".format(precision_score(y_test_over, y_under_predict)))
print("recall: {}".format(recall_score(y_test_over, y_under_predict)))
print("f2: {}".format(fbeta_score(y_test_over, y_under_predict, beta=2)))
-->>
accuracy: 0.9998593109755025
precision: 0.9997728464092259
recall: 0.9999475707794477
f2: 0.9999126210198874

看起来还不错