一、背景介绍
近年来,随着信用卡的普及,对于银行业来说同时也增加了信用卡盗刷等金融欺诈案件的风险。随着技术的发展,商业银行以及一些提供借贷服务的金融公司也越来越倾向用大数据的能力,机器学习的方法来防止这种金融风险,保障客户的安全。
今天我们将对28W+信用卡交易数据,进行探索分析,找出差异特征,利用随机森林进行特征建模,从而对信用卡交易进行分类,判断交易为正常还是盗刷。
数据集来源:https://www.kaggle.com/mlg-ulb/creditcardfraud
二、导入相应的模块和数据集
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import matplotlib.gridspec as gridspec
import warnings
warnings.filterwarnings('ignore')
plt.style.use('ggplot')
df_credit = pd.read_csv('C:/Users/Jason/Desktop/creditcard.csv')
查看前五列数据
df_credit.head()
-->>
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 ... V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 149.62 0
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 0.085102 -0.255425 ... -0.225775 -0.638672 0.101288 -0.339846 0.167170 0.125895 -0.008983 0.014724 2.69 0
2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 0.247676 -1.514654 ... 0.247998 0.771679 0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752 378.66 0
3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 0.377436 -1.387024 ... -0.108300 0.005274 -0.190321 -1.175575 0.647376 -0.221929 0.062723 0.061458 123.50 0
4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 -0.270533 0.817739 ... -0.009431 0.798278 -0.137458 0.141267 -0.206010 0.502292 0.219422 0.215153 69.99 0
接下来看看是否有空值
df_credit.info()
-->>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
Time 284807 non-null float64
V1 284807 non-null float64
V2 284807 non-null float64
V3 284807 non-null float64
V4 284807 non-null float64
V5 284807 non-null float64
V6 284807 non-null float64
V7 284807 non-null float64
V8 284807 non-null float64
V9 284807 non-null float64
V10 284807 non-null float64
V11 284807 non-null float64
V12 284807 non-null float64
V13 284807 non-null float64
V14 284807 non-null float64
V15 284807 non-null float64
V16 284807 non-null float64
V17 284807 non-null float64
V18 284807 non-null float64
V19 284807 non-null float64
V20 284807 non-null float64
V21 284807 non-null float64
V22 284807 non-null float64
V23 284807 non-null float64
V24 284807 non-null float64
V25 284807 non-null float64
V26 284807 non-null float64
V27 284807 non-null float64
V28 284807 non-null float64
Amount 284807 non-null float64
Class 284807 non-null int64
dtypes: float64(30), int64(1)
memory usage: 67.4 MB
看来数据集比较干净,不需要我们做太多的数据清洗操作
三、接下来我们做一些探索性的分析
3.1 查看正常刷卡和盗刷的统计分布
print('正常刷卡(0):盗刷(1)')
print(df_credit.Class.value_counts())
plt.figure(figsize=(8,4))
sns.countplot(df_credit.Class)
plt.title('Distribution',fontsize=16)
plt.ylabel('Count',fontsize=14)
-->>
正常刷卡(0):盗刷(1)
0 284315
1 492
Name: Class, dtype: int64
由数据显示盗刷和正常消费的分布非常不平衡,而这种不平衡往往会给我们数据分析和建模造成很大的困扰,所以后面我们会采用一些统计方法,让数据集变得分布一致,放大少数事件对于分析带来的影响,使得模型根据变量进行更好的判断。
3.2 从交易金额上来分析
df_fake = df_credit[df_credit['Class'] ==1]
df_normal = df_credit[df_credit['Class'] ==0]
print('盗刷统计:',df_fake.Amount.describe())
print('正常交易统计',df_normal.Amount.describe())
-->>
盗刷统计: count 492.000000
mean 122.211321
std 256.683288
min 0.000000
25% 1.000000
50% 9.250000
75% 105.890000
max 2125.870000
Name: Amount, dtype: float64
正常交易统计 count 284315.000000
mean 88.291022
std 250.105092
min 0.000000
25% 5.650000
50% 22.000000
75% 77.050000
max 25691.160000
Name: Amount, dtype: float64
盗刷的卡再25%-75%分为点的分布比较宽,盗刷金额比例相对偏大,而正常的交易有很长的长尾效应
3.3 从刷卡时间来分析
先把Time这一列转为时间戳
timedata = pd.to_timedelta(df_credit['Time'],unit='s')
df_credit['min'] = (timedata.dt.components.minutes).astype(int)
df_credit['hour'] = (timedata.dt.components.hours).astype(int)
plt.figure(figsize=(12,5))
sns.distplot(df_credit[df_credit['Class'] == 0]['hour']),
sns.distplot(df_credit[df_credit['Class'] == 1]['hour']),
plt.xlim([-1,25])
正常消费发生的时间在9点到晚上10点居多,符合一般人的活动时间,盗刷在早上9点到11点较高,下午3点到晚上9点较高,凌晨0点到3点有所回升,基本和正常刷卡相接近。
3.4 同样用描述分布的方法来分析其他变量
columns = df_credit.iloc[:,1:29].columns
fake = df_credit.Class == 1
normal = df_credit.Class == 0
grid = gridspec.GridSpec(14,2)
plt.figure(figsize=(15,80))
for n, col in enumerate(df_credit[columns]):
ax = plt.subplot(grid[n])
sns.distplot(df_credit[col][fake], bins = 50, color='g')
sns.distplot(df_credit[col][normal], bins = 50, color='r')
可以看到V4,V9,V11,V16等变量有巨大差异,可以帮助我们在建模中找到盗刷的事件
四、使用随机森林建立模型来预测金融欺诈
4.1 对数据预处理
导入相应的库
from imblearn.pipeline import make_pipeline as make_pipeline_imb
from imblearn.over_sampling import SMOTE
from sklearn.pipeline import make_pipeline
from imblearn.metrics import classification_report_imbalanced
from sklearn.model_selection import train_test_split
from collections import Counter
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score, recall_score, fbeta_score, confusion_matrix, precision_recall_curve, accuracy_score
from sklearn.model_selection import KFold, cross_val_score
from sklearn.model_selection import GridSearchCV
import sklearn.preprocessing as sp
Amount:消费金额这个字段数值差异比较大,不利于机器学习算法进行样本处理,我们用sklearn中的StandardScaler进行数据标准化处理,并选择特征字段进行分析
from sklearn.preprocessing import StandardScaler
df_credit['Amount'] = StandardScaler().fit_transform(df_credit['Amount'].values.reshape(-1,1))
df_credit = df_credit[["hour","min","V2","V3","V4","V9","V10","V11","V12","V14","V16","V17","V18","V19","V27","Amount","Class"]]
df_credit.head()
-->>
hour min V2 V3 V4 V9 V10 V11 V12 V14 V16 V17 V18 V19 V27 Amount Class
0 0 0 -0.072781 2.536347 1.378155 0.363787 0.090794 -0.551600 -0.617801 -0.311169 -0.470401 0.207971 0.025791 0.403993 0.133558 0.244964 0
1 0 0 0.266151 0.166480 0.448154 -0.255425 -0.166974 1.612727 1.065235 -0.143772 0.463917 -0.114805 -0.183361 -0.145783 -0.008983 -0.342475 0
2 0 0 -1.340163 1.773209 0.379780 -1.514654 0.207643 0.624501 0.066084 -0.165946 -2.890083 1.109969 -0.121359 -2.261857 -0.055353 1.160686 0
3 0 0 -0.185226 1.792993 -0.863291 -1.387024 -0.054952 -0.226487 0.178228 -0.287924 -1.059647 -0.684093 1.965775 -1.232622 0.062723 0.140534 0
4 0 0 0.877737 1.548718 0.403034 0.817739 0.753074 -0.822843 0.538196 -1.119670 -0.451449 -0.237033 -0.038195 0.803487 0.219422 -0.073403 0
4.2 由于数据样本分布不均衡,我们将采用欠采样(undersampling)和过采样(oversampling)两种方法,比较最终结果
先来看看欠采样的表现
feature_col = df_credit.iloc[:,:-1].columns
x = df_credit[feature_col]
y = df_credit['Class']
fake_number = len(df_credit[df_credit.Class==1])
fake_index = np.array(df_credit[df_credit.Class==1].index)
normal_index = df_credit[df_credit.Class == 0].index
random_normal_index = np.random.choice(normal_index,fake_number,replace = False)
random_normal_index = np.array(random_normal_index)
under_sample_index = np.concatenate([fake_index,random_normal_index])
under_sample_data = df_credit.iloc[under_sample_index,:]
x_undersample = under_sample_data.iloc[:,under_sample_data.columns !='Class']
y_undersample = under_sample_data.iloc[:,under_sample_data.columns =='Class']
划分数据集
x_train_under,x_test_under,y_train_under,y_test_under = train_test_split(x_undersample,y_undersample,test_size=0.2,random_state=2)
采用随机森林建模,并实现自动调参
param_grid = {
'max_depth':[3,5,None],
'n_estimators':[5,10,100],
'max_features':[5,6,7]
}
model = RandomForestClassifier(max_depth=3, n_estimators=10, random_state=7,max_features=3)
grid_search = GridSearchCV(model,param_grid=param_grid,cv=5,scoring='recall_weighted')
grid_search.fit(x_train_under,y_train_under)
y_under_predict = grid_search.predict(x_test_under)
print(grid_search.best_score_)
print(grid_search.best_params_)
-->>
0.9301136821736676
{'max_depth': 5, 'max_features': 5, 'n_estimators': 100}
print(confusion_matrix(y_test_under,y_under_predict))
-->>
[[98 4]
[10 85]]
打印得分
print("accuracy: {}".format(accuracy_score(y_test_under, y_under_predict)))
print("precision: {}".format(precision_score(y_test_under, y_under_predict)))
print("recall: {}".format(recall_score(y_test_under, y_under_predict)))
print("f2: {}".format(fbeta_score(y_test_under, y_under_predict, beta=2)))
-->>
accuracy: 0.9289340101522843
precision: 0.9550561797752809
recall: 0.8947368421052632
f2: 0.9061833688699361
我们再来看看过采样的表现
over_sample = SMOTE(random_state=4)
os_features,os_labels = over_sample.fit_sample(x,y)
查看采样之后少数样本的结果:
len(os_labels[os_labels==1])
-->>
284315
x_train_over,x_test_over,y_train_over,y_test_over = train_test_split(os_features,os_labels,test_size=0.2,random_state=2)
param_grid = {
'max_depth':[3,5,None],
'n_estimators':[5,10,100],
'max_features':[5,6,7]
}
model = RandomForestClassifier(max_depth=3, n_estimators=10, random_state=7,max_features=3)
grid_search = GridSearchCV(model,param_grid=param_grid,cv=5,scoring='recall_weighted')
grid_search.fit(x_train_over,y_train_over)
y_under_predict = grid_search.predict(x_test_over)
print(grid_search.best_score_)
print(grid_search.best_params_)
-->>
0.9997911648500677
{'max_depth': 5, 'max_features': 5, 'n_estimators': 100}
好像比欠采样稍微高一点点,再来看看recall和其它得分
print("accuracy: {}".format(accuracy_score(y_test_over, y_under_predict)))
print("precision: {}".format(precision_score(y_test_over, y_under_predict)))
print("recall: {}".format(recall_score(y_test_over, y_under_predict)))
print("f2: {}".format(fbeta_score(y_test_over, y_under_predict, beta=2)))
-->>
accuracy: 0.9998593109755025
precision: 0.9997728464092259
recall: 0.9999475707794477
f2: 0.9999126210198874
看起来还不错