一个在线交易防欺诈模型

机器学习训练营——机器学习爱好者的自由交流空间(入群联系qq:2279055353)

案例介绍

IEEE-CIS(IEEE Computational Intelligence Society)在人工智能与机器学习的很多领域开展科研工作。目前,他们与全球领先的支付服务公司Vesta合作,针对金融欺诈预防行业,寻求最佳的解决方案。这个案例将建立基于电子商务数据的反欺诈模型,改善欺诈交易预警效率,有助于商务活动减少欺诈损失,提高收益。

数据描述

本案例是一个二值分类问题,即,目标变量表示用户的交易行为是否是欺诈的(fraudlent or not fraudlent). 该套数据由两个文件identity and transaction 组成,它们由共同的特征TransactionID连接。注意,并不是所有的交易都有对应的identity 信息。

Transaction 类特征

  • ProductCD

  • card1 - card6

  • addr1, addr2

  • P_emaildomain

  • R_emaildomain

  • M1 - M9

Identity 类特征

  • DeviceType

  • DeviceInfo

  • id_12 - id_38

数据探索

加载必需的库。

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.metrics import roc_auc_score
import matplotlib.gridspec as gridspec

# Standard plotly imports
import plotly.plotly as py
import plotly.graph_objs as go
import plotly.tools as tls

import warnings
warnings.filterwarnings("ignore")

import gc
gc.enable()

import logging
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.
print ("Ready!")

[‘ieee-fraud-detection.zip’, ‘sample_submission.csv’, ‘test_identity.csv’, ‘test_transaction.csv’, ‘train_identity.csv’, ‘train_transaction.csv’]

Ready!

加载数据

print('# File sizes')
for f in os.listdir('../input'):
    if 'zip' not in f:
        print(f.ljust(30) + str(round(os.path.getsize('../input/' + f) / 1000000, 2)) + 'MB')

File sizes

sample_submission.csv 6.08MB

test_identity.csv 25.8MB

test_transaction.csv 613.19MB

train_identity.csv 26.53MB

train_transaction.csv 683.35MB

train_transaction = pd.read_csv('../input/train_transaction.csv', index_col='TransactionID')
test_transaction = pd.read_csv('../input/test_transaction.csv', index_col='TransactionID')
train_identity = pd.read_csv('../input/train_identity.csv', index_col='TransactionID')
test_identity = pd.read_csv('../input/test_identity.csv', index_col='TransactionID')
print ("Data is loaded!")

Data is loaded!

Wall time: 1min 11s

print('train_transaction shape is {}'.format(train_transaction.shape))
print('test_transaction shape is {}'.format(test_transaction.shape))
print('train_identity shape is {}'.format(train_identity.shape))
print('test_identity shape is {}'.format(test_identity.shape))

train_transaction shape is (590540, 393)

test_transaction shape is (506691, 392)

train_identity shape is (144233, 40)

test_identity shape is (141907, 40)

显示训练集前5行。

train_transaction.head()

一个在线交易防欺诈模型_第1张图片

train_identity.head()

一个在线交易防欺诈模型_第2张图片

缺失值

我们看看具体的变量缺失值数量。

  • train_transaction
missing_values_count = train_transaction.isnull().sum()
print (missing_values_count[0:10])
total_cells = np.product(train_transaction.shape)
total_missing = missing_values_count.sum()
print ("% of missing data = ",(total_missing/total_cells) * 100)

一个在线交易防欺诈模型_第3张图片

missing_values_count = train_identity.isnull().sum()
print (missing_values_count[0:10])
total_cells = np.product(train_identity.shape)
total_missing = missing_values_count.sum()
print ("% of missing data = ",(total_missing/total_cells) * 100)

一个在线交易防欺诈模型_第4张图片

不平衡问题

x=train_transaction['isFraud'].value_counts().values
sns.barplot([0,1],x)
plt.title('Target variable count')

一个在线交易防欺诈模型_第5张图片
目标变量isFraud明显是不平衡的,0值占绝大多数,即,大多数交易是非欺诈的。如果我们使用这个数据框做预测模型,得到的算法很可能过度拟合,这是因为数据假设大多数交易是非欺诈的。但这不是我们想要的假设,我们想要建立的模型应该能检验出欺诈的信号!

删除临时变量x, 并回收内存。

del x
gc.collect()

时间跨度

我们来看看数据集所涉及交易的时间跨度。特征 TransactionDT 是一个timedelta, 表示相对一个参考日期的时间差。

Train: min = 86400 max = 15811131

Test: min = 18403224 max = 34214345

train.min() and test.max() 的差 x = 34214345 - 86400 = 34127945 即是数据集的时间跨度,但不知道单位是秒,分钟还是小时?

  • 如果是小时,那么时间跨度 x/(24*365) = 3895.884132 年,这是不可能的!

  • 如果是分钟,那么时间跨度 x/(6024365) = 64.931402 年,这也是不可能的,因为数据提供方 Vesta 成立于1995年,至今才24年。

  • 如果是秒,那么时间跨度 x/(360024365) = 1.0821 年,这是合理的。

因此,TransactionDT 单位是秒,那么

Time span of the total dataset is 394.9993634259259 days

Time span of Train dataset is 181.99920138888888 days

Time span of Test dataset is 182.99908564814814 days

The gap between train and test is 30.00107638888889 days

交易关联

我们根据 TransactionIDs, 找到与之关联的 identity.

# Here we confirm that all of the transactions in `train_identity`
print(np.sum(train_transaction.index.isin(train_identity.index.unique())))
print(np.sum(test_transaction.index.isin(test_identity.index.unique())))

144233

141907

24.4% of TransactionIDs in train (144233 / 590540) have an associated train_identity.

28.0% of TransactionIDs in test (144233 / 590540) have an associated test_identity.

train_transaction['TransactionDT'].shape[0] , train_transaction['TransactionDT'].nunique()

(590540, 573349)

TransactionDT 并不是一个时间戳,但有时候我们使用它测量时间。

更多精彩内容请关注微信公众号“统计学习与大数据

你可能感兴趣的:(python,机器学习)