机器学习训练营——机器学习爱好者的自由交流空间(入群联系qq:2279055353)
IEEE-CIS
(IEEE Computational Intelligence Society)在人工智能与机器学习的很多领域开展科研工作。目前,他们与全球领先的支付服务公司Vesta
合作,针对金融欺诈预防行业,寻求最佳的解决方案。这个案例将建立基于电子商务数据的反欺诈模型,改善欺诈交易预警效率,有助于商务活动减少欺诈损失,提高收益。
本案例是一个二值分类问题,即,目标变量表示用户的交易行为是否是欺诈的(fraudlent
or not fraudlent
). 该套数据由两个文件identity
and transaction
组成,它们由共同的特征TransactionID
连接。注意,并不是所有的交易都有对应的identity
信息。
ProductCD
card1 - card6
addr1, addr2
P_emaildomain
R_emaildomain
M1 - M9
DeviceType
DeviceInfo
id_12 - id_38
加载必需的库。
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.metrics import roc_auc_score
import matplotlib.gridspec as gridspec
# Standard plotly imports
import plotly.plotly as py
import plotly.graph_objs as go
import plotly.tools as tls
import warnings
warnings.filterwarnings("ignore")
import gc
gc.enable()
import logging
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
import os
print(os.listdir("../input"))
# Any results you write to the current directory are saved as output.
print ("Ready!")
[‘ieee-fraud-detection.zip’, ‘sample_submission.csv’, ‘test_identity.csv’, ‘test_transaction.csv’, ‘train_identity.csv’, ‘train_transaction.csv’]
Ready!
print('# File sizes')
for f in os.listdir('../input'):
if 'zip' not in f:
print(f.ljust(30) + str(round(os.path.getsize('../input/' + f) / 1000000, 2)) + 'MB')
File sizes
sample_submission.csv 6.08MB
test_identity.csv 25.8MB
test_transaction.csv 613.19MB
train_identity.csv 26.53MB
train_transaction.csv 683.35MB
train_transaction = pd.read_csv('../input/train_transaction.csv', index_col='TransactionID')
test_transaction = pd.read_csv('../input/test_transaction.csv', index_col='TransactionID')
train_identity = pd.read_csv('../input/train_identity.csv', index_col='TransactionID')
test_identity = pd.read_csv('../input/test_identity.csv', index_col='TransactionID')
print ("Data is loaded!")
Data is loaded!
Wall time: 1min 11s
print('train_transaction shape is {}'.format(train_transaction.shape))
print('test_transaction shape is {}'.format(test_transaction.shape))
print('train_identity shape is {}'.format(train_identity.shape))
print('test_identity shape is {}'.format(test_identity.shape))
train_transaction shape is (590540, 393)
test_transaction shape is (506691, 392)
train_identity shape is (144233, 40)
test_identity shape is (141907, 40)
显示训练集前5行。
train_transaction.head()
train_identity.head()
我们看看具体的变量缺失值数量。
missing_values_count = train_transaction.isnull().sum()
print (missing_values_count[0:10])
total_cells = np.product(train_transaction.shape)
total_missing = missing_values_count.sum()
print ("% of missing data = ",(total_missing/total_cells) * 100)
missing_values_count = train_identity.isnull().sum()
print (missing_values_count[0:10])
total_cells = np.product(train_identity.shape)
total_missing = missing_values_count.sum()
print ("% of missing data = ",(total_missing/total_cells) * 100)
x=train_transaction['isFraud'].value_counts().values
sns.barplot([0,1],x)
plt.title('Target variable count')
目标变量isFraud
明显是不平衡的,0值占绝大多数,即,大多数交易是非欺诈的。如果我们使用这个数据框做预测模型,得到的算法很可能过度拟合,这是因为数据假设大多数交易是非欺诈的。但这不是我们想要的假设,我们想要建立的模型应该能检验出欺诈的信号!
删除临时变量x
, 并回收内存。
del x
gc.collect()
我们来看看数据集所涉及交易的时间跨度。特征 TransactionDT
是一个timedelta
, 表示相对一个参考日期的时间差。
Train: min = 86400 max = 15811131
Test: min = 18403224 max = 34214345
train.min() and test.max() 的差 x = 34214345 - 86400 = 34127945 即是数据集的时间跨度,但不知道单位是秒,分钟还是小时?
如果是小时,那么时间跨度 x/(24*365) = 3895.884132 年,这是不可能的!
如果是分钟,那么时间跨度 x/(6024365) = 64.931402 年,这也是不可能的,因为数据提供方 Vesta
成立于1995年,至今才24年。
如果是秒,那么时间跨度 x/(360024365) = 1.0821 年,这是合理的。
因此,TransactionDT
单位是秒,那么
Time span of the total dataset is 394.9993634259259 days
Time span of Train dataset is 181.99920138888888 days
Time span of Test dataset is 182.99908564814814 days
The gap between train and test is 30.00107638888889 days
我们根据 TransactionIDs
, 找到与之关联的 identity
.
# Here we confirm that all of the transactions in `train_identity`
print(np.sum(train_transaction.index.isin(train_identity.index.unique())))
print(np.sum(test_transaction.index.isin(test_identity.index.unique())))
144233
141907
24.4% of TransactionIDs in train (144233 / 590540) have an associated train_identity.
28.0% of TransactionIDs in test (144233 / 590540) have an associated test_identity.
train_transaction['TransactionDT'].shape[0] , train_transaction['TransactionDT'].nunique()
(590540, 573349)
TransactionDT
并不是一个时间戳,但有时候我们使用它测量时间。
更多精彩内容请关注微信公众号“统计学习与大数据”