Kaggle竞赛——IEEE-CIS Fraud Detection
赛题描述:
In this competition, you’ll benchmark machine learning models on a challenging large-scale dataset. The data comes from Vesta’s real-world e-commerce transactions and contains a wide range of features from device type to product features. You also have the opportunity to create new features to improve your results.
In this competition you are predicting the probability that an online transaction is fraudulent, as denoted by the binary target isFraud.
The data is broken into two files identity and transaction, which are joined by TransactionID. Not all transactions have corresponding identity information.
LB:利用测试集前20%的数据进行验证的auc得分。
Private Leaderboard最终得分:利用测试集剩余80%的数据进行验证的auc得分。
本次比赛可以提交两份结果。
之前参加了Kaggle的几个入门级比赛,这次试试看IEEE和Vesta主办的二分类预测比赛,使用Python基于Jupyter Notebook用LightGBM建立模型进行预测,本比赛提分的关键在于对于数据的挖掘以及数据处理生成特征的策略选取,需要进行非常细致的EDA以及FE。
本次比赛的结果是铜牌:373/6381-Top 6% Private Leaderboard:0.928512
本文给出的思路,旨在辅助对于题目的理解并帮助解释贴出的Python代码,并不是最优做法。本文思路及代码仅供参考,思路中涉及到的方法以及详细步骤等请移步至参考链接。代码中变量命名、注释、试验记录等比较乱,仅供参考。
请参考以下Kaggle_kernels:
Nanashi:Fraud complete EDA_Nanashi
官方数据描述及相关答疑:Data Description (Details and Discussion)
先来看Transaction表:
TransactionDT: 不是真实的时间戳,而是与某一时间开始以秒为单位的时间差。
TransactionAMT: transaction payment amount in USD,小数部分值得关注。
ProductCD: product code,有W\H\C\S\R五种。不一定是实际商品也有可能指某种服务。
card1 - card6: payment card information, such as card type, card category, issue bank, country, etc.
addr1-addr2: 是billing region和billing country
dist: distances between (not limited) billing address, mailing address, zip code, IP address, phone area, etc.
P_ and (R__) emaildomain: purchaser and recipient email domain,有一部分交易是不需要recipient的,其对应Remaildomain为空
C1-C14: counting, such as how many addresses are found to be associated with the payment card, etc. Plus like device, ipaddr, billingaddr, etc. Also these are for both purchaser and recipient, which doubles the number.
D1-D15: timedelta, such as days between previous transaction, etc.
M1-M9: match, such as names on card and address, etc.均为01变量
Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations.不同部分的V特征有不同比例的缺失,其真正含义和处理方式仍不明。
再来看Identity表:
id01 to id11 are numerical features for identity, which is collected by Vesta and security partners such as device rating, ip_domain rating, proxy rating, etc. Also it recorded behavioral fingerprint like account login times/failed to login times, how long an account stayed on the page, etc. All of these are not able to elaborate due to security partner T&C.
DeviceType、DeviceInfo、id12 - id38是Categorical Features。
在许多EDA相关Kernels中我们可以发现数据的一些特征,尤其是数据随时间变化分布上的特征,还有训练集与测试集分布的不同之处。
def check_dependency(independent_var, dependent_var):
independent_uniques = []
temp_df = pd.concat([train_df[[independent_var, dependent_var]], test_df[[independent_var, dependent_var]]])
for value in temp_df[independent_var].unique():
independent_uniques.append(temp_df[temp_df[independent_var] == value][dependent_var].value_counts().shape[0])
values = pd.Series(data=independent_uniques, index=temp_df[independent_var].unique())
N = len(values)
N_dependent = len(values[values == 1])
N_notdependent = len(values[values > 1])
N_null = len(values[values == 0])
print(f'In {independent_var}, there are {N} unique values')
print(f'{N_dependent}/{N} have one unique {dependent_var} value')
print(f'{N_notdependent}/{N} have more than one unique {dependent_var} values')
print(f'{N_null}/{N} have only missing {dependent_var} values\n')
举个例子:
check_dependency('R_emaildomain', 'C5')
print(train_df['C10'].isnull().sum()/train_df.shape[0])
print(test_df['C10'].isnull().sum()/test_df.shape[0])
print(test_df[~test_df['R_emaildomain'].isnull()]['C5'].value_counts())
In R_emaildomain, there are 61 unique values
60/61 have one unique C5 value
0/61 have more than one unique C5 values
1/61 have only missing C5 values
0.0
5.920768278891869e-06
0.0 135867
Name: C5, dtype: int64
可见 R_emaildomain和C5相关度很高,且C5特征于测试集中有少量缺失,而R_emaildomain不缺失的时候C5缺失,R_emaildomain不缺失时C5均为0,将C5缺失值用0补上便是比较合理的。
按这个思路找到了几组相关度很高的特征,将测试集中的缺失值补上:
#1.1 find dependency and fillna
#'dist1', 'C3',只有test有C3的缺失,且只在dist1不缺失的时候缺失,dist1不缺失的时候C3全都是0
test_df['C3'] = test_df['C3'].fillna(0)
#'R_emaildomain', 'C5',只有test有C5的缺失,基本上都是在R_emaildomain不缺失的时候缺失,R_emaildomain缺失的C5缺失只有3个
test_df['C5'] = test_df['C5'].fillna(0)
#'id_30','C7',只有test有C7的缺失,只在id_30不缺失的时候缺失,id_30不缺失的C7缺失只有3个,其他都是0(Device)
test_df['C7'] = test_df['C7'].fillna(0)
#'id_31','C9',只有test有C9的缺失,只在id_31不缺失的时候缺失,id_31不缺失的C9缺失只有3个,其他都是0(Browser)
test_df['C9'] = test_df['C9'].fillna(0)
#1. More interaction between card features + fill nans
i_cols = ['TransactionID','card1','card2','card3','card4','card5','card6']
full_df = pd.concat([train_df[i_cols], test_df[i_cols]])
## I've used frequency encoding before so we have ints here
## we will drop very rare cards
full_df['card6'] = np.where(full_df['card6']==30, np.nan, full_df['card6'])
full_df['card6'] = np.where(full_df['card6']==16, np.nan, full_df['card6'])
i_cols = ['card2','card3','card4','card5','card6']
## We will find best match for nan values and fill with it 把23456都补上好多了
for col in i_cols:
temp_df = full_df.groupby(['card1',col])[col].agg(['count']).reset_index()
temp_df = temp_df.sort_values(by=['card1','count'], ascending=False).reset_index(drop=True)
del temp_df['count']
temp_df = temp_df.drop_duplicates(keep='first').reset_index(drop=True)
temp_df.index = temp_df['card1'].values
temp_df = temp_df[col].to_dict()
full_df[col] = np.where(full_df[col].isna(), full_df['card1'].map(temp_df), full_df[col])
i_cols = ['card1','card2','card3','card4','card5','card6']
for col in i_cols:
train_df[col] = full_df[full_df['TransactionID'].isin(train_df['TransactionID'])][col].values
test_df[col] = full_df[full_df['TransactionID'].isin(test_df['TransactionID'])][col].values
为了保护用户信息官方对特征做了许多处理也隐瞒了特征的真实意义,需要通过对数据细致的观察分析来判断特征的意义及其蕴含的信息,以选择特征处理的合理手段。
选取处理策略:
dt_df[new_col+'_min_max'] = (dt_df[col]-dt_df['temp_min'])/(dt_df['temp_max']-dt_df['temp_min'])
dt_df[new_col+'_std_score'] = (dt_df[col]-dt_df['temp_mean'])/(dt_df['temp_std'])
其中numerical类型的Vs特征有:
'V126' 'V127' 'V128' 'V130' 'V131' 'V132' 'V133' 'V134' 'V136' 'V137'
'V143' 'V144' 'V145' 'V150' 'V159' 'V160' 'V164' 'V165' 'V166' 'V202'
'V203' 'V204' 'V205' 'V206' 'V207' 'V208' 'V209' 'V210' 'V211' 'V212'
'V213' 'V214' 'V215' 'V216' 'V263' 'V264' 'V265' 'V266' 'V267' 'V268'
'V270' 'V271' 'V272' 'V273' 'V274' 'V275' 'V276' 'V277' 'V278' 'V306'
'V307' 'V308' 'V309' 'V310' 'V312' 'V313' 'V314' 'V315' 'V316' 'V317'
'V318' 'V320' 'V321' 'V331' 'V332' 'V333' 'V335'
选取处理方式:
初步特征处理思路(LB–>0.9487)请参考:
Konstantin Yakovlev——IEEE - Internal Blend
David Cairuz——Feature Engineering & LightGBM
后期特征处理思路(LB:0.9487–>0.9526)请参见其他实验记录,以下为最终采用的特征工程代码:
import numpy as np
import pandas as pd
import gc
import os, sys, random, datetime
将数据集缩小,占用更小内存,并得到更高的处理效率,请参考:Konstantin Yakovlev——IEEE Data minification
def seed_everything(seed=0):
random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
np.random.seed(seed)
## Memory Reducer
# :df pandas dataframe to reduce size # type: pd.DataFrame()
# :verbose # type: bool
def reduce_mem_usage(df, verbose=True):
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
start_mem = df.memory_usage().sum() / 1024**2
for col in df.columns:
col_type = df[col].dtypes
if col_type in numerics:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
end_mem = df.memory_usage().sum() / 1024**2
if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
return df
载入训练集和测试集,缩小其占用空间。
print('Load Data')
train_df = pd.read_csv('../input/train_transaction.csv')
test_df = pd.read_csv('../input/test_transaction.csv')
test_df['isFraud'] = 0
train_identity = pd.read_csv('../input/train_identity.csv')
test_identity = pd.read_csv('../input/test_identity.csv')
print('Reduce Memory')
train_df = reduce_mem_usage(train_df)
test_df = reduce_mem_usage(test_df)
train_identity = reduce_mem_usage(train_identity)
test_identity = reduce_mem_usage(test_identity)
Load Data
Reduce Memory
Mem. usage decreased to 542.35 Mb (69.4% reduction)
Mem. usage decreased to 473.07 Mb (68.9% reduction)
Mem. usage decreased to 25.86 Mb (42.7% reduction)
Mem. usage decreased to 25.44 Mb (42.7% reduction)
对identity部分数据进行初步处理,主要是将字符串特征,如DeviceInfo、id_30(系统信息)、id_31(浏览器信息),split生成新的特征,用id_33(分辨率)生成设备特征;并将其余类别特征从字符串转为numerical,部分信息bin处理:
def id_split(dataframe):
dataframe['device_name'] = dataframe['DeviceInfo'].str.split('/', expand=True)[0]
dataframe['device_version'] = dataframe['DeviceInfo'].str.split('/', expand=True)[1]
dataframe['OS_id_30'] = dataframe['id_30'].str.split(' ', expand=True)[0]
dataframe['version_id_30'] = dataframe['id_30'].str.split(' ', expand=True)[1]
dataframe['browser_id_31'] = dataframe['id_31'].str.split(' ', expand=True)[0]
dataframe['version_id_31'] = dataframe['id_31'].str.split(' ', expand=True)[1]
dataframe['screen_width'] = dataframe['id_33'].str.split('x', expand=True)[0]
dataframe['screen_height'] = dataframe['id_33'].str.split('x', expand=True)[1]
dataframe['id_12'] = dataframe['id_12'].map({'Found':1, 'NotFound':0})
dataframe['id_15'] = dataframe['id_15'].map({'New':2, 'Found':1, 'Unknown':0})
dataframe['id_16'] = dataframe['id_16'].map({'Found':1, 'NotFound':0})
dataframe['id_23'] = dataframe['id_23'].map({'TRANSPARENT':4, 'IP_PROXY':3, 'IP_PROXY:ANONYMOUS':2, 'IP_PROXY:HIDDEN':1})
dataframe['id_27'] = dataframe['id_27'].map({'Found':1, 'NotFound':0})
dataframe['id_28'] = dataframe['id_28'].map({'New':2, 'Found':1})
dataframe['id_29'] = dataframe['id_29'].map({'Found':1, 'NotFound':0})
dataframe['id_35'] = dataframe['id_35'].map({'T':1, 'F':0})
dataframe['id_36'] = dataframe['id_36'].map({'T':1, 'F':0})
dataframe['id_37'] = dataframe['id_37'].map({'T':1, 'F':0})
dataframe['id_38'] = dataframe['id_38'].map({'T':1, 'F':0})
dataframe['id_34'] = dataframe['id_34'].fillna(':0')
dataframe['id_34'] = dataframe['id_34'].apply(lambda x: x.split(':')[1]).astype(np.int8)
dataframe['id_34'] = np.where(dataframe['id_34']==0, np.nan, dataframe['id_34'])
dataframe['id_33'] = dataframe['id_33'].fillna('0x0')
dataframe['id_33_0'] = dataframe['id_33'].apply(lambda x: x.split('x')[0]).astype(int)
dataframe['id_33_1'] = dataframe['id_33'].apply(lambda x: x.split('x')[1]).astype(int)
dataframe['id_33'] = np.where(dataframe['id_33']=='0x0', np.nan, dataframe['id_33'])
for feature in ['id_01', 'id_31', 'id_33', 'id_36']:
dataframe[feature + '_count_dist'] = dataframe[feature].map(dataframe[feature].value_counts(dropna=False))
dataframe['DeviceType'].map({'desktop':1, 'mobile':0})
dataframe.loc[dataframe['device_name'].str.contains('SM', na=False), 'device_name'] = 'Samsung'
dataframe.loc[dataframe['device_name'].str.contains('SAMSUNG', na=False), 'device_name'] = 'Samsung'
dataframe.loc[dataframe['device_name'].str.contains('GT-', na=False), 'device_name'] = 'Samsung'
dataframe.loc[dataframe['device_name'].str.contains('Moto G', na=False), 'device_name'] = 'Motorola'
dataframe.loc[dataframe['device_name'].str.contains('Moto', na=False), 'device_name'] = 'Motorola'
dataframe.loc[dataframe['device_name'].str.contains('moto', na=False), 'device_name'] = 'Motorola'
dataframe.loc[dataframe['device_name'].str.contains('LG-', na=False), 'device_name'] = 'LG'
dataframe.loc[dataframe['device_name'].str.contains('rv:', na=False), 'device_name'] = 'RV'
dataframe.loc[dataframe['device_name'].str.contains('HUAWEI', na=False), 'device_name'] = 'Huawei'
dataframe.loc[dataframe['device_name'].str.contains('ALE-', na=False), 'device_name'] = 'Huawei'
dataframe.loc[dataframe['device_name'].str.contains('-L', na=False), 'device_name'] = 'Huawei'
dataframe.loc[dataframe['device_name'].str.contains('Blade', na=False), 'device_name'] = 'ZTE'
dataframe.loc[dataframe['device_name'].str.contains('BLADE', na=False), 'device_name'] = 'ZTE'
dataframe.loc[dataframe['device_name'].str.contains('Linux', na=False), 'device_name'] = 'Linux'
dataframe.loc[dataframe['device_name'].str.contains('XT', na=False), 'device_name'] = 'Sony'
dataframe.loc[dataframe['device_name'].str.contains('HTC', na=False), 'device_name'] = 'HTC'
dataframe.loc[dataframe['device_name'].str.contains('ASUS', na=False), 'device_name'] = 'Asus'
dataframe.loc[dataframe.device_name.isin(dataframe.device_name.value_counts()[dataframe.device_name.value_counts() < 200].index), 'device_name'] = "Others"
dataframe['had_id'] = 1
gc.collect()
return dataframe
train_identity = id_split(train_identity)
test_identity = id_split(test_identity)
对Transaction部分数据进行初步处理:
#new features trans
def gen_new(train_trans,test_trans):
# New feature - log of transaction amount.
train_trans['TransactionAmt_Log'] = np.log1p(train_trans['TransactionAmt'])
test_trans['TransactionAmt_Log'] = np.log1p(test_trans['TransactionAmt'])
# New feature - decimal part of the transaction amount.
train_trans['TransactionAmt_decimal'] = ((train_trans['TransactionAmt'] - train_trans['TransactionAmt'].astype(int)) * 1000).astype(int)
test_trans['TransactionAmt_decimal'] = ((test_trans['TransactionAmt'] - test_trans['TransactionAmt'].astype(int)) * 1000).astype(int)
# New feature - day of week in which a transaction happened.
train_trans['Transaction_day_of_week'] = np.floor((train_trans['TransactionDT'] / (3600 * 24) - 1) % 7)
test_trans['Transaction_day_of_week'] = np.floor((test_trans['TransactionDT'] / (3600 * 24) - 1) % 7)
# New feature - hour of the day in which a transaction happened.
train_trans['Transaction_hour'] = np.floor(train_trans['TransactionDT'] / 3600) % 24
test_trans['Transaction_hour'] = np.floor(test_trans['TransactionDT'] / 3600) % 24
#New feature - emaildomain with suffix
emails = {'gmail': 'google', 'att.net': 'att', 'twc.com': 'spectrum', 'scranton.edu': 'other', 'optonline.net': 'other', 'hotmail.co.uk': 'microsoft', 'comcast.net': 'other', 'yahoo.com.mx': 'yahoo', 'yahoo.fr': 'yahoo', 'yahoo.es': 'yahoo', 'charter.net': 'spectrum', 'live.com': 'microsoft', 'aim.com': 'aol', 'hotmail.de': 'microsoft', 'centurylink.net': 'centurylink', 'gmail.com': 'google', 'me.com': 'apple', 'earthlink.net': 'other', 'gmx.de': 'other', 'web.de': 'other', 'cfl.rr.com': 'other', 'hotmail.com': 'microsoft', 'protonmail.com': 'other', 'hotmail.fr': 'microsoft', 'windstream.net': 'other', 'outlook.es': 'microsoft', 'yahoo.co.jp': 'yahoo', 'yahoo.de': 'yahoo', 'servicios-ta.com': 'other', 'netzero.net': 'other', 'suddenlink.net': 'other', 'roadrunner.com': 'other', 'sc.rr.com': 'other', 'live.fr': 'microsoft', 'verizon.net': 'yahoo', 'msn.com': 'microsoft', 'q.com': 'centurylink', 'prodigy.net.mx': 'att', 'frontier.com': 'yahoo', 'anonymous.com': 'other', 'rocketmail.com': 'yahoo', 'sbcglobal.net': 'att', 'frontiernet.net': 'yahoo', 'ymail.com': 'yahoo', 'outlook.com': 'microsoft', 'mail.com': 'other', 'bellsouth.net': 'other', 'embarqmail.com': 'centurylink', 'cableone.net': 'other', 'hotmail.es': 'microsoft', 'mac.com': 'apple', 'yahoo.co.uk': 'yahoo', 'netzero.com': 'other', 'yahoo.com': 'yahoo', 'live.com.mx': 'microsoft', 'ptd.net': 'other', 'cox.net': 'other', 'aol.com': 'aol', 'juno.com': 'other', 'icloud.com': 'apple','uknown':'uknown'}
us_emails = ['gmail', 'net', 'edu']
for c in ['P_emaildomain', 'R_emaildomain']:
train_trans[c] = train_trans[c].fillna('uknown')
test_trans[c] = test_trans[c].fillna('uknown')
train_trans[c + '_bin'] = train_trans[c].map(emails)
test_trans[c + '_bin'] = test_trans[c].map(emails)
train_trans[c + '_suffix'] = train_trans[c].apply(lambda x: str(x).split('.')[-1])
test_trans[c + '_suffix'] = test_trans[c].apply(lambda x: str(x).split('.')[-1])
train_trans[c + '_prefix'] = train_trans[c].apply(lambda x: str(x).split('.')[0])
test_trans[c + '_prefix'] = test_trans[c].apply(lambda x: str(x).split('.')[0])
train_trans[c + '_suffix_us'] = train_trans[c + '_suffix'].map(lambda x: x if str(x) not in us_emails else 'us')
test_trans[c + '_suffix_us'] = test_trans[c + '_suffix'].map(lambda x: x if str(x) not in us_emails else 'us')
train_trans['email_check'] = np.where((train_trans['P_emaildomain']==train_trans['R_emaildomain'])&(train_trans['P_emaildomain']!='uknown'),1,0)
test_trans['email_check'] = np.where((test_trans['P_emaildomain']==test_trans['R_emaildomain'])&(test_trans['P_emaildomain']!='uknown'),1,0)
#New feature - dates
START_DATE = datetime.datetime.strptime('2017-11-30', '%Y-%m-%d')
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
dates_range = pd.date_range(start='2017-10-01', end='2019-01-01')
us_holidays = calendar().holidays(start=dates_range.min(), end=dates_range.max())
for df in [train_trans, test_trans]:
# Temporary
df['DT'] = df['TransactionDT'].apply(lambda x: (START_DATE + datetime.timedelta(seconds = x)))
df['DT_M'] = (df['DT'].dt.year-2017)*12 + df['DT'].dt.month
df['DT_W'] = (df['DT'].dt.year-2017)*52 + df['DT'].dt.weekofyear
df['DT_D'] = (df['DT'].dt.year-2017)*365 + df['DT'].dt.dayofyear
df['DT_hour'] = df['DT'].dt.hour
df['DT_day_week'] = df['DT'].dt.dayofweek
df['DT_day'] = df['DT'].dt.day
df['DT_day_month'] = (df['DT'].dt.day).astype(np.int8)
# Possible solo feature
df['is_december'] = df['DT'].dt.month
df['is_december'] = (df['is_december']==12).astype(np.int8)
# Holidays
df['is_holiday'] = (df['DT'].dt.date.astype('datetime64').isin(us_holidays)).astype(np.int8)
#New feature - binary encoded 1/0 gen new
i_cols = ['M1','M2','M3','M5','M6','M7','M8','M9']
for df in [train_trans, test_trans]:
df['M_sum'] = df[i_cols].sum(axis=1).astype(np.int8)
df['M_na'] = df[i_cols].isna().sum(axis=1).astype(np.int8)
#New feature - ProductCD and M4 Target mean
for col in ['ProductCD','M4']:
temp_dict = train_trans.groupby([col])['isFraud'].agg(['mean']).reset_index().rename(columns={'mean': col+'_target_mean'})
temp_dict.index = temp_dict[col].values
temp_dict = temp_dict[col+'_target_mean'].to_dict()
train_trans[col+'_target_mean'] = train_trans[col].map(temp_dict)
test_trans[col+'_target_mean'] = test_trans[col].map(temp_dict)
#New feature - use it for aggregations
train_trans['uid1'] = train_trans['card1'].astype(str)+'_'+train_trans['card2'].astype(str)
test_trans['uid1'] = test_trans['card1'].astype(str)+'_'+test_trans['card2'].astype(str)
train_trans['uid2'] = train_trans['uid1'].astype(str)+'_'+train_trans['card3'].astype(str)+'_'+train_trans['card5'].astype(str)
test_trans['uid2'] = test_trans['uid1'].astype(str)+'_'+test_trans['card3'].astype(str)+'_'+test_trans['card5'].astype(str)
train_trans['uid3'] = train_trans['uid2'].astype(str)+'_'+train_trans['addr1'].astype(str)+'_'+train_trans['addr2'].astype(str)
test_trans['uid3'] = test_trans['uid2'].astype(str)+'_'+test_trans['addr1'].astype(str)+'_'+test_trans['addr2'].astype(str)
# Check if the Transaction Amount is common or not (we can use freq encoding here)
# In our dialog with a model we are telling to trust or not to these values
# Clip Values
train_trans['TransactionAmt'] = train_trans['TransactionAmt'].clip(0,5000)
test_trans['TransactionAmt'] = test_trans['TransactionAmt'].clip(0,5000)
train_trans['TransactionAmt_check'] = np.where(train_trans['TransactionAmt'].isin(test_trans['TransactionAmt']), 1, 0)
test_trans['TransactionAmt_check'] = np.where(test_trans['TransactionAmt'].isin(train_trans['TransactionAmt']), 1, 0)
return train_trans,test_trans
train_df,test_df = gen_new(train_df,test_df)
定义aggregation用函数,按一定时长计算出现频率的timeblock_frequency_encoding,以uid做agg类处理的uid_aggregation,uid_aggregation_and_normalization,计算频率进行编码的frequency_encoding:
def timeblock_frequency_encoding(train_df, test_df, periods, columns,
with_proportions=True, only_proportions=False):
for period in periods:
for col in columns:
new_col = col +'_'+ period
train_df[new_col] = train_df[col].astype(str)+'_'+train_df[period].astype(str)
test_df[new_col] = test_df[col].astype(str)+'_'+test_df[period].astype(str)
temp_df = pd.concat([train_df[[new_col]], test_df[[new_col]]])
fq_encode = temp_df[new_col].value_counts().to_dict()
train_df[new_col] = train_df[new_col].map(fq_encode)
test_df[new_col] = test_df[new_col].map(fq_encode)
if only_proportions:
train_df[new_col] = train_df[new_col]/train_df[period+'_total']
test_df[new_col] = test_df[new_col]/test_df[period+'_total']
if with_proportions:
train_df[new_col+'_proportions'] = train_df[new_col]/train_df[period+'_total']
test_df[new_col+'_proportions'] = test_df[new_col]/test_df[period+'_total']
return train_df, test_df
def uid_aggregation(train_df, test_df, main_columns, uids, aggregations):
for main_column in main_columns:
for col in uids:
for agg_type in aggregations:
new_col_name = col+'_'+main_column+'_'+agg_type
temp_df = pd.concat([train_df[[col, main_column]], test_df[[col,main_column]]])
temp_df = temp_df.groupby([col])[main_column].agg([agg_type]).reset_index().rename(
columns={agg_type: new_col_name})
temp_df.index = list(temp_df[col])
temp_df = temp_df[new_col_name].to_dict()
train_df[new_col_name] = train_df[col].map(temp_df)
test_df[new_col_name] = test_df[col].map(temp_df)
return train_df, test_df
def uid_aggregation_and_normalization(train_df, test_df, main_columns, uids, aggregations):
for main_column in main_columns:
for col in uids:
new_norm_col_name = col+'_'+main_column+'_std_norm'
norm_cols = []
for agg_type in aggregations:
new_col_name = col+'_'+main_column+'_'+agg_type
temp_df = pd.concat([train_df[[col, main_column]], test_df[[col,main_column]]])
temp_df = temp_df.groupby([col])[main_column].agg([agg_type]).reset_index().rename(
columns={agg_type: new_col_name})
temp_df.index = list(temp_df[col])
temp_df = temp_df[new_col_name].to_dict()
train_df[new_col_name] = train_df[col].map(temp_df)
test_df[new_col_name] = test_df[col].map(temp_df)
norm_cols.append(new_col_name)
train_df[new_norm_col_name] = (train_df[main_column]-train_df[norm_cols[0]])/train_df[norm_cols[1]]
test_df[new_norm_col_name] = (test_df[main_column]-test_df[norm_cols[0]])/test_df[norm_cols[1]]
del train_df[norm_cols[0]], train_df[norm_cols[1]]
del test_df[norm_cols[0]], test_df[norm_cols[1]]
return train_df, test_df
def frequency_encoding(train_df, test_df, columns, self_encoding=False):
for col in columns:
temp_df = pd.concat([train_df[[col]], test_df[[col]]])
fq_encode = temp_df[col].value_counts(dropna=False).to_dict()
if self_encoding:
train_df[col] = train_df[col].map(fq_encode)
test_df[col] = test_df[col].map(fq_encode)
else:
train_df[col+'_fq_enc'] = train_df[col].map(fq_encode)
test_df[col+'_fq_enc'] = test_df[col].map(fq_encode)
return train_df, test_df
接下来开始进一步的特征工程:
#2. Keep intersactions
for col in ['card1']:
valid_card = pd.concat([train_df[[col]], test_df[[col]]])
valid_card = valid_card[col].value_counts()
valid_card_std = valid_card.values.std()
invalid_cards = valid_card[valid_card<=2]
print('Rare cards',len(invalid_cards))
valid_card = valid_card[valid_card>2]
valid_card = list(valid_card.index)
print('No intersection in Train', len(train_df[~train_df[col].isin(test_df[col])]))
print('Intersection in Train', len(train_df[train_df[col].isin(test_df[col])]))
train_df[col] = np.where(train_df[col].isin(test_df[col]), train_df[col], np.nan)
test_df[col] = np.where(test_df[col].isin(train_df[col]), test_df[col], np.nan)
train_df[col] = np.where(train_df[col].isin(valid_card), train_df[col], np.nan)
test_df[col] = np.where(test_df[col].isin(valid_card), test_df[col], np.nan)
print('#'*20)
for col in ['card2','card3','card4','card5','card6']:
print('No intersection in Train', col, len(train_df[~train_df[col].isin(test_df[col])]))
print('Intersection in Train', col, len(train_df[train_df[col].isin(test_df[col])]))
train_df[col] = np.where(train_df[col].isin(test_df[col]), train_df[col], np.nan)
test_df[col] = np.where(test_df[col].isin(train_df[col]), test_df[col], np.nan)
print('#'*20)
Rare cards 5993
No intersection in Train 10396
Intersection in Train 580144
####################
No intersection in Train card2 0
Intersection in Train card2 590540
####################
No intersection in Train card3 47
Intersection in Train card3 590493
####################
No intersection in Train card4 0
Intersection in Train card4 590540
####################
No intersection in Train card5 176
Intersection in Train card5 590364
####################
No intersection in Train card6 30
Intersection in Train card6 590510
####################
#3.generate accurate userids and cardids
train_df['uid4'] = train_df['uid3'].astype(str)+'_'+train_df['P_emaildomain'].astype(str)
test_df['uid4'] = test_df['uid3'].astype(str)+'_'+test_df['P_emaildomain'].astype(str)
train_df['uid5'] = train_df['uid3'].astype(str)+'_'+train_df['R_emaildomain'].astype(str)
test_df['uid5'] = test_df['uid3'].astype(str)+'_'+test_df['R_emaildomain'].astype(str)
train_df['uid6'] = train_df['card1'].astype(str)+'_'+train_df['D15'].astype(str)
test_df['uid6'] = test_df['card1'].astype(str)+'_'+test_df['D15'].astype(str)
#try to generate more accuracy card_id and user_id
#uid1\2 不太有使用的价值了
#guess_card_id
train_df['TransactionDTday'] = (train_df['TransactionDT']/(60*60*24)).map(int)
test_df['TransactionDTday'] = (test_df['TransactionDT']/(60*60*24)).map(int)
train_df['D1minusday'] = train_df['D1'] - train_df['TransactionDTday'] #发卡日
test_df['D1minusday'] = test_df['D1'] - test_df['TransactionDTday']
train_df['D4minusday'] = train_df['D4'] - train_df['TransactionDTday'] #发卡日
test_df['D4minusday'] = test_df['D4'] - test_df['TransactionDTday']
#这个应该对D1\D2\D3\D8有效果,D2没必要动,D3/D8应该有别的用法
train_df['cid_1'] = train_df['uid4'].astype(str)+'_'+train_df['D1minusday'].astype(str)
test_df['cid_1'] = test_df['uid4'].astype(str)+'_'+test_df['D1minusday'].astype(str)
#guess_user_id 用D4
train_df['uid7'] = train_df['uid4'].astype(str)+'_'+train_df['D4minusday'].astype(str)
test_df['uid7'] = test_df['uid4'].astype(str)+'_'+test_df['D4minusday'].astype(str)
print('#'*10)
print('Most common uIds:')
new_columns = ['uid1','uid2','uid3','uid4','uid5','uid6','uid7','cid_1']
for col in new_columns:
print('#'*10, col)
print(train_df[col].value_counts()[:10])
# Do Global frequency encoding
i_cols = ['card1','card2','card3','card5'] + new_columns
train_df, test_df = frequency_encoding(train_df, test_df, i_cols, self_encoding=False)
##########
Most common uIds:
########## uid1
7919_194.0 14891
9500_321.0 14112
15885_545.0 10332
17188_321.0 10312
15066_170.0 7918
12695_490.0 7079
6019_583.0 6766
12544_321.0 6760
2803_100.0 6126
7585_553.0 5325
Name: uid1, dtype: int64
########## uid2
9500_321.0_150.0_226.0 14112
15885_545.0_185.0_138.0 10332
17188_321.0_150.0_226.0 10312
7919_194.0_150.0_166.0 8844
15066_170.0_150.0_102.0 7918
12695_490.0_150.0_226.0 7079
6019_583.0_150.0_226.0 6766
12544_321.0_150.0_226.0 6760
2803_100.0_150.0_226.0 6126
7919_194.0_150.0_202.0 6047
Name: uid2, dtype: int64
########## uid3
15885_545.0_185.0_138.0_nan_nan 9900
17188_321.0_150.0_226.0_299.0_87.0 5862
12695_490.0_150.0_226.0_325.0_87.0 5766
9500_321.0_150.0_226.0_204.0_87.0 4647
3154_408.0_185.0_224.0_nan_nan 4398
12839_321.0_150.0_226.0_264.0_87.0 3538
16132_111.0_150.0_226.0_299.0_87.0 3523
15497_490.0_150.0_226.0_299.0_87.0 3419
9500_321.0_150.0_226.0_272.0_87.0 2715
5812_408.0_185.0_224.0_nan_nan 2639
Name: uid3, dtype: int64
########## uid4
15885_545.0_185.0_138.0_nan_nan_hotmail.com 4002
15885_545.0_185.0_138.0_nan_nan_gmail.com 3830
17188_321.0_150.0_226.0_299.0_87.0_gmail.com 2235
12695_490.0_150.0_226.0_325.0_87.0_gmail.com 2045
9500_321.0_150.0_226.0_204.0_87.0_gmail.com 1947
3154_408.0_185.0_224.0_nan_nan_hotmail.com 1890
3154_408.0_185.0_224.0_nan_nan_gmail.com 1537
12839_321.0_150.0_226.0_264.0_87.0_gmail.com 1473
15775_481.0_150.0_102.0_330.0_87.0_uknown 1453
15497_490.0_150.0_226.0_299.0_87.0_gmail.com 1383
Name: uid4, dtype: int64
########## uid5
12695_490.0_150.0_226.0_325.0_87.0_uknown 5446
17188_321.0_150.0_226.0_299.0_87.0_uknown 5322
9500_321.0_150.0_226.0_204.0_87.0_uknown 4403
15885_545.0_185.0_138.0_nan_nan_hotmail.com 4002
15885_545.0_185.0_138.0_nan_nan_gmail.com 3830
12839_321.0_150.0_226.0_264.0_87.0_uknown 3365
16132_111.0_150.0_226.0_299.0_87.0_uknown 3212
15497_490.0_150.0_226.0_299.0_87.0_uknown 3027
9500_321.0_150.0_226.0_272.0_87.0_uknown 2601
7664_490.0_150.0_226.0_264.0_87.0_uknown 2396
Name: uid5, dtype: int64
########## uid6
15885.0_0.0 7398
7919.0_0.0 4170
6019.0_nan 3962
nan_0.0 3754
9500.0_0.0 3414
3154.0_0.0 3016
15066.0_0.0 2995
9633.0_0.0 2968
nan_nan 2794
17188.0_0.0 2434
Name: uid6, dtype: int64
########## uid7
15775_481.0_150.0_102.0_330.0_87.0_uknown_nan 1453
12695_490.0_150.0_226.0_325.0_87.0_uknown_nan 928
17188_321.0_150.0_226.0_299.0_87.0_uknown_nan 923
9500_321.0_150.0_226.0_204.0_87.0_uknown_nan 622
16132_111.0_150.0_226.0_299.0_87.0_uknown_nan 622
12839_321.0_150.0_226.0_264.0_87.0_uknown_nan 580
7207_111.0_150.0_226.0_204.0_87.0_uknown_nan 551
7664_490.0_150.0_226.0_264.0_87.0_uknown_nan 545
15497_490.0_150.0_226.0_299.0_87.0_uknown_nan 480
9112_250.0_150.0_226.0_441.0_87.0_uknown_nan 439
Name: uid7, dtype: int64
########## cid_1
15775_481.0_150.0_102.0_330.0_87.0_uknown_-129.0 1414
9500_321.0_150.0_226.0_126.0_87.0_aol.com_85.0 404
8528_215.0_150.0_226.0_387.0_87.0_uknown_159.0 207
7207_111.0_150.0_226.0_204.0_87.0_uknown_465.0 189
12741_106.0_150.0_226.0_143.0_87.0_gmail.com_202.0 156
13597_198.0_150.0_226.0_191.0_87.0_yahoo.com_48.0 145
4121_361.0_150.0_226.0_476.0_87.0_hotmail.com_8.0 141
8900_385.0_150.0_226.0_231.0_87.0_uknown_60.0 132
9323_111.0_150.0_226.0_191.0_87.0_charter.net_50.0 109
3898_281.0_150.0_226.0_181.0_87.0_hotmail.com_188.0 106
Name: cid_1, dtype: int64
#4. period counts
for col in ['DT_M','DT_W','DT_D']:
temp_df = pd.concat([train_df[[col]], test_df[[col]]])
fq_encode = temp_df[col].value_counts().to_dict()
train_df[col+'_total'] = train_df[col].map(fq_encode)
test_df[col+'_total'] = test_df[col].map(fq_encode)
#User period counts
periods = ['DT_M','DT_W','DT_D']
i_cols = ['uid4','uid5','uid6','uid7','cid_1']
for period in periods:
for col in i_cols:
new_column = col + '_' + period
temp_df = pd.concat([train_df[[col,period]], test_df[[col,period]]])
temp_df[new_column] = temp_df[col].astype(str) + '_' + (temp_df[period]).astype(str)
fq_encode = temp_df[new_column].value_counts().to_dict()
train_df[new_column] = (train_df[col].astype(str) + '_' + train_df[period].astype(str)).map(fq_encode)
test_df[new_column] = (test_df[col].astype(str) + '_' + test_df[period].astype(str)).map(fq_encode)
train_df[new_column] /= train_df[period+'_total']
test_df[new_column] /= test_df[period+'_total']
#5. Prepare bank type feature
for df in [train_df, test_df]:
df['bank_type'] = df['card3'].astype(str) +'_'+ df['card5'].astype(str)
encoding_mean = {
1: ['DT_D','DT_hour','_hour_dist','DT_hour_mean'],
2: ['DT_W','DT_day_week','_week_day_dist','DT_day_week_mean'],
3: ['DT_M','DT_day_month','_month_day_dist','DT_day_month_mean'],
}
encoding_best = {
1: ['DT_D','DT_hour','_hour_dist_best','DT_hour_best'],
2: ['DT_W','DT_day_week','_week_day_dist_best','DT_day_week_best'],
3: ['DT_M','DT_day_month','_month_day_dist_best','DT_day_month_best'],
}
train_df['DT_day_month'] = (train_df['DT'].dt.day).astype(np.int8)
test_df['DT_day_month'] = (test_df['DT'].dt.day).astype(np.int8)
# Some ugly code here (even worse than in other parts)
for col in ['card3','card5','bank_type']:
for df in [train_df, test_df]:
for encode in encoding_mean:
encode = encoding_mean[encode].copy()
new_col = col + '_' + encode[0] + encode[2]
df[new_col] = df[col].astype(str) +'_'+ df[encode[0]].astype(str)
temp_dict = df.groupby([new_col])[encode[1]].agg(['mean']).reset_index().rename(
columns={'mean': encode[3]})
temp_dict.index = temp_dict[new_col].values
temp_dict = temp_dict[encode[3]].to_dict()
df[new_col] = df[encode[1]] - df[new_col].map(temp_dict)
for encode in encoding_best:
encode = encoding_best[encode].copy()
new_col = col + '_' + encode[0] + encode[2]
df[new_col] = df[col].astype(str) +'_'+ df[encode[0]].astype(str)
temp_dict = df.groupby([col,encode[0],encode[1]])[encode[1]].agg(['count']).reset_index().rename(
columns={'count': encode[3]})
temp_dict.sort_values(by=[col,encode[0],encode[3]], inplace=True)
temp_dict = temp_dict.drop_duplicates(subset=[col,encode[0]], keep='last')
temp_dict[new_col] = temp_dict[col].astype(str) +'_'+ temp_dict[encode[0]].astype(str)
temp_dict.index = temp_dict[new_col].values
temp_dict = temp_dict[encode[1]].to_dict()
df[new_col] = df[encode[1]] - df[new_col].map(temp_dict)
#6. BankType timeblock_frequency_encoding
i_cols = ['bank_type']
periods = ['DT_M','DT_W','DT_D']
# We have few options to encode it here:
# - Just count transactions
# (but some timblocks have more transactions than others)
# - Devide to total transactions per timeblock (proportions)
# - Use both
# - Use only proportions
train_df, test_df = timeblock_frequency_encoding(train_df, test_df, periods, i_cols,
with_proportions=False, only_proportions=True)
#7. Ds uid aggregations (maybe not useful)
i_cols = ['D'+str(i) for i in range(1,16)]
uids = ['uid3','uid4','uid5','bank_type','cid1','uid6','uid7']
aggregations = ['mean','min']
####### Cleaning Neagtive values and columns transformations
for df in [train_df, test_df]:
for col in i_cols:
df[col] = df[col].clip(0)
# Lets transform D8 and D9 column
# As we almost sure it has connection with hours
df['D9_not_na'] = np.where(df['D9'].isna(),0,1)
df['D8_not_same_day'] = np.where(df['D8']>=1,1,0)
df['D8_D9_decimal_dist'] = df['D8'].fillna(0)-df['D8'].fillna(0).astype(int)
df['D8_D9_decimal_dist'] = ((df['D8_D9_decimal_dist']-df['D9'])**2)**0.5
df['D8'] = df['D8'].fillna(-1).astype(int)
def values_normalization(dt_df, periods, columns):
for period in periods:
for col in columns:
new_col = col +'_'+ period
dt_df[col] = dt_df[col].astype(float)
temp_min = dt_df.groupby([period])[col].agg(['min']).reset_index()
temp_min.index = temp_min[period].values
temp_min = temp_min['min'].to_dict()
temp_max = dt_df.groupby([period])[col].agg(['max']).reset_index()
temp_max.index = temp_max[period].values
temp_max = temp_max['max'].to_dict()
temp_mean = dt_df.groupby([period])[col].agg(['mean']).reset_index()
temp_mean.index = temp_mean[period].values
temp_mean = temp_mean['mean'].to_dict()
temp_std = dt_df.groupby([period])[col].agg(['std']).reset_index()
temp_std.index = temp_std[period].values
temp_std = temp_std['std'].to_dict()
dt_df['temp_min'] = dt_df[period].map(temp_min)
dt_df['temp_max'] = dt_df[period].map(temp_max)
dt_df['temp_mean'] = dt_df[period].map(temp_mean)
dt_df['temp_std'] = dt_df[period].map(temp_std)
dt_df[new_col+'_min_max'] = (dt_df[col]-dt_df['temp_min'])/(dt_df['temp_max']-dt_df['temp_min'])
dt_df[new_col+'_std_score'] = (dt_df[col]-dt_df['temp_mean'])/(dt_df['temp_std'])
del dt_df['temp_min'],dt_df['temp_max'],dt_df['temp_mean'],dt_df['temp_std']
return dt_df
#8. Ds period calculation (maybe not useful)
####### Values Normalization
i_cols.remove('D1')
i_cols.remove('D2')
i_cols.remove('D9')
periods = ['DT_D','DT_W','DT_M']
for df in [train_df, test_df]:
df = values_normalization(df, periods, i_cols)
for col in ['D1','D2']:
for df in [train_df, test_df]:
df[col+'_scaled'] = df[col]/train_df[col].max()
####### Global Self frequency encoding
# self_encoding=True because
# we don't need original values anymore
i_cols = ['D'+str(i) for i in range(1,16)]
train_df, test_df = frequency_encoding(train_df, test_df, i_cols, self_encoding=True)
对TransactionAmt做各种处理:
#9. TransAmt uids/cids aggregations and calculations(need more fe)
i_cols = ['TransactionAmt','TransactionAmt_decimal']
#uids = ['card1','card2','card3','card5','uid1','uid2','uid3','uid4','uid5','bank_type','uid6']
uids = ['card1','card2','card3','card5','uid3','uid4','uid5','bank_type','uid6','uid7','cid_1']
aggregations = ['mean','std','min']
# uIDs aggregations
train_df, test_df = uid_aggregation(train_df, test_df, i_cols, uids, aggregations)
for df in [train_df,test_df]:
df['transAmt_mut_C1'] = df['TransactionAmt'] * df['C1']
df['transAmt_mut_C13'] = df['TransactionAmt'] * df['C13']
df['transAmt_mut_C14'] = df['TransactionAmt'] * df['C14']
df['transAmt_dec_diff'] = df['TransactionAmt_decimal'] - ((df['uid4_TransactionAmt_mean']-df['uid4_TransactionAmt_mean'].astype(int)) * 1000).astype(int)
df['Transdiff_in_uid'] = df['transAmt_dec_diff']*df['uid4_TransactionAmt_mean']/1000
# TransactionAmt Normalization-period scaling
periods = ['DT_D','DT_W','DT_M']
for df in [train_df, test_df]:
df = values_normalization(df, periods, i_cols)
# Product type
train_df['product_type'] = train_df['ProductCD'].astype(str)+'_'+train_df['TransactionAmt'].astype(str)
test_df['product_type'] = test_df['ProductCD'].astype(str)+'_'+test_df['TransactionAmt'].astype(str)
i_cols = ['product_type']
periods = ['DT_D','DT_W','DT_M']
train_df, test_df = timeblock_frequency_encoding(train_df, test_df, periods, i_cols,
with_proportions=False, only_proportions=True)
train_df, test_df = frequency_encoding(train_df, test_df, i_cols, self_encoding=True)
对于Vs特征进行分类,请参考:Rajesh Vikraman——Understanding V columns
def column_value_freq(sel_col,cum_per):
dfpercount = pd.DataFrame(columns=['col_name','num_values_'+str(round(cum_per,2))])
for col in sel_col:
col_value = train_df[col].value_counts(normalize=True)
colpercount = pd.DataFrame({'value' : col_value.index,'per_count' : col_value.values})
colpercount['cum_per_count'] = colpercount['per_count'].cumsum()
if len(colpercount.loc[colpercount['cum_per_count'] < cum_per,] ) < 2:
num_col_99 = len(colpercount.loc[colpercount['per_count'] > (1- cum_per),]) #返回大头
else:
num_col_99 = len(colpercount.loc[colpercount['cum_per_count']< cum_per,] ) #返回小头
dfpercount=dfpercount.append({'col_name': col,'num_values_'+str(round(cum_per,2)): num_col_99},ignore_index = True)
dfpercount['unique_values'] = train_df[sel_col].nunique().values
dfpercount['unique_value_to_num_values'+str(round(cum_per,2))+'_ratio'] = 100 * (dfpercount['num_values_'+str(round(cum_per,2))]/dfpercount.unique_values)
#dfpercount['percent_missing'] = percent_na(train_transaction[sel_col])['percent_missing'].round(3).values
return dfpercount
#10. V cols
#Understand V cols
v_cols = ['V'+str(i) for i in range(1,340)]
cum_per = 0.965
colfreq=column_value_freq(v_cols,cum_per)
print(colfreq.head())
colfreq_bool = colfreq[colfreq.unique_values==2]['col_name'].values
colfreq_pseudobool = colfreq[(colfreq.unique_values !=2) & (colfreq['num_values_'+str(round(cum_per,2))] <= 2)]
colfreq_pseudobool_cat = colfreq_pseudobool[colfreq_pseudobool.unique_values <=15]['col_name'].values
colfreq_pseudobool_num = colfreq_pseudobool[colfreq_pseudobool.unique_values >15]['col_name'].values
colfreq_cat = colfreq[(colfreq.unique_values >15) & (colfreq['num_values_'+str(round(cum_per,2))] <= 15) & (colfreq['num_values_'+str(round(cum_per,2))]> 2)]['col_name'].values
colfreq_num = colfreq[colfreq['num_values_'+str(round(cum_per,2))]>15]['col_name'].values
col_name num_values_0.96 unique_values unique_value_to_num_values0.96_ratio
0 V1 1 2 50
1 V2 2 9 22.2222
2 V3 2 10 20
3 V4 2 7 28.5714
4 V5 2 7 28.5714
EDA观察数据,去掉部分离群值:
#cliping v_num_cats
vcol_spike = ['V96', 'V97','V167', 'V168','V177', 'V178','V179', 'V217', 'V218', 'V219','V231','V280', 'V282','V294', 'V322', 'V323', 'V324']
cols = list(colfreq_pseudobool_num) + vcol_spike
for df in [train_df, test_df]:
for col in cols :
max_value = train_df[train_df['DT_M']==train_df['DT_M'].min()][col].max()
df[col] = df[col].clip(None,max_value)
仅对numerical的Vs进行归一化以及PCA,请参考:Konstantin Yakovlev——IEEE - V columns pv,但与参考中不同的是本模型仅对numerical的V特征进行处理:
#Dealing with V cols
#Scaling with pca - Numerical V cols - scaling仍需谨慎
from sklearn.preprocessing import StandardScaler
v_cols = colfreq_num
print(v_cols)
test_group = list(v_cols)
train_df['group_sum'] = train_df[test_group].to_numpy().sum(axis=1)
train_df['group_mean'] = train_df[test_group].to_numpy().mean(axis=1)
test_df['group_sum'] = test_df[test_group].to_numpy().sum(axis=1)
test_df['group_mean'] = test_df[test_group].to_numpy().mean(axis=1)
compact_cols = ['group_sum','group_mean']
for col in test_group:
sc = StandardScaler()
sc.fit(train_df[[col]].fillna(0))
train_df[col] = sc.transform(train_df[[col]].fillna(0))
test_df[col] = sc.transform(test_df[[col]].fillna(0))
sc_test_group = test_group
# check -> same obviously
features_check = []
from scipy.stats import ks_2samp #检查两个分布是否相同的函数
for col in sc_test_group:
features_check.append(ks_2samp(train_df[col], test_df[col])[1])
features_check = pd.Series(features_check, index=sc_test_group).sort_values()
print(features_check)
from sklearn.decomposition import PCA
#PCA还是必要的-是正交线性去噪
pca = PCA(random_state=42)
pca.fit(train_df[sc_test_group])
print(len(sc_test_group), pca.transform(train_df[sc_test_group]).shape[-1])
train_df[sc_test_group] = pca.transform(train_df[sc_test_group])
test_df[sc_test_group] = pca.transform(test_df[sc_test_group])
sc_variance =pca.explained_variance_ratio_
print(sc_variance)
# check
features_check = []
for col in sc_test_group:
features_check.append(ks_2samp(train_df[col], test_df[col])[1])
features_check = pd.Series(features_check, index=sc_test_group).sort_values()
print(features_check)
train_df[col], test_df[col]
['V126' 'V127' 'V128' 'V130' 'V131' 'V132' 'V133' 'V134' 'V136' 'V137'
'V143' 'V144' 'V145' 'V150' 'V159' 'V160' 'V164' 'V165' 'V166' 'V202'
'V203' 'V204' 'V205' 'V206' 'V207' 'V208' 'V209' 'V210' 'V211' 'V212'
'V213' 'V214' 'V215' 'V216' 'V263' 'V264' 'V265' 'V266' 'V267' 'V268'
'V270' 'V271' 'V272' 'V273' 'V274' 'V275' 'V276' 'V277' 'V278' 'V306'
'V307' 'V308' 'V309' 'V310' 'V312' 'V313' 'V314' 'V315' 'V316' 'V317'
'V318' 'V320' 'V321' 'V331' 'V332' 'V333' 'V335']
V130 1.069942e-100
V136 3.228515e-87
V317 1.542568e-65
V133 9.980904e-61
V127 1.833679e-60
...
V206 2.495735e-01
V332 2.967873e-01
V333 2.998484e-01
V331 4.952229e-01
V335 5.364810e-01
Length: 67, dtype: float64
67 67
[3.95243705e-01 1.20604713e-01 9.08724136e-02 7.99145695e-02
5.93916129e-02 5.00332241e-02 4.54584312e-02 2.89428818e-02
2.32736617e-02 1.84120687e-02 1.45453003e-02 1.14526355e-02
7.81445065e-03 7.34786437e-03 5.85068362e-03 4.37949141e-03
4.02093888e-03 3.46559896e-03 3.18676729e-03 2.44932594e-03
2.27222589e-03 2.24909807e-03 1.99991560e-03 1.95987640e-03
1.71973338e-03 1.53540490e-03 1.51142993e-03 1.03556294e-03
9.71562292e-04 8.96170111e-04 8.77193459e-04 7.01838736e-04
6.96799764e-04 6.61501420e-04 5.95271089e-04 5.05089251e-04
4.25160153e-04 3.66978537e-04 3.32092303e-04 3.14215803e-04
3.07523162e-04 2.74325442e-04 2.09382351e-04 1.37651414e-04
1.31921650e-04 1.06460734e-04 9.48509909e-05 8.59092947e-05
7.21478151e-05 6.68975933e-05 5.36152094e-05 4.34649260e-05
3.15145841e-05 2.53136017e-05 2.00119609e-05 1.58136115e-05
1.04702022e-05 9.50144927e-06 5.76505723e-06 4.72613777e-06
2.37874562e-06 2.01718484e-06 5.65667558e-07 2.11127749e-07
7.10596417e-08 2.13648711e-08 9.31242653e-09]
V216 0.000000e+00
V215 0.000000e+00
V333 0.000000e+00
V316 0.000000e+00
V265 0.000000e+00
...
V310 9.931409e-60
V130 1.223724e-54
V204 6.822684e-54
V209 3.987460e-53
V312 5.351877e-44
Length: 67, dtype: float64
(0 0.000023
1 0.000009
2 0.000009
3 0.000015
4 0.000141
...
590535 0.000013
590536 0.000009
590537 0.000009
590538 0.000012
590539 0.000016
Name: V335, Length: 590540, dtype: float64, 0 0.000009
1 -0.000035
2 -0.000038
3 0.000018
4 -0.000004
...
506686 0.000009
506687 0.000007
506688 0.000009
506689 0.000009
506690 0.000009
Name: V335, Length: 506691, dtype: float64)
对Cs特征进行Clip,由EDA可知,许多Cs特征在训练集和测试集的分布相差甚远,训练集的Cs特征会在冬季出现明显的离群值,考虑将离群值去掉改善分布。
#12. Cs frequency encode and clip
i_cols = ['C'+str(i) for i in range(1,15)]
####### Global Self frequency encoding
# self_encoding=False because
# I want to keep original values
train_df, test_df = frequency_encoding(train_df, test_df, i_cols, self_encoding=False)
####### Clip max values-这就跟丢掉冬天一样了
for df in [train_df, test_df]:
for col in i_cols:
max_value = train_df[train_df['DT_M']==train_df['DT_M'].max()][col].max()
df[col] = df[col].clip(None,max_value)
进行多种组合和尝试:
#13. More combinations
## Identity columns
from sklearn.preprocessing import LabelEncoder
for col in ['id_33']:
train_identity[col] = train_identity[col].fillna('unseen_before_label')
test_identity[col] = test_identity[col].fillna('unseen_before_label')
le = LabelEncoder()
le.fit(list(train_identity[col])+list(test_identity[col]))
train_identity[col] = le.transform(train_identity[col])
test_identity[col] = le.transform(test_identity[col])
print('train_set shape before merge:',train_df.shape)
train_df1 = train_df.merge(train_identity,how='left',on=['TransactionID'])
print('train_set shape after merge:',train_df.shape)
print('test_set shape before merge:',test_df.shape)
test_df1 = test_df.merge(test_identity,how='left',on=['TransactionID'])
print('test_set shape after merge:',test_df.shape)
# New feature - mean of sth
columns_a = ['TransactionAmt', 'id_02', 'D15']
columns_b = ['card1', 'card4', 'addr1']
for col_a in columns_a:
for col_b in columns_b:
for df in [train_df1, test_df1]:
df[f'{col_a}_to_mean_{col_b}'] = df[col_a] / df.groupby([col_b])[col_a].transform('mean')
df[f'{col_a}_to_std_{col_b}'] = df[col_a] / df.groupby([col_b])[col_a].transform('std')
del columns_a,columns_b
gc.collect()
# Some arbitrary features interaction 试做联合特征(?????)
for feature in ['id_02__id_20', 'id_02__D8', 'D11__DeviceInfo', 'DeviceInfo__P_emaildomain', 'P_emaildomain__C2',
'card2__dist1', 'card1__card5', 'card2__id_20', 'card5__P_emaildomain', 'addr1__card1','card1__id_02']:
f1, f2 = feature.split('__')
train_df1[feature] = train_df1[f1].astype(str) + '_' + train_df1[f2].astype(str)
test_df1[feature] = test_df1[f1].astype(str) + '_' + test_df1[f2].astype(str)
le = LabelEncoder()
le.fit(list(train_df1[feature].astype(str).values) + list(test_df1[feature].astype(str).values))
train_df1[feature] = le.transform(list(train_df1[feature].astype(str).values))
test_df1[feature] = le.transform(list(test_df1[feature].astype(str).values))
train_df = train_df1
test_df = test_df1
利用had_id区分交易为traditional还是online,这只是个假设,区分后的交易情况分布符合假说,并且可以很好地解释冬季交易额和交易数的暴增,(冬季Black Friday、Cyber Monday以及二月Chinese New Year线下交易额峰值),制作区分had_id的agg特征,这部分特征使得LB略微上升:
train_df['had_id'] = train_df['had_id'].fillna(0)
test_df['had_id'] = test_df['had_id'].fillna(0)
def uid_sep_aggregation(train_df, test_df, main_columns, uids, aggregations):
for main_column in main_columns:
for col in uids:
for agg_type in aggregations:
new_col_name = col+'_'+main_column+'_sep_'+agg_type
train_df[col+'_sep'] = train_df[col].astype(str)+train_df['had_id'].astype(str)
test_df[col+'_sep'] = test_df[col].astype(str)+test_df['had_id'].astype(str)
temp_df = pd.concat([train_df[[col+'_sep', main_column]], test_df[[col+'_sep',main_column]]])
temp_df = temp_df.groupby([col+'_sep'])[main_column].agg([agg_type]).reset_index().rename(
columns={agg_type: new_col_name})
temp_df.index = list(temp_df[col+'_sep'])
temp_df = temp_df[new_col_name].to_dict()
train_df[new_col_name] = train_df[col+'_sep'].map(temp_df)
test_df[new_col_name] = test_df[col+'_sep'].map(temp_df)
del train_df[col+'_sep'],test_df[col+'_sep']
return train_df, test_df
def values_sep_normalization(dt_df, periods, columns):
for period in periods:
for col in columns:
new_col = col +'_sep_'+ period
dt_df[col] = dt_df[col].astype(float)
dt_df[period+'_sep'] = dt_df[period].astype(str)+dt_df['had_id'].astype(str)
temp_min = dt_df.groupby([period+'_sep'])[col].agg(['min']).reset_index()
temp_min.index = temp_min[period+'_sep'].values
temp_min = temp_min['min'].to_dict()
temp_max = dt_df.groupby([period+'_sep'])[col].agg(['max']).reset_index()
temp_max.index = temp_max[period+'_sep'].values
temp_max = temp_max['max'].to_dict()
temp_mean = dt_df.groupby([period+'_sep'])[col].agg(['mean']).reset_index()
temp_mean.index = temp_mean[period+'_sep'].values
temp_mean = temp_mean['mean'].to_dict()
temp_std = dt_df.groupby([period+'_sep'])[col].agg(['std']).reset_index()
temp_std.index = temp_std[period+'_sep'].values
temp_std = temp_std['std'].to_dict()
dt_df['temp_min'] = dt_df[period+'_sep'].map(temp_min)
dt_df['temp_max'] = dt_df[period+'_sep'].map(temp_max)
dt_df['temp_mean'] = dt_df[period+'_sep'].map(temp_mean)
dt_df['temp_std'] = dt_df[period+'_sep'].map(temp_std)
dt_df[new_col+'_min_max'] = (dt_df[col]-dt_df['temp_min'])/(dt_df['temp_max']-dt_df['temp_min'])
dt_df[new_col+'_std_score'] = (dt_df[col]-dt_df['temp_mean'])/(dt_df['temp_std'])
del dt_df['temp_min'],dt_df['temp_max'],dt_df['temp_mean'],dt_df['temp_std'],dt_df[period+'_sep']
return dt_df
#9.1 TransAmt seperated by had_id(Online/Traditional)
#分online/traditional来groupbyuid
i_cols = ['TransactionAmt','TransactionAmt_decimal']
uids = ['uid3','uid4','uid5','bank_type','uid6','uid7','cid_1']
aggregations = ['mean','std','min']
train_df, test_df = uid_aggregation(train_df, test_df, i_cols, uids, aggregations)
#分online/traditional来normalization
periods = ['DT_D','DT_W','DT_M']
for df in [train_df, test_df]:
df = values_sep_normalization(df, periods, i_cols)
对部分特征做频率编码,或是str转numerical:
#14. Category Encoding
print('Category Encoding')
from sklearn.preprocessing import LabelEncoder
## card4, card6, ProductCD
# Converting Strings to ints(or floats if nan in column) using frequency encoding
# We will be able to use these columns as category or as numerical feature
for col in ['card4', 'card6', 'ProductCD']:
print('Encoding', col)
temp_df = pd.concat([train_df[[col]], test_df[[col]]])
col_encoded = temp_df[col].value_counts().to_dict()
train_df[col] = train_df[col].map(col_encoded) #多分类用出现次数作为编码
test_df[col] = test_df[col].map(col_encoded)
print(col_encoded)
del temp_df,col_encoded
gc.collect()
## M columns
# Converting Strings to ints(or floats if nan in column)
for col in ['M1','M2','M3','M5','M6','M7','M8','M9']:
train_df[col] = train_df[col].map({'T':1, 'F':0})
test_df[col] = test_df[col].map({'T':1, 'F':0})
for col in ['P_emaildomain', 'R_emaildomain','M4']:
print('Encoding', col)
temp_df = pd.concat([train_df[[col]], test_df[[col]]])
col_encoded = temp_df[col].value_counts().to_dict()
train_df[col] = train_df[col].map(col_encoded)
test_df[col] = test_df[col].map(col_encoded)
print(col_encoded)
del temp_df,col_encoded
gc.collect()
i_cols = ['TransactionAmt']
uids = ['card2__id_20','card1__id_02']
aggregations = ['mean','std']
# uIDs aggregations
train_df, test_df = uid_aggregation(train_df, test_df, i_cols, uids, aggregations)
## Reduce Mem One More Time
train_df = reduce_mem_usage(train_df)
test_df = reduce_mem_usage(test_df)
Category Encoding
Encoding card4
{'visa': 722693, 'mastercard': 348803, 'american express': 16078, 'discover': 9572}
Encoding card6
{'debit': 828379, 'credit': 268753, 'charge card': 16}
Encoding ProductCD
{'W': 800657, 'C': 137785, 'R': 73346, 'H': 62397, 'S': 23046}
Encoding P_emaildomain
{'gmail.com': 435803, 'yahoo.com': 182784, 'uknown': 163648, 'hotmail.com': 85649, 'anonymous.com': 71062, 'aol.com': 52337, 'comcast.net': 14474, 'icloud.com': 12316, 'outlook.com': 9934, 'att.net': 7647, 'msn.com': 7480, 'sbcglobal.net': 5767, 'live.com': 5720, 'verizon.net': 5011, 'ymail.com': 4075, 'bellsouth.net': 3437, 'yahoo.com.mx': 2827, 'me.com': 2713, 'cox.net': 2657, 'optonline.net': 1937, 'live.com.mx': 1470, 'charter.net': 1443, 'mail.com': 1156, 'rocketmail.com': 1105, 'gmail': 993, 'earthlink.net': 979, 'outlook.es': 863, 'mac.com': 862, 'hotmail.fr': 674, 'hotmail.es': 627, 'frontier.com': 594, 'roadrunner.com': 583, 'juno.com': 574, 'windstream.net': 552, 'web.de': 518, 'aim.com': 468, 'embarqmail.com': 464, 'twc.com': 439, 'frontiernet.net': 397, 'netzero.com': 387, 'centurylink.net': 386, 'q.com': 362, 'yahoo.fr': 344, 'hotmail.co.uk': 334, 'suddenlink.net': 323, 'netzero.net': 319, 'cfl.rr.com': 318, 'cableone.net': 311, 'prodigy.net.mx': 303, 'gmx.de': 298, 'sc.rr.com': 277, 'yahoo.es': 272, 'protonmail.com': 159, 'ptd.net': 140, 'yahoo.de': 137, 'hotmail.de': 130, 'live.fr': 106, 'yahoo.co.uk': 103, 'yahoo.co.jp': 101, 'servicios-ta.com': 80, 'scranton.edu': 2}
Encoding R_emaildomain
{'uknown': 824070, 'gmail.com': 118885, 'hotmail.com': 53166, 'anonymous.com': 39644, 'yahoo.com': 21405, 'aol.com': 7239, 'outlook.com': 5011, 'comcast.net': 3513, 'icloud.com': 2820, 'yahoo.com.mx': 2743, 'msn.com': 1698, 'live.com.mx': 1464, 'live.com': 1444, 'verizon.net': 1202, 'sbcglobal.net': 1163, 'me.com': 1095, 'att.net': 870, 'cox.net': 854, 'outlook.es': 853, 'bellsouth.net': 795, 'hotmail.fr': 667, 'hotmail.es': 595, 'web.de': 514, 'mac.com': 430, 'ymail.com': 405, 'optonline.net': 350, 'mail.com': 341, 'hotmail.co.uk': 317, 'yahoo.fr': 315, 'prodigy.net.mx': 303, 'gmx.de': 297, 'charter.net': 263, 'gmail': 196, 'earthlink.net': 170, 'embarqmail.com': 140, 'yahoo.de': 139, 'hotmail.de': 130, 'rocketmail.com': 126, 'yahoo.es': 124, 'juno.com': 111, 'frontier.com': 110, 'live.fr': 105, 'windstream.net': 104, 'yahoo.co.jp': 104, 'roadrunner.com': 101, 'yahoo.co.uk': 82, 'servicios-ta.com': 80, 'aim.com': 77, 'protonmail.com': 75, 'ptd.net': 70, 'scranton.edu': 69, 'twc.com': 61, 'cfl.rr.com': 57, 'suddenlink.net': 55, 'cableone.net': 46, 'q.com': 45, 'frontiernet.net': 38, 'centurylink.net': 28, 'netzero.com': 24, 'netzero.net': 19, 'sc.rr.com': 14}
Encoding M4
{'M0': 357789, 'M2': 122947, 'M1': 97306}
Mem. usage decreased to 1086.67 Mb (55.8% reduction)
Mem. usage decreased to 939.08 Mb (55.5% reduction)
将结果存入.pkl格式文件,占用空间小,可利用pd.read_pickle快速读取:
## Export
train_df.to_pickle('train_transaction_15.pkl')
test_df.to_pickle('test_transaction_15.pkl')
对Vs特征分块进行PCA相关操作,无法确定最佳维度,且较多Vs特征并非numerical,不适合PCA,若仅加入分块PCA特征,试验后LB也并无增长,放弃此举。
考虑到V126-V138可能是某些值的累积,对它们做了groupby ProductCD和crad1、addr1的diff(),这里边’V126’ ‘V127’ ‘V128’ ‘V130’ ‘V131’ ‘V132’ ‘V133’ ‘V134’ ‘V136’ 'V137’是numerical,试验后LB并无增长,放弃此举。
降维:
(1)Recursive feature elimination for block of features:做了三个UID_bolck,D_block,Trans_block,踢了30个feat,LB0.9488–>0.9487可以接受。
(2)PCA on v_cols:只对numerical类型的V126-335部分feat做了scaling外加PCA,train和test是各自归一化的,训练集测试集分布不太行的问题应该是解决了,分布差得不是很大了,但是test还是有好多好多特别高的值,阈值大得多。其他Vs_feat不适合做PCA。
(3)permutation importance
仅使用solo_feature试试看。importance=0的,5fold选出[‘C7’,‘C7_fq_enc’,‘C10_fq_enc’,‘addr2’,‘M7’,‘C4’],感觉波动不会太大总之先试试。LB:0.9486–>0.9482 降了,还不能去掉,好像有缓和overfit的作用。
(4)想把mean和std以及uidagg组合类好好过滤一下,感觉过拟合都是这些导致的。
多了Di_uidagg部分,删了Duid部分,一下子降了200维,LB:0.9486–>0.9478 不到1个千分点……Duidagg是没有意义的,毫无疑问是造成过拟合的原因,但是没有代替它的好特征是不行的。(后使用D生成新uid对TransactionAmt进行agg操作使得LB大幅提升)
change dist:
1)Pseudo Labeling(2种办法)- overampling
(1)取test一次predict结果的极端样本(0.01内)填充到train里边区再跑一次6折lgbm:
正样本率从0.03499下降至0.019082,少了一半,分布区别加剧,放弃。
就总结地来讲这个样本集做oversampling应该是会造成data的unbalance更加严重。
(2)negative downsampling
可以节约时间,用来测试新feature ,但对模型表现不会有提升,会损失一部分训练数据。
deeper fe:(find magic)
(1)做了个rolling window of duplicates: 查看测试集和训练集比例差不多,LB:0.9487–>0.9486 放弃。
(2)改善分布+想办法提升对冬季(分布最不同处)的预测精度
a.Cs: 做了15、30天的shift和30天的rolling出mean和median LB:0.9487–>0.9486 基本上没有用
b.Vs:‘V144’ ‘V145’ ‘V150’ ‘V159’ 'V160’冬季会出现小山坡状的V,在未scalingPCA之前做了15、30天的shift
c.lag用在userid上,本人笔记本内存不支持产生rolling window的lag特征,放弃。
d.uid_aggregation用在一部分C上,是不是能区别出商家用户和一般用户呢?
从某些角度来讲也算是扩大了冬天数据的影响,但是对应的样本太少了,没能提升第一fold的auc,看起来不行,放弃。
f.把email系列给bin了,外加了一点Transamt*C的组合
特征都挺重要的,但是LB:0.9487–>0.9486这样,CV挺高的,先保留。
g.把Vcols里边的多分类clip一下去除离群值,LB无增长,先保留。
h.之前对Ds的处理太草了,把D重新处理一下,之前根据uidagg得到的mean和std简直是noise…… 只保留Dsuidagg是LB:0.9478–>0.9480 2个万分点/然后scaling加上是LB:0.9480–>0.9486 6个万分点,可以判断Dsuidagg是造成过拟合的元凶了。
i.做DminusDT,可以考虑剔除uid1、2这样的试试。Ds有部分分为to id cards,to id users2类,考虑生成新的uid和cardid,Trans_agg附加Trans的min,可以考虑保留一下scaling的part 看情况,当前维度502:
j.从had_id分析Xmas飙升:可能是Black Friday和Cyber Monday带来的?
2月份的线下交易额飙升是指中国春节?had_id指的是线上交易?
请参考:miguel perez——Physical vs e-commerce (real dates, clearer)
以had_id区分traditional和online,75.6%的train的had_id为空,也就是traditional。71.9%的test的had_id为空。从daily_trans_count和daily_trans_amt都可以看出来,数据的变化符合购物节假说,考虑使用had_id生成新的特征。
这样如何?:had_id非空,也就是online的part用online的trans作count/uidagg;traditional的part用对应的trans count/uidagg,然后合成一列作为新特征。附加Ds_period_normalization。
k.试试Vs_group的PCA作为新特征?
没啥用,Mean AUC = 0.9466500259256587 Out of folds AUC = 0.9458734168996746,稍微增了一点点,但是LB完全是0.9526没变。
尝试改变cv策略:(当前为GroupKfold)
(1)之前的cv策略都有时间穿越的问题,考虑用time_split试1fold,降了–>0.9389,放弃。
(2)直接试试sklearn的time_series_split如何? 降了–>0.9339 ,放弃。
首先读入已经处理好的.pkl文件:
import lightgbm as lgb
import pandas as pd
import numpy as np
import os, sys
import logging
import operator
import gc
from sklearn import metrics
train_trans = pd.read_pickle('train_transaction_14.pkl')
test_trans = pd.read_pickle('test_transaction_14.pkl')
print('train_set shape after merge:',train_trans.shape)
print('test_set shape after merge:',test_trans.shape)
train_set shape after merge: (590540, 765)
test_set shape after merge: (506691, 765)
然后丢掉不需要的、没有价值的、没法处理的中间特征变量。同时将还未labelencode的object类型特征进行编码以供lightGBM处理。
部分特征是因为其具有过拟合的性质或是属于噪音影响模型表现,又或是lightgbm给出的feature importance过低从而决定丢弃;关于其余丢弃特征,请参考:Roman——Recursive feature elimination
#drop cols
not_use = ['dist2', 'C3', 'D7', 'M1', 'id_04', 'id_07', 'id_08', 'id_10', 'id_16', 'id_18', 'id_21', 'id_22', 'id_23', 'id_24', 'id_25', 'id_26', 'id_27', 'id_28', 'id_29', 'id_34', 'id_35']
rm_cols = ['bank_type','uid1','uid2','uid3','uid4','uid5','DT','DT_W','DT_D','DT_hour','DT_day_week','DT_day','DT_D_total','DT_W_total','DT_M_total','id_30','id_31','id_33']
drop_v_vols = ['V1', 'V2', 'V14', 'V15', 'V16', 'V18', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V31', 'V32', 'V39', 'V41', 'V42', 'V43', 'V50', 'V55', 'V57', 'V65', 'V66', 'V67', 'V68', 'V77', 'V79', 'V86', 'V88', 'V89', 'V98', 'V101', 'V102', 'V103', 'V104', 'V105', 'V106', 'V107', 'V108', 'V109', 'V110', 'V111', 'V112', 'V113', 'V114', 'V115', 'V116', 'V117', 'V118', 'V119', 'V120', 'V121', 'V122', 'V123', 'V124', 'V125', 'V129', 'V132', 'V133', 'V134', 'V135', 'V136', 'V137', 'V141', 'V142', 'V144', 'V148', 'V153', 'V155', 'V157', 'V168', 'V174', 'V179', 'V181', 'V183', 'V185', 'V186', 'V190', 'V191', 'V192', 'V193', 'V194', 'V196', 'V198', 'V199', 'V211', 'V218', 'V230', 'V232', 'V235', 'V236', 'V237', 'V240', 'V241', 'V248', 'V250', 'V252', 'V254', 'V255', 'V260', 'V269', 'V281', 'V284', 'V286', 'V290', 'V293', 'V295', 'V296', 'V297', 'V298', 'V299', 'V300', 'V301', 'V302', 'V305', 'V309', 'V311', 'V316', 'V318', 'V319', 'V320', 'V321', 'V325', 'V327', 'V328', 'V330', 'V334', 'V337', 'V339']
drop_cols2 = ['had_id','M_sum','D9','V138','D9_not_na','card1','TransactionDTday','card1_TransactionAmt_decimal_min', 'bank_type_TransactionAmt_decimal_min', 'card2_TransactionAmt_decimal_min', 'card5_TransactionAmt_decimal_min', 'card3_TransactionAmt_decimal_min']
#rfe_not1 = ['card3_fq_enc','bank_type_D1_mean','bank_type_D7_mean','bank_type_D10_mean','bank_type_D11_mean','D6_DT_W_min_max','D7_DT_W_min_max', 'D7_DT_W_std_score', 'D12_DT_W_min_max', 'D13_DT_W_min_max', 'D6_DT_M_min_max', 'D7_DT_M_std_score','D12_DT_M_min_max','D12_DT_M_std_score','D13_DT_M_min_max']
not_use = not_use + rm_cols + drop_cols2 + drop_v_vols
train_trans = train_trans.drop(not_use,axis=1)
test_trans = test_trans.drop(not_use,axis=1)
from sklearn.preprocessing import LabelEncoder
for col in train_trans.columns:
if train_trans[col].dtype == 'object':
le = LabelEncoder()
le.fit(list(train_trans[col].astype(str).values) + list(test_trans[col].astype(str).values))
train_trans[col] = le.transform(list(train_trans[col].astype(str).values))
test_trans[col] = le.transform(list(test_trans[col].astype(str).values))
P_emaildomain_bin
P_emaildomain_suffix
P_emaildomain_prefix
P_emaildomain_suffix_us
R_emaildomain_bin
R_emaildomain_suffix
R_emaildomain_prefix
R_emaildomain_suffix_us
uid6
cid_1
uid7
DeviceType
DeviceInfo
device_name
device_version
OS_id_30
version_id_30
browser_id_31
version_id_31
screen_width
screen_height
去掉纯噪音‘TransactionDT’和‘TransactionID’,也去掉标签的’isFraud’。将训练集做6折GroupFold,并将训练集和测试集转为lgb可以处理的类型,设定好lgb参数,开始训练,取每折预测结果的平均作为最终的预测结果。
#fit_lgb
X = train_trans.sort_values('TransactionDT').drop(['isFraud', 'TransactionDT','TransactionID'], axis=1)
y = train_trans.sort_values('TransactionDT')['isFraud']
X_test = test_trans.drop(['TransactionDT', 'isFraud','TransactionID'], axis=1)
print('the shape of train_df is:',X.shape)
print('the shape of test_df is:',X_test.shape)
the shape of train_df is: (590540, 580)
the shape of test_df is: (506691, 580)
#fit_lgb
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GroupKFold
NFOLDS = 6
folds = GroupKFold(n_splits=NFOLDS)
params = {'num_leaves': 491,
'min_child_weight': 0.03454472573214212,
'feature_fraction': 0.3797454081646243,
'bagging_fraction': 0.4181193142567742,
'min_data_in_leaf': 106,
'objective': 'binary',
'max_depth': -1,
'learning_rate': 0.006883242363721497,
"boosting_type": "gbdt",
"bagging_seed": 11,
"metric": 'auc',
"verbosity": -1,
'reg_alpha': 0.3899927210061127,
'reg_lambda': 0.6485237330340494,
'random_state': 47,
'num_threads':4,
'n_estimators':1800
}
columns = X.columns
split_groups = X['DT_M']
splits = folds.split(X, y,groups=split_groups)
y_preds = np.zeros(X_test.shape[0])
y_oof = np.zeros(X.shape[0])
score = 0
feature_importances = pd.DataFrame()
feature_importances['feature'] = columns
for fold_n, (train_index, valid_index) in enumerate(splits):
print('Fold:',fold_n)
X_train, X_valid = X[columns].iloc[train_index], X[columns].iloc[valid_index]
y_train, y_valid = y.iloc[train_index], y.iloc[valid_index]
dtrain = lgb.Dataset(X_train, label=y_train)
dvalid = lgb.Dataset(X_valid, label=y_valid)
clf = lgb.train(params, dtrain, 10000, valid_sets = [dtrain, dvalid], verbose_eval=200, early_stopping_rounds=200)
feature_importances[f'fold_{fold_n + 1}'] = clf.feature_importance()
y_pred_valid = clf.predict(X_valid)
y_oof[valid_index] = y_pred_valid
print(f"Fold {fold_n + 1} | AUC: {roc_auc_score(y_valid, y_pred_valid)}")
score += roc_auc_score(y_valid, y_pred_valid) / NFOLDS
y_preds += clf.predict(X_test) / NFOLDS
del X_train, X_valid, y_train, y_valid
gc.collect()
print(f"\nMean AUC = {score}")
print(f"Out of folds AUC = {roc_auc_score(y, y_oof)}")
Fold: 0
//anaconda3/lib/python3.7/site-packages/lightgbm/engine.py:113: UserWarning: Found `n_estimators` in params. Will use it instead of argument
warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
Training until validation scores don't improve for 200 rounds.
[200] training's auc: 0.970841 valid_1's auc: 0.888492
[400] training's auc: 0.989367 valid_1's auc: 0.902119
[600] training's auc: 0.99698 valid_1's auc: 0.909347
[800] training's auc: 0.999369 valid_1's auc: 0.913024
[1000] training's auc: 0.999881 valid_1's auc: 0.915294
[1200] training's auc: 0.999982 valid_1's auc: 0.916427
[1400] training's auc: 0.999998 valid_1's auc: 0.916971
[1600] training's auc: 1 valid_1's auc: 0.917535
[1800] training's auc: 1 valid_1's auc: 0.917986
Did not meet early stopping. Best iteration is:
[1800] training's auc: 1 valid_1's auc: 0.917986
Fold 1 | AUC: 0.9179858958040653
Fold: 1
//anaconda3/lib/python3.7/site-packages/lightgbm/engine.py:113: UserWarning: Found `n_estimators` in params. Will use it instead of argument
warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
Training until validation scores don't improve for 200 rounds.
[200] training's auc: 0.96973 valid_1's auc: 0.920247
[400] training's auc: 0.988597 valid_1's auc: 0.934074
[600] training's auc: 0.996845 valid_1's auc: 0.941598
[800] training's auc: 0.999326 valid_1's auc: 0.944648
[1000] training's auc: 0.999883 valid_1's auc: 0.946392
[1200] training's auc: 0.999983 valid_1's auc: 0.947476
[1400] training's auc: 0.999998 valid_1's auc: 0.948062
[1600] training's auc: 1 valid_1's auc: 0.94845
[1800] training's auc: 1 valid_1's auc: 0.948716
Did not meet early stopping. Best iteration is:
[1798] training's auc: 1 valid_1's auc: 0.948727
Fold 2 | AUC: 0.9487270375003357
Fold: 2
//anaconda3/lib/python3.7/site-packages/lightgbm/engine.py:113: UserWarning: Found `n_estimators` in params. Will use it instead of argument
warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
Training until validation scores don't improve for 200 rounds.
[200] training's auc: 0.967541 valid_1's auc: 0.920002
[400] training's auc: 0.987839 valid_1's auc: 0.935714
[600] training's auc: 0.996483 valid_1's auc: 0.943642
[800] training's auc: 0.999224 valid_1's auc: 0.94732
[1000] training's auc: 0.999857 valid_1's auc: 0.949401
[1200] training's auc: 0.999978 valid_1's auc: 0.950679
[1400] training's auc: 0.999998 valid_1's auc: 0.951512
[1600] training's auc: 1 valid_1's auc: 0.952117
[1800] training's auc: 1 valid_1's auc: 0.952431
Did not meet early stopping. Best iteration is:
[1798] training's auc: 1 valid_1's auc: 0.95242
Fold 3 | AUC: 0.9524195779901877
Fold: 3
//anaconda3/lib/python3.7/site-packages/lightgbm/engine.py:113: UserWarning: Found `n_estimators` in params. Will use it instead of argument
warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
Training until validation scores don't improve for 200 rounds.
[200] training's auc: 0.968861 valid_1's auc: 0.917606
[400] training's auc: 0.987988 valid_1's auc: 0.931882
[600] training's auc: 0.996424 valid_1's auc: 0.93957
[800] training's auc: 0.999185 valid_1's auc: 0.94231
[1000] training's auc: 0.999843 valid_1's auc: 0.943998
[1200] training's auc: 0.999974 valid_1's auc: 0.945387
[1400] training's auc: 0.999996 valid_1's auc: 0.946162
[1600] training's auc: 1 valid_1's auc: 0.946454
[1800] training's auc: 1 valid_1's auc: 0.946702
Did not meet early stopping. Best iteration is:
[1800] training's auc: 1 valid_1's auc: 0.946702
Fold 4 | AUC: 0.9467022988334145
Fold: 4
//anaconda3/lib/python3.7/site-packages/lightgbm/engine.py:113: UserWarning: Found `n_estimators` in params. Will use it instead of argument
warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
Training until validation scores don't improve for 200 rounds.
[200] training's auc: 0.967885 valid_1's auc: 0.933123
[400] training's auc: 0.987962 valid_1's auc: 0.945288
[600] training's auc: 0.996395 valid_1's auc: 0.950617
[800] training's auc: 0.999154 valid_1's auc: 0.952375
[1000] training's auc: 0.999834 valid_1's auc: 0.953323
[1200] training's auc: 0.999972 valid_1's auc: 0.954082
[1400] training's auc: 0.999997 valid_1's auc: 0.954569
[1600] training's auc: 1 valid_1's auc: 0.954912
[1800] training's auc: 1 valid_1's auc: 0.955194
Did not meet early stopping. Best iteration is:
[1800] training's auc: 1 valid_1's auc: 0.955194
Fold 5 | AUC: 0.9551941380371436
Fold: 5
//anaconda3/lib/python3.7/site-packages/lightgbm/engine.py:113: UserWarning: Found `n_estimators` in params. Will use it instead of argument
warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
Training until validation scores don't improve for 200 rounds.
[200] training's auc: 0.96762 valid_1's auc: 0.92463
[400] training's auc: 0.987262 valid_1's auc: 0.942792
[600] training's auc: 0.996126 valid_1's auc: 0.951216
[800] training's auc: 0.999056 valid_1's auc: 0.954502
[1000] training's auc: 0.999806 valid_1's auc: 0.956104
[1200] training's auc: 0.999966 valid_1's auc: 0.957193
[1400] training's auc: 0.999995 valid_1's auc: 0.958021
[1600] training's auc: 0.999999 valid_1's auc: 0.958572
[1800] training's auc: 1 valid_1's auc: 0.958972
Did not meet early stopping. Best iteration is:
[1800] training's auc: 1 valid_1's auc: 0.958972
Fold 6 | AUC: 0.9589715765186857
Mean AUC = 0.9465119707733309
Out of folds AUC = 0.9457792719899325
接下来将预测结果输出成文件:
#prediction
submission = pd.DataFrame({'TransactionID':test_trans['TransactionID'],'isFraud':y_preds})
print('submission_shape is:',submission.shape)
submission.to_csv('submission_16.3.csv',index = False)
国内上传速度越来越慢了,以如下形式也可以在Linux内将结果上传至Kaggle,形如:
kaggle competitions submit -c ieee-fraud-detection -f submission.csv -m "Message"
稍微观察一下特征重要度,先看重要度最高的特征及其得分。
feature_importances['average'] = feature_importances[[f'fold_{fold_n + 1}' for fold_n in range(folds.n_splits)]].mean(axis=1)
feature_importances.to_csv('feature_importances.csv')
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline
plt.figure(figsize=(16, 16))
sns.barplot(data=feature_importances.sort_values(by='average', ascending=False).head(50), x='average', y='feature');
plt.title('50 TOP feature importance over {} folds average'.format(folds.n_splits));
再看重要度最低的特征,决定是否要在优化过程中将这些特征去掉或加以其他处理。
完成特征工程改进之后,用LightGBM训练得到预测结果,分析auc和LB不断对特征处理方法进行改进,最终单模型得分LB上升到了0.9526。
blend考量:
(1)submission16 CV:Mean AUC = 0.9460381025052613 Out of folds AUC = 0.9449767448300119 LB:0.9525
(2)submission16.2 CV: Mean AUC = 0.9464855222093469 Out of folds AUC = 0.9457199040811567 LB:0.9525 -->冬天好像改善了一样0.917
(3) submission16.3 CV: Mean AUC = 0.9465119707733309 Out of folds AUC = 0.9457792719899325增了一点 LB:0.9525–>0.9526 冬季的表现有变好:Fold 1 | AUC: 0.9179858958040653
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
sub_1 = pd.read_csv('blends/submission_16.9449.csv')
sub_2 = pd.read_csv('blends/submission_16.94571.csv')
sub_3 = pd.read_csv('blends/submission_16.94577.csv')
sub_4 = pd.read_csv('output/submission_09.9487.csv')
sub_5 = pd.read_csv('output/submission_07.9477.csv')
sub_1['isFraud'] = sub_1['isFraud'] + sub_2['isFraud'] + sub_3['isFraud'] + sub_4['isFraud'] + sub_5['isFraud']
sub_1.to_csv('submission_blend1.csv', index=False)
虽然Blend模型的LB:0.9532比单模型的LB:0.9526要高出不少,但是最终结果还是单模型表现较好,Blend模型虽然是由各模型组合生成的,还是会存在对LB的过拟合现象。