[Kaggle竞赛] IEEE-CIS Fraud Detection

文章目录

  • 0.写在前面
  • 1.EDA
    • 1.1 观察数据
    • 1.2 处理缺失值
    • 1.3 挖掘数据隐含信息以便模型利用
  • 2.Deep Feature Engineering
  • 3.特征筛选+降维(实验记录)
  • 4.lightGBM+best_parameters
  • 5. Internal blend
  • 6.最终结果

0.写在前面

Kaggle竞赛——IEEE-CIS Fraud Detection

  • 赛题描述:
    In this competition, you’ll benchmark machine learning models on a challenging large-scale dataset. The data comes from Vesta’s real-world e-commerce transactions and contains a wide range of features from device type to product features. You also have the opportunity to create new features to improve your results.
    In this competition you are predicting the probability that an online transaction is fraudulent, as denoted by the binary target isFraud.
    The data is broken into two files identity and transaction, which are joined by TransactionID. Not all transactions have corresponding identity information.

  • LB:利用测试集前20%的数据进行验证的auc得分。

  • Private Leaderboard最终得分:利用测试集剩余80%的数据进行验证的auc得分。

  • 本次比赛可以提交两份结果。

    之前参加了Kaggle的几个入门级比赛,这次试试看IEEE和Vesta主办的二分类预测比赛,使用Python基于Jupyter Notebook用LightGBM建立模型进行预测,本比赛提分的关键在于对于数据的挖掘以及数据处理生成特征的策略选取,需要进行非常细致的EDA以及FE。
    本次比赛的结果是铜牌:373/6381-Top 6% Private Leaderboard:0.928512
    output
    本文给出的思路,旨在辅助对于题目的理解并帮助解释贴出的Python代码,并不是最优做法。本文思路及代码仅供参考,思路中涉及到的方法以及详细步骤等请移步至参考链接。代码中变量命名、注释、试验记录等比较乱,仅供参考。

1.EDA

请参考以下Kaggle_kernels:
Nanashi:Fraud complete EDA_Nanashi

1.1 观察数据

官方数据描述及相关答疑:Data Description (Details and Discussion)

  1. 先来看Transaction表:
    TransactionDT: 不是真实的时间戳,而是与某一时间开始以秒为单位的时间差。
    TransactionAMT: transaction payment amount in USD,小数部分值得关注。
    ProductCD: product code,有W\H\C\S\R五种。不一定是实际商品也有可能指某种服务。
    card1 - card6: payment card information, such as card type, card category, issue bank, country, etc.
    addr1-addr2: 是billing region和billing country
    dist: distances between (not limited) billing address, mailing address, zip code, IP address, phone area, etc.
    P_ and (R__) emaildomain: purchaser and recipient email domain,有一部分交易是不需要recipient的,其对应Remaildomain为空
    C1-C14: counting, such as how many addresses are found to be associated with the payment card, etc. Plus like device, ipaddr, billingaddr, etc. Also these are for both purchaser and recipient, which doubles the number.
    D1-D15: timedelta, such as days between previous transaction, etc.
    M1-M9: match, such as names on card and address, etc.均为01变量
    Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations.不同部分的V特征有不同比例的缺失,其真正含义和处理方式仍不明。

  2. 再来看Identity表:
    id01 to id11 are numerical features for identity, which is collected by Vesta and security partners such as device rating, ip_domain rating, proxy rating, etc. Also it recorded behavioral fingerprint like account login times/failed to login times, how long an account stayed on the page, etc. All of these are not able to elaborate due to security partner T&C.
    DeviceType、DeviceInfo、id12 - id38是Categorical Features

在许多EDA相关Kernels中我们可以发现数据的一些特征,尤其是数据随时间变化分布上的特征,还有训练集与测试集分布的不同之处。

1.2 处理缺失值

  1. 缺失比例:缺失值的比例也请查看EDA Kernels。
  2. 利用特征相关度判断相关特征来填充缺失值
    请参考:Gunes Evitan——IEEE-CIS Fraud Detection Dependency Check
def check_dependency(independent_var, dependent_var):
    
    independent_uniques = []
    temp_df = pd.concat([train_df[[independent_var, dependent_var]], test_df[[independent_var, dependent_var]]])
    
    for value in temp_df[independent_var].unique():
        independent_uniques.append(temp_df[temp_df[independent_var] == value][dependent_var].value_counts().shape[0])

    values = pd.Series(data=independent_uniques, index=temp_df[independent_var].unique())
    
    N = len(values)
    N_dependent = len(values[values == 1])
    N_notdependent = len(values[values > 1])
    N_null = len(values[values == 0])
        
    print(f'In {independent_var}, there are {N} unique values')
    print(f'{N_dependent}/{N} have one unique {dependent_var} value')
    print(f'{N_notdependent}/{N} have more than one unique {dependent_var} values')
    print(f'{N_null}/{N} have only missing {dependent_var} values\n')

举个例子:

check_dependency('R_emaildomain', 'C5')
print(train_df['C10'].isnull().sum()/train_df.shape[0])
print(test_df['C10'].isnull().sum()/test_df.shape[0])
print(test_df[~test_df['R_emaildomain'].isnull()]['C5'].value_counts())
In R_emaildomain, there are 61 unique values
60/61 have one unique C5 value
0/61 have more than one unique C5 values
1/61 have only missing C5 values
0.0
5.920768278891869e-06
0.0    135867
Name: C5, dtype: int64

可见 R_emaildomain和C5相关度很高,且C5特征于测试集中有少量缺失,而R_emaildomain不缺失的时候C5缺失,R_emaildomain不缺失时C5均为0,将C5缺失值用0补上便是比较合理的。
按这个思路找到了几组相关度很高的特征,将测试集中的缺失值补上:

#1.1 find dependency and fillna
#'dist1', 'C3',只有test有C3的缺失,且只在dist1不缺失的时候缺失,dist1不缺失的时候C3全都是0
test_df['C3'] = test_df['C3'].fillna(0)
#'R_emaildomain', 'C5',只有test有C5的缺失,基本上都是在R_emaildomain不缺失的时候缺失,R_emaildomain缺失的C5缺失只有3个
test_df['C5'] = test_df['C5'].fillna(0)
#'id_30','C7',只有test有C7的缺失,只在id_30不缺失的时候缺失,id_30不缺失的C7缺失只有3个,其他都是0(Device)
test_df['C7'] = test_df['C7'].fillna(0)
#'id_31','C9',只有test有C9的缺失,只在id_31不缺失的时候缺失,id_31不缺失的C9缺失只有3个,其他都是0(Browser)
test_df['C9'] = test_df['C9'].fillna(0)
  1. 利用card1对应其余card特征的信息来填补card23456的缺失值
#1. More interaction between card features + fill nans
i_cols = ['TransactionID','card1','card2','card3','card4','card5','card6']

full_df = pd.concat([train_df[i_cols], test_df[i_cols]])

## I've used frequency encoding before so we have ints here
## we will drop very rare cards
full_df['card6'] = np.where(full_df['card6']==30, np.nan, full_df['card6'])
full_df['card6'] = np.where(full_df['card6']==16, np.nan, full_df['card6'])

i_cols = ['card2','card3','card4','card5','card6']

## We will find best match for nan values and fill with it 把23456都补上好多了
for col in i_cols:
    temp_df = full_df.groupby(['card1',col])[col].agg(['count']).reset_index()
    temp_df = temp_df.sort_values(by=['card1','count'], ascending=False).reset_index(drop=True)
    del temp_df['count']
    temp_df = temp_df.drop_duplicates(keep='first').reset_index(drop=True)
    temp_df.index = temp_df['card1'].values
    temp_df = temp_df[col].to_dict()
    full_df[col] = np.where(full_df[col].isna(), full_df['card1'].map(temp_df), full_df[col])
    
    
i_cols = ['card1','card2','card3','card4','card5','card6']
for col in i_cols:
    train_df[col] = full_df[full_df['TransactionID'].isin(train_df['TransactionID'])][col].values
    test_df[col] = full_df[full_df['TransactionID'].isin(test_df['TransactionID'])][col].values

1.3 挖掘数据隐含信息以便模型利用

为了保护用户信息官方对特征做了许多处理也隐瞒了特征的真实意义,需要通过对数据细致的观察分析来判断特征的意义及其蕴含的信息,以选择特征处理的合理手段。

  1. 日期
    Kevin——TransactionDT startdate
    这样Black Friday和Cyber Monday可以更好重合,这里选取2017-11-30作为起始日期点,加以TransactionDT这个timedelta可以获得交易的日期信息。
  2. D系列特征
    Akasyanama——EDA what’s behind D features?
    A Humphrey——Understanding the D features (updated)
    tuttifrutti——Creating features from D columns (guessing userID)
    取几个意思明晰的:
    D1: timedelta (days, rounded down) since first transaction for one card.
    D2: this appears to be the same as D1, except D1 = 0 values have been replaced by NaN.
    D3: timedelta since the previous transaction for one card. As with D1 and D2, the this feature appears to count different cards separately.
    D4: timedelta since first transaction for all cards on the account. Using the example of a husband and wife each using their own card on a joint credit card account, this feature would not distinguish between which card was used.
    D5: timedelta since the previous transaction for all cards on the account.
    D6 and D7: 是D4和D5某种组合变形,丢掉任何一个auc都会下降。
    D8: timedelta (float) since some event.
    D9:是D8的小数部分,也就是the hour of day,由于每个小时对应的fraud_rate,也就是IsFraud的平均值变化相差很小,这个特征无法为模型的预测提供较大的帮助,计划丢掉这个特征。
    D10:some kind of timedelta for domestic transactions.

选取处理策略:

  • 由于Ds特征具有时间相关性,会随TransactionDT变化,可以考虑取部分D特征(如D1,D4)和TransactionDT,用两者求差得到时间差,从而显示开卡时间、上一笔交易具体时间等因素,单纯利用不加处理的Ds特征只能反映距离某一操作的时间差累积,且会引入时间变化。得到DminusDT类特征后可用于进行用户uid和卡cardid的合成,可以更加清晰地确定用户。关于得到的DminusDT类特征,虽然有可能带来过拟合的风险,但本模型还是选择保留它了。
  • Ds特征也可进行以不同时间段内的min_max_scaling处理以及std_score处理,用自定义的value_normalization函数实现。
dt_df[new_col+'_min_max'] = (dt_df[col]-dt_df['temp_min'])/(dt_df['temp_max']-dt_df['temp_min'])
dt_df[new_col+'_std_score'] = (dt_df[col]-dt_df['temp_mean'])/(dt_df['temp_std'])
  1. C系列特征
    [Kaggle竞赛] IEEE-CIS Fraud Detection_第1张图片
    分布请参考EDA相关kernels。
    前文也提到了,由于Cs特征是对于交易付款人和收款人信息(如账单地址、邮箱地址)个数的统计,部分C与其他特征有较高的关联度,可考虑通过这个思路填充其测试集的缺失值。
    训练集和测试集的分布有较大差别,考虑去除离群值改善分布。
  2. V系列特征
    请参考:Laevatein——Interesting finding about the V columns
    可以根据Vs特征缺失率将Vs特征分块,各部分内应该是由相同数据生成的。
    V1 ~ V11
    V12 ~ V34
    V35 ~ V52
    V53 ~ V74
    V75 ~ V94
    V95 ~ V137 高相关度
    V126-V138
    V138 ~ V166 (high null ratio)
    V167 ~ V216 (high null ratio)
    V217 ~ V278 (high null ratio, 2 different null ratios)
    V279 ~ V321 (2 different null ratios)
    V289-V318
    V319-V321高相关度
    V322 ~ V339 (high null ratio)

其中numerical类型的Vs特征有:

'V126' 'V127' 'V128' 'V130' 'V131' 'V132' 'V133' 'V134' 'V136' 'V137'
 'V143' 'V144' 'V145' 'V150' 'V159' 'V160' 'V164' 'V165' 'V166' 'V202'
 'V203' 'V204' 'V205' 'V206' 'V207' 'V208' 'V209' 'V210' 'V211' 'V212'
 'V213' 'V214' 'V215' 'V216' 'V263' 'V264' 'V265' 'V266' 'V267' 'V268'
 'V270' 'V271' 'V272' 'V273' 'V274' 'V275' 'V276' 'V277' 'V278' 'V306'
 'V307' 'V308' 'V309' 'V310' 'V312' 'V313' 'V314' 'V315' 'V316' 'V317'
 'V318' 'V320' 'V321' 'V331' 'V332' 'V333' 'V335'

选取处理方式:

  • 对numerical的V做scaling和pca
  • 对Vs做Group PCA、一些其他处理,但是LB没有提升便放弃了

2.Deep Feature Engineering

初步特征处理思路(LB–>0.9487)请参考:
Konstantin Yakovlev——IEEE - Internal Blend
David Cairuz——Feature Engineering & LightGBM
后期特征处理思路(LB:0.9487–>0.9526)请参见其他实验记录,以下为最终采用的特征工程代码:

import numpy as np
import pandas as pd
import gc
import os, sys, random, datetime

将数据集缩小,占用更小内存,并得到更高的处理效率,请参考:Konstantin Yakovlev——IEEE Data minification

def seed_everything(seed=0):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)

## Memory Reducer
# :df pandas dataframe to reduce size             # type: pd.DataFrame()
# :verbose                                        # type: bool
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

载入训练集和测试集,缩小其占用空间。

print('Load Data')
train_df = pd.read_csv('../input/train_transaction.csv')
test_df = pd.read_csv('../input/test_transaction.csv')
test_df['isFraud'] = 0
train_identity = pd.read_csv('../input/train_identity.csv')
test_identity = pd.read_csv('../input/test_identity.csv')
print('Reduce Memory')
train_df = reduce_mem_usage(train_df)
test_df  = reduce_mem_usage(test_df)
train_identity = reduce_mem_usage(train_identity)
test_identity  = reduce_mem_usage(test_identity)
Load Data
Reduce Memory
Mem. usage decreased to 542.35 Mb (69.4% reduction)
Mem. usage decreased to 473.07 Mb (68.9% reduction)
Mem. usage decreased to 25.86 Mb (42.7% reduction)
Mem. usage decreased to 25.44 Mb (42.7% reduction)

对identity部分数据进行初步处理,主要是将字符串特征,如DeviceInfo、id_30(系统信息)、id_31(浏览器信息),split生成新的特征,用id_33(分辨率)生成设备特征;并将其余类别特征从字符串转为numerical,部分信息bin处理:

def id_split(dataframe):
    
    dataframe['device_name'] = dataframe['DeviceInfo'].str.split('/', expand=True)[0]
    dataframe['device_version'] = dataframe['DeviceInfo'].str.split('/', expand=True)[1]

    dataframe['OS_id_30'] = dataframe['id_30'].str.split(' ', expand=True)[0]
    dataframe['version_id_30'] = dataframe['id_30'].str.split(' ', expand=True)[1]
 
    dataframe['browser_id_31'] = dataframe['id_31'].str.split(' ', expand=True)[0]
    dataframe['version_id_31'] = dataframe['id_31'].str.split(' ', expand=True)[1]

    dataframe['screen_width'] = dataframe['id_33'].str.split('x', expand=True)[0]
    dataframe['screen_height'] = dataframe['id_33'].str.split('x', expand=True)[1]
    dataframe['id_12'] = dataframe['id_12'].map({'Found':1, 'NotFound':0})
    dataframe['id_15'] = dataframe['id_15'].map({'New':2, 'Found':1, 'Unknown':0})
    dataframe['id_16'] = dataframe['id_16'].map({'Found':1, 'NotFound':0})

    dataframe['id_23'] = dataframe['id_23'].map({'TRANSPARENT':4, 'IP_PROXY':3, 'IP_PROXY:ANONYMOUS':2, 'IP_PROXY:HIDDEN':1})

    dataframe['id_27'] = dataframe['id_27'].map({'Found':1, 'NotFound':0})
    dataframe['id_28'] = dataframe['id_28'].map({'New':2, 'Found':1})

    dataframe['id_29'] = dataframe['id_29'].map({'Found':1, 'NotFound':0})

    dataframe['id_35'] = dataframe['id_35'].map({'T':1, 'F':0})
    dataframe['id_36'] = dataframe['id_36'].map({'T':1, 'F':0})
    dataframe['id_37'] = dataframe['id_37'].map({'T':1, 'F':0})
    dataframe['id_38'] = dataframe['id_38'].map({'T':1, 'F':0})

    dataframe['id_34'] = dataframe['id_34'].fillna(':0')
    dataframe['id_34'] = dataframe['id_34'].apply(lambda x: x.split(':')[1]).astype(np.int8)
    dataframe['id_34'] = np.where(dataframe['id_34']==0, np.nan, dataframe['id_34'])
    
    dataframe['id_33'] = dataframe['id_33'].fillna('0x0')
    dataframe['id_33_0'] = dataframe['id_33'].apply(lambda x: x.split('x')[0]).astype(int)
    dataframe['id_33_1'] = dataframe['id_33'].apply(lambda x: x.split('x')[1]).astype(int)
    dataframe['id_33'] = np.where(dataframe['id_33']=='0x0', np.nan, dataframe['id_33'])
    
    for feature in ['id_01', 'id_31', 'id_33', 'id_36']:
        dataframe[feature + '_count_dist'] = dataframe[feature].map(dataframe[feature].value_counts(dropna=False))
    
    dataframe['DeviceType'].map({'desktop':1, 'mobile':0})
    
    dataframe.loc[dataframe['device_name'].str.contains('SM', na=False), 'device_name'] = 'Samsung'
    dataframe.loc[dataframe['device_name'].str.contains('SAMSUNG', na=False), 'device_name'] = 'Samsung'
    dataframe.loc[dataframe['device_name'].str.contains('GT-', na=False), 'device_name'] = 'Samsung'
    dataframe.loc[dataframe['device_name'].str.contains('Moto G', na=False), 'device_name'] = 'Motorola'
    dataframe.loc[dataframe['device_name'].str.contains('Moto', na=False), 'device_name'] = 'Motorola'
    dataframe.loc[dataframe['device_name'].str.contains('moto', na=False), 'device_name'] = 'Motorola'
    dataframe.loc[dataframe['device_name'].str.contains('LG-', na=False), 'device_name'] = 'LG'
    dataframe.loc[dataframe['device_name'].str.contains('rv:', na=False), 'device_name'] = 'RV'
    dataframe.loc[dataframe['device_name'].str.contains('HUAWEI', na=False), 'device_name'] = 'Huawei'
    dataframe.loc[dataframe['device_name'].str.contains('ALE-', na=False), 'device_name'] = 'Huawei'
    dataframe.loc[dataframe['device_name'].str.contains('-L', na=False), 'device_name'] = 'Huawei'
    dataframe.loc[dataframe['device_name'].str.contains('Blade', na=False), 'device_name'] = 'ZTE'
    dataframe.loc[dataframe['device_name'].str.contains('BLADE', na=False), 'device_name'] = 'ZTE'
    dataframe.loc[dataframe['device_name'].str.contains('Linux', na=False), 'device_name'] = 'Linux'
    dataframe.loc[dataframe['device_name'].str.contains('XT', na=False), 'device_name'] = 'Sony'
    dataframe.loc[dataframe['device_name'].str.contains('HTC', na=False), 'device_name'] = 'HTC'
    dataframe.loc[dataframe['device_name'].str.contains('ASUS', na=False), 'device_name'] = 'Asus'

    dataframe.loc[dataframe.device_name.isin(dataframe.device_name.value_counts()[dataframe.device_name.value_counts() < 200].index), 'device_name'] = "Others"
    dataframe['had_id'] = 1
    gc.collect()
    return dataframe
train_identity = id_split(train_identity)
test_identity = id_split(test_identity)

对Transaction部分数据进行初步处理:

  • 对TransactionAmt做一定变换(Log,取小数部分)
  • emaildomain信息作一定处理:bin\前后缀\us\缺失值特征
  • 对TransactionDT做一定处理(转换为明确的datetime,timedelta起始日期的探索见1.3,DTs 特征留作对其他特征进行aggregation操作,其本身并无价值,对于模型来讲属于噪音)
  • 对Ms特征进行01编码
  • 将ProductCD和M4组合在一起
  • 将card特征结合其他特征(addr\email\Ds)组合形成模拟uid,留作对其他特征进行aggregation操作,card1和大部分uid特征对于模型来讲也属于噪声,可能带来过拟合
  • 对TransactionAmt作Clip去除离群值,并检查训练集和测试集TransactionAmt数值上的重合
#new features trans
def gen_new(train_trans,test_trans):

    # New feature - log of transaction amount.
    train_trans['TransactionAmt_Log'] = np.log1p(train_trans['TransactionAmt'])
    test_trans['TransactionAmt_Log'] = np.log1p(test_trans['TransactionAmt'])

    # New feature - decimal part of the transaction amount.
    train_trans['TransactionAmt_decimal'] = ((train_trans['TransactionAmt'] - train_trans['TransactionAmt'].astype(int)) * 1000).astype(int)
    test_trans['TransactionAmt_decimal'] = ((test_trans['TransactionAmt'] - test_trans['TransactionAmt'].astype(int)) * 1000).astype(int)

    
    # New feature - day of week in which a transaction happened.
    train_trans['Transaction_day_of_week'] = np.floor((train_trans['TransactionDT'] / (3600 * 24) - 1) % 7)
    test_trans['Transaction_day_of_week'] = np.floor((test_trans['TransactionDT'] / (3600 * 24) - 1) % 7)

    # New feature - hour of the day in which a transaction happened.
    train_trans['Transaction_hour'] = np.floor(train_trans['TransactionDT'] / 3600) % 24
    test_trans['Transaction_hour'] = np.floor(test_trans['TransactionDT'] / 3600) % 24
    
    #New feature - emaildomain with suffix
    emails = {'gmail': 'google', 'att.net': 'att', 'twc.com': 'spectrum', 'scranton.edu': 'other', 'optonline.net': 'other', 'hotmail.co.uk': 'microsoft', 'comcast.net': 'other', 'yahoo.com.mx': 'yahoo', 'yahoo.fr': 'yahoo', 'yahoo.es': 'yahoo', 'charter.net': 'spectrum', 'live.com': 'microsoft', 'aim.com': 'aol', 'hotmail.de': 'microsoft', 'centurylink.net': 'centurylink', 'gmail.com': 'google', 'me.com': 'apple', 'earthlink.net': 'other', 'gmx.de': 'other', 'web.de': 'other', 'cfl.rr.com': 'other', 'hotmail.com': 'microsoft', 'protonmail.com': 'other', 'hotmail.fr': 'microsoft', 'windstream.net': 'other', 'outlook.es': 'microsoft', 'yahoo.co.jp': 'yahoo', 'yahoo.de': 'yahoo', 'servicios-ta.com': 'other', 'netzero.net': 'other', 'suddenlink.net': 'other', 'roadrunner.com': 'other', 'sc.rr.com': 'other', 'live.fr': 'microsoft', 'verizon.net': 'yahoo', 'msn.com': 'microsoft', 'q.com': 'centurylink', 'prodigy.net.mx': 'att', 'frontier.com': 'yahoo', 'anonymous.com': 'other', 'rocketmail.com': 'yahoo', 'sbcglobal.net': 'att', 'frontiernet.net': 'yahoo', 'ymail.com': 'yahoo', 'outlook.com': 'microsoft', 'mail.com': 'other', 'bellsouth.net': 'other', 'embarqmail.com': 'centurylink', 'cableone.net': 'other', 'hotmail.es': 'microsoft', 'mac.com': 'apple', 'yahoo.co.uk': 'yahoo', 'netzero.com': 'other', 'yahoo.com': 'yahoo', 'live.com.mx': 'microsoft', 'ptd.net': 'other', 'cox.net': 'other', 'aol.com': 'aol', 'juno.com': 'other', 'icloud.com': 'apple','uknown':'uknown'}
    us_emails = ['gmail', 'net', 'edu']

    for c in ['P_emaildomain', 'R_emaildomain']:
        train_trans[c] = train_trans[c].fillna('uknown')
        test_trans[c] = test_trans[c].fillna('uknown')
        
        train_trans[c + '_bin'] = train_trans[c].map(emails)
        test_trans[c + '_bin'] = test_trans[c].map(emails)
    
        train_trans[c + '_suffix'] = train_trans[c].apply(lambda x: str(x).split('.')[-1])
        test_trans[c + '_suffix'] = test_trans[c].apply(lambda x: str(x).split('.')[-1])
        
        train_trans[c + '_prefix'] = train_trans[c].apply(lambda x: str(x).split('.')[0])
        test_trans[c + '_prefix'] = test_trans[c].apply(lambda x: str(x).split('.')[0])

        train_trans[c + '_suffix_us'] = train_trans[c + '_suffix'].map(lambda x: x if str(x) not in us_emails else 'us')
        test_trans[c + '_suffix_us'] = test_trans[c + '_suffix'].map(lambda x: x if str(x) not in us_emails else 'us')
    train_trans['email_check'] = np.where((train_trans['P_emaildomain']==train_trans['R_emaildomain'])&(train_trans['P_emaildomain']!='uknown'),1,0)
    test_trans['email_check'] = np.where((test_trans['P_emaildomain']==test_trans['R_emaildomain'])&(test_trans['P_emaildomain']!='uknown'),1,0)
    
    #New feature - dates
    START_DATE = datetime.datetime.strptime('2017-11-30', '%Y-%m-%d')
    from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
    dates_range = pd.date_range(start='2017-10-01', end='2019-01-01')
    us_holidays = calendar().holidays(start=dates_range.min(), end=dates_range.max())

    for df in [train_trans, test_trans]:
        # Temporary
        df['DT'] = df['TransactionDT'].apply(lambda x: (START_DATE + datetime.timedelta(seconds = x)))
        df['DT_M'] = (df['DT'].dt.year-2017)*12 + df['DT'].dt.month
        df['DT_W'] = (df['DT'].dt.year-2017)*52 + df['DT'].dt.weekofyear
        df['DT_D'] = (df['DT'].dt.year-2017)*365 + df['DT'].dt.dayofyear

        df['DT_hour'] = df['DT'].dt.hour
        df['DT_day_week'] = df['DT'].dt.dayofweek
        df['DT_day'] = df['DT'].dt.day
        df['DT_day_month'] = (df['DT'].dt.day).astype(np.int8)
        # Possible solo feature
        df['is_december'] = df['DT'].dt.month
        df['is_december'] = (df['is_december']==12).astype(np.int8)

        # Holidays
        df['is_holiday'] = (df['DT'].dt.date.astype('datetime64').isin(us_holidays)).astype(np.int8)
   
    #New feature - binary encoded 1/0 gen new
    i_cols = ['M1','M2','M3','M5','M6','M7','M8','M9']
    for df in [train_trans, test_trans]:
        df['M_sum'] = df[i_cols].sum(axis=1).astype(np.int8)
        df['M_na'] = df[i_cols].isna().sum(axis=1).astype(np.int8)

    #New feature - ProductCD and M4 Target mean
    for col in ['ProductCD','M4']:
        temp_dict = train_trans.groupby([col])['isFraud'].agg(['mean']).reset_index().rename(columns={'mean': col+'_target_mean'})
        temp_dict.index = temp_dict[col].values
        temp_dict = temp_dict[col+'_target_mean'].to_dict()

        train_trans[col+'_target_mean'] = train_trans[col].map(temp_dict)
        test_trans[col+'_target_mean']  = test_trans[col].map(temp_dict)
    
    #New feature - use it for aggregations
    train_trans['uid1'] = train_trans['card1'].astype(str)+'_'+train_trans['card2'].astype(str) 
    test_trans['uid1'] = test_trans['card1'].astype(str)+'_'+test_trans['card2'].astype(str)

    train_trans['uid2'] = train_trans['uid1'].astype(str)+'_'+train_trans['card3'].astype(str)+'_'+train_trans['card5'].astype(str)
    test_trans['uid2'] = test_trans['uid1'].astype(str)+'_'+test_trans['card3'].astype(str)+'_'+test_trans['card5'].astype(str)

    train_trans['uid3'] = train_trans['uid2'].astype(str)+'_'+train_trans['addr1'].astype(str)+'_'+train_trans['addr2'].astype(str)
    test_trans['uid3'] = test_trans['uid2'].astype(str)+'_'+test_trans['addr1'].astype(str)+'_'+test_trans['addr2'].astype(str)

    # Check if the Transaction Amount is common or not (we can use freq encoding here)
    # In our dialog with a model we are telling to trust or not to these values   
    # Clip Values
    train_trans['TransactionAmt'] = train_trans['TransactionAmt'].clip(0,5000)
    test_trans['TransactionAmt']  = test_trans['TransactionAmt'].clip(0,5000)

    train_trans['TransactionAmt_check'] = np.where(train_trans['TransactionAmt'].isin(test_trans['TransactionAmt']), 1, 0)
    test_trans['TransactionAmt_check']  = np.where(test_trans['TransactionAmt'].isin(train_trans['TransactionAmt']), 1, 0)

    return train_trans,test_trans
train_df,test_df = gen_new(train_df,test_df)

定义aggregation用函数,按一定时长计算出现频率的timeblock_frequency_encoding,以uid做agg类处理的uid_aggregation,uid_aggregation_and_normalization,计算频率进行编码的frequency_encoding:

def timeblock_frequency_encoding(train_df, test_df, periods, columns, 
                                 with_proportions=True, only_proportions=False):
    for period in periods:
        for col in columns:
            new_col = col +'_'+ period
            train_df[new_col] = train_df[col].astype(str)+'_'+train_df[period].astype(str)
            test_df[new_col]  = test_df[col].astype(str)+'_'+test_df[period].astype(str)

            temp_df = pd.concat([train_df[[new_col]], test_df[[new_col]]])
            fq_encode = temp_df[new_col].value_counts().to_dict()

            train_df[new_col] = train_df[new_col].map(fq_encode)
            test_df[new_col]  = test_df[new_col].map(fq_encode)
            
            if only_proportions:
                train_df[new_col] = train_df[new_col]/train_df[period+'_total']
                test_df[new_col]  = test_df[new_col]/test_df[period+'_total']

            if with_proportions:
                train_df[new_col+'_proportions'] = train_df[new_col]/train_df[period+'_total']
                test_df[new_col+'_proportions']  = test_df[new_col]/test_df[period+'_total']

    return train_df, test_df
def uid_aggregation(train_df, test_df, main_columns, uids, aggregations):
    for main_column in main_columns:  
        for col in uids:
            for agg_type in aggregations:
                new_col_name = col+'_'+main_column+'_'+agg_type
                temp_df = pd.concat([train_df[[col, main_column]], test_df[[col,main_column]]])
                temp_df = temp_df.groupby([col])[main_column].agg([agg_type]).reset_index().rename(
                                                        columns={agg_type: new_col_name})

                temp_df.index = list(temp_df[col])
                temp_df = temp_df[new_col_name].to_dict()   

                train_df[new_col_name] = train_df[col].map(temp_df)
                test_df[new_col_name]  = test_df[col].map(temp_df)
    return train_df, test_df

def uid_aggregation_and_normalization(train_df, test_df, main_columns, uids, aggregations):
    for main_column in main_columns:  
        for col in uids:
            
            new_norm_col_name = col+'_'+main_column+'_std_norm'
            norm_cols = []
            
            for agg_type in aggregations:
                new_col_name = col+'_'+main_column+'_'+agg_type
                temp_df = pd.concat([train_df[[col, main_column]], test_df[[col,main_column]]])
                temp_df = temp_df.groupby([col])[main_column].agg([agg_type]).reset_index().rename(
                                                        columns={agg_type: new_col_name})

                temp_df.index = list(temp_df[col])
                temp_df = temp_df[new_col_name].to_dict()   

                train_df[new_col_name] = train_df[col].map(temp_df)
                test_df[new_col_name]  = test_df[col].map(temp_df)
                norm_cols.append(new_col_name)
            
            train_df[new_norm_col_name] = (train_df[main_column]-train_df[norm_cols[0]])/train_df[norm_cols[1]]
            test_df[new_norm_col_name]  = (test_df[main_column]-test_df[norm_cols[0]])/test_df[norm_cols[1]]          
            
            del train_df[norm_cols[0]], train_df[norm_cols[1]]
            del test_df[norm_cols[0]], test_df[norm_cols[1]]
                                              
    return train_df, test_df

def frequency_encoding(train_df, test_df, columns, self_encoding=False):
    for col in columns:
        temp_df = pd.concat([train_df[[col]], test_df[[col]]])
        fq_encode = temp_df[col].value_counts(dropna=False).to_dict()
        if self_encoding:
            train_df[col] = train_df[col].map(fq_encode)
            test_df[col]  = test_df[col].map(fq_encode)            
        else:
            train_df[col+'_fq_enc'] = train_df[col].map(fq_encode)
            test_df[col+'_fq_enc']  = test_df[col].map(fq_encode)
    return train_df, test_df

接下来开始进一步的特征工程:

#2. Keep intersactions
for col in ['card1']: 
    valid_card = pd.concat([train_df[[col]], test_df[[col]]])
    valid_card = valid_card[col].value_counts()
    valid_card_std = valid_card.values.std()

    invalid_cards = valid_card[valid_card<=2]
    print('Rare cards',len(invalid_cards))

    valid_card = valid_card[valid_card>2]
    valid_card = list(valid_card.index)

    print('No intersection in Train', len(train_df[~train_df[col].isin(test_df[col])]))
    print('Intersection in Train', len(train_df[train_df[col].isin(test_df[col])]))
    
    train_df[col] = np.where(train_df[col].isin(test_df[col]), train_df[col], np.nan)
    test_df[col]  = np.where(test_df[col].isin(train_df[col]), test_df[col], np.nan)

    train_df[col] = np.where(train_df[col].isin(valid_card), train_df[col], np.nan)
    test_df[col]  = np.where(test_df[col].isin(valid_card), test_df[col], np.nan)
    print('#'*20)

for col in ['card2','card3','card4','card5','card6']: 
    print('No intersection in Train', col, len(train_df[~train_df[col].isin(test_df[col])]))
    print('Intersection in Train', col, len(train_df[train_df[col].isin(test_df[col])]))
    
    train_df[col] = np.where(train_df[col].isin(test_df[col]), train_df[col], np.nan)
    test_df[col]  = np.where(test_df[col].isin(train_df[col]), test_df[col], np.nan)
    print('#'*20)
Rare cards 5993
No intersection in Train 10396
Intersection in Train 580144
####################
No intersection in Train card2 0
Intersection in Train card2 590540
####################
No intersection in Train card3 47
Intersection in Train card3 590493
####################
No intersection in Train card4 0
Intersection in Train card4 590540
####################
No intersection in Train card5 176
Intersection in Train card5 590364
####################
No intersection in Train card6 30
Intersection in Train card6 590510
####################
#3.generate accurate userids and cardids
train_df['uid4'] = train_df['uid3'].astype(str)+'_'+train_df['P_emaildomain'].astype(str)
test_df['uid4'] = test_df['uid3'].astype(str)+'_'+test_df['P_emaildomain'].astype(str)

train_df['uid5'] = train_df['uid3'].astype(str)+'_'+train_df['R_emaildomain'].astype(str)
test_df['uid5'] = test_df['uid3'].astype(str)+'_'+test_df['R_emaildomain'].astype(str)

train_df['uid6'] = train_df['card1'].astype(str)+'_'+train_df['D15'].astype(str)
test_df['uid6'] = test_df['card1'].astype(str)+'_'+test_df['D15'].astype(str)

#try to generate more accuracy card_id and user_id
#uid1\2 不太有使用的价值了

#guess_card_id
train_df['TransactionDTday'] = (train_df['TransactionDT']/(60*60*24)).map(int)
test_df['TransactionDTday'] = (test_df['TransactionDT']/(60*60*24)).map(int)
train_df['D1minusday'] = train_df['D1'] - train_df['TransactionDTday'] #发卡日
test_df['D1minusday'] = test_df['D1'] - test_df['TransactionDTday']
train_df['D4minusday'] = train_df['D4'] - train_df['TransactionDTday'] #发卡日
test_df['D4minusday'] = test_df['D4'] - test_df['TransactionDTday']

#这个应该对D1\D2\D3\D8有效果,D2没必要动,D3/D8应该有别的用法
train_df['cid_1'] = train_df['uid4'].astype(str)+'_'+train_df['D1minusday'].astype(str)
test_df['cid_1'] = test_df['uid4'].astype(str)+'_'+test_df['D1minusday'].astype(str)

#guess_user_id 用D4
train_df['uid7'] = train_df['uid4'].astype(str)+'_'+train_df['D4minusday'].astype(str)
test_df['uid7'] = test_df['uid4'].astype(str)+'_'+test_df['D4minusday'].astype(str)

print('#'*10)
print('Most common uIds:')
new_columns = ['uid1','uid2','uid3','uid4','uid5','uid6','uid7','cid_1']
for col in new_columns:
    print('#'*10, col)
    print(train_df[col].value_counts()[:10])

# Do Global frequency encoding 

i_cols = ['card1','card2','card3','card5'] + new_columns
train_df, test_df = frequency_encoding(train_df, test_df, i_cols, self_encoding=False)
##########
Most common uIds:
########## uid1
7919_194.0     14891
9500_321.0     14112
15885_545.0    10332
17188_321.0    10312
15066_170.0     7918
12695_490.0     7079
6019_583.0      6766
12544_321.0     6760
2803_100.0      6126
7585_553.0      5325
Name: uid1, dtype: int64
########## uid2
9500_321.0_150.0_226.0     14112
15885_545.0_185.0_138.0    10332
17188_321.0_150.0_226.0    10312
7919_194.0_150.0_166.0      8844
15066_170.0_150.0_102.0     7918
12695_490.0_150.0_226.0     7079
6019_583.0_150.0_226.0      6766
12544_321.0_150.0_226.0     6760
2803_100.0_150.0_226.0      6126
7919_194.0_150.0_202.0      6047
Name: uid2, dtype: int64
########## uid3
15885_545.0_185.0_138.0_nan_nan       9900
17188_321.0_150.0_226.0_299.0_87.0    5862
12695_490.0_150.0_226.0_325.0_87.0    5766
9500_321.0_150.0_226.0_204.0_87.0     4647
3154_408.0_185.0_224.0_nan_nan        4398
12839_321.0_150.0_226.0_264.0_87.0    3538
16132_111.0_150.0_226.0_299.0_87.0    3523
15497_490.0_150.0_226.0_299.0_87.0    3419
9500_321.0_150.0_226.0_272.0_87.0     2715
5812_408.0_185.0_224.0_nan_nan        2639
Name: uid3, dtype: int64
########## uid4
15885_545.0_185.0_138.0_nan_nan_hotmail.com     4002
15885_545.0_185.0_138.0_nan_nan_gmail.com       3830
17188_321.0_150.0_226.0_299.0_87.0_gmail.com    2235
12695_490.0_150.0_226.0_325.0_87.0_gmail.com    2045
9500_321.0_150.0_226.0_204.0_87.0_gmail.com     1947
3154_408.0_185.0_224.0_nan_nan_hotmail.com      1890
3154_408.0_185.0_224.0_nan_nan_gmail.com        1537
12839_321.0_150.0_226.0_264.0_87.0_gmail.com    1473
15775_481.0_150.0_102.0_330.0_87.0_uknown       1453
15497_490.0_150.0_226.0_299.0_87.0_gmail.com    1383
Name: uid4, dtype: int64
########## uid5
12695_490.0_150.0_226.0_325.0_87.0_uknown      5446
17188_321.0_150.0_226.0_299.0_87.0_uknown      5322
9500_321.0_150.0_226.0_204.0_87.0_uknown       4403
15885_545.0_185.0_138.0_nan_nan_hotmail.com    4002
15885_545.0_185.0_138.0_nan_nan_gmail.com      3830
12839_321.0_150.0_226.0_264.0_87.0_uknown      3365
16132_111.0_150.0_226.0_299.0_87.0_uknown      3212
15497_490.0_150.0_226.0_299.0_87.0_uknown      3027
9500_321.0_150.0_226.0_272.0_87.0_uknown       2601
7664_490.0_150.0_226.0_264.0_87.0_uknown       2396
Name: uid5, dtype: int64
########## uid6
15885.0_0.0    7398
7919.0_0.0     4170
6019.0_nan     3962
nan_0.0        3754
9500.0_0.0     3414
3154.0_0.0     3016
15066.0_0.0    2995
9633.0_0.0     2968
nan_nan        2794
17188.0_0.0    2434
Name: uid6, dtype: int64
########## uid7
15775_481.0_150.0_102.0_330.0_87.0_uknown_nan    1453
12695_490.0_150.0_226.0_325.0_87.0_uknown_nan     928
17188_321.0_150.0_226.0_299.0_87.0_uknown_nan     923
9500_321.0_150.0_226.0_204.0_87.0_uknown_nan      622
16132_111.0_150.0_226.0_299.0_87.0_uknown_nan     622
12839_321.0_150.0_226.0_264.0_87.0_uknown_nan     580
7207_111.0_150.0_226.0_204.0_87.0_uknown_nan      551
7664_490.0_150.0_226.0_264.0_87.0_uknown_nan      545
15497_490.0_150.0_226.0_299.0_87.0_uknown_nan     480
9112_250.0_150.0_226.0_441.0_87.0_uknown_nan      439
Name: uid7, dtype: int64
########## cid_1
15775_481.0_150.0_102.0_330.0_87.0_uknown_-129.0       1414
9500_321.0_150.0_226.0_126.0_87.0_aol.com_85.0          404
8528_215.0_150.0_226.0_387.0_87.0_uknown_159.0          207
7207_111.0_150.0_226.0_204.0_87.0_uknown_465.0          189
12741_106.0_150.0_226.0_143.0_87.0_gmail.com_202.0      156
13597_198.0_150.0_226.0_191.0_87.0_yahoo.com_48.0       145
4121_361.0_150.0_226.0_476.0_87.0_hotmail.com_8.0       141
8900_385.0_150.0_226.0_231.0_87.0_uknown_60.0           132
9323_111.0_150.0_226.0_191.0_87.0_charter.net_50.0      109
3898_281.0_150.0_226.0_181.0_87.0_hotmail.com_188.0     106
Name: cid_1, dtype: int64
#4. period counts
for col in ['DT_M','DT_W','DT_D']:
    temp_df = pd.concat([train_df[[col]], test_df[[col]]])
    fq_encode = temp_df[col].value_counts().to_dict()
            
    train_df[col+'_total'] = train_df[col].map(fq_encode)
    test_df[col+'_total']  = test_df[col].map(fq_encode)
        
#User period counts
periods = ['DT_M','DT_W','DT_D']
i_cols = ['uid4','uid5','uid6','uid7','cid_1']
for period in periods:
    for col in i_cols:
        new_column = col + '_' + period
            
        temp_df = pd.concat([train_df[[col,period]], test_df[[col,period]]])
        temp_df[new_column] = temp_df[col].astype(str) + '_' + (temp_df[period]).astype(str)
        fq_encode = temp_df[new_column].value_counts().to_dict()
            
        train_df[new_column] = (train_df[col].astype(str) + '_' + train_df[period].astype(str)).map(fq_encode)
        test_df[new_column]  = (test_df[col].astype(str) + '_' + test_df[period].astype(str)).map(fq_encode)
        
        train_df[new_column] /= train_df[period+'_total']
        test_df[new_column]  /= test_df[period+'_total']
#5. Prepare bank type feature
for df in [train_df, test_df]:
    df['bank_type'] = df['card3'].astype(str) +'_'+ df['card5'].astype(str)

encoding_mean = {
    1: ['DT_D','DT_hour','_hour_dist','DT_hour_mean'],
    2: ['DT_W','DT_day_week','_week_day_dist','DT_day_week_mean'],
    3: ['DT_M','DT_day_month','_month_day_dist','DT_day_month_mean'],
    }

encoding_best = {
    1: ['DT_D','DT_hour','_hour_dist_best','DT_hour_best'],
    2: ['DT_W','DT_day_week','_week_day_dist_best','DT_day_week_best'],
    3: ['DT_M','DT_day_month','_month_day_dist_best','DT_day_month_best'],   
    }

train_df['DT_day_month'] = (train_df['DT'].dt.day).astype(np.int8)
test_df['DT_day_month'] = (test_df['DT'].dt.day).astype(np.int8)
# Some ugly code here (even worse than in other parts)
for col in ['card3','card5','bank_type']:
    for df in [train_df, test_df]:
        for encode in encoding_mean:
            encode = encoding_mean[encode].copy()
            new_col = col + '_' + encode[0] + encode[2]
            df[new_col] = df[col].astype(str) +'_'+ df[encode[0]].astype(str)

            temp_dict = df.groupby([new_col])[encode[1]].agg(['mean']).reset_index().rename(
                                                                    columns={'mean': encode[3]})
            temp_dict.index = temp_dict[new_col].values
            temp_dict = temp_dict[encode[3]].to_dict()
            df[new_col] = df[encode[1]] - df[new_col].map(temp_dict)

        for encode in encoding_best:
            encode = encoding_best[encode].copy()
            new_col = col + '_' + encode[0] + encode[2]
            df[new_col] = df[col].astype(str) +'_'+ df[encode[0]].astype(str)
            temp_dict = df.groupby([col,encode[0],encode[1]])[encode[1]].agg(['count']).reset_index().rename(
                                                                    columns={'count': encode[3]})

            temp_dict.sort_values(by=[col,encode[0],encode[3]], inplace=True)
            temp_dict = temp_dict.drop_duplicates(subset=[col,encode[0]], keep='last')
            temp_dict[new_col] = temp_dict[col].astype(str) +'_'+ temp_dict[encode[0]].astype(str)
            temp_dict.index = temp_dict[new_col].values
            temp_dict = temp_dict[encode[1]].to_dict()
            df[new_col] = df[encode[1]] - df[new_col].map(temp_dict)
#6. BankType timeblock_frequency_encoding
i_cols = ['bank_type'] 
periods = ['DT_M','DT_W','DT_D']

# We have few options to encode it here:
# - Just count transactions
# (but some timblocks have more transactions than others)
# - Devide to total transactions per timeblock (proportions)
# - Use both
# - Use only proportions
train_df, test_df = timeblock_frequency_encoding(train_df, test_df, periods, i_cols, 
                                 with_proportions=False, only_proportions=True)
#7. Ds uid aggregations (maybe not useful)
i_cols = ['D'+str(i) for i in range(1,16)]
uids = ['uid3','uid4','uid5','bank_type','cid1','uid6','uid7']
aggregations = ['mean','min']

####### Cleaning Neagtive values and columns transformations
for df in [train_df, test_df]:

    for col in i_cols:
        df[col] = df[col].clip(0) 
    
    # Lets transform D8 and D9 column
    # As we almost sure it has connection with hours
    df['D9_not_na'] = np.where(df['D9'].isna(),0,1)
    df['D8_not_same_day'] = np.where(df['D8']>=1,1,0)
    df['D8_D9_decimal_dist'] = df['D8'].fillna(0)-df['D8'].fillna(0).astype(int)
    df['D8_D9_decimal_dist'] = ((df['D8_D9_decimal_dist']-df['D9'])**2)**0.5
    df['D8'] = df['D8'].fillna(-1).astype(int)
def values_normalization(dt_df, periods, columns):
    for period in periods:
        for col in columns:
            new_col = col +'_'+ period
            dt_df[col] = dt_df[col].astype(float)  

            temp_min = dt_df.groupby([period])[col].agg(['min']).reset_index()
            temp_min.index = temp_min[period].values
            temp_min = temp_min['min'].to_dict()

            temp_max = dt_df.groupby([period])[col].agg(['max']).reset_index()
            temp_max.index = temp_max[period].values
            temp_max = temp_max['max'].to_dict()

            temp_mean = dt_df.groupby([period])[col].agg(['mean']).reset_index()
            temp_mean.index = temp_mean[period].values
            temp_mean = temp_mean['mean'].to_dict()

            temp_std = dt_df.groupby([period])[col].agg(['std']).reset_index()
            temp_std.index = temp_std[period].values
            temp_std = temp_std['std'].to_dict()

            dt_df['temp_min'] = dt_df[period].map(temp_min)
            dt_df['temp_max'] = dt_df[period].map(temp_max)
            dt_df['temp_mean'] = dt_df[period].map(temp_mean)
            dt_df['temp_std'] = dt_df[period].map(temp_std)

            dt_df[new_col+'_min_max'] = (dt_df[col]-dt_df['temp_min'])/(dt_df['temp_max']-dt_df['temp_min'])
            dt_df[new_col+'_std_score'] = (dt_df[col]-dt_df['temp_mean'])/(dt_df['temp_std'])
            del dt_df['temp_min'],dt_df['temp_max'],dt_df['temp_mean'],dt_df['temp_std']
    return dt_df
#8. Ds period calculation (maybe not useful)
####### Values Normalization
i_cols.remove('D1')
i_cols.remove('D2')
i_cols.remove('D9')
periods = ['DT_D','DT_W','DT_M']

for df in [train_df, test_df]:
    df = values_normalization(df, periods, i_cols)


for col in ['D1','D2']:
    for df in [train_df, test_df]:
        df[col+'_scaled'] = df[col]/train_df[col].max()
        
####### Global Self frequency encoding
# self_encoding=True because 
# we don't need original values anymore
i_cols = ['D'+str(i) for i in range(1,16)]
train_df, test_df = frequency_encoding(train_df, test_df, i_cols, self_encoding=True)

对TransactionAmt做各种处理:

#9. TransAmt uids/cids aggregations and calculations(need more fe)
i_cols = ['TransactionAmt','TransactionAmt_decimal']
#uids = ['card1','card2','card3','card5','uid1','uid2','uid3','uid4','uid5','bank_type','uid6']
uids = ['card1','card2','card3','card5','uid3','uid4','uid5','bank_type','uid6','uid7','cid_1']
aggregations = ['mean','std','min']

# uIDs aggregations
train_df, test_df = uid_aggregation(train_df, test_df, i_cols, uids, aggregations)

for df in [train_df,test_df]:
    df['transAmt_mut_C1'] = df['TransactionAmt'] * df['C1']
    df['transAmt_mut_C13'] = df['TransactionAmt'] * df['C13']
    df['transAmt_mut_C14'] = df['TransactionAmt'] * df['C14']
    df['transAmt_dec_diff'] = df['TransactionAmt_decimal'] - ((df['uid4_TransactionAmt_mean']-df['uid4_TransactionAmt_mean'].astype(int)) * 1000).astype(int)
    df['Transdiff_in_uid'] = df['transAmt_dec_diff']*df['uid4_TransactionAmt_mean']/1000

# TransactionAmt Normalization-period scaling
periods = ['DT_D','DT_W','DT_M']
for df in [train_df, test_df]:
    df = values_normalization(df, periods, i_cols)
# Product type
train_df['product_type'] = train_df['ProductCD'].astype(str)+'_'+train_df['TransactionAmt'].astype(str)
test_df['product_type'] = test_df['ProductCD'].astype(str)+'_'+test_df['TransactionAmt'].astype(str)

i_cols = ['product_type']
periods = ['DT_D','DT_W','DT_M']
train_df, test_df = timeblock_frequency_encoding(train_df, test_df, periods, i_cols, 
                                                 with_proportions=False, only_proportions=True)
train_df, test_df = frequency_encoding(train_df, test_df, i_cols, self_encoding=True)

对于Vs特征进行分类,请参考:Rajesh Vikraman——Understanding V columns

def column_value_freq(sel_col,cum_per):
    dfpercount = pd.DataFrame(columns=['col_name','num_values_'+str(round(cum_per,2))])
    for col in sel_col:
        col_value = train_df[col].value_counts(normalize=True)
        colpercount = pd.DataFrame({'value' : col_value.index,'per_count' : col_value.values})
        colpercount['cum_per_count'] = colpercount['per_count'].cumsum()
        if len(colpercount.loc[colpercount['cum_per_count'] < cum_per,] ) < 2:
            num_col_99 = len(colpercount.loc[colpercount['per_count'] > (1- cum_per),]) #返回大头
        else:
            num_col_99 = len(colpercount.loc[colpercount['cum_per_count']< cum_per,] ) #返回小头
        dfpercount=dfpercount.append({'col_name': col,'num_values_'+str(round(cum_per,2)): num_col_99},ignore_index = True)
    dfpercount['unique_values'] = train_df[sel_col].nunique().values
    dfpercount['unique_value_to_num_values'+str(round(cum_per,2))+'_ratio'] = 100 * (dfpercount['num_values_'+str(round(cum_per,2))]/dfpercount.unique_values)
    #dfpercount['percent_missing'] = percent_na(train_transaction[sel_col])['percent_missing'].round(3).values
    return dfpercount
#10. V cols
#Understand V cols
v_cols = ['V'+str(i) for i in range(1,340)]
cum_per = 0.965
colfreq=column_value_freq(v_cols,cum_per)
print(colfreq.head())
colfreq_bool = colfreq[colfreq.unique_values==2]['col_name'].values
colfreq_pseudobool = colfreq[(colfreq.unique_values !=2) & (colfreq['num_values_'+str(round(cum_per,2))] <= 2)]
colfreq_pseudobool_cat = colfreq_pseudobool[colfreq_pseudobool.unique_values <=15]['col_name'].values
colfreq_pseudobool_num = colfreq_pseudobool[colfreq_pseudobool.unique_values >15]['col_name'].values
colfreq_cat = colfreq[(colfreq.unique_values >15) & (colfreq['num_values_'+str(round(cum_per,2))] <= 15) & (colfreq['num_values_'+str(round(cum_per,2))]> 2)]['col_name'].values
colfreq_num = colfreq[colfreq['num_values_'+str(round(cum_per,2))]>15]['col_name'].values
 col_name num_values_0.96  unique_values unique_value_to_num_values0.96_ratio
0       V1               1              2                                   50
1       V2               2              9                              22.2222
2       V3               2             10                                   20
3       V4               2              7                              28.5714
4       V5               2              7                              28.5714

EDA观察数据,去掉部分离群值:

#cliping v_num_cats
vcol_spike = ['V96', 'V97','V167', 'V168','V177', 'V178','V179', 'V217', 'V218', 'V219','V231','V280', 'V282','V294', 'V322', 'V323', 'V324']
cols = list(colfreq_pseudobool_num) + vcol_spike
for df in [train_df, test_df]:
    for col in cols :
        max_value = train_df[train_df['DT_M']==train_df['DT_M'].min()][col].max()
        df[col] = df[col].clip(None,max_value) 

仅对numerical的Vs进行归一化以及PCA,请参考:Konstantin Yakovlev——IEEE - V columns pv,但与参考中不同的是本模型仅对numerical的V特征进行处理:

#Dealing with V cols
#Scaling with pca - Numerical V cols - scaling仍需谨慎
from sklearn.preprocessing import StandardScaler

v_cols = colfreq_num
print(v_cols)
test_group = list(v_cols)
train_df['group_sum'] = train_df[test_group].to_numpy().sum(axis=1)
train_df['group_mean'] = train_df[test_group].to_numpy().mean(axis=1)
    
test_df['group_sum'] = test_df[test_group].to_numpy().sum(axis=1)
test_df['group_mean'] = test_df[test_group].to_numpy().mean(axis=1)
compact_cols = ['group_sum','group_mean']
 
for col in test_group:
    sc = StandardScaler()
    sc.fit(train_df[[col]].fillna(0))
    train_df[col] = sc.transform(train_df[[col]].fillna(0))
    test_df[col] = sc.transform(test_df[[col]].fillna(0))
    
sc_test_group = test_group

# check -> same obviously
features_check = []
from scipy.stats import ks_2samp #检查两个分布是否相同的函数
for col in sc_test_group:
    features_check.append(ks_2samp(train_df[col], test_df[col])[1])
    
features_check = pd.Series(features_check, index=sc_test_group).sort_values() 
print(features_check)

from sklearn.decomposition import PCA
#PCA还是必要的-是正交线性去噪
pca = PCA(random_state=42)
pca.fit(train_df[sc_test_group])
print(len(sc_test_group), pca.transform(train_df[sc_test_group]).shape[-1])
train_df[sc_test_group] = pca.transform(train_df[sc_test_group])
test_df[sc_test_group] = pca.transform(test_df[sc_test_group])

sc_variance =pca.explained_variance_ratio_
print(sc_variance)

# check
features_check = []

for col in sc_test_group:
    features_check.append(ks_2samp(train_df[col], test_df[col])[1])
    
features_check = pd.Series(features_check, index=sc_test_group).sort_values() 
print(features_check)
train_df[col], test_df[col]
['V126' 'V127' 'V128' 'V130' 'V131' 'V132' 'V133' 'V134' 'V136' 'V137'
 'V143' 'V144' 'V145' 'V150' 'V159' 'V160' 'V164' 'V165' 'V166' 'V202'
 'V203' 'V204' 'V205' 'V206' 'V207' 'V208' 'V209' 'V210' 'V211' 'V212'
 'V213' 'V214' 'V215' 'V216' 'V263' 'V264' 'V265' 'V266' 'V267' 'V268'
 'V270' 'V271' 'V272' 'V273' 'V274' 'V275' 'V276' 'V277' 'V278' 'V306'
 'V307' 'V308' 'V309' 'V310' 'V312' 'V313' 'V314' 'V315' 'V316' 'V317'
 'V318' 'V320' 'V321' 'V331' 'V332' 'V333' 'V335']
V130    1.069942e-100
V136     3.228515e-87
V317     1.542568e-65
V133     9.980904e-61
V127     1.833679e-60
            ...      
V206     2.495735e-01
V332     2.967873e-01
V333     2.998484e-01
V331     4.952229e-01
V335     5.364810e-01
Length: 67, dtype: float64
67 67
[3.95243705e-01 1.20604713e-01 9.08724136e-02 7.99145695e-02
 5.93916129e-02 5.00332241e-02 4.54584312e-02 2.89428818e-02
 2.32736617e-02 1.84120687e-02 1.45453003e-02 1.14526355e-02
 7.81445065e-03 7.34786437e-03 5.85068362e-03 4.37949141e-03
 4.02093888e-03 3.46559896e-03 3.18676729e-03 2.44932594e-03
 2.27222589e-03 2.24909807e-03 1.99991560e-03 1.95987640e-03
 1.71973338e-03 1.53540490e-03 1.51142993e-03 1.03556294e-03
 9.71562292e-04 8.96170111e-04 8.77193459e-04 7.01838736e-04
 6.96799764e-04 6.61501420e-04 5.95271089e-04 5.05089251e-04
 4.25160153e-04 3.66978537e-04 3.32092303e-04 3.14215803e-04
 3.07523162e-04 2.74325442e-04 2.09382351e-04 1.37651414e-04
 1.31921650e-04 1.06460734e-04 9.48509909e-05 8.59092947e-05
 7.21478151e-05 6.68975933e-05 5.36152094e-05 4.34649260e-05
 3.15145841e-05 2.53136017e-05 2.00119609e-05 1.58136115e-05
 1.04702022e-05 9.50144927e-06 5.76505723e-06 4.72613777e-06
 2.37874562e-06 2.01718484e-06 5.65667558e-07 2.11127749e-07
 7.10596417e-08 2.13648711e-08 9.31242653e-09]
V216    0.000000e+00
V215    0.000000e+00
V333    0.000000e+00
V316    0.000000e+00
V265    0.000000e+00
            ...     
V310    9.931409e-60
V130    1.223724e-54
V204    6.822684e-54
V209    3.987460e-53
V312    5.351877e-44
Length: 67, dtype: float64
(0         0.000023
 1         0.000009
 2         0.000009
 3         0.000015
 4         0.000141
             ...   
 590535    0.000013
 590536    0.000009
 590537    0.000009
 590538    0.000012
 590539    0.000016
 Name: V335, Length: 590540, dtype: float64, 0         0.000009
 1        -0.000035
 2        -0.000038
 3         0.000018
 4        -0.000004
             ...   
 506686    0.000009
 506687    0.000007
 506688    0.000009
 506689    0.000009
 506690    0.000009
 Name: V335, Length: 506691, dtype: float64)

对Cs特征进行Clip,由EDA可知,许多Cs特征在训练集和测试集的分布相差甚远,训练集的Cs特征会在冬季出现明显的离群值,考虑将离群值去掉改善分布。

#12. Cs frequency encode and clip
i_cols = ['C'+str(i) for i in range(1,15)]

####### Global Self frequency encoding
# self_encoding=False because 
# I want to keep original values
train_df, test_df = frequency_encoding(train_df, test_df, i_cols, self_encoding=False)

####### Clip max values-这就跟丢掉冬天一样了
for df in [train_df, test_df]:
    for col in i_cols:
        max_value = train_df[train_df['DT_M']==train_df['DT_M'].max()][col].max()
        df[col] = df[col].clip(None,max_value) 

进行多种组合和尝试:

#13. More combinations
## Identity columns
from sklearn.preprocessing import LabelEncoder
for col in ['id_33']:
    train_identity[col] = train_identity[col].fillna('unseen_before_label')
    test_identity[col]  = test_identity[col].fillna('unseen_before_label')
    
    le = LabelEncoder()
    le.fit(list(train_identity[col])+list(test_identity[col]))
    train_identity[col] = le.transform(train_identity[col])
    test_identity[col]  = le.transform(test_identity[col])
    
print('train_set shape before merge:',train_df.shape)
train_df1 = train_df.merge(train_identity,how='left',on=['TransactionID'])
print('train_set shape after merge:',train_df.shape)

print('test_set shape before merge:',test_df.shape)
test_df1 = test_df.merge(test_identity,how='left',on=['TransactionID'])
print('test_set shape after merge:',test_df.shape)

# New feature - mean of sth
columns_a = ['TransactionAmt', 'id_02', 'D15']
columns_b = ['card1', 'card4', 'addr1']
for col_a in columns_a:
    for col_b in columns_b:
        for df in [train_df1, test_df1]:
            df[f'{col_a}_to_mean_{col_b}'] = df[col_a] / df.groupby([col_b])[col_a].transform('mean')
            df[f'{col_a}_to_std_{col_b}'] = df[col_a] / df.groupby([col_b])[col_a].transform('std')
del columns_a,columns_b
gc.collect()

# Some arbitrary features interaction 试做联合特征(?????)
for feature in ['id_02__id_20', 'id_02__D8', 'D11__DeviceInfo', 'DeviceInfo__P_emaildomain', 'P_emaildomain__C2', 
                    'card2__dist1', 'card1__card5', 'card2__id_20', 'card5__P_emaildomain', 'addr1__card1','card1__id_02']:

    f1, f2 = feature.split('__')
    train_df1[feature] = train_df1[f1].astype(str) + '_' + train_df1[f2].astype(str)
    test_df1[feature] = test_df1[f1].astype(str) + '_' + test_df1[f2].astype(str)

    le = LabelEncoder()
    le.fit(list(train_df1[feature].astype(str).values) + list(test_df1[feature].astype(str).values))
    train_df1[feature] = le.transform(list(train_df1[feature].astype(str).values))
    test_df1[feature] = le.transform(list(test_df1[feature].astype(str).values))
train_df = train_df1
test_df = test_df1

利用had_id区分交易为traditional还是online,这只是个假设,区分后的交易情况分布符合假说,并且可以很好地解释冬季交易额和交易数的暴增,(冬季Black Friday、Cyber Monday以及二月Chinese New Year线下交易额峰值),制作区分had_id的agg特征,这部分特征使得LB略微上升:

train_df['had_id'] = train_df['had_id'].fillna(0)
test_df['had_id'] = test_df['had_id'].fillna(0)
def uid_sep_aggregation(train_df, test_df, main_columns, uids, aggregations):
    for main_column in main_columns:  
        for col in uids:
            for agg_type in aggregations:
                new_col_name = col+'_'+main_column+'_sep_'+agg_type
                
                train_df[col+'_sep'] = train_df[col].astype(str)+train_df['had_id'].astype(str)
                test_df[col+'_sep'] = test_df[col].astype(str)+test_df['had_id'].astype(str)
                
                temp_df = pd.concat([train_df[[col+'_sep', main_column]], test_df[[col+'_sep',main_column]]])
                temp_df = temp_df.groupby([col+'_sep'])[main_column].agg([agg_type]).reset_index().rename(
                                                        columns={agg_type: new_col_name})
                
                temp_df.index = list(temp_df[col+'_sep'])
                temp_df = temp_df[new_col_name].to_dict()   
                
                train_df[new_col_name] = train_df[col+'_sep'].map(temp_df)
                test_df[new_col_name]  = test_df[col+'_sep'].map(temp_df)
                del train_df[col+'_sep'],test_df[col+'_sep']
    return train_df, test_df

def values_sep_normalization(dt_df, periods, columns):
    for period in periods:
        for col in columns:
            new_col = col +'_sep_'+ period
            dt_df[col] = dt_df[col].astype(float)  

            dt_df[period+'_sep'] = dt_df[period].astype(str)+dt_df['had_id'].astype(str)     
                
            temp_min = dt_df.groupby([period+'_sep'])[col].agg(['min']).reset_index()
            temp_min.index = temp_min[period+'_sep'].values
            temp_min = temp_min['min'].to_dict()

            temp_max = dt_df.groupby([period+'_sep'])[col].agg(['max']).reset_index()
            temp_max.index = temp_max[period+'_sep'].values
            temp_max = temp_max['max'].to_dict()

            temp_mean = dt_df.groupby([period+'_sep'])[col].agg(['mean']).reset_index()
            temp_mean.index = temp_mean[period+'_sep'].values
            temp_mean = temp_mean['mean'].to_dict()

            temp_std = dt_df.groupby([period+'_sep'])[col].agg(['std']).reset_index()
            temp_std.index = temp_std[period+'_sep'].values
            temp_std = temp_std['std'].to_dict()

            dt_df['temp_min'] = dt_df[period+'_sep'].map(temp_min)
            dt_df['temp_max'] = dt_df[period+'_sep'].map(temp_max)
            dt_df['temp_mean'] = dt_df[period+'_sep'].map(temp_mean)
            dt_df['temp_std'] = dt_df[period+'_sep'].map(temp_std)

            dt_df[new_col+'_min_max'] = (dt_df[col]-dt_df['temp_min'])/(dt_df['temp_max']-dt_df['temp_min'])
            dt_df[new_col+'_std_score'] = (dt_df[col]-dt_df['temp_mean'])/(dt_df['temp_std'])
            del dt_df['temp_min'],dt_df['temp_max'],dt_df['temp_mean'],dt_df['temp_std'],dt_df[period+'_sep']
    return dt_df
#9.1 TransAmt seperated by had_id(Online/Traditional)
#分online/traditional来groupbyuid
i_cols = ['TransactionAmt','TransactionAmt_decimal']
uids = ['uid3','uid4','uid5','bank_type','uid6','uid7','cid_1']
aggregations = ['mean','std','min']

train_df, test_df = uid_aggregation(train_df, test_df, i_cols, uids, aggregations)

#分online/traditional来normalization
periods = ['DT_D','DT_W','DT_M']
for df in [train_df, test_df]:
    df = values_sep_normalization(df, periods, i_cols)

对部分特征做频率编码,或是str转numerical:

#14. Category Encoding
print('Category Encoding')
from sklearn.preprocessing import LabelEncoder
## card4, card6, ProductCD
# Converting Strings to ints(or floats if nan in column) using frequency encoding
# We will be able to use these columns as category or as numerical feature


for col in ['card4', 'card6', 'ProductCD']:
    print('Encoding', col)
    temp_df = pd.concat([train_df[[col]], test_df[[col]]])
    col_encoded = temp_df[col].value_counts().to_dict()   
    train_df[col] = train_df[col].map(col_encoded) #多分类用出现次数作为编码
    test_df[col]  = test_df[col].map(col_encoded)
    print(col_encoded)
    del temp_df,col_encoded
    gc.collect()

## M columns
# Converting Strings to ints(or floats if nan in column)

for col in ['M1','M2','M3','M5','M6','M7','M8','M9']:
    train_df[col] = train_df[col].map({'T':1, 'F':0})
    test_df[col]  = test_df[col].map({'T':1, 'F':0})

for col in ['P_emaildomain', 'R_emaildomain','M4']:
    print('Encoding', col)
    temp_df = pd.concat([train_df[[col]], test_df[[col]]])
    col_encoded = temp_df[col].value_counts().to_dict()   
    train_df[col] = train_df[col].map(col_encoded)
    test_df[col]  = test_df[col].map(col_encoded)
    print(col_encoded)
    del temp_df,col_encoded
    gc.collect()
    
i_cols = ['TransactionAmt']
uids = ['card2__id_20','card1__id_02']
aggregations = ['mean','std']

# uIDs aggregations
train_df, test_df = uid_aggregation(train_df, test_df, i_cols, uids, aggregations)
 
    
## Reduce Mem One More Time
train_df = reduce_mem_usage(train_df)
test_df  = reduce_mem_usage(test_df)
Category Encoding
Encoding card4
{'visa': 722693, 'mastercard': 348803, 'american express': 16078, 'discover': 9572}
Encoding card6
{'debit': 828379, 'credit': 268753, 'charge card': 16}
Encoding ProductCD
{'W': 800657, 'C': 137785, 'R': 73346, 'H': 62397, 'S': 23046}
Encoding P_emaildomain
{'gmail.com': 435803, 'yahoo.com': 182784, 'uknown': 163648, 'hotmail.com': 85649, 'anonymous.com': 71062, 'aol.com': 52337, 'comcast.net': 14474, 'icloud.com': 12316, 'outlook.com': 9934, 'att.net': 7647, 'msn.com': 7480, 'sbcglobal.net': 5767, 'live.com': 5720, 'verizon.net': 5011, 'ymail.com': 4075, 'bellsouth.net': 3437, 'yahoo.com.mx': 2827, 'me.com': 2713, 'cox.net': 2657, 'optonline.net': 1937, 'live.com.mx': 1470, 'charter.net': 1443, 'mail.com': 1156, 'rocketmail.com': 1105, 'gmail': 993, 'earthlink.net': 979, 'outlook.es': 863, 'mac.com': 862, 'hotmail.fr': 674, 'hotmail.es': 627, 'frontier.com': 594, 'roadrunner.com': 583, 'juno.com': 574, 'windstream.net': 552, 'web.de': 518, 'aim.com': 468, 'embarqmail.com': 464, 'twc.com': 439, 'frontiernet.net': 397, 'netzero.com': 387, 'centurylink.net': 386, 'q.com': 362, 'yahoo.fr': 344, 'hotmail.co.uk': 334, 'suddenlink.net': 323, 'netzero.net': 319, 'cfl.rr.com': 318, 'cableone.net': 311, 'prodigy.net.mx': 303, 'gmx.de': 298, 'sc.rr.com': 277, 'yahoo.es': 272, 'protonmail.com': 159, 'ptd.net': 140, 'yahoo.de': 137, 'hotmail.de': 130, 'live.fr': 106, 'yahoo.co.uk': 103, 'yahoo.co.jp': 101, 'servicios-ta.com': 80, 'scranton.edu': 2}
Encoding R_emaildomain
{'uknown': 824070, 'gmail.com': 118885, 'hotmail.com': 53166, 'anonymous.com': 39644, 'yahoo.com': 21405, 'aol.com': 7239, 'outlook.com': 5011, 'comcast.net': 3513, 'icloud.com': 2820, 'yahoo.com.mx': 2743, 'msn.com': 1698, 'live.com.mx': 1464, 'live.com': 1444, 'verizon.net': 1202, 'sbcglobal.net': 1163, 'me.com': 1095, 'att.net': 870, 'cox.net': 854, 'outlook.es': 853, 'bellsouth.net': 795, 'hotmail.fr': 667, 'hotmail.es': 595, 'web.de': 514, 'mac.com': 430, 'ymail.com': 405, 'optonline.net': 350, 'mail.com': 341, 'hotmail.co.uk': 317, 'yahoo.fr': 315, 'prodigy.net.mx': 303, 'gmx.de': 297, 'charter.net': 263, 'gmail': 196, 'earthlink.net': 170, 'embarqmail.com': 140, 'yahoo.de': 139, 'hotmail.de': 130, 'rocketmail.com': 126, 'yahoo.es': 124, 'juno.com': 111, 'frontier.com': 110, 'live.fr': 105, 'windstream.net': 104, 'yahoo.co.jp': 104, 'roadrunner.com': 101, 'yahoo.co.uk': 82, 'servicios-ta.com': 80, 'aim.com': 77, 'protonmail.com': 75, 'ptd.net': 70, 'scranton.edu': 69, 'twc.com': 61, 'cfl.rr.com': 57, 'suddenlink.net': 55, 'cableone.net': 46, 'q.com': 45, 'frontiernet.net': 38, 'centurylink.net': 28, 'netzero.com': 24, 'netzero.net': 19, 'sc.rr.com': 14}
Encoding M4
{'M0': 357789, 'M2': 122947, 'M1': 97306}
Mem. usage decreased to 1086.67 Mb (55.8% reduction)
Mem. usage decreased to 939.08 Mb (55.5% reduction)

将结果存入.pkl格式文件,占用空间小,可利用pd.read_pickle快速读取:

## Export
train_df.to_pickle('train_transaction_15.pkl')
test_df.to_pickle('test_transaction_15.pkl')

3.特征筛选+降维(实验记录)

  1. 对Vs特征分块进行PCA相关操作,无法确定最佳维度,且较多Vs特征并非numerical,不适合PCA,若仅加入分块PCA特征,试验后LB也并无增长,放弃此举。

  2. 考虑到V126-V138可能是某些值的累积,对它们做了groupby ProductCD和crad1、addr1的diff(),这里边’V126’ ‘V127’ ‘V128’ ‘V130’ ‘V131’ ‘V132’ ‘V133’ ‘V134’ ‘V136’ 'V137’是numerical,试验后LB并无增长,放弃此举。

  3. 降维:
    (1)Recursive feature elimination for block of features:做了三个UID_bolck,D_block,Trans_block,踢了30个feat,LB0.9488–>0.9487可以接受。
    (2)PCA on v_cols:只对numerical类型的V126-335部分feat做了scaling外加PCA,train和test是各自归一化的,训练集测试集分布不太行的问题应该是解决了,分布差得不是很大了,但是test还是有好多好多特别高的值,阈值大得多。其他Vs_feat不适合做PCA。
    (3)permutation importance
    仅使用solo_feature试试看。importance=0的,5fold选出[‘C7’,‘C7_fq_enc’,‘C10_fq_enc’,‘addr2’,‘M7’,‘C4’],感觉波动不会太大总之先试试。LB:0.9486–>0.9482 降了,还不能去掉,好像有缓和overfit的作用。
    (4)想把mean和std以及uidagg组合类好好过滤一下,感觉过拟合都是这些导致的。
    多了Di_uidagg部分,删了Duid部分,一下子降了200维,LB:0.9486–>0.9478 不到1个千分点……Duidagg是没有意义的,毫无疑问是造成过拟合的原因,但是没有代替它的好特征是不行的。(后使用D生成新uid对TransactionAmt进行agg操作使得LB大幅提升)

  4. change dist:
    1)Pseudo Labeling(2种办法)- overampling
    (1)取test一次predict结果的极端样本(0.01内)填充到train里边区再跑一次6折lgbm:
    正样本率从0.03499下降至0.019082,少了一半,分布区别加剧,放弃。
    就总结地来讲这个样本集做oversampling应该是会造成data的unbalance更加严重。
    (2)negative downsampling
    可以节约时间,用来测试新feature ,但对模型表现不会有提升,会损失一部分训练数据。

  5. deeper fe:(find magic)
    (1)做了个rolling window of duplicates: 查看测试集和训练集比例差不多,LB:0.9487–>0.9486 放弃。
    (2)改善分布+想办法提升对冬季(分布最不同处)的预测精度
    a.Cs: 做了15、30天的shift和30天的rolling出mean和median LB:0.9487–>0.9486 基本上没有用
    b.Vs:‘V144’ ‘V145’ ‘V150’ ‘V159’ 'V160’冬季会出现小山坡状的V,在未scalingPCA之前做了15、30天的shift
    c.lag用在userid上,本人笔记本内存不支持产生rolling window的lag特征,放弃。
    d.uid_aggregation用在一部分C上,是不是能区别出商家用户和一般用户呢?
    从某些角度来讲也算是扩大了冬天数据的影响,但是对应的样本太少了,没能提升第一fold的auc,看起来不行,放弃。
    f.把email系列给bin了,外加了一点Transamt*C的组合
    特征都挺重要的,但是LB:0.9487–>0.9486这样,CV挺高的,先保留。
    g.把Vcols里边的多分类clip一下去除离群值,LB无增长,先保留。
    h.之前对Ds的处理太草了,把D重新处理一下,之前根据uidagg得到的mean和std简直是noise…… 只保留Dsuidagg是LB:0.9478–>0.9480 2个万分点/然后scaling加上是LB:0.9480–>0.9486 6个万分点,可以判断Dsuidagg是造成过拟合的元凶了。
    i.做DminusDT,可以考虑剔除uid1、2这样的试试。Ds有部分分为to id cards,to id users2类,考虑生成新的uid和cardid,Trans_agg附加Trans的min,可以考虑保留一下scaling的part 看情况,当前维度502:

    • CV:Mean AUC = 0.9460381025052613 Out of folds AUC = 0.9449767448300119暴增
    • LB:0.9487–>0.9525 (MAGIC HERE
    • 冬季的表现: Fold 1 | AUC: 0.9147271019498214

    j.从had_id分析Xmas飙升:可能是Black Friday和Cyber Monday带来的?
    2月份的线下交易额飙升是指中国春节?had_id指的是线上交易?
    请参考:miguel perez——Physical vs e-commerce (real dates, clearer)
    [Kaggle竞赛] IEEE-CIS Fraud Detection_第2张图片
    [Kaggle竞赛] IEEE-CIS Fraud Detection_第3张图片
    以had_id区分traditional和online,75.6%的train的had_id为空,也就是traditional。71.9%的test的had_id为空。从daily_trans_count和daily_trans_amt都可以看出来,数据的变化符合购物节假说,考虑使用had_id生成新的特征。
    这样如何?:had_id非空,也就是online的part用online的trans作count/uidagg;traditional的part用对应的trans count/uidagg,然后合成一列作为新特征。附加Ds_period_normalization。

    • CV: Mean AUC = 0.9465119707733309 Out of folds AUC = 0.9457792719899325增了一点
    • LB:0.9525–>0.9526(Best Single Model
    • 冬季的表现有变好:Fold 1 | AUC: 0.9179858958040653

    k.试试Vs_group的PCA作为新特征?
    没啥用,Mean AUC = 0.9466500259256587 Out of folds AUC = 0.9458734168996746,稍微增了一点点,但是LB完全是0.9526没变。

  6. 尝试改变cv策略:(当前为GroupKfold)
    (1)之前的cv策略都有时间穿越的问题,考虑用time_split试1fold,降了–>0.9389,放弃。
    (2)直接试试sklearn的time_series_split如何? 降了–>0.9339 ,放弃。

4.lightGBM+best_parameters

首先读入已经处理好的.pkl文件:

import lightgbm as lgb
import pandas as pd
import numpy as np
import os, sys
import logging
import operator
import gc
from sklearn import metrics

train_trans = pd.read_pickle('train_transaction_14.pkl')
test_trans = pd.read_pickle('test_transaction_14.pkl')
print('train_set shape after merge:',train_trans.shape)
print('test_set shape after merge:',test_trans.shape)
train_set shape after merge: (590540, 765)
test_set shape after merge: (506691, 765)

然后丢掉不需要的、没有价值的、没法处理的中间特征变量。同时将还未labelencode的object类型特征进行编码以供lightGBM处理。
部分特征是因为其具有过拟合的性质或是属于噪音影响模型表现,又或是lightgbm给出的feature importance过低从而决定丢弃;关于其余丢弃特征,请参考:Roman——Recursive feature elimination

#drop cols
not_use = ['dist2', 'C3', 'D7', 'M1', 'id_04', 'id_07', 'id_08', 'id_10', 'id_16', 'id_18', 'id_21', 'id_22', 'id_23', 'id_24', 'id_25', 'id_26', 'id_27', 'id_28', 'id_29', 'id_34', 'id_35']
rm_cols = ['bank_type','uid1','uid2','uid3','uid4','uid5','DT','DT_W','DT_D','DT_hour','DT_day_week','DT_day','DT_D_total','DT_W_total','DT_M_total','id_30','id_31','id_33']
drop_v_vols = ['V1', 'V2', 'V14', 'V15', 'V16', 'V18', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V31', 'V32', 'V39', 'V41', 'V42', 'V43', 'V50', 'V55', 'V57', 'V65', 'V66', 'V67', 'V68', 'V77', 'V79', 'V86', 'V88', 'V89', 'V98', 'V101', 'V102', 'V103', 'V104', 'V105', 'V106', 'V107', 'V108', 'V109', 'V110', 'V111', 'V112', 'V113', 'V114', 'V115', 'V116', 'V117', 'V118', 'V119', 'V120', 'V121', 'V122', 'V123', 'V124', 'V125', 'V129', 'V132', 'V133', 'V134', 'V135', 'V136', 'V137', 'V141', 'V142', 'V144', 'V148', 'V153', 'V155', 'V157', 'V168', 'V174', 'V179', 'V181', 'V183', 'V185', 'V186', 'V190', 'V191', 'V192', 'V193', 'V194', 'V196', 'V198', 'V199', 'V211', 'V218', 'V230', 'V232', 'V235', 'V236', 'V237', 'V240', 'V241', 'V248', 'V250', 'V252', 'V254', 'V255', 'V260', 'V269', 'V281', 'V284', 'V286', 'V290', 'V293', 'V295', 'V296', 'V297', 'V298', 'V299', 'V300', 'V301', 'V302', 'V305', 'V309', 'V311', 'V316', 'V318', 'V319', 'V320', 'V321', 'V325', 'V327', 'V328', 'V330', 'V334', 'V337', 'V339']
drop_cols2 = ['had_id','M_sum','D9','V138','D9_not_na','card1','TransactionDTday','card1_TransactionAmt_decimal_min', 'bank_type_TransactionAmt_decimal_min', 'card2_TransactionAmt_decimal_min', 'card5_TransactionAmt_decimal_min', 'card3_TransactionAmt_decimal_min']
#rfe_not1 = ['card3_fq_enc','bank_type_D1_mean','bank_type_D7_mean','bank_type_D10_mean','bank_type_D11_mean','D6_DT_W_min_max','D7_DT_W_min_max', 'D7_DT_W_std_score', 'D12_DT_W_min_max', 'D13_DT_W_min_max', 'D6_DT_M_min_max', 'D7_DT_M_std_score','D12_DT_M_min_max','D12_DT_M_std_score','D13_DT_M_min_max']

not_use = not_use + rm_cols + drop_cols2 + drop_v_vols 
train_trans = train_trans.drop(not_use,axis=1)
test_trans = test_trans.drop(not_use,axis=1)
from sklearn.preprocessing import LabelEncoder
for col in train_trans.columns:
    if train_trans[col].dtype == 'object':
        le = LabelEncoder()
        le.fit(list(train_trans[col].astype(str).values) + list(test_trans[col].astype(str).values))
        train_trans[col] = le.transform(list(train_trans[col].astype(str).values))
        test_trans[col] = le.transform(list(test_trans[col].astype(str).values))
P_emaildomain_bin
P_emaildomain_suffix
P_emaildomain_prefix
P_emaildomain_suffix_us
R_emaildomain_bin
R_emaildomain_suffix
R_emaildomain_prefix
R_emaildomain_suffix_us
uid6
cid_1
uid7
DeviceType
DeviceInfo
device_name
device_version
OS_id_30
version_id_30
browser_id_31
version_id_31
screen_width
screen_height

去掉纯噪音‘TransactionDT’和‘TransactionID’,也去掉标签的’isFraud’。将训练集做6折GroupFold,并将训练集和测试集转为lgb可以处理的类型,设定好lgb参数,开始训练,取每折预测结果的平均作为最终的预测结果。

#fit_lgb
X = train_trans.sort_values('TransactionDT').drop(['isFraud', 'TransactionDT','TransactionID'], axis=1)
y = train_trans.sort_values('TransactionDT')['isFraud']
X_test = test_trans.drop(['TransactionDT', 'isFraud','TransactionID'], axis=1)
print('the shape of train_df is:',X.shape)
print('the shape of test_df is:',X_test.shape)
the shape of train_df is: (590540, 580)
the shape of test_df is: (506691, 580)
#fit_lgb
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GroupKFold
NFOLDS = 6
folds = GroupKFold(n_splits=NFOLDS)

params = {'num_leaves': 491,
          'min_child_weight': 0.03454472573214212,
          'feature_fraction': 0.3797454081646243,
          'bagging_fraction': 0.4181193142567742,
          'min_data_in_leaf': 106,
          'objective': 'binary',
          'max_depth': -1,
          'learning_rate': 0.006883242363721497,
          "boosting_type": "gbdt",
          "bagging_seed": 11,
          "metric": 'auc',
          "verbosity": -1,
          'reg_alpha': 0.3899927210061127,
          'reg_lambda': 0.6485237330340494,
          'random_state': 47,
          'num_threads':4,
          'n_estimators':1800
         }

columns = X.columns
split_groups = X['DT_M']    
splits = folds.split(X, y,groups=split_groups)
y_preds = np.zeros(X_test.shape[0])
y_oof = np.zeros(X.shape[0])
score = 0
feature_importances = pd.DataFrame()
feature_importances['feature'] = columns

for fold_n, (train_index, valid_index) in enumerate(splits):
    print('Fold:',fold_n)
    X_train, X_valid = X[columns].iloc[train_index], X[columns].iloc[valid_index]
    y_train, y_valid = y.iloc[train_index], y.iloc[valid_index]
    
    dtrain = lgb.Dataset(X_train, label=y_train)
    dvalid = lgb.Dataset(X_valid, label=y_valid)

    clf = lgb.train(params, dtrain, 10000, valid_sets = [dtrain, dvalid], verbose_eval=200, early_stopping_rounds=200)
    
    feature_importances[f'fold_{fold_n + 1}'] = clf.feature_importance()
    
    y_pred_valid = clf.predict(X_valid)
    y_oof[valid_index] = y_pred_valid
    print(f"Fold {fold_n + 1} | AUC: {roc_auc_score(y_valid, y_pred_valid)}")
    
    score += roc_auc_score(y_valid, y_pred_valid) / NFOLDS
    y_preds += clf.predict(X_test) / NFOLDS
    
    del X_train, X_valid, y_train, y_valid
    gc.collect()

print(f"\nMean AUC = {score}")
print(f"Out of folds AUC = {roc_auc_score(y, y_oof)}")
Fold: 0
//anaconda3/lib/python3.7/site-packages/lightgbm/engine.py:113: UserWarning: Found `n_estimators` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
Training until validation scores don't improve for 200 rounds.
[200]	training's auc: 0.970841	valid_1's auc: 0.888492
[400]	training's auc: 0.989367	valid_1's auc: 0.902119
[600]	training's auc: 0.99698	valid_1's auc: 0.909347
[800]	training's auc: 0.999369	valid_1's auc: 0.913024
[1000]	training's auc: 0.999881	valid_1's auc: 0.915294
[1200]	training's auc: 0.999982	valid_1's auc: 0.916427
[1400]	training's auc: 0.999998	valid_1's auc: 0.916971
[1600]	training's auc: 1	valid_1's auc: 0.917535
[1800]	training's auc: 1	valid_1's auc: 0.917986
Did not meet early stopping. Best iteration is:
[1800]	training's auc: 1	valid_1's auc: 0.917986
Fold 1 | AUC: 0.9179858958040653
Fold: 1
//anaconda3/lib/python3.7/site-packages/lightgbm/engine.py:113: UserWarning: Found `n_estimators` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
Training until validation scores don't improve for 200 rounds.
[200]	training's auc: 0.96973	valid_1's auc: 0.920247
[400]	training's auc: 0.988597	valid_1's auc: 0.934074
[600]	training's auc: 0.996845	valid_1's auc: 0.941598
[800]	training's auc: 0.999326	valid_1's auc: 0.944648
[1000]	training's auc: 0.999883	valid_1's auc: 0.946392
[1200]	training's auc: 0.999983	valid_1's auc: 0.947476
[1400]	training's auc: 0.999998	valid_1's auc: 0.948062
[1600]	training's auc: 1	valid_1's auc: 0.94845
[1800]	training's auc: 1	valid_1's auc: 0.948716
Did not meet early stopping. Best iteration is:
[1798]	training's auc: 1	valid_1's auc: 0.948727
Fold 2 | AUC: 0.9487270375003357
Fold: 2
//anaconda3/lib/python3.7/site-packages/lightgbm/engine.py:113: UserWarning: Found `n_estimators` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
Training until validation scores don't improve for 200 rounds.
[200]	training's auc: 0.967541	valid_1's auc: 0.920002
[400]	training's auc: 0.987839	valid_1's auc: 0.935714
[600]	training's auc: 0.996483	valid_1's auc: 0.943642
[800]	training's auc: 0.999224	valid_1's auc: 0.94732
[1000]	training's auc: 0.999857	valid_1's auc: 0.949401
[1200]	training's auc: 0.999978	valid_1's auc: 0.950679
[1400]	training's auc: 0.999998	valid_1's auc: 0.951512
[1600]	training's auc: 1	valid_1's auc: 0.952117
[1800]	training's auc: 1	valid_1's auc: 0.952431
Did not meet early stopping. Best iteration is:
[1798]	training's auc: 1	valid_1's auc: 0.95242
Fold 3 | AUC: 0.9524195779901877
Fold: 3
//anaconda3/lib/python3.7/site-packages/lightgbm/engine.py:113: UserWarning: Found `n_estimators` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
Training until validation scores don't improve for 200 rounds.
[200]	training's auc: 0.968861	valid_1's auc: 0.917606
[400]	training's auc: 0.987988	valid_1's auc: 0.931882
[600]	training's auc: 0.996424	valid_1's auc: 0.93957
[800]	training's auc: 0.999185	valid_1's auc: 0.94231
[1000]	training's auc: 0.999843	valid_1's auc: 0.943998
[1200]	training's auc: 0.999974	valid_1's auc: 0.945387
[1400]	training's auc: 0.999996	valid_1's auc: 0.946162
[1600]	training's auc: 1	valid_1's auc: 0.946454
[1800]	training's auc: 1	valid_1's auc: 0.946702
Did not meet early stopping. Best iteration is:
[1800]	training's auc: 1	valid_1's auc: 0.946702
Fold 4 | AUC: 0.9467022988334145
Fold: 4
//anaconda3/lib/python3.7/site-packages/lightgbm/engine.py:113: UserWarning: Found `n_estimators` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
Training until validation scores don't improve for 200 rounds.
[200]	training's auc: 0.967885	valid_1's auc: 0.933123
[400]	training's auc: 0.987962	valid_1's auc: 0.945288
[600]	training's auc: 0.996395	valid_1's auc: 0.950617
[800]	training's auc: 0.999154	valid_1's auc: 0.952375
[1000]	training's auc: 0.999834	valid_1's auc: 0.953323
[1200]	training's auc: 0.999972	valid_1's auc: 0.954082
[1400]	training's auc: 0.999997	valid_1's auc: 0.954569
[1600]	training's auc: 1	valid_1's auc: 0.954912
[1800]	training's auc: 1	valid_1's auc: 0.955194
Did not meet early stopping. Best iteration is:
[1800]	training's auc: 1	valid_1's auc: 0.955194
Fold 5 | AUC: 0.9551941380371436
Fold: 5
//anaconda3/lib/python3.7/site-packages/lightgbm/engine.py:113: UserWarning: Found `n_estimators` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
Training until validation scores don't improve for 200 rounds.
[200]	training's auc: 0.96762	valid_1's auc: 0.92463
[400]	training's auc: 0.987262	valid_1's auc: 0.942792
[600]	training's auc: 0.996126	valid_1's auc: 0.951216
[800]	training's auc: 0.999056	valid_1's auc: 0.954502
[1000]	training's auc: 0.999806	valid_1's auc: 0.956104
[1200]	training's auc: 0.999966	valid_1's auc: 0.957193
[1400]	training's auc: 0.999995	valid_1's auc: 0.958021
[1600]	training's auc: 0.999999	valid_1's auc: 0.958572
[1800]	training's auc: 1	valid_1's auc: 0.958972
Did not meet early stopping. Best iteration is:
[1800]	training's auc: 1	valid_1's auc: 0.958972
Fold 6 | AUC: 0.9589715765186857

Mean AUC = 0.9465119707733309 
Out of folds AUC = 0.9457792719899325

接下来将预测结果输出成文件:

#prediction
submission = pd.DataFrame({'TransactionID':test_trans['TransactionID'],'isFraud':y_preds})
print('submission_shape is:',submission.shape)
submission.to_csv('submission_16.3.csv',index = False)

国内上传速度越来越慢了,以如下形式也可以在Linux内将结果上传至Kaggle,形如:

kaggle competitions submit -c ieee-fraud-detection -f submission.csv -m "Message"

稍微观察一下特征重要度,先看重要度最高的特征及其得分。

feature_importances['average'] = feature_importances[[f'fold_{fold_n + 1}' for fold_n in range(folds.n_splits)]].mean(axis=1)
feature_importances.to_csv('feature_importances.csv')
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline
plt.figure(figsize=(16, 16))
sns.barplot(data=feature_importances.sort_values(by='average', ascending=False).head(50), x='average', y='feature');
plt.title('50 TOP feature importance over {} folds average'.format(folds.n_splits));

[Kaggle竞赛] IEEE-CIS Fraud Detection_第4张图片
再看重要度最低的特征,决定是否要在优化过程中将这些特征去掉或加以其他处理。

完成特征工程改进之后,用LightGBM训练得到预测结果,分析auc和LB不断对特征处理方法进行改进,最终单模型得分LB上升到了0.9526。

5. Internal blend

blend考量:
(1)submission16 CV:Mean AUC = 0.9460381025052613 Out of folds AUC = 0.9449767448300119 LB:0.9525
(2)submission16.2 CV: Mean AUC = 0.9464855222093469 Out of folds AUC = 0.9457199040811567 LB:0.9525 -->冬天好像改善了一样0.917
(3) submission16.3 CV: Mean AUC = 0.9465119707733309 Out of folds AUC = 0.9457792719899325增了一点 LB:0.9525–>0.9526 冬季的表现有变好:Fold 1 | AUC: 0.9179858958040653

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
sub_1 = pd.read_csv('blends/submission_16.9449.csv')
sub_2 = pd.read_csv('blends/submission_16.94571.csv')
sub_3 = pd.read_csv('blends/submission_16.94577.csv')
sub_4 = pd.read_csv('output/submission_09.9487.csv')
sub_5 = pd.read_csv('output/submission_07.9477.csv')

sub_1['isFraud'] = sub_1['isFraud'] + sub_2['isFraud'] + sub_3['isFraud'] + sub_4['isFraud'] + sub_5['isFraud'] 
sub_1.to_csv('submission_blend1.csv', index=False)

6.最终结果

[Kaggle竞赛] IEEE-CIS Fraud Detection_第5张图片
虽然Blend模型的LB:0.9532比单模型的LB:0.9526要高出不少,但是最终结果还是单模型表现较好,Blend模型虽然是由各模型组合生成的,还是会存在对LB的过拟合现象。

你可能感兴趣的:(Python笔记)