下采样downsample代码

针对二分类表格数据任务

[1]代码如下:

def df_train(df_train):
    number_records_fraud = len(df_train[df_train.isFraud == 1])#多少数据
    
    fraud_indices = np.array(df_train[df_train.isFraud == 1].index)#isFraud=1数据的下标
    normal_indices = df_train[df_train.isFraud == 0].index#isFraud=0数据的下标
    
    random_normal_indices = np.random.choice(normal_indices, number_records_fraud*14, replace = False)
    # replace=True: 可以从a 中反复选取同一个元素。
    # replace=False: a 中同一个元素只能被选取一次。
    random_normal_indices = np.array(random_normal_indices)
    
    under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])#所有数据的index
    under_sample_data = df_train.iloc[under_sample_indices,:].reset_index(drop=True)#获取所有数据(isFraud=1的所有数据,以及isFruad=0的抽样数据)
    del df_train
    return under_sample_data

 

代码用法是:
修改上面的代码中的14,

14表示isFraud=0是14表示isFraud=1的数据量的14倍

根据[2]中提到的,根据作者实践:

不要采用上采样(数据量小的那一类中进行重复抽样),

优先采样下采样(数据量大的那一类中进行不重复抽样).

 

[3]中的代码是:

def Negativedownsampling(train, ratio) :
    

    # Number of data points in the minority class
    number_records_fraud = len(train[train.isFraud == 1])
    fraud_indices = np.array(train[train.isFraud == 1].index)

    # Picking the indices of the normal classes
    normal_indices = train[train.isFraud == 0].index

    # Out of the indices we picked, randomly select "x" number (number_records_fraud)
    random_normal_indices = np.random.choice(normal_indices, number_records_fraud*ratio, replace = False)
    random_normal_indices = np.array(random_normal_indices)

    # Appending the 2 indices
    under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])

    # Under sample dataset
    under_sample_data = train.iloc[under_sample_indices,:]
    
    # Showing ratio
    print("Percentage of normal transactions: ", round(len(under_sample_data[under_sample_data.isFraud == 0])/len(under_sample_data),2)* 100,"%")
    print("Percentage of fraud transactions: ", round(len(under_sample_data[under_sample_data.isFraud == 1])/len(under_sample_data),2)* 100,"%")
    print("Total number of transactions in resampled data: ", len(under_sample_data))
    
    return under_sample_data

代码用法是:

df_train_resampling_1 = Negativedownsampling(df_train, 9)
Percentage of normal transactions:  90.0 %
Percentage of fraud transactions:  10.0 %
Total number of transactions in resampled data:  206630
df_train_resampling_2 = Negativedownsampling(df_train, 3)
Percentage of normal transactions:  75.0 %
Percentage of fraud transactions:  25.0 %
Total number of transactions in resampled data:  82652

 

如何和GroupKFold配合使用的话,注意记得使用reset_index(drop=True)

 

转载自:

[1]https://www.kaggle.com/c/ieee-fraud-detection/discussion/108616#latest-625841

[2]https://www.kaggle.com/c/ieee-fraud-detection/discussion/100268#latest-581989

[3]https://www.kaggle.com/zero92/negative-downsampling-improve-the-score

你可能感兴趣的:(Kaggle-数据挖掘与技巧)