针对二分类表格数据任务
[1]代码如下:
def df_train(df_train):
number_records_fraud = len(df_train[df_train.isFraud == 1])#多少数据
fraud_indices = np.array(df_train[df_train.isFraud == 1].index)#isFraud=1数据的下标
normal_indices = df_train[df_train.isFraud == 0].index#isFraud=0数据的下标
random_normal_indices = np.random.choice(normal_indices, number_records_fraud*14, replace = False)
# replace=True: 可以从a 中反复选取同一个元素。
# replace=False: a 中同一个元素只能被选取一次。
random_normal_indices = np.array(random_normal_indices)
under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])#所有数据的index
under_sample_data = df_train.iloc[under_sample_indices,:].reset_index(drop=True)#获取所有数据(isFraud=1的所有数据,以及isFruad=0的抽样数据)
del df_train
return under_sample_data
代码用法是:
修改上面的代码中的14,
14表示isFraud=0是14表示isFraud=1的数据量的14倍
根据[2]中提到的,根据作者实践:
不要采用上采样(数据量小的那一类中进行重复抽样),
优先采样下采样(数据量大的那一类中进行不重复抽样).
[3]中的代码是:
def Negativedownsampling(train, ratio) :
# Number of data points in the minority class
number_records_fraud = len(train[train.isFraud == 1])
fraud_indices = np.array(train[train.isFraud == 1].index)
# Picking the indices of the normal classes
normal_indices = train[train.isFraud == 0].index
# Out of the indices we picked, randomly select "x" number (number_records_fraud)
random_normal_indices = np.random.choice(normal_indices, number_records_fraud*ratio, replace = False)
random_normal_indices = np.array(random_normal_indices)
# Appending the 2 indices
under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])
# Under sample dataset
under_sample_data = train.iloc[under_sample_indices,:]
# Showing ratio
print("Percentage of normal transactions: ", round(len(under_sample_data[under_sample_data.isFraud == 0])/len(under_sample_data),2)* 100,"%")
print("Percentage of fraud transactions: ", round(len(under_sample_data[under_sample_data.isFraud == 1])/len(under_sample_data),2)* 100,"%")
print("Total number of transactions in resampled data: ", len(under_sample_data))
return under_sample_data
代码用法是:
df_train_resampling_1 = Negativedownsampling(df_train, 9)
Percentage of normal transactions: 90.0 %
Percentage of fraud transactions: 10.0 %
Total number of transactions in resampled data: 206630
df_train_resampling_2 = Negativedownsampling(df_train, 3)
Percentage of normal transactions: 75.0 %
Percentage of fraud transactions: 25.0 %
Total number of transactions in resampled data: 82652
如何和GroupKFold配合使用的话,注意记得使用reset_index(drop=True)
转载自:
[1]https://www.kaggle.com/c/ieee-fraud-detection/discussion/108616#latest-625841
[2]https://www.kaggle.com/c/ieee-fraud-detection/discussion/100268#latest-581989
[3]https://www.kaggle.com/zero92/negative-downsampling-improve-the-score