欠采样与过采样方法

1、使用SMOTE进行过采样

使用SMOTE过采样时应先切分训练集和验证集,再对训练集进行过采样,否则将会导致严重的过拟合
https://beckernick.github.io/oversampling-modeling/

使用方法:

X_train, X_val, y_train, y_val = train_test_split(train_df[predictors], train_df[target], test_size=0.15, random_state=1234)
from imblearn.over_sampling import SMOTE
oversampler = SMOTE(ratio='auto', random_state=np.random.randint(100), k_neighbors=5, m_neighbors=10, kind='regular', n_jobs=-1)
os_X_train, os_y_train = oversampler.fit_sample(X_train,y_train)
from collections import Counter
print('Resampled dataset shape {}'.format(Counter(os_y_train)))

注意,过采样之后就不能直接把Pandas.DataFrame数据传入模型,特征名称已改变

model=XGBClassifier(
    learning_rate =0.1,
    n_estimators=1000,
    max_depth=5,
    min_child_weight=1,
    gamma=0,
    subsample=0.8,
    colsample_bytree=0.8,
    objective= 'binary:logistic',
    nthread=-1,
    scale_pos_weight=1,
    seed=27
)

model.fit(
    os_X_train,
    os_y_train,
    eval_set=[(X_val.values, y_val)],
    early_stopping_rounds=3,
    verbose=True,
    eval_metric='auc'
)

https://www.kaggle.com/ktattan/recall-97-with-smote-random-forest-tsne

2、欠采样,也叫下采样

def down_sample(df):
    """
    欠采样
    """
    df1 = df[df['acc_now_delinq'] == 1]
    df2 = df[df['acc_now_delinq'] == 0]
    df3 = df2.sample(frac=0.1)
    return pd.concat([df1, df3], ignore_index=True)

你可能感兴趣的:(数据科学,SMOTE,过采样)