使用SMOTE过采样时应先切分训练集和验证集,再对训练集进行过采样,否则将会导致严重的过拟合
https://beckernick.github.io/oversampling-modeling/
使用方法:
X_train, X_val, y_train, y_val = train_test_split(train_df[predictors], train_df[target], test_size=0.15, random_state=1234)
from imblearn.over_sampling import SMOTE
oversampler = SMOTE(ratio='auto', random_state=np.random.randint(100), k_neighbors=5, m_neighbors=10, kind='regular', n_jobs=-1)
os_X_train, os_y_train = oversampler.fit_sample(X_train,y_train)
from collections import Counter
print('Resampled dataset shape {}'.format(Counter(os_y_train)))
注意,过采样之后就不能直接把Pandas.DataFrame数据传入模型,特征名称已改变
model=XGBClassifier(
learning_rate =0.1,
n_estimators=1000,
max_depth=5,
min_child_weight=1,
gamma=0,
subsample=0.8,
colsample_bytree=0.8,
objective= 'binary:logistic',
nthread=-1,
scale_pos_weight=1,
seed=27
)
model.fit(
os_X_train,
os_y_train,
eval_set=[(X_val.values, y_val)],
early_stopping_rounds=3,
verbose=True,
eval_metric='auc'
)
https://www.kaggle.com/ktattan/recall-97-with-smote-random-forest-tsne
def down_sample(df):
"""
欠采样
"""
df1 = df[df['acc_now_delinq'] == 1]
df2 = df[df['acc_now_delinq'] == 0]
df3 = df2.sample(frac=0.1)
return pd.concat([df1, df3], ignore_index=True)