基于前面对Blending集成学习算法的讨论,我们知道:Blending在集成的过程中只会用到验证集的数 据,对数据实际上是一个很大的浪费。为了解决这个问题,我们详细分析下Blending到底哪里出现问题 并如何改进。在Blending中,我们产生验证集的方式是使用分割的方式,产生一组训练集和一组验证 集,这让我们联想到交叉验证的方式。顺着这个思路,我们对Stacking进行建模,
Stacking方法是一种分层模型集成框架。以两层为例,首先将数据集分成训练集和测试集,利用训练集训练得到多个初级学习器,然后用初级学习器对测试集进行预测,并将输出值作为下一阶段训练的输入值,最终的标签作为输出值,用于训练次级学习器(通常最后一级使用Logistic回归)。由于两次所使用的训练数据不同,因此可以在一定程度上防止过拟合。
由于要进行多次训练,因此这种方法要求训练数据很多,为了防止发生划分训练集和测试集后,测试集比例过小,生成的次级学习器泛化性能不强的问题,通常在Stacking算法中会使用我们上次讲到的交叉验证法或留一法来进行训练。
以下是多模型运用stacking的例子:
for f in list(df_feature.select_dtypes('object')):
if f in ['carid', 'regdate']:
continue
le = LabelEncoder()
df_feature[f] = le.fit_transform(
df_feature[f].astype('str')).astype('int')
#df_train = df_feature[df_feature['y1_is_purchase'].notnull()]
#df_test = df_feature[df_feature['y1_is_purchase'].isnull()]
df_train=df_feature[df_feature['y1_is_purchase']!=-999]
df_test=df_feature[df_feature['y1_is_purchase']==-999]
test_prob = np.zeros(len(df_test))
prob = np.zeros(len(df_train))
ycol = 'y1_is_purchase'
feature_names = list(
filter(lambda x: x not in [ycol, 'regdate', 'carid'], df_train.columns))
test_data = df_test[feature_names].values
model = lgb.LGBMClassifier(num_leaves=64,
max_depth=10,
learning_rate=0.01,
n_estimators=10000,
subsample=0.8,
feature_fraction=0.8,
reg_alpha=0.5,
reg_lambda=0.5,
random_state=seed,
metric=None)
oof = []
prediction = df_test[['carid']]
prediction['label'] = 0
df_importance_list = []
kfold = StratifiedKFold(n_splits=5, random_state=seed, shuffle=True)
for fold_id, (trn_idx, val_idx) in enumerate(kfold.split(
df_train[feature_names], df_train[ycol])):
X_train = df_train.iloc[trn_idx][feature_names]
Y_train = df_train.iloc[trn_idx][ycol]
X_val = df_train.iloc[val_idx][feature_names]
Y_val = df_train.iloc[val_idx][ycol]
print('\nFold_{} Training ================================\n'.format(fold_id+1))
lgb_model = model.fit(X_train,
Y_train,
eval_names=['valid'],
eval_set=[(X_val, Y_val)],
verbose=500,
eval_metric='auc',
early_stopping_rounds=50)
pred_val = lgb_model.predict_proba(
X_val, num_iteration=lgb_model.best_iteration_)[:, 1]
df_oof = df_train.iloc[val_idx][[
'carid', ycol]].copy()
df_oof['pred'] = pred_val
oof.append(df_oof)
pred_test = lgb_model.predict_proba(
df_test[feature_names], num_iteration=lgb_model.best_iteration_)[:, 1]
prediction['label'] += pred_test / 5
df_importance = pd.DataFrame({
'column': feature_names,
'importance': lgb_model.feature_importances_,
})
df_importance_list.append(df_importance)
prob[val_idx] = lgb_model.predict_proba(X_val)[:, 1]
test_prob += lgb_model.predict_proba(test_data)[:, 1]/5
del lgb_model, pred_val, pred_test, X_train, Y_train, X_val, Y_val
gc.collect()
df_importance = pd.concat(df_importance_list)
df_importance = df_importance.groupby(['column'])['importance'].agg(
'mean').sort_values(ascending=False).reset_index()
df_importance
df_oof = pd.concat(oof)
score = roc_auc_score(df_oof['y1_is_purchase'], df_oof['pred'])
score
score
df_oof.head(20)
prediction.head()
df_train['lgb_prob2'] = prob
df_test['lgb_prob2'] = test_prob
第一段是lgb代码,我们将标签用于下一个xgb模型
import xgboost as xgb
test=df_test[feature_names]
params = {
'booster': 'gbtree',
'objective': 'binary:logistic',
'eval_metric': 'logloss',
'gamma': 0.1,
'max_depth': 8,
'alpha': 0,
'lambda': 0,
'subsample': 0.7,
'colsample_bytree': 0.5,
'min_child_weight': 3,
'silent': 1,
'eta': 0.02,
'nthread': 8,
'missing': 1,
'seed': 2019,
}
folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=2020)
xgb_prob = np.zeros((df_test.shape[0]))
prob = np.zeros(len(df_train))
## train and predict
feature_importance_df = pd.DataFrame()
for fold_, (trn_idx, val_idx) in enumerate(folds.split(df_train[feature_names], df_train[ycol])):
print("fold {}".format(fold_ + 1))
trn_data = xgb.DMatrix(df_train.iloc[trn_idx][feature_names], label=df_train.iloc[trn_idx][ycol])
val_data = xgb.DMatrix(df_train.iloc[val_idx][feature_names], label=df_train.iloc[val_idx][ycol])
watchlist = [(trn_data, 'train'), (val_data, 'valid')]
clf = xgb.train(params, trn_data, 5000, watchlist, verbose_eval=200, early_stopping_rounds=50)
xgb_prob += clf.predict(xgb.DMatrix(test[feature_names]), ntree_limit=clf.best_ntree_limit) / folds.n_splits
#prob=clf.predict(xgb.DMatrix(df_train[feature_names]), ntree_limit=clf.best_ntree_limit)
prob[val_idx] = clf.predict(xgb.DMatrix(df_train.iloc[val_idx][feature_names]), ntree_limit=clf.best_ntree_limit)
fold_importance_df = pd.DataFrame()
fold_importance_df["Feature"] = clf.get_fscore().keys()
fold_importance_df["importance"] = clf.get_fscore().values()
fold_importance_df["fold"] = fold_ + 1
feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
df_train['xgb_prob'] = prob
df_test['xgb_prob'] = xgb_prob