集成学习task-2 stacking

基于前面对Blending集成学习算法的讨论,我们知道:Blending在集成的过程中只会用到验证集的数 据,对数据实际上是一个很大的浪费。为了解决这个问题,我们详细分析下Blending到底哪里出现问题 并如何改进。在Blending中,我们产生验证集的方式是使用分割的方式,产生一组训练集和一组验证 集,这让我们联想到交叉验证的方式。顺着这个思路,我们对Stacking进行建模,

 

Stacking方法是一种分层模型集成框架。以两层为例,首先将数据集分成训练集和测试集,利用训练集训练得到多个初级学习器,然后用初级学习器对测试集进行预测,并将输出值作为下一阶段训练的输入值,最终的标签作为输出值,用于训练次级学习器(通常最后一级使用Logistic回归)。由于两次所使用的训练数据不同,因此可以在一定程度上防止过拟合。

由于要进行多次训练,因此这种方法要求训练数据很多,为了防止发生划分训练集和测试集后,测试集比例过小,生成的次级学习器泛化性能不强的问题,通常在Stacking算法中会使用我们上次讲到的交叉验证法或留一法来进行训练。

以下是多模型运用stacking的例子:

 

for f in list(df_feature.select_dtypes('object')):
    if f in ['carid', 'regdate']:
        continue
    le = LabelEncoder()
    df_feature[f] = le.fit_transform(
        df_feature[f].astype('str')).astype('int')

#df_train = df_feature[df_feature['y1_is_purchase'].notnull()]
#df_test = df_feature[df_feature['y1_is_purchase'].isnull()]
df_train=df_feature[df_feature['y1_is_purchase']!=-999]
df_test=df_feature[df_feature['y1_is_purchase']==-999]
test_prob = np.zeros(len(df_test))
prob = np.zeros(len(df_train))
ycol = 'y1_is_purchase'
feature_names = list(
    filter(lambda x: x not in [ycol, 'regdate', 'carid'], df_train.columns))
test_data = df_test[feature_names].values
model = lgb.LGBMClassifier(num_leaves=64,
                           max_depth=10,
                           learning_rate=0.01,
                           n_estimators=10000,
                           subsample=0.8,
                           feature_fraction=0.8,
                           reg_alpha=0.5,
                           reg_lambda=0.5,
                           random_state=seed,
                           metric=None)

oof = []
prediction = df_test[['carid']]
prediction['label'] = 0
df_importance_list = []

kfold = StratifiedKFold(n_splits=5, random_state=seed, shuffle=True)
for fold_id, (trn_idx, val_idx) in enumerate(kfold.split(
        df_train[feature_names], df_train[ycol])):
    X_train = df_train.iloc[trn_idx][feature_names]
    Y_train = df_train.iloc[trn_idx][ycol]

    X_val = df_train.iloc[val_idx][feature_names]
    Y_val = df_train.iloc[val_idx][ycol]

    print('\nFold_{} Training ================================\n'.format(fold_id+1))

    lgb_model = model.fit(X_train,
                          Y_train,
                          eval_names=['valid'],
                          eval_set=[(X_val, Y_val)],
                          verbose=500,
                          eval_metric='auc',
                          early_stopping_rounds=50)

    pred_val = lgb_model.predict_proba(
        X_val, num_iteration=lgb_model.best_iteration_)[:, 1]
    df_oof = df_train.iloc[val_idx][[
        'carid', ycol]].copy()
    df_oof['pred'] = pred_val
    oof.append(df_oof)

    pred_test = lgb_model.predict_proba(
        df_test[feature_names], num_iteration=lgb_model.best_iteration_)[:, 1]
    prediction['label'] += pred_test / 5

    df_importance = pd.DataFrame({
        'column': feature_names,
        'importance': lgb_model.feature_importances_,
    })
    df_importance_list.append(df_importance)
    prob[val_idx] = lgb_model.predict_proba(X_val)[:, 1]
    test_prob += lgb_model.predict_proba(test_data)[:, 1]/5

    del lgb_model, pred_val, pred_test, X_train, Y_train, X_val, Y_val
    gc.collect()

df_importance = pd.concat(df_importance_list)
df_importance = df_importance.groupby(['column'])['importance'].agg(
    'mean').sort_values(ascending=False).reset_index()
df_importance

df_oof = pd.concat(oof)
score = roc_auc_score(df_oof['y1_is_purchase'], df_oof['pred'])
score

score

df_oof.head(20)

prediction.head()
df_train['lgb_prob2'] = prob
df_test['lgb_prob2'] = test_prob

第一段是lgb代码,我们将标签用于下一个xgb模型

import xgboost as xgb
test=df_test[feature_names]
params = {
    'booster': 'gbtree',
    'objective': 'binary:logistic',
    'eval_metric': 'logloss',
    'gamma': 0.1,
    'max_depth': 8,
    'alpha': 0,
    'lambda': 0,
    'subsample': 0.7,
    'colsample_bytree': 0.5,
    'min_child_weight': 3,
    'silent': 1,
    'eta': 0.02,
    'nthread': 8,
    'missing': 1,
    'seed': 2019,
}

folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=2020)
xgb_prob = np.zeros((df_test.shape[0]))
prob = np.zeros(len(df_train))

## train and predict
feature_importance_df = pd.DataFrame()
for fold_, (trn_idx, val_idx) in enumerate(folds.split(df_train[feature_names], df_train[ycol])):
    print("fold {}".format(fold_ + 1))
    trn_data = xgb.DMatrix(df_train.iloc[trn_idx][feature_names], label=df_train.iloc[trn_idx][ycol])
    val_data = xgb.DMatrix(df_train.iloc[val_idx][feature_names], label=df_train.iloc[val_idx][ycol])
    watchlist = [(trn_data, 'train'), (val_data, 'valid')]
    
    clf = xgb.train(params, trn_data,  5000, watchlist, verbose_eval=200, early_stopping_rounds=50)
    xgb_prob += clf.predict(xgb.DMatrix(test[feature_names]), ntree_limit=clf.best_ntree_limit) / folds.n_splits
    #prob=clf.predict(xgb.DMatrix(df_train[feature_names]), ntree_limit=clf.best_ntree_limit)
    prob[val_idx] = clf.predict(xgb.DMatrix(df_train.iloc[val_idx][feature_names]), ntree_limit=clf.best_ntree_limit)
    
fold_importance_df = pd.DataFrame()
fold_importance_df["Feature"] = clf.get_fscore().keys()
fold_importance_df["importance"] = clf.get_fscore().values()
fold_importance_df["fold"] = fold_ + 1
feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
df_train['xgb_prob'] = prob
df_test['xgb_prob'] = xgb_prob 

 

你可能感兴趣的:(集成学习task-2 stacking)