XGB快速上手_初次参赛_津南数字制造算法

一、背景—工艺过程—字段意义

二、特征工程做法

三、baseline

四、模型stack—用到了 BayesianRidge回归

五、比赛入门与跟随1
比赛入门与跟随2
数据挖掘进阶:kaggle竞赛top代码分享

六、做Kaggle比赛,Jdata,天池的经验,看完我这篇就够了。

初步改进点有

1、删除收率异常

2、数据集的时间也有异常值

A11-A12=1:00 (有时候会有异常)

3、转换为时间区间

4、将收率进行分箱,然后构造每个特征中的类别对应不同收率的均值

待优化

1、多模型结果

2、mean 特征

3、对时间部分 优化

通用模板-主流机器学习模型模板代码+经验分享[xgb, lgb, Keras, LR]

-----lgb默认参数-----

param = {'num_leaves': 30,
         'min_data_in_leaf': 30, 
         'objective':'regression',
         'max_depth': -1,
         'learning_rate': 0.01,
         "min_child_samples": 30,
         "boosting": "gbdt",
         "feature_fraction": 0.9,
         "bagging_freq": 1,
         "bagging_fraction": 0.9 ,
         "bagging_seed": 11,
         "metric": 'mse',
         "lambda_l1": 0.1,
         "verbosity": -1}


# 6折交叉验证
#[2095]	training's l2: 0.000183012	valid_1's l2: 0.000238902
#CV score: 0.00031

#	training's l2: 0.000179558	valid_1's l2: 0.000251254
#CV score: 0.00031 

folds = KFold(n_splits=10, shuffle=True, random_state=2018)

oof_lgb = np.zeros(len(train))
predictions_lgb = np.zeros(len(test))

for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)):
    print("fold n°{}".format(fold_+1))
    trn_data = lgb.Dataset(X_train[trn_idx], y_train[trn_idx])
    val_data = lgb.Dataset(X_train[val_idx], y_train[val_idx])

    num_round = 10000
    clf = lgb.train(param, 
                    trn_data, 
                    num_round, 
                    valid_sets = [trn_data, val_data], 
                    verbose_eval = 200, 
                    early_stopping_rounds = 200)
    oof_lgb[val_idx] = clf.predict(X_train[val_idx], num_iteration=clf.best_iteration)
    
    predictions_lgb += clf.predict(X_test, num_iteration=clf.best_iteration) / folds.n_splits

print("CV score: {:<8.5f}".format(mean_squared_error(oof_lgb, target)))





xgb

xgb_params = {'eta': 0.005, 'max_depth': 10, 'subsample': 0.8, 'colsample_bytree': 0.8, 
          'objective': 'reg:linear', 'eval_metric': 'rmse', 'silent': True, 'nthread': 4}

folds = KFold(n_splits=15, shuffle=True, random_state=2018)
oof_xgb = np.zeros(len(train))
predictions_xgb = np.zeros(len(test))

for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)):
    print("fold n°{}".format(fold_+1))
    trn_data = xgb.DMatrix(X_train[trn_idx], y_train[trn_idx])
    val_data = xgb.DMatrix(X_train[val_idx], y_train[val_idx])

    watchlist = [(trn_data, 'train'), (val_data, 'valid_data')]
    clf = xgb.train(dtrain=trn_data, num_boost_round=20000, evals=watchlist, early_stopping_rounds=200, verbose_eval=100, params=xgb_params)
    oof_xgb[val_idx] = clf.predict(xgb.DMatrix(X_train[val_idx]), ntree_limit=clf.best_ntree_limit)
    predictions_xgb += clf.predict(xgb.DMatrix(X_test), ntree_limit=clf.best_ntree_limit) / folds.n_splits
    
print("CV score: {:<8.8f}".format(mean_squared_error(oof_xgb, target)))

将lgb和xgb的结果进行stacking

train_stack = np.vstack([oof_lgb,oof_xgb]).transpose()
test_stack = np.vstack([predictions_lgb, predictions_xgb]).transpose()

folds_stack = RepeatedKFold(n_splits=5, n_repeats=2, random_state=4590)
oof_stack = np.zeros(train_stack.shape[0])
predictions = np.zeros(test_stack.shape[0])

for fold_, (trn_idx, val_idx) in enumerate(folds_stack.split(train_stack,target)):
    print("fold {}".format(fold_))
    trn_data, trn_y = train_stack[trn_idx], target.iloc[trn_idx].values
    val_data, val_y = train_stack[val_idx], target.iloc[val_idx].values
    
    clf_3 = BayesianRidge()
    clf_3.fit(trn_data, trn_y)
    
    oof_stack[val_idx] = clf_3.predict(val_data)
    predictions += clf_3.predict(test_stack) / 10
    
mean_squared_error(target.values, oof_stack)

data[‘B11’]=1.0
提交结果线下 0.00011862717410224899  线上

data[‘B11’]  取根据b10 加一个小时 pad
提交结果线下线下 0.00011745711197741756

xgb调参过程

调参对于模型准确率的提高有一定的帮助,但这是有限的。最重要的还是要通过数据清洗,特征选择,特征融合,模型融合等手段来进行改进!

参数解释
调参指导_网格搜索

调参指导2

https://blog.csdn.net/sb19931201/article/details/52577592

你可能感兴趣的:(ML)