机器学习_LGB调参汇总(开箱即食)

在数据层面的一些正负采样,业务层面一些数据筛选,以及异常值的处理后。我们进行模型训练,同时需要对模型进行参数的调整,以提升模型的精度。笔者就一些现有的调参框架进行汇总。

一、贝叶斯调参

pip install bayesian-optimization

一般使用步骤

  • 构建一个funcLGB_bayesian:

    • 入参为需要调整的参数
    • 用参数训练模型
    • 输出评估评估指标(越大越好)
  • 设定参数的范围bounds_LGB

  • 初始化贝叶斯优化模型(lgb_opt = BayesianOptimization(LGB_bayesian, bounds_LGB, random_state=42))

  • 贝叶斯优化(lgb_opt.maximize(init_points=5, n_iter=5, acq='ucb', xi=0.0, alpha=1e-6))

LGB回归模型使用示例

import lightgbm as lgb
import numpy as np
from bayes_opt import BayesianOptimization
from sklearn.metrics import mean_squared_error

def LGB_bayesian(
    num_leaves,
    min_data_in_leaf,
    min_sum_hessian_in_leaf, feature_fraction,
    lambda_l1, lambda_l2,
    min_gain_to_split
    ):
    num_leaves = int(np.round(num_leaves))
    min_data_in_leaf = int(np.round(min_data_in_leaf))
    param = {
        'num_leaves': num_leaves,
        'max_bin':128,
        'min_data_in_leaf': min_data_in_leaf,
        'learning_rate':0.01,
        
        'bagging_fraction': 0.95,
        'bagging_freq':5,
        'bagging_seed':66,
        'feature_fraction': feature_fraction,
        'feature_fraction_seed': 66,
        # loss
        'lambda_l1':lambda_l1,
        'lambda_l2':lambda_l2,
        'min_gain_to_split': min_gain_to_split,
        # greedy
        'min_sum_hessian_in_leaf': min_sum_hessian_in_leaf,
        # object-metric
        'objective': 'regression',
        'metric': 'rmse',
        'n_jobs':25,
        'boosting_type': 'gbdt',
        'verbose': 1,
        'early_stopping_rounds':50,
        'n_estimators': 500
    }
    lgb_train = lgb.Dataset(train_df[used_features], label=np.log1p(train_df[target]))
    lgb_valid = lgb.Dataset(val_df[used_features], label=np.log1p(val_df[target]))
    lgb_estimator = lgb.train(param, lgb_train, valid_sets=[lgb_train, lgb_valid], verbose_eval=200)
    pred_ = lgb_estimator.predict(val_df[used_features], num_iteration = lgb_estimator.best_iteration) 
    loss = np.sqrt(mean_squared_error(val_df[target].values, np.round(np.expm1(pred_))))
    return -loss


bounds_LGB = {
    'num_leaves': (10, 30),
    'min_data_in_leaf': (5, 30),
    'min_sum_hessian_in_leaf': (0, 5),
    'feature_fraction': (0.55, 1),
    'lambda_l1': (0, 3),
    'lambda_l2': (0, 3),
    'min_gain_to_split': (0, 1)
}

lgb_opt = BayesianOptimization(LGB_bayesian, bounds_LGB, random_state=42)
print(lgb_opt.space.keys)

print('=='*30)
lgb_opt.maximize(init_points=5, n_iter=5, acq='ucb', xi=0.0, alpha=1e-6)

# ---------------------------------------------------------
print(lgb_opt.max['target'])
rest_dict = lgb_opt.max['params']
lgb_param = {
        'num_leaves': int(np.round(rest_dict['num_leaves'])),
        'max_bin':128,
        'min_data_in_leaf': int(np.round(rest_dict['min_data_in_leaf'])),
        'learning_rate':0.01,

        'bagging_fraction': 0.95,
        'bagging_freq':5,
        'bagging_seed':66,
        'feature_fraction':rest_dict['feature_fraction'] ,
        'feature_fraction_seed': 66,
        # loss
        'lambda_l1':rest_dict['lambda_l1'] ,
        'lambda_l2':rest_dict['lambda_l2'] ,
        'min_gain_to_split': rest_dict['min_gain_to_split'] ,
        # greedy
        'min_sum_hessian_in_leaf':  rest_dict['min_sum_hessian_in_leaf'] ,
        # object-metric
        'objective': 'regression',
        'metric': 'rmse',
        'n_jobs':25,
        'boosting_type': 'gbdt',
        'verbose': 1,
        'early_stopping_rounds':50,
        'n_estimators': 500
    }
print(lgb_param)

二、随机搜索

2.1 LGB回归模型使用示例

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint as sp_randint
import lightgbm as lgb


lgb_model = lgb.LGBMRegressor(
    objective='regression',
    metric='rmse',
    max_bin=100,
    n_estimators=500,
    learning_rate=0.01,
    bagging_fraction=0.95,
    bagging_freq=5,
    bagging_seed=66,
    feature_fraction_seed=66,
    boosting='gbdt',
    n_jobs=25,
    verbose=0,
    early_stopping_rounds=50
)


param_dict = {
    "num_leaves": sp_randint(5, 40),
    "min_data_in_leaf": sp_randint(5, 64),
    "min_sum_hessian_in_leaf": np.linspace(0, 10, 30),
    "feature_fraction": np.linspace(0.55, 1, 30),
    'lambda_l1': np.linspace(0, 10, 30),
    'lambda_l2': np.linspace(0, 10, 30),
    "min_gain_to_split": np.linspace(0., 1, 30)
}

random_search = RandomizedSearchCV(
    lgb_model,
    param_distributions=param_dict,
    n_iter=15,
    cv=3
)

reg_cv = random_search.fit(
    train_df[used_features], np.log1p(train_df[target]), 
    eval_set=[(train_df[used_features], np.log1p(train_df[target])), (val_df[used_features], np.log1p(val_df[target]))],
    verbose=200
)

reg_cv.best_params_ 

三、optuna

3.1 简介

Optuna 是一个自动超参数优化软件框架,专为机器学习而设计。 它具有命令式、运行时定义的用户 API。 使用define-by-run API,使用Optuna编写的代码具有高度模块化,并且Optuna的用户可以动态构建超参数的搜索空间,详细可以看Optuna的github

3.2 使用示例

import optuna
import lightgbm as lgb
from sklearn.metrics import mean_squared_error
from functools import partial

def lgb_optuna(trial, train_x, train_y, test_x, test_y):
    param = {
        'num_leaves': trial.suggest_int(
            'num_leaves', 5, 40
        ),
        'max_bin':100,
        'min_data_in_leaf': trial.suggest_int(
            'min_data_in_leaf', 5, 64
        ),
        'learning_rate':0.01,
        'bagging_fraction': 0.95,
        'bagging_freq':5,
        'bagging_seed':66,
        'feature_fraction': trial.suggest_loguniform(
            'feature_fraction', 0.55 , 0.99
        ),
        'feature_fraction_seed': 66,
        # loss
        'lambda_l1':trial.suggest_discrete_uniform(
            'lambda_l1', 0.0 , 10.0, 0.1
        ),
        'lambda_l2':trial.suggest_discrete_uniform(
            'lambda_l2', 0.0 , 10.0, 0.1
        ),
        'min_gain_to_split': rest_dict['min_gain_to_split'] ,
        # greedy
        'min_sum_hessian_in_leaf':   trial.suggest_discrete_uniform(
            'min_sum_hessian_in_leaf', 0.55 , 20.0, 0.1
        ),
        # object-metric
        'objective': 'regression',
        'metric': 'rmse',
        'n_jobs':25,
        'boosting': 'gbdt',
        'verbose': 1,
        'early_stopping_rounds':50,
        'n_estimators': 500
    }
    model = lgb.LGBMRegressor(**param)  
    model.fit(train_x,  train_y,
            eval_set=[
                (train_x,  train_y), (test_x, test_y)
            ],
            early_stopping_rounds=50,     
            verbose=200
            )
    pred_ = model.predict(test_x)
    loss = np.sqrt(mean_squared_error(test_x, np.round(np.expm1(pred_))))
    return loss

study = optuna.create_study(direction='minimize')
lgb_op_partial = partial(lgb_optuna,                          train_x=train_df[used_features],                          train_y=train_df[target].values,                          test_x=val_df[used_features],                          test_y=val_df[target].values                        )
study.optimize(lgb_op_partial, n_trials=15)
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)

四、hyperopt

4.1 简介

Hyperopt:是python中的一个用于"分布式异步算法组态/超参数优化"的类库。使用它我们可以拜托繁杂的超参数优化过程,自动获取最佳的超参数。广泛意义上,可以将带有超参数的模型看作是一个必然的非凸函数,因此hyperopt几乎可以稳定的获取比手工更加合理的调参结果。尤其对于调参比较复杂的模型而言,其更是能以远快于人工调参的速度同样获得远远超过人工调参的最终性能。

4.2 使用示例

from hyperopt import fmin, tpe, hp, partial, STATUS_OK, Trials


def lgb_hyp_opt(hpy_space_dict, train_x, train_y, test_x, test_y):
    param = {
        'num_leaves': hpy_space_dict['num_leaves'],
        'max_bin':100,
        'min_data_in_leaf': hpy_space_dict['min_data_in_leaf'],
        'learning_rate':0.01,
        'bagging_fraction': 0.95,
        'bagging_freq':5,
        'bagging_seed':66,
        'feature_fraction': hpy_space_dict['feature_fraction'],
        'feature_fraction_seed': 66,
        # loss
        'lambda_l1':hpy_space_dict['lambda_l1'],
        'lambda_l2':hpy_space_dict['lambda_l2'],
        'min_gain_to_split': hpy_space_dict['min_gain_to_split'],
        # greedy
        'min_sum_hessian_in_leaf':   hpy_space_dict['min_sum_hessian_in_leaf'],
        # object-metric
        'objective': 'regression',
        'metric': 'rmse',
        'n_jobs':25,
        'boosting': 'gbdt',
        'verbose': 1,
        'early_stopping_rounds':50,
        'n_estimators': 500
    }
    model = lgb.LGBMRegressor(**param)
    model.fit(train_x,  np.log1p(train_y),
            eval_set=[
                (train_x,  np.log1p(train_y)), (test_x, np.log1p(test_y))
            ],
            early_stopping_rounds=50,     
            verbose=200
            )
    pred_ = model.predict(test_x)
    loss = np.sqrt(mean_squared_error(test_y, np.round(np.expm1(pred_))))
    return {"loss": loss, "status": STATUS_OK}


space = {
        'num_leaves': hp.randint(
            'num_leaves', 5, 40
        ),
        'min_data_in_leaf': hp.randint(
            'min_data_in_leaf', 5, 64
        ),
        'feature_fraction': hp.uniform(
            'feature_fraction', 0.55 , 0.99
        ),
        # loss
        'lambda_l1': hp.loguniform(
            'lambda_l1', -2.5 , 2.5
        ),
        'lambda_l2': hp.loguniform(
            'lambda_l2', -2.5 , 2.5
        ),
        'min_gain_to_split': hp.loguniform(
            'min_gain_to_split', 1e-3 , 1.0
        ),
        # greedy
        'min_sum_hessian_in_leaf': hp.loguniform(
            'min_sum_hessian_in_leaf', -2.5, 3
        )
}
lgb_opt = partial(lgb_hyp_opt,
                         train_x=train_df[used_features], 
                         train_y=train_df[target].values, 
                         test_x=val_df[used_features], 
                         test_y=val_df[target].values
                        )

algo = partial(tpe.suggest, n_startup_jobs=1)
best = fmin(lgb_opt, space, algo=algo, 
            max_evals=20, 
            pass_expr_memo_ctrl=None)

print(best)

五、小结

一般推荐使用optunabayes_opt,两者想对其他调参框架得到的模型精度会更高
共性:分为参数空间构建-参数搜索评估器-参数搜索优化器

  • 参数空间构建(这部分可以基于对算法收敛的原理进行相应参数空间设定)

    • 贝叶斯调参-参数空间构造bounds_LGB
      • 以字典的形式,key为参数,value为参数空间,用元组的方式
    • 随机搜索-参数空间构造param_dict
      • 以字典的形式,key为参数,value为参数空间,需要罗列出所有的搜索值(可以用scipy中的一些分布)
    • optuna-参数空间,和搜索评估器合在一起
      • 可以直接用 trial.suggest_loguniform 用其API自带的一些分布来进行构建参数空间
    • hyperopt-参数空间space
      • 以字典的形式,key为参数,value为参数空间,用API自带方法hp罗列出所有搜索参数的分布
  • 参数搜索评估器构建

    • 贝叶斯调参-搜索评估器 LGB_bayesian
      • 输入参数空间的key对应的值
      • 输出关注的评估指标,一般越大越好,如果需要越小越好,可以直接加一个负号
    • 随机搜索-搜索评估器 RandomizedSearchCV
      • 不需要额外构建,sklearn接口已经封装好
    • optuna-搜索评估器 lgb_optuna(trial, train_x, train_y, test_x, test_y)
      • 输入优化器中的trail, 还需要用partial方法将训练数据先设置好
      • 输出关注的评估指标
    • hyperopt-参数空间lgb_hyp_opt(hpy_space_dict, train_x, train_y, test_x, test_y)
      • 输入优化器返回的参数空间中的一个抽样字典, 还需要用partial方法将训练数据先设置好
      • 输出关注的评估指标
  • 参数搜索优化器

    • 贝叶斯调参-优化器 BayesianOptimization(LGB_bayesian, bounds_LGB, random_state=42)
      • lgb_opt.maximize(init_points=5, n_iter=5, acq='ucb', xi=0.0, alpha=1e-6)
    • 随机搜索-优化器 RandomizedSearchCV.fit
      • 直接用fit方法
    • optuna-优化器 optuna.create_study(direction='minimize')
      • study.optimize(lgb_op_partial, n_trials=15, n_jobs=2) 可设置并行
    • hyperopt-优化器 fmin
      • best = fmin(lgb_opt, space, algo=algo, max_evals=20, pass_expr_memo_ctrl=None)
      • algo可以设置并行algo = partial(tpe.suggest, n_startup_jobs=1)

参考

  • 知乎: 贝叶斯优化: 一种更好的超参数调优方式
  • LGB-Parameters-Tuning
  • Optuna: A hyperparameter optimization framework
  • Hyperopt: Distributed Asynchronous Hyper-parameter Optimization
  • 知乎:如何使用hyperopt对xgboost进行自动调参

你可能感兴趣的:(机器学习,机器学习,python,数据挖掘,人工智能)