在日常工作中用到的比较多的还是树回归模型,由于LightGBM不需要的类别数据进行预处理所以用得特别多,中间涉及到超参数优化时通常使用随机参数优化方法。在 算法模型自动超参数优化方法中有提到了Optuna,平时工作中也会使用到,今天主要对如何使用Optuna进行整理。
Optuna是一种机器学习自动超参优化框架,前支持的模型:
Optuna中的三个核心概念:
Objective 负责定义待优化函数并指定参/超参数数范围,trial 对应着 objective 的单次执行,而 study 则负责管理优化,决定优化的方式,总试验的次数、试验结果的记录等功能。
一个简单的示例:
import optuna import sklearn import sklearn.datasets # Define an objective function to be minimized. def objective(trial): # Invoke suggest methods of a Trial object to generate hyperparameters. regressor_name = trial.suggest_categorical('classifier', ['SVR', 'RandomForest']) if regressor_name == 'SVR': svr_c = trial.suggest_loguniform('svr_c', 1e-10, 1e10) regressor_obj = sklearn.svm.SVR(C=svr_c) else: rf_max_depth = trial.suggest_int('rf_max_depth', 2, 32) regressor_obj = sklearn.ensemble.RandomForestRegressor(max_depth=rf_max_depth) X, y = sklearn.datasets.load_boston(return_X_y=True) X_train, X_val, y_train, y_val = sklearn.model_selection.train_test_split(X, y, random_state=0) regressor_obj.fit(X_train, y_train) y_pred = regressor_obj.predict(X_val) error = sklearn.metrics.mean_squared_error(y_val, y_pred) return error # An objective value linked with the Trial object. study = optuna.create_study() # Create a new study. study.optimize(objective, n_trials=100) # Invoke optimization of the objective function. print(study.best_params) print(study.best_value)
上面我们定义了一个objective,在它内部,模型的均方误差被作为返回值,具体使用时可以用优化指标作为返回。中间还使用trial.suggest_*设置了参数空间和随机分布。支持的方法有:
optuna.create_study()
用来创建学习,中间主要涉及到数据库存储和优化方向设置。
函数原型:
optuna.study.create_study(storage: Union[str, optuna.storages._base.BaseStorage, None] = None, sampler: Optional[samplers.BaseSampler] = None, pruner: Optional[optuna.pruners._base.BasePruner] = None, study_name: Optional[str] = None, direction: Optional[str] = None, load_if_exists: bool = False, *, directions: Optional[Sequence[str]] = None)
参数说明:
使用SQLite 存储的示例:
study_name = 'example-study' study = optuna.create_study(study_name=study_name, storage='sqlite:///example.db') study.optimize(objective, n_trials=300)
假如 ‘sqlite:///example.db’ 这一 URL 对应的数据库文件不存在,Optuna将创建一个对应的数据库文件并开始新的优化过程。假设优化过程被打断了,只要 optuna 监测到`’sqlite:///example.db’ 在路径上存在且该数据库中有 study_name 为 ‘example-study’ 的记录,它就会继续未完成的优化过程。如需查看数据库中的数据,只需:
df = study.trials_dataframe(attrs=('number', 'value', 'params', 'state'))
study.optimize()
启动优化。函数原型:
optimize(func: Callable[[optuna.trial._trial.Trial], Union[float, Sequence[float]]], n_trials: Optional[int] = None, timeout: Optional[float] = None, n_jobs: int = 1, catch: Tuple[Type[Exception], …] = (), callbacks: Optional[List[Callable[[Study, optuna.trial._frozen.FrozenTrial], None]]] = None, gc_after_trial: bool = False, show_progress_bar: bool = False)
参数说明:
剪枝
为了用最简单的形式实现剪枝算法,Optuna 为以下库提供了集成模块。关于 Optuna 集成模块的完整列表,参见 optuna.integration.比如,XGBoostPruningCallback 在没有改变训练迭代过程的逻辑的情况下引入了剪枝。
pruning_callback = optuna.integration.XGBoostPruningCallback(trial, 'validation-error') bst = xgb.train(param, dtrain, evals=[(dvalid, 'validation')], callbacks=[pruning_callback])
import pandas as pd import numpy as np import lightgbm as lgb import optuna from sklearn.model_selection import train_test_split from sklearn.metrics import roc_auc_score def objective(trial): df_data = pd.read_csv("example.csv") data = df_data.iloc[:, :-1] target = df_data.iloc[:, -1] train_x, valid_x, train_y, valid_y = train_test_split(data, target, test_size=0.25) dtrain = lgb.Dataset(train_x, label=train_y, categorical_feature=cols_to_encode) dvalid = lgb.Dataset(valid_x, label=valid_y, categorical_feature=cols_to_encode) param = { "objective": "binary", "is_unbalance": True, "metric": "auc", "verbosity": -1, "lambda_l1": trial.suggest_float("lambda_l1", 1e-8, 10.0, log=True), "lambda_l2": trial.suggest_float("lambda_l2", 1e-8, 10.0, log=True), "num_leaves": trial.suggest_int("num_leaves", 2, 256), "feature_fraction": trial.suggest_float("feature_fraction", 0.4, 1.0), "bagging_fraction": trial.suggest_float("bagging_fraction", 0.4, 1.0), "bagging_freq": trial.suggest_int("bagging_freq", 1, 7), 'learning_rate': trial.suggest_loguniform("learning_rate", 1e-4, 1), "min_child_samples": trial.suggest_int("min_child_samples", 5, 100), "cat_smooth": trial.suggest_int("cat_smooth", 0, 100), } pruning_callback = optuna.integration.LightGBMPruningCallback(trial, "auc") gbm = lgb.train( param, dtrain, valid_sets=[dvalid], verbose_eval=False, callbacks=[pruning_callback] ) preds = gbm.predict(valid_x) recall = roc_auc_score(valid_y, pred_labels) return recall study = optuna.create_study(pruner=optuna.pruners.MedianPruner(n_warmup_steps=10), direction="maximize", study_name='example', storage='sqlite:///example.db') study.optimize(objective, n_trials=50) print("Number of finished trials: {}".format(len(study.trials))) print("Best trial:") trial = study.best_trial print(" Value: {}".format(trial.value)) print(" Params: ") for key, value in trial.params.items(): print(" {}: {}".format(key, value))
Optuna的visualization 采用 plotly 来创建图表,但是 JupyterLab 无法在默认情况下渲染这些图表。目前还没尝试出解决方案。
参考链接: