LGBM调参方法学习

一、了解LGBM参数:

LGBM是微软发布的轻量梯度提升机,最主要的特点是快,回归和分类树模型。使用LGBM首先需要查看其参数含义:
微软官方github上的说明:
https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst#early_stopping_round
LGBM中文手册:
http://lightgbm.apachecn.org/#/docs/2

所以大致了解参数设置:
1、核心参数:
“objective”: “regression”,
‘boosting’:‘gbdt’,
“learning_rate”: 0.05,
‘num_iterations’:100,
“num_leaves”: 31,
‘num_threads’:-1,
2、学习控制参数
“max_depth”: 10,
‘feature_fraction’:1,
‘bagging_fraction’:0.8,
‘bagging_freq’:8,
‘lambda_l1’:0,
‘lambda_l2’:0,
“min_data_in_leaf”: 20
‘min_sum_hessian_in_leaf’:1e-3
‘early_stopping_round’
3、指标函数:
“metric”: “rmse”,可选的有很多,针对不同的问题

根据问题的性质,可以确定"metric":、“objective”: “regression”,其实主要需要调节的参数就有: “learning_rate”:、‘num_iterations’:、max_depth 和 num_leaves、min_data_in_leaf 和 min_sum_hessian_in_leaf、feature_fraction 和 bagging_fraction、正则化参数:lambda_l1(reg_alpha), lambda_l2(reg_lambda)

二、调参过程

网上的的代码,由于采用不同的工具,显得有点乱:在此总结一下:
参考:
网友的简书:(顺便讲了一下XGB)
https://www.jianshu.com/p/1100e333fcab
一篇csdn:
https://blog.csdn.net/ssswill/article/details/85235074
一篇慕课手记:
https://www.imooc.com/article/43784?block_id=tuijian_wz

1、 使用sklearn的GridSearchCV

先上例子:

import pandas as pd
import lightgbm as lgb
from sklearn.grid_search import GridSearchCV  # Perforing grid search
from sklearn.model_selection import train_test_split

train_data = pd.read_csv('train.csv')   # 读取数据
y = train_data.pop('30').values   # 用pop方式将训练数据中的标签值y取出来,作为训练目标,这里的‘30’是标签的列名
col = train_data.columns   
x = train_data[col].values  # 剩下的列作为训练数据
train_x, valid_x, train_y, valid_y = train_test_split(x, y, test_size=0.333, random_state=0)   # 分训练集和验证集
train = lgb.Dataset(train_x, train_y)
valid = lgb.Dataset(valid_x, valid_y, reference=train)


parameters = {
              'max_depth': [15, 20, 25, 30, 35],
              'learning_rate': [0.01, 0.02, 0.05, 0.1, 0.15],
              'feature_fraction': [0.6, 0.7, 0.8, 0.9, 0.95],
              'bagging_fraction': [0.6, 0.7, 0.8, 0.9, 0.95],
              'bagging_freq': [2, 4, 5, 6, 8],
              'lambda_l1': [0, 0.1, 0.4, 0.5, 0.6],
              'lambda_l2': [0, 10, 15, 35, 40],
              'cat_smooth': [1, 10, 15, 20, 35]
}
gbm = lgb.LGBMClassifier(boosting_type='gbdt',
                         objective = 'binary',
                         metric = 'auc',
                         verbose = 0,
                         learning_rate = 0.01,
                         num_leaves = 35,
                         feature_fraction=0.8,
                         bagging_fraction= 0.9,
                         bagging_freq= 8,
                         lambda_l1= 0.6,
                         lambda_l2= 0)
#有了gridsearch我们便不需要fit函数
gsearch = GridSearchCV(gbm, param_grid=parameters, scoring='accuracy', cv=3)
gsearch.fit(train_x, train_y)

print("Best score: %0.3f" % gsearch.best_score_)
print("Best parameters set:")
best_parameters = gsearch.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

可以从参数看出,该代码的面对的问题是二分类,objective = ‘binary’,以AUC为指标, verbose 指多少轮迭代打印一次日志。
接下来重点讲:GridSearchCV()函数:
官方说明文档:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
其实主要就几个参数:
GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, n_jobs=None, iid=’warn’, refit=True, cv=’warn’, verbose=0, pre_dispatch=‘2*n_jobs’, error_score=’raise-deprecating’, return_train_score=’warn’)
estimator:估计器对象
param_grid : 参数字典
scoring : 评分指标,支持多个,逗号分隔
n_jobs : 线程数,-1全部
cv : 交叉验证折数,把gsearch.fit(train_x, train_y)中(train_x, train_y)cv等分,一份做验证

对于scoring 需要查看GridSearchCV支持的指标:
https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
如要使用其他指标,首先导入该函数,然后在scoring 一项中写入对应地字符串,这里必须说一下,sklearn模型评估里的scoring参数都是采用的higher return values are better than lower return values(较高的返回值优于较低的返回值)。
LGBM调参方法学习_第1张图片
缺点:无法使用估计器自带的earlystoping功能,很有可能调出来的参数过拟合,虽然可以设置一些参数和交叉验证来防止过拟合,

你可能感兴趣的:(ML,LGBM,调参,ML)