参考:lgbm的github:
https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst
代码来源参见我另一篇博客:
https://blog.csdn.net/ssswill/article/details/85217702
网格搜索寻找超参数:
from sklearn.model_selection import (cross_val_score, train_test_split,
GridSearchCV, RandomizedSearchCV)
from sklearn.metrics import r2_score
from lightgbm.sklearn import LGBMRegressor
hyper_space = {'n_estimators': [1000, 1500, 2000, 2500],
'max_depth': [4, 5, 8, -1],
'num_leaves': [15, 31, 63, 127],
'subsample': [0.6, 0.7, 0.8, 1.0],
'colsample_bytree': [0.6, 0.7, 0.8, 1.0],
'learning_rate' : [0.01,0.02,0.03]
}
est = lgb.LGBMRegressor(n_jobs=-1, random_state=2018)
gs = GridSearchCV(est, hyper_space, scoring='r2', cv=4, verbose=1)
gs_results = gs.fit(train_X, train_y)
print("BEST PARAMETERS: " + str(gs_results.best_params_))
print("BEST CV SCORE: " + str(gs_results.best_score_))
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
import lightgbm as lgb
lgb_params = {"objective" : "regression", "metric" : "rmse",
"max_depth": 7, "min_child_samples": 20,
"reg_alpha": 1, "reg_lambda": 1,
"num_leaves" : 64, "learning_rate" : 0.01,
"subsample" : 0.8, "colsample_bytree" : 0.8,
"verbosity": -1}
FOLDs = KFold(n_splits=5, shuffle=True, random_state=42)
oof_lgb = np.zeros(len(train_X))
predictions_lgb = np.zeros(len(test_X))
features_lgb = list(train_X.columns)
feature_importance_df_lgb = pd.DataFrame()
for fold_, (trn_idx, val_idx) in enumerate(FOLDs.split(train_X)):
trn_data = lgb.Dataset(train_X.iloc[trn_idx], label=train_y.iloc[trn_idx])
val_data = lgb.Dataset(train_X.iloc[val_idx], label=train_y.iloc[val_idx])
print("-" * 20 +"LGB Fold:"+str(fold_)+ "-" * 20)
num_round = 10000
clf = lgb.train(lgb_params, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=1000, early_stopping_rounds = 50)
oof_lgb[val_idx] = clf.predict(train_X.iloc[val_idx], num_iteration=clf.best_iteration)
fold_importance_df_lgb = pd.DataFrame()
fold_importance_df_lgb["feature"] = features_lgb
fold_importance_df_lgb["importance"] = clf.feature_importance()
fold_importance_df_lgb["fold"] = fold_ + 1
feature_importance_df_lgb = pd.concat([feature_importance_df_lgb, fold_importance_df_lgb], axis=0)
predictions_lgb += clf.predict(test_X, num_iteration=clf.best_iteration) / FOLDs.n_splits
print("Best RMSE: ",np.sqrt(mean_squared_error(oof_lgb, train_y)))
hyper_space = {'n_estimators': [1000, 1500, 2000, 2500],
'max_depth': [4, 5, 8, -1],
'num_leaves': [15, 31, 63, 127],
'subsample': [0.6, 0.7, 0.8, 1.0],
'colsample_bytree': [0.6, 0.7, 0.8, 1.0],
'learning_rate' : [0.01,0.02,0.03]
}
从图中可看到,n_estimators是num_itertations的别名,默认是100.也就是循环次数,或者叫树的数目。
后面又有一句note:对于多分类问题,树的数目是种类数*你设的树颗数。
“max_depth”:树的深度
-1代表无限制。
“num_leaves”:
“subsample”:
介绍里说了它的好处:加速训练,避免过拟合等。并说与feature_fraction类似。我们来看看这个参数是啥:
原来如此,它和RF的
很像。
图来自:https://www.cnblogs.com/harvey888/p/6512312.html
再来看:‘colsample_bytree’:
原来就是上面的参数。
继续:‘learning_rate’
这个肯定不用多说了,学习率。
再来看这两行代码:
est = lgb.LGBMRegressor(n_jobs=-1, random_state=2018)
gs = GridSearchCV(est, hyper_space, scoring='r2', cv=4, verbose=1)
n_jobs=1:并行job个数。这个在ensemble算法中非常重要,尤其是bagging(而非boosting,因为boosting的每次迭代之间有影响,所以很难进行并行化),因为可以并行从而提高性能。1=不并行;n:n个并行;-1:CPU有多少core,就启动多少job。
“random_state”:
“verbose”:
就是控制输出信息的冗长程度。不用太在意。默认就好。
类似这样的输出日志:
信息的意思是:
因为LightGBM使用的是leaf-wise的算法,因此在调节树的复杂程度时,使用的是num_leaves而不是max_depth。
大致换算关系:num_leaves = 2^(max_depth)。它的值的设置应该小于 2^(max_depth),否则可能会导致过拟合。