与xgboost一样,lightgbm也是使用C++实现,然后给python提供了接口,这里也分为了lightgbm naive API,以及为了和机器学习最常用的库sklearn一致而提供的sklearn wrapper。
然而naive版的lgb与sklearn接口还是存在一些差异的,我们可以通过以下简单测试对比:
首先使用sklean的make_classification
生成一个多分类数据集:
import pandas as pd
import lightgbm as lgb
from sklearn.datasets import make_classification
data = make_classification(n_samples=10000, n_features=20, n_informative=4, n_redundant=2,
n_repeated=0, n_classes=5, n_clusters_per_class=2, weights=[0.05,0.1,0.1,0.5],
flip_y=0.4, class_sep=1.0, hypercube=True,shift=0.0, scale=1.0,
shuffle=True, random_state=2018)
df = pd.DataFrame(data[0])
df['label'] = data[1]
df.label.value_counts()
3 3789
4 2266
2 1418
1 1413
0 1114
Name: label, dtype: int64
划分测试、验证集:
from sklearn.model_selection import train_test_split
x1,x2,y1,y2 = train_test_split(df.drop(['label'],axis=1),df.label)
lightgbm中使用lgb.train来训练模型,模型参数以形参形式传入:
params_naive={
"learning_rate":0.1,
'max_bin':150,
'num_leaves':32,
"max_depth":11,
"lambda_l1":0.1,
"lambda_l2":0.2,
"objective":"multiclass",
"num_class":5,
}
为了与sklearn一致,lightgbm提供了sklearn接口,模型参数在构建分类器时初始化:
params_sklearn = {
'learning_rate':0.1,
'max_bin':150,
'num_leaves':32,
'max_depth':11,
'reg_alpha':0.1,
'reg_lambda':0.2,
'objective':'multiclass',
'n_estimators':300,
#'class_weight':weight
}
watchlist = [(x1,y1),(x2,y2)]
clf = lgb.LGBMClassifier(**params_sklearn)
clf.fit(x1,y1,early_stopping_rounds=10,eval_set=watchlist,verbose=10)
print('-'*100)
dtrain = lgb.Dataset(x1,label=y1)
dtest = lgb.Dataset(x2,label=y2)
x = lgb.train(params=params_naive,train_set=dtrain,valid_sets=[dtrain,dtest],verbose_eval=10,early_stopping_rounds=10,num_boost_round=300)
训练结果如下:
Training until validation scores don't improve for 10 rounds.
[10] training's multi_logloss: 1.29012 valid_1's multi_logloss: 1.36028
[20] training's multi_logloss: 1.16467 valid_1's multi_logloss: 1.29577
[30] training's multi_logloss: 1.0841 valid_1's multi_logloss: 1.28103
[40] training's multi_logloss: 1.01747 valid_1's multi_logloss: 1.27543
[50] training's multi_logloss: 0.961691 valid_1's multi_logloss: 1.27711
Early stopping, best iteration is:
[40] training's multi_logloss: 1.01747 valid_1's multi_logloss: 1.27543
----------------------------------------------------------------------------------------------------
Training until validation scores don't improve for 10 rounds.
[10] training's multi_logloss: 1.29012 valid_1's multi_logloss: 1.36028
[20] training's multi_logloss: 1.16467 valid_1's multi_logloss: 1.29577
[30] training's multi_logloss: 1.0841 valid_1's multi_logloss: 1.28103
[40] training's multi_logloss: 1.01747 valid_1's multi_logloss: 1.27543
[50] training's multi_logloss: 0.961691 valid_1's multi_logloss: 1.27711
Early stopping, best iteration is:
[40] training's multi_logloss: 1.01747 valid_1's multi_logloss: 1.27543
我们需要注意以下几点:
1. 多分类时lgb.train
除了'objective':'multiclass'
,还要指定"num_class":5
,而sklearn接口只需要指定'objective':'multiclass'
。
2. lgb.train
中正则化参数为"lambda_l1"
, "lambda_l1"
,sklearn中则为'reg_alpha'
, 'reg_lambda'
。
3. 迭代次数在sklearn中是'n_estimators':300
,在初始化模型时指定。而在lgb.train
中则可在参数params中指定,也可在函数形参中指出。