lightgbm的原生版本与sklearn 接口版本对比

文章目录

  • 1. 准备数据
  • 2. lightgbm naive API
  • 3. lightgbm sklearn wrapper
  • 3. 使用early stopping 训练
  • 4. 总结

与xgboost一样,lightgbm也是使用C++实现,然后给python提供了接口,这里也分为了lightgbm naive API,以及为了和机器学习最常用的库sklearn一致而提供的sklearn wrapper。
然而naive版的lgb与sklearn接口还是存在一些差异的,我们可以通过以下简单测试对比:

1. 准备数据

首先使用sklean的make_classification生成一个多分类数据集:

import pandas as pd
import lightgbm as lgb
from sklearn.datasets import make_classification

data = make_classification(n_samples=10000, n_features=20, n_informative=4, n_redundant=2,
					n_repeated=0, n_classes=5, n_clusters_per_class=2, weights=[0.05,0.1,0.1,0.5],
 					flip_y=0.4, class_sep=1.0, hypercube=True,shift=0.0, scale=1.0, 
					shuffle=True, random_state=2018)
df = pd.DataFrame(data[0])
df['label'] = data[1]

df.label.value_counts()

3    3789
4    2266
2    1418
1    1413
0    1114
Name: label, dtype: int64

划分测试、验证集:

from sklearn.model_selection import train_test_split
x1,x2,y1,y2 = train_test_split(df.drop(['label'],axis=1),df.label)

2. lightgbm naive API

lightgbm中使用lgb.train来训练模型,模型参数以形参形式传入:

params_naive={
    "learning_rate":0.1,    
    'max_bin':150,
    'num_leaves':32,
    "max_depth":11,

    "lambda_l1":0.1,
    "lambda_l2":0.2,

    "objective":"multiclass",    
    "num_class":5,
}

3. lightgbm sklearn wrapper

为了与sklearn一致,lightgbm提供了sklearn接口,模型参数在构建分类器时初始化:

params_sklearn = {
    'learning_rate':0.1,
    'max_bin':150,
    'num_leaves':32,    
    'max_depth':11,
    
    'reg_alpha':0.1,
    'reg_lambda':0.2,   
     
    'objective':'multiclass',
    'n_estimators':300,
    #'class_weight':weight
}

3. 使用early stopping 训练

watchlist = [(x1,y1),(x2,y2)]

clf = lgb.LGBMClassifier(**params_sklearn)
clf.fit(x1,y1,early_stopping_rounds=10,eval_set=watchlist,verbose=10)
print('-'*100)
dtrain = lgb.Dataset(x1,label=y1)
dtest = lgb.Dataset(x2,label=y2)

x = lgb.train(params=params_naive,train_set=dtrain,valid_sets=[dtrain,dtest],verbose_eval=10,early_stopping_rounds=10,num_boost_round=300)

训练结果如下:

Training until validation scores don't improve for 10 rounds.
[10]	training's multi_logloss: 1.29012	valid_1's multi_logloss: 1.36028
[20]	training's multi_logloss: 1.16467	valid_1's multi_logloss: 1.29577
[30]	training's multi_logloss: 1.0841	valid_1's multi_logloss: 1.28103
[40]	training's multi_logloss: 1.01747	valid_1's multi_logloss: 1.27543
[50]	training's multi_logloss: 0.961691	valid_1's multi_logloss: 1.27711
Early stopping, best iteration is:
[40]	training's multi_logloss: 1.01747	valid_1's multi_logloss: 1.27543
----------------------------------------------------------------------------------------------------
Training until validation scores don't improve for 10 rounds.
[10]	training's multi_logloss: 1.29012	valid_1's multi_logloss: 1.36028
[20]	training's multi_logloss: 1.16467	valid_1's multi_logloss: 1.29577
[30]	training's multi_logloss: 1.0841	valid_1's multi_logloss: 1.28103
[40]	training's multi_logloss: 1.01747	valid_1's multi_logloss: 1.27543
[50]	training's multi_logloss: 0.961691	valid_1's multi_logloss: 1.27711
Early stopping, best iteration is:
[40]	training's multi_logloss: 1.01747	valid_1's multi_logloss: 1.27543

4. 总结

我们需要注意以下几点:
1. 多分类时lgb.train除了'objective':'multiclass',还要指定"num_class":5,而sklearn接口只需要指定'objective':'multiclass'
2. lgb.train中正则化参数为"lambda_l1", "lambda_l1",sklearn中则为'reg_alpha', 'reg_lambda'
3. 迭代次数在sklearn中是'n_estimators':300,在初始化模型时指定。而在lgb.train中则可在参数params中指定,也可在函数形参中指出。

你可能感兴趣的:(机器学习)