目录
载入数据
参数设置
训练模型
交叉验证
Early Stop
预测
GOSS
EFB
light GBM是微软开源的一种使用基于树的学习算法的梯度提升框架。
文档地址:官方文档
源码地址:github
中文文档地址:中文文档
论文地址:lightgbm-a-highly-efficient-gradient-boosting-decision-tree
参考博客:lightgbm,xgboost,gbdt的区别与联系 - Mata - 博客园
LightGBM原理之论文详解 - u010242233的博客 - CSDN博客
LGB可以load以下类型数据。
load libsvm text file or a LightGBM binary file
train_data = lgb.Dataset('train.svm.bin')
load a numpy array
data = np.random.rand(500, 10) # 500 entities, each contains 10 features
label = np.random.randint(2, size=500) # binary target
train_data = lgb.Dataset(data, label=label)
load a scpiy.sparse.csr_matrix array
csr = scipy.sparse.csr_matrix((dat, (row, col)))
train_data = lgb.Dataset(csr)
Saving Dataset into a LightGBM binary file(可以提高运行速度)
train_data = lgb.Dataset('train.svm.txt')
train_data.save_binary('train.bin')
Create validation data:
validation_data = train_data.create_valid('validation.svm')
在构建数据集时,需要把类别转化为整型数。同时设置free_raw_data=True( 默认是true).
一般会初始化权重
w = np.random.rand(500, )
train_data = lgb.Dataset(data, label=label, weight=w)
booster参数
param = {'num_leaves':31, 'num_trees':100, 'objective':'binary'} param['metric'] = 'auc' or 'binary_logloss'
核心参数
config
︎, default = ""
, type = string, aliases: config_file
task
︎, default = train
, type = enum, options: train
, predict
, convert_model
, refit
, aliases: task_type
train
, for training, aliases: training
predict
, for prediction, aliases: prediction
, test
convert_model
, for converting model file into if-else format, see more information in IO Parametersrefit
, for refitting existing models with new data, aliases: refit_tree
objective
︎, default = regression
, type = enum, options: regression
, regression_l1
, huber
, fair
, poisson
, quantile
, mape
, gammma
, tweedie
, binary
, multiclass
, multiclassova
, xentropy
, xentlambda
, lambdarank
, aliases: objective_type
, app
, application
regression_l2
, L2 loss, aliases: regression
, mean_squared_error
, mse
, l2_root
, root_mean_squared_error
, rmse
regression_l1
, L1 loss, aliases: mean_absolute_error
, mae
huber
, Huber lossfair
, Fair losspoisson
, Poisson regressionquantile
, Quantile regressionmape
, MAPE loss, aliases: mean_absolute_percentage_error
gamma
, Gamma regression with log-link. It might be useful, e.g., for modeling insurance claims severity, or for any target that might be gamma-distributedtweedie
, Tweedie regression with log-link. It might be useful, e.g., for modeling total loss in insurance, or for any target that might be tweedie-distributedbinary
, binary log loss classification (or logistic regression). Requires labels in {0, 1}; see cross-entropy
application for general probability labels in [0, 1]multiclass
, softmax objective function, aliases: softmax
multiclassova
, One-vs-All binary objective function, aliases: multiclass_ova
, ova
, ovr
num_class
should be set as wellxentropy
, objective function for cross-entropy (with optional linear weights), aliases: cross_entropy
xentlambda
, alternative parameterization of cross-entropy, aliases: cross_entropy_lambda
lambdarank
, lambdarank application
int
type in lambdarank tasks, and larger number represents the higher relevance (e.g. 0:bad, 1:fair, 2:good, 3:perfect)int
labellabel
must be smaller than number of elements in label_gain
boosting
︎, default = gbdt
, type = enum, options: gbdt
, gbrt
, rf
, random_forest
, dart
, goss
, aliases: boosting_type
, boost
gbdt
, traditional Gradient Boosting Decision Tree, aliases: gbrt
rf
, Random Forest, aliases: random_forest
dart
, Dropouts meet Multiple Additive Regression Treesgoss
, Gradient-based One-Side Samplingdata
︎, default = ""
, type = string, aliases: train
, train_data
, train_data_file
, data_filename
valid
︎, default = ""
, type = string, aliases: test
, valid_data
, valid_data_file
, test_data
, test_data_file
, valid_filenames
,
num_iterations
︎, default = 100
, type = int, aliases: num_iteration
, n_iter
, num_tree
, num_trees
, num_round
, num_rounds
, num_boost_round
, n_estimators
, constraints: num_iterations >= 0
num_class * num_iterations
trees for multi-class classification problemslearning_rate
︎, default = 0.1
, type = double, aliases: shrinkage_rate
, eta
, constraints: learning_rate > 0.0
dart
, it also affects on normalization weights of dropped treesnum_leaves
︎, default = 31
, type = int, aliases: num_leaf
, max_leaves
, max_leaf
, constraints: num_leaves > 1
tree_learner
︎, default = serial
, type = enum, options: serial
, feature
, data
, voting
, aliases: tree
, tree_type
, tree_learner_type
serial
, single machine tree learnerfeature
, feature parallel tree learner, aliases: feature_parallel
data
, data parallel tree learner, aliases: data_parallel
voting
, voting parallel tree learner, aliases: voting_parallel
num_threads
︎, default = 0
, type = int, aliases: num_thread
, nthread
, nthreads
, n_jobs
0
means default number of threads in OpenMPdevice_type
︎, default = cpu
, type = enum, options: cpu
, gpu
, aliases: device
max_bin
(e.g. 63) to get the better speed upgpu_use_dp=true
to enable 64-bit float point, but it will slow down the trainingseed
︎, default = None
, type = int, aliases: random_seed
, random_state
data_random_seed
, feature_fraction_seed
, etc.learning control参数
max_depth
︎, default = -1
, type = int
#data
is small. Tree still grows leaf-wise< 0
means no limitmin_data_in_leaf
︎, default = 20
, type = int, aliases: min_data_per_leaf
, min_data
, min_child_samples
, constraints: min_data_in_leaf >= 0
min_sum_hessian_in_leaf
︎, default = 1e-3
, type = double, aliases: min_sum_hessian_per_leaf
, min_sum_hessian
, min_hessian
, min_child_weight
, constraints: min_sum_hessian_in_leaf >= 0.0
min_data_in_leaf
, it can be used to deal with over-fittingbagging_fraction
︎, default = 1.0
, type = double, aliases: sub_row
, subsample
, bagging
, constraints: 0.0 < bagging_fraction <= 1.0
feature_fraction
, but this will randomly select part of data without resamplingbagging_freq
should be set to a non zero value as wellbagging_freq
︎, default = 0
, type = int, aliases: subsample_freq
0
means disable bagging; k
means perform bagging at every k
iterationbagging_fraction
should be set to value smaller than 1.0
as wellbagging_seed
︎, default = 3
, type = int, aliases: bagging_fraction_seed
feature_fraction
︎, default = 1.0
, type = double, aliases: sub_feature
, colsample_bytree
, constraints: 0.0 < feature_fraction <= 1.0
feature_fraction
smaller than 1.0
. For example, if you set it to 0.8
, LightGBM will select 80% of features before training each treefeature_fraction_seed
︎, default = 2
, type = int
feature_fraction
early_stopping_round
︎, default = 0
, type = int, aliases: early_stopping_rounds
, early_stopping
early_stopping_round
rounds<= 0
means disablemax_delta_step
︎, default = 0.0
, type = double, aliases: max_tree_output
, max_leaf_output
<= 0
means no constraintlearning_rate * max_delta_step
lambda_l1
︎, default = 0.0
, type = double, aliases: reg_alpha
, constraints: lambda_l1 >= 0.0
lambda_l2
︎, default = 0.0
, type = double, aliases: reg_lambda
, lambda
, constraints: lambda_l2 >= 0.0
min_gain_to_split
︎, default = 0.0
, type = double, aliases: min_split_gain
, constraints: min_gain_to_split >= 0.0
drop_rate
︎, default = 0.1
, type = double, aliases: rate_drop
, constraints: 0.0 <= drop_rate <= 1.0
dart
max_drop
︎, default = 50
, type = int
dart
<=0
means no limitskip_drop
︎, default = 0.5
, type = double, constraints: 0.0 <= skip_drop <= 1.0
dart
xgboost_dart_mode
︎, default = false
, type = bool
dart
true
, if you want to use xgboost dart modeuniform_drop
︎, default = false
, type = bool
dart
true
, if you want to use uniform dropdrop_seed
︎, default = 4
, type = int
dart
top_rate
︎, default = 0.2
, type = double, constraints: 0.0 <= top_rate <= 1.0
goss
other_rate
︎, default = 0.1
, type = double, constraints: 0.0 <= other_rate <= 1.0
goss
min_data_per_group
︎, default = 100
, type = int, constraints: min_data_per_group > 0
max_cat_threshold
︎, default = 32
, type = int, constraints: max_cat_threshold > 0
cat_l2
︎, default = 10.0
, type = double, constraints: cat_l2 >= 0.0
cat_smooth
︎, default = 10.0
, type = double, constraints: cat_smooth >= 0.0
max_cat_to_onehot
︎, default = 4
, type = int, constraints: max_cat_to_onehot > 0
max_cat_to_onehot
, one-vs-other split algorithm will be usedtop_k
︎, default = 20
, type = int, aliases: topk
, constraints: top_k > 0
monotone_constraints
︎, default = None
, type = multi-int, aliases: mc
, monotone_constraint
1
means increasing, -1
means decreasing, 0
means non-constraintmc=-1,0,1
means decreasing for 1st feature, non-constraint for 2nd feature and increasing for the 3rd featurefeature_contri
︎, default = None
, type = multi-double, aliases: feature_contrib
, fc
, fp
, feature_penalty
gain[i] = max(0, feature_contri[i]) * gain[i]
to replace the split gain of i-th featureforcedsplits_filename
︎, default = ""
, type = string, aliases: fs
, forced_splits_filename
, forced_splits_file
, forced_splits
.json
file that specifies splits to force at the top of every decision tree before best-first learning commences.json
file can be arbitrarily nested, and each split contains feature
, threshold
fields, as well as left
and right
fields representing subsplitsleft
representing the split containing the feature value and right
representing other valuesrefit_decay_rate
︎, default = 0.9
, type = double, constraints: 0.0 <= refit_decay_rate <= 1.0
refit
task, will use leaf_output = refit_decay_rate * old_leaf_output + (1.0 - refit_decay_rate) * new_leaf_output
to refit treesrefit
task in CLI version or as argument in refit
function in language-specific packageIO参数
verbosity
︎, default = 1
, type = int, aliases: verbose
< 0
: Fatal, = 0
: Error (Warning), = 1
: Info, > 1
: Debugmax_bin
︎, default = 255
, type = int, constraints: max_bin > 1
max_bin
. For example, LightGBM will use uint8_t
for feature value if max_bin=255
min_data_in_bin
︎, default = 3
, type = int, constraints: min_data_in_bin > 0
bin_construct_sample_cnt
︎, default = 200000
, type = int, aliases: subsample_for_bin
, constraints: bin_construct_sample_cnt > 0
histogram_pool_size
︎, default = -1.0
, type = double, aliases: hist_pool_size
< 0
means no limitdata_random_seed
︎, default = 1
, type = int, aliases: data_seed
feature_parallel
mode)output_model
︎, default = LightGBM_model.txt
, type = string, aliases: model_output
, model_out
snapshot_freq
︎, default = -1
, type = int, aliases: save_period
snapshot_freq=1
input_model
︎, default = ""
, type = string, aliases: model_input
, model_in
prediction
task, this model will be applied to prediction datatrain
task, training will be continued from this modeloutput_result
︎, default = LightGBM_predict_result.txt
, type = string, aliases: predict_result
, prediction_result
, predict_name
, prediction_name
, pred_name
, name_pred
prediction
taskinitscore_filename
︎, default = ""
, type = string, aliases: init_score_filename
, init_score_file
, init_score
, input_init_score
""
, will use train_data_file
+ .init
(if exists)valid_data_initscores
︎, default = ""
, type = string, aliases: valid_data_init_scores
, valid_init_score_file
, valid_init_score
""
, will use valid_data_file
+ .init
(if exists),
for multi-validation datapre_partition
︎, default = false
, type = bool, aliases: is_pre_partition
feature_parallel
mode)true
if training data are pre-partitioned, and different machines use different partitionsenable_bundle
︎, default = true
, type = bool, aliases: is_enable_bundle
, bundle
false
to disable Exclusive Feature Bundling (EFB), which is described in LightGBM: A Highly Efficient Gradient Boosting Decision Treemax_conflict_rate
︎, default = 0.0
, type = double, constraints: 0.0 <= max_conflict_rate < 1.0
0.0
to disallow the conflict and provide more accurate resultsis_enable_sparse
︎, default = true
, type = bool, aliases: is_sparse
, enable_sparse
, sparse
sparse_threshold
︎, default = 0.8
, type = double, constraints: 0.0 < sparse_threshold <= 1.0
use_missing
︎, default = true
, type = bool
false
to disable the special handle of missing valuezero_as_missing
︎, default = false
, type = bool
true
to treat all zero as missing values (including the unshown values in libsvm/sparse matrices)false
to use na
for representing missing valuestwo_round
︎, default = false
, type = bool, aliases: two_round_loading
, use_two_round_loading
true
if data file is too big to fit in memorysave_binary
︎, default = false
, type = bool, aliases: is_save_binary
, is_save_binary_file
true
, LightGBM will save the dataset (including validation data) to a binary file. This speed ups the data loading for the next timeheader
︎, default = false
, type = bool, aliases: has_header
true
if input data has headerlabel_column
︎, default = ""
, type = int or string, aliases: label
label=0
means column_0 is the labelname:
for column name, e.g. label=name:is_click
weight_column
︎, default = ""
, type = int or string, aliases: weight
weight=0
means column_0 is the weightname:
for column name, e.g. weight=name:weight
0
and it doesn’t count the label column when passing type is int
, e.g. when label is column_0, and weight is column_1, the correct parameter is weight=0
group_column
︎, default = ""
, type = int or string, aliases: group
, group_id
, query_column
, query
, query_id
query=0
means column_0 is the query idname:
for column name, e.g. query=name:query_id
0
and it doesn’t count the label column when passing type is int
, e.g. when label is column_0 and query_id is column_1, the correct parameter is query=0
ignore_column
︎, default = ""
, type = multi-int or string, aliases: ignore_feature
, blacklist
ignore_column=0,1,2
means column_0, column_1 and column_2 will be ignoredname:
for column name, e.g. ignore_column=name:c1,c2,c3
means c1, c2 and c3 will be ignored0
and it doesn’t count the label column when passing type is int
categorical_feature
︎, default = ""
, type = multi-int or string, aliases: cat_feature
, categorical_column
, cat_column
categorical_feature=0,1,2
means column_0, column_1 and column_2 are categorical featuresname:
for column name, e.g. categorical_feature=name:c1,c2,c3
means c1, c2 and c3 are categorical featuresint
type0
and it doesn’t count the label column when passing type is int
Int32.MaxValue
(2147483647)predict_raw_score
︎, default = false
, type = bool, aliases: is_predict_raw_score
, predict_rawscore
, raw_score
prediction
tasktrue
to predict only the raw scoresfalse
to predict transformed scorespredict_leaf_index
︎, default = false
, type = bool, aliases: is_predict_leaf_index
, leaf_index
prediction
tasktrue
to predict with leaf index of all treespredict_contrib
︎, default = false
, type = bool, aliases: is_predict_contrib
, contrib
prediction
tasktrue
to estimate SHAP values, which represent how each feature contributes to each prediction#features + 1
values where the last value is the expected value of the model output over the training datanum_iteration_predict
︎, default = -1
, type = int
prediction
task<= 0
means no limitpred_early_stop
︎, default = false
, type = bool
prediction
tasktrue
, will use early-stopping to speed up the prediction. May affect the accuracypred_early_stop_freq
︎, default = 10
, type = int
prediction
taskpred_early_stop_margin
︎, default = 10.0
, type = double
prediction
taskconvert_model_language
︎, default = ""
, type = string
convert_model
taskcpp
is supported yetconvert_model_language
is set and task=train
, the model will be also convertedconvert_model
︎, default = gbdt_prediction.cpp
, type = string, aliases: convert_model_file
convert_model
task目标参数
num_class
︎, default = 1
, type = int, aliases: num_classes
, constraints: num_class > 0
multi-class
classification applicationis_unbalance
︎, default = false
, type = bool, aliases: unbalance
, unbalanced_sets
binary
applicationtrue
if training data are unbalancedscale_pos_weight
, choose only one of themscale_pos_weight
︎, default = 1.0
, type = double, constraints: scale_pos_weight > 0.0
binary
applicationis_unbalance
, choose only one of themsigmoid
︎, default = 1.0
, type = double, constraints: sigmoid > 0.0
binary
and multiclassova
classification and in lambdarank
applicationsboost_from_average
︎, default = true
, type = bool
regression
, binary
and cross-entropy
applicationsreg_sqrt
︎, default = false
, type = bool
regression
applicationsqrt(label)
instead of original values and prediction result will be also automatically converted to prediction^2
alpha
︎, default = 0.9
, type = double, constraints: alpha > 0.0
huber
and quantile
regression
applicationsfair_c
︎, default = 1.0
, type = double, constraints: fair_c > 0.0
fair
regression
applicationpoisson_max_delta_step
︎, default = 0.7
, type = double, constraints: poisson_max_delta_step > 0.0
poisson
regression
applicationtweedie_variance_power
︎, default = 1.5
, type = double, constraints: 1.0 <= tweedie_variance_power < 2.0
tweedie
regression
application2
to shift towards a Gamma distribution1
to shift towards a Poisson distributionmax_position
︎, default = 20
, type = int, constraints: max_position > 0
lambdarank
applicationlabel_gain
︎, default = 0,1,3,7,15,31,63,...,2^30-1
, type = multi-double
lambdarank
application2
is 3
in case of default label gains,
评价指标
metric
︎, default = ""
, type = multi-enum, aliases: metrics
, metric_types
""
(empty string or not specified) means that metric corresponding to specified objective
will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added)"None"
(string, not a None
value) means that no metric will be registered, aliases: na
, null
, custom
l1
, absolute loss, aliases: mean_absolute_error
, mae
, regression_l1
l2
, square loss, aliases: mean_squared_error
, mse
, regression_l2
, regression
l2_root
, root square loss, aliases: root_mean_squared_error
, rmse
quantile
, Quantile regressionmape
, MAPE loss, aliases: mean_absolute_percentage_error
huber
, Huber lossfair
, Fair losspoisson
, negative log-likelihood for Poisson regressiongamma
, negative log-likelihood for Gamma regressiongamma_deviance
, residual deviance for Gamma regressiontweedie
, negative log-likelihood for Tweedie regressionndcg
, NDCG, aliases: lambdarank
map
, MAP, aliases: mean_average_precision
auc
, AUCbinary_logloss
, log loss, aliases: binary
binary_error
, for one sample: 0
for correct classification, 1
for error classificationmulti_logloss
, log loss for multi-class classification, aliases: multiclass
, softmax
, multiclassova
, multiclass_ova
, ova
, ovr
multi_error
, error rate for multi-class classificationxentropy
, cross-entropy (with optional linear weights), aliases: cross_entropy
xentlambda
, “intensity-weighted” cross-entropy, aliases: cross_entropy_lambda
kldiv
, Kullback-Leibler divergence, aliases: kullback_leibler
,
metric_freq
︎, default = 1
, type = int, aliases: output_freq
, constraints: metric_freq > 0
is_provide_training_metric
︎, default = false
, type = bool, aliases: training_metric
, is_training_metric
, train_metric
true
to output metric result over training dataseteval_at
︎, default = 1,2,3,4,5
, type = multi-int, aliases: ndcg_eval_at
, ndcg_at
, map_eval_at
, map_at
ndcg
and map
metrics,
num_round = 10
bst = lgb.train(param, train_data, num_round, valid_sets=[validation_data])
lightgbm.train(params, train_set, num_boost_round=100, valid_sets=None, valid_names=None, fobj=None, feval=None, init_model=None, feature_name='auto', categorical_feature='auto', early_stopping_rounds=None, evals_result=None, verbose_eval=True, learning_rates=None, keep_training_booster=False, callbacks=None)
- params (dict) – Parameters for training.
- train_set (Dataset) – Data to be trained on.
- num_boost_round (int, optional (default=100)) – Number of boosting iterations.
- valid_sets (list of Datasets or None, optional (default=None)) – List of data to be evaluated on during training.
- valid_names (list of strings or None, optional (default=None)) – Names of
valid_sets
.- fobj (callable or None, optional (default=None)) – Customized objective function.
- feval (callable or None, optional (default=None)) – Customized evaluation function. Should accept two parameters: preds, train_data, and return (eval_name, eval_result, is_higher_better) or list of such tuples. For multi-class task, the preds is group by class_id first, then group by row_id. If you want to get i-th row preds in j-th class, the access way is preds[j * num_data + i]. To ignore the default metric corresponding to the used objective, set the
metric
parameter to the string"None"
inparams
.- init_model (string, Booster or None, optional (default=None)) – Filename of LightGBM model or Booster instance used for continue training.
- feature_name (list of strings or 'auto', optional (default="auto")) – Feature names. If ‘auto’ and data is pandas DataFrame, data columns names are used.
- categorical_feature (list of strings or int, or 'auto', optional (default="auto")) – Categorical features. If list of int, interpreted as indices. If list of strings, interpreted as feature names (need to specify
feature_name
as well). If ‘auto’ and data is pandas DataFrame, pandas categorical columns are used. All values in categorical features should be less than int32 max value (2147483647). Large values could be memory consuming. Consider using consecutive integers starting from zero. All negative values in categorical features will be treated as missing values.- early_stopping_rounds (int or None, optional (default=None)) – Activates early stopping. The model will train until the validation score stops improving. Validation score needs to improve at least every
early_stopping_rounds
round(s) to continue training. Requires at least one validation data and one metric. If there’s more than one, will check all of them. But the training data is ignored anyway. To check only the first metric you can pass incallbacks
early_stopping
callback withfirst_metric_only=True
. The index of iteration that has the best performance will be saved in thebest_iteration
field if early stopping logic is enabled by settingearly_stopping_rounds
.- evals_result (dict or None, optional (default=None)) –
This dictionary used to store all evaluation results of all the items in
valid_sets
.Example
With a
valid_sets
= [valid_set, train_set],valid_names
= [‘eval’, ‘train’] and aparams
= {‘metric’: ‘logloss’} returns {‘train’: {‘logloss’: [‘0.48253’, ‘0.35953’, …]}, ‘eval’: {‘logloss’: [‘0.480385’, ‘0.357756’, …]}}.- verbose_eval (bool or int, optional (default=True)) –
Requires at least one validation data. If True, the eval metric on the valid set is printed at each boosting stage. If int, the eval metric on the valid set is printed at every
verbose_eval
boosting stage. The last boosting stage or the boosting stage found by usingearly_stopping_rounds
is also printed.Example
With
verbose_eval
= 4 and at least one item invalid_sets
, an evaluation metric is printed every 4 (instead of 1) boosting stages.- learning_rates (list, callable or None, optional (default=None)) – List of learning rates for each boosting round or a customized function that calculates
learning_rate
in terms of current number of round (e.g. yields learning rate decay).- keep_training_booster (bool, optional (default=False)) – Whether the returned Booster will be used to keep training. If False, the returned value will be converted into _InnerPredictor before returning. You can still use _InnerPredictor as
init_model
for future continue training.- callbacks (list of callables or None, optional (default=None)) – List of callback functions that are applied at each iteration. See Callbacks in Python API for more information
训练模型后可以保存模型
bst.save_model('model.txt') # txt格式
json_model = bst.dump_model() # json格式
训练好的模型可以直接调用
bst = lgb.Booster(model_file='model.txt') #init model
num_round = 10
lgb.cv(param, train_data, num_round, nfold=5) # 五折,每一折十轮
如果有验证集,你可以使用early stop来找到最佳数量的轮。该模型将进行训练,直到验证分数停止改善。验证分数至少需要提高每一项early_stopping_rounds
才能继续训练。
bst = lgb.train(param, train_data, num_round, valid_sets=valid_sets, early_stopping_rounds=10)
bst.save_model('model.txt', num_iteration=bst.best_iteration)
模型训练好后可以对数据进行预测
# 7 entities, each contains 10 features
data = np.random.rand(7, 10)
ypred = bst.predict(data)
稍微看了下论文,公式和算法流程有点 看不懂,但是大致知道了 LGB的改进在哪里
LGB主要是对计算速度和数据的稀疏进行了一些改进,针对计算速度问题提出了GOSS(Gradient Boosting Decision Tree);针对数据稀疏问题提出了EFB(Exclusive Feature Bundling)。
核心思想是通过去掉部分梯度小的instances,用剩余梯度大的建立信息增益,主要依据在于梯度大的instances在计算信息增益中起到更重要的作用。大概步骤如下:
如下公式(看不懂):
算法流程如下(同看不懂):
核心思想是将互斥的特征捆绑在一起来减少特征数目。
算法流程如下(看不懂):
需要解决两个问题,1、确定哪些特征需要捆绑到一起;2、怎么建立捆绑关系。
将特征划分为独立小块是一个NP难问题,所以先把捆绑问题转化为填色问题,将特征作为顶点并为每两个特征添加边,如果它们不相互排斥,那么我们使用一个可以合理生成的贪心算法 用于生成束的图着色的良好结果(具有恒定的近似比)。如果允许一定程度的互斥,我们可以得到数量更少的特征束,可以提高计算效率。大概步骤如下:
建立捆绑关系的关键在于确保原始数据的值能够从特征束中识别。由于基于直方图的算法存储离散区而不是连续的特征值,因此我们可以通过让互斥特征属于不同的区间中从而构建特征束。