1、sklearn的原生接口和sklearn接口调用feature_importance有差别:
bst = xgb.train(param, d1_train, num_boost_round=100, evals=watch_list)
xgc = xgb.XGBClassifier(objective=‘binary:logistic’, seed=10086, **bst_params)
xgc.feature_importances_ 等同于xgc.get_booster().get_fcore()等同于xgc.get_booster().get_score(importance_type=“weight”)
而原生接口直接bst.get_fcore()以及.get_score(importance_type=“weight”)
2、以sklearn接口为例xgboost的特征重要性有两种途径,如下:
xgc.feature_importances_、xgb.plot_importance(xgc, max_num_features=10)
但是两者输出的结果有差异:
网上有说其中一个的importance_type是gain,一个是weight。其实不然,查看源码:
@property
def feature_importances_(self):
"""
Returns
-------
feature_importances_ : array of shape = [n_features]
"""
b = self.get_booster()
fs = b.get_fscore()
all_features = [fs.get(f, 0.) for f in b.feature_names]
all_features = np.array(all_features, dtype=np.float32)
return all_features / all_features.sum()
def plot_importance(booster, ax=None, height=0.2,
xlim=None, ylim=None, title='Feature importance',
xlabel='F score', ylabel='Features',
importance_type='weight', max_num_features=None,
grid=True, show_values=True, **kwargs):
"""Plot importance based on fitted trees.
Parameters
----------
booster : Booster, XGBModel or dict
Booster or XGBModel instance, or dict taken by Booster.get_fscore()
ax : matplotlib Axes, default None
Target axes instance. If None, new figure and axes will be created.
grid : bool, Turn the axes grids on or off. Default is True (On).
importance_type : str, default "weight"
How the importance is calculated: either "weight", "gain", or "cover"
"weight" is the number of times a feature appears in a tree
"gain" is the average gain of splits which use the feature
"cover" is the average coverage of splits which use the feature
where coverage is defined as the number of samples affected by the split
max_num_features : int, default None
Maximum number of top features displayed on plot. If None, all features will be displayed.
height : float, default 0.2
Bar height, passed to ax.barh()
xlim : tuple, default None
Tuple passed to axes.xlim()
ylim : tuple, default None
Tuple passed to axes.ylim()
title : str, default "Feature importance"
Axes title. To disable, pass None.
xlabel : str, default "F score"
X axis title label. To disable, pass None.
ylabel : str, default "Features"
Y axis title label. To disable, pass None.
show_values : bool, default True
Show values on plot. To disable, pass False.
kwargs :
Other keywords passed to ax.barh()
Returns
-------
ax : matplotlib Axes
"""
# TODO: move this to compat.py
try:
import matplotlib.pyplot as plt
except ImportError:
raise ImportError('You must install matplotlib to plot importance')
if isinstance(booster, XGBModel):
importance = booster.get_booster().get_score(importance_type=importance_type)
elif isinstance(booster, Booster):
importance = booster.get_score(importance_type=importance_type)
elif isinstance(booster, dict):
importance = booster
else:
raise ValueError('tree must be Booster, XGBModel or dict instance')
if len(importance) == 0:
raise ValueError('Booster.get_score() results in empty')
从源码可知importance_type都是‘weight’,就是特征用于分割的次数,只是feature_importances_ 返回值均一化处理了下。
3、比较importance_type不同取值
发现两种方法的变量重要性排序有差异,
实际上,判断特征重要性共有三个维度,而在实际中,三个选项中的每个选项的功能重要性排序都非常不同
权重。在所有树中一个特征被用来分裂数据的次数。
覆盖。在所有树中一个特征被用来分裂数据的次数,并且有多少数据点通过这个分裂点。
增益。使用特征分裂时平均训练损失的减少量
然后,推荐使用shap衡量特征重要性,参考这篇博客,以及之前对feature_importance原理的说明
行文之初参考这篇博文