xgboost中的plot_importance方法内置了几种计算重要性的方式。
def plot_importance(booster, ax=None, height=0.2,
xlim=None, ylim=None, title='Feature importance',
xlabel='F score', ylabel='Features',
importance_type='weight', max_num_features=None,
grid=True, show_values=True, **kwargs):
"""Plot importance based on fitted trees.
Parameters
----------
booster : Booster, XGBModel or dict
Booster or XGBModel instance, or dict taken by Booster.get_fscore()
ax : matplotlib Axes, default None
Target axes instance. If None, new figure and axes will be created.
grid : bool, Turn the axes grids on or off. Default is True (On).
importance_type : str, default "weight"
How the importance is calculated: either "weight", "gain", or "cover"
* "weight" is the number of times a feature appears in a tree
* "gain" is the average gain of splits which use the feature
* "cover" is the average coverage of splits which use the feature
where coverage is defined as the number of samples affected by the split
max_num_features : int, default None
Maximum number of top features displayed on plot. If None, all features will be displayed.
height : float, default 0.2
Bar height, passed to ax.barh()
xlim : tuple, default None
Tuple passed to axes.xlim()
ylim : tuple, default None
Tuple passed to axes.ylim()
title : str, default "Feature importance"
Axes title. To disable, pass None.
xlabel : str, default "F score"
X axis title label. To disable, pass None.
ylabel : str, default "Features"
Y axis title label. To disable, pass None.
show_values : bool, default True
Show values on plot. To disable, pass False.
kwargs :
Other keywords passed to ax.barh()
Returns
-------
ax : matplotlib Axes
"""
plot_importance的方法签名如上所示。
从上面的方法签名可以看出
1.如果没有指定坐标轴名称,默认的x轴名称为"F score",y轴名称为"Features"。
2.重要性计算类型有三种,分别为weight, gain, cover,下面我们针对这三种计算类型进行总结。
* "weight" is the number of times a feature appears in a tree
从上面的解释不难看出,weight方法衡量特征重要性的计算方式,是在子树进行分裂的时候,用到的特征次数,而且这里指的是所有的树。
一般来说,weight会给数值特征更高的值。因为连续值的变化越多,树分裂时候可以切割的空间就越大,那被用到的次数也就越多。所以对于weight指标,比较容易掩盖重要的枚举类特征。
* "gain" is the average gain of splits which use the feature
gain采用的计算熵的方式。如果按某个特征进行分裂,熵的增量比较大,那么该特征的重要性就越强。
与特征选择里面采用计算信息增益的方式是一样的。
* "cover" is the average coverage of splits which use the feature
where coverage is defined as the number of samples affected by the split
cover的计算方法是,树在进行分列时,特征下面的叶子结点涵盖的样本数除以特征用来分裂的次数。当分裂越靠近树的根部时,cover的值会越大。
cover 对于枚举特征会更合适。同时,它也没有过度拟合目标函数,不会受目标函数的量纲影响。
除此以外,还有permutation_importance方法也可以做衡量特征重要性的工作。sklearn官方文档针对该方法的说明如下
Permutation feature importance is a model inspection technique that can be used for any fitted estimator when the data is tabular.
This is especially useful for non-linear or opaque estimators.
The permutation feature importance is defined to be the decrease in a model score when a single feature value is randomly shuffled.
This procedure breaks the relationship between the feature and the target,
thus the drop in the model score is indicative of how much the model depends on the feature.
This technique benefits from being model agnostic and can be calculated many times with different permutations of the feature.
其原理大致如下:
1.首先根据训练集训练一个模型。
2.在测试集上测试该模型,得到模型相关的指标,比如回归问题为MSE,分类问题为logloss或者auc之类的指标。
3.在测试集上将某一个特征进行randomly shuffle(随机替换该特征值),在使用模型进行预测,得到新的模型指标。与第2步得到的指标进行比较,如果相差越多,说明特征的重要性越大。