sklearn的feature_importances_含义是什么?

增添:这篇博文讲的也特别好

正文:

Sk-learn作者的答案:

There are indeed several ways to get feature “importances”. As often, there is no strict consensus about what this word means.
In scikit-learn, we implement the importance as described in [1] (often cited, but unfortunately rarely read…). It is sometimes called “gini importance” or “mean decrease impurity” and is defined as the total decrease in node impurity (weighted by the probability of reaching that node (which is approximated by the proportion of samples reaching that node)) averaged over all trees of the ensemble.
In the literature or in some other packages, you can also find feature importances implemented as the “mean decrease accuracy”. Basically, the idea is to measure the decrease in accuracy on OOB data when you randomly permute the values for that feature. If the decrease is low, then the feature is not important, and vice-versa.
(Note that both algorithms are available in the randomForest R package.)
[1]: Breiman, Friedman, “Classification and regression trees”, 1984.

所以,共有两种比较流行的特征重要性评估方法:
这一篇文中有两种方法的代码:Selecting good features – Part III: random forests

  1. Mean decrease impurity
    这个方法的原理其实就是Tree-Model进行分类、回归的原理:特征越重要,对节点的纯度增加的效果越好。而纯度的判别标准有很多,如GINI、信息熵、信息熵增益。
    这也是Sklearn的feature_importances_的意义。
    代码:
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
import numpy as np
#Load boston housing dataset as an example
boston = load_boston()
X = boston["data"]
Y = boston["target"]
names = boston["feature_names"]
rf = RandomForestRegressor()
rf.fit(X, Y)
print "Features sorted by their score:"
print sorted(zip(map(lambda x: round(x, 4), rf.feature_importances_), names), 
             reverse=True)

  1. Mean decrease accuracy
    这个方法更直观一些,是说某个特征对模型精度的影响。把一个变量的取值变为随机数,随机森林预测准确性的降低程度。该值越大表示该变量的重要性越大。
    这个方法是很多研究、文献中所采用的方法。
    代码:
from sklearn.cross_validation import ShuffleSplit
from sklearn.metrics import r2_score
from collections import defaultdict
 
X = boston["data"]
Y = boston["target"]
 
rf = RandomForestRegressor()
scores = defaultdict(list)
 
#crossvalidate the scores on a number of different random splits of the data
for train_idx, test_idx in ShuffleSplit(len(X), 100, .3):
    X_train, X_test = X[train_idx], X[test_idx]
    Y_train, Y_test = Y[train_idx], Y[test_idx]
    r = rf.fit(X_train, Y_train)
    acc = r2_score(Y_test, rf.predict(X_test))
    for i in range(X.shape[1]):
        X_t = X_test.copy()
        np.random.shuffle(X_t[:, i])
        shuff_acc = r2_score(Y_test, rf.predict(X_t))
        scores[names[i]].append((acc-shuff_acc)/acc)
print "Features sorted by their score:"
print sorted([(round(np.mean(score), 4), feat) for
              feat, score in scores.items()], reverse=True)

关于两个方法的比较,有一个人的回答是这样的:
the second metric actually gives you a direct measure of this, whereas the “mean decrease impurity” is just a good proxy.
使用Mean decrease accuracy是最直观的方法,使用Mean decrease impurity仅仅是Mean decrease accuracy一个比较好的“代理”。

你可能感兴趣的:(机器学习/深度学习算法)