sklearn源码解析:ensemble模型 零碎记录;如何看sklearn代码,以tree的feature_importance为例



最近看sklearn的源码比较多,好记性不如烂笔头啊,还是记一下吧。



整体:

)实现的代码非常好,模块化、多继承等写的很清楚。

)predict功能通常在该模型的直接类中实现,fit通常在继承的类中实现,方便不同的子类共同引用。





随机森林 和 GBDT

)RandomForest的bootstrap是又放回的;GBDT则是无放回的。

)实现的代码非常好,比如GBDT提供了一些小白不常用的函数【staged_decision_function,staged_predict】之类,对于调试观察每个DT的输出非常有帮助。

)大多数模型的predict都依赖于predict_proba返回的proba,但GBDT的predict依赖于decision_function返回的score,但本质一样,仅记录一下。

)还没观察adaboost如何实现,但GBDT给人的感觉是,这种串行训练模型一般在fit中调用_fit_stages,所以看源码知道重点了吧。GBDT在https://github.com/scikit-learn/scikit-learn/blob/51a765a/sklearn/ensemble/gradient_boosting.py#L747的_fit_stage才是真正的训练函数、L763中给出了训练时使用的base tree是【tree= DecisionTreeRegressor(...)

In random forests (see RandomForestClassifier and RandomForestRegressor classes), each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features. ===》 训练树之前,bootstrap出样本,训练每个节点时,才sample特征。。。。。

In extremely randomized trees (see ExtraTreesClassifier and ExtraTreesRegressor classes), randomness goes one step further in the way splits are computed. As in random forests, a random subset of candidate features is used, but instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule. This usually allows to reduce the variance of the model a bit more, at the expense of a slightly greater increase in bias ===》 训练树之前,bootstrap出样本,训练每个节点时,才sample特征。但是,特征上的threshold也是随机sample【一些】出来选择最优,而不是【所有threshold中】最优的。

RandomTreesEmbedding implements an unsupervised transformation of the data. Using a forest of completely random trees,RandomTreesEmbedding encodes the data by the indices of the leaves a data point ends up in. This index is then encoded in a one-of-K manner, leading to a high dimensional, sparse binary coding. This coding can be computed very efficiently and can then be used as a basis for other learning tasks. The size and sparsity of the code can be influenced by choosing the number of trees and the maximum depth per tree. For each tree in the ensemble, the coding contains one entry of one. The size of the coding is at most n_estimators * 2 ** max_depth, the maximum number of leaves in the forest. As neighboring data points are more likely to lie within the same leaf of a tree, the transformation performs an implicit, non-parametric density estimation. ===》 将样本通过树编码成one-hot-encoding形式,再训练。。。

)adaboost算法用的不多了,主要是对异常点太敏感(老关注这些点确实不对)。。。。

)上面这些随机模型,都强调【bootstrap,sample feature,max_depth,GBDT自己特殊的shrinkage(其他模型叫learning_rate)】和n_estimators的交互。。。The figure below illustrates the effect of shrinkage and subsampling on the goodness-of-fit of the model. We can clearly see that shrinkage outperforms no-shrinkage. Subsampling with shrinkage can further increase the accuracy of the model. Subsampling without shrinkage, on the other hand, does poorly. ===》bootstrap(0.5),sample feature(0.8)和shrinkage(<0.1)要同时使用,并且通过validation set的early stop实现tree size的控制,以达到最有效果。。。

)最后谈feature_importance:Features used at the top of the tree are used contribute to the final prediction decision of a larger fraction of the input samples. The expected fraction of the samples they contribute to can thus be used as an estimate of the relative importance of the features. By averaging those expected activity rates over several randomized trees one can reduce the variance of such an estimate and use it for feature selection. ===》 我知道大家一定明白什么叫做【they contribute to】。不明白的看下代码(GBDT为例):

下面将【averaging】:

@property
    def feature_importances_(self):
        """Return the feature importances (the higher, the more important the
           feature).
        Returns
        -------
        feature_importances_ : array, shape = [n_features]
        """
        self._check_initialized()

        total_sum = np.zeros((self.n_features, ), dtype=np.float64)
        for stage in self.estimators_:
            stage_sum = sum(tree.feature_importances_
                            for tree in stage) / len(stage)
            total_sum += stage_sum

        importances = total_sum / len(self.estimators_)
        return importances

上面说了,GBDT训练时使用的base tree是【tree= DecisionTreeRegressor(...)】,而该model定义在https://github.com/scikit-learn/scikit-learn/blob/51a765a/sklearn/tree/tree.py#L724,然而它继承自BaseDecisionTree,相应的定义在https://github.com/scikit-learn/scikit-learn/blob/51a765a/sklearn/tree/tree.py#L72,该模型中的tree.feature_importances_的计算(L460):

@property
    def feature_importances_(self):
        """Return the feature importances.
        The importance of a feature is computed as the (normalized) total
        reduction of the criterion brought by that feature.
        It is also known as the Gini importance.
        Returns
        -------
        feature_importances_ : array, shape = [n_features]
        """
        if self.tree_ is None:
            raise NotFittedError("Estimator not fitted, call `fit` before"
                                 " `feature_importances_`.")

        return self.tree_.compute_feature_importances()
我去,还要扒self.tree_的compute_feature_importances()。。。,幸运的是,在L335中有:

self.tree_ = Tree(self.n_features_, self.n_classes_, self.n_outputs_)
注意到,在L42定义了Tree来自哪个文件:

from ._tree import Tree
好吧,我能告诉你,到这里之后,所有的真正计算必须要下载source code,继续扒cython代码才可以知道吗。。。哈哈,具体文件位置为:E:\scikit-learn-master\sklearn\tree\_tree.pyx的L1033,pyx都是cython文件了,到不难,但要考察耐心。。。。
    cpdef compute_feature_importances(self, normalize=True):
        """Computes the importance of each feature (aka variable)."""
        cdef Node* left
        cdef Node* right
        cdef Node* nodes = self.nodes
        cdef Node* node = nodes
        cdef Node* end_node = node + self.node_count

        cdef double normalizer = 0.

        cdef np.ndarray[np.float64_t, ndim=1] importances
        importances = np.zeros((self.n_features,))
        cdef DOUBLE_t* importance_data = importances.data

        with nogil:
            while node != end_node:
                if node.left_child != _TREE_LEAF:
                    # ... and node.right_child != _TREE_LEAF:
                    left = &nodes[node.left_child]
                    right = &nodes[node.right_child]

                    importance_data[node.feature] += (
                        node.weighted_n_node_samples * node.impurity -
                        left.weighted_n_node_samples * left.impurity -
                        right.weighted_n_node_samples * right.impurity)
                node += 1

        importances /= nodes[0].weighted_n_node_samples

        if normalize:
            normalizer = np.sum(importances)

            if normalizer > 0.0:
                # Avoid dividing by zero (e.g., when root is pure)
                importances /= normalizer

        return importances

上面的代码关键点是两个:

第一点:weighted_n_node_samples : array of int, shape [node_count]
        weighted_n_node_samples[i] holds the weighted number of training samples reaching node i.

第二点:impurity : array of double, shape [node_count]
        impurity[i] holds the impurity (i.e., the value of the splitting criterion) at node i.

对比一下上面的描述【The expected fraction of the samples they contribute to 】,可以发现这只描述了【第一点的sample】,而且还缺少weighted的描述。

那么第二点在哪里描述的呢???===》在DecisionTreeRegressor、DecisionTreeClassifier的feature_importance的介绍中:The feature importances. The higher, the more important the feature. The importance of a feature is computed as the (normalized) 【total reduction of the criterion brought by that feature. It is also known as the Gini importance [R70]

这下就搞明白所有的逻辑了。



再写一下Tree构建的两个策略:DepthFirstTreeBuilder和BestFirstTreeBuilder。

先看下何时用他们:max_leaf_nodes : int or None, optional (default=None)

Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes. If not None thenmax_depth will be ignored.

换句话,max_leaf_nodes有效,则使用best first;否则使用depth first。

先来看depth first【深度优先先构建左孩子,再构建右孩子】

好处是构建的树相对平衡(因为此时max_depth控制着树高)。。。。如果高度控制好,一般不会过拟合。

with nogil:
            # push root node onto stack
            rc = stack.push(0, n_node_samples, 0, _TREE_UNDEFINED, 0, INFINITY, 0)
            if rc == -1:
                # got return code -1 - out-of-memory
                with gil:
                    raise MemoryError()

            while not stack.is_empty():
                stack.pop(&stack_record)
		......
                if not is_leaf:
                    # Push right child on stack
                    rc = stack.push(split.pos, end, depth + 1, node_id, 0,
                                    split.impurity_right, n_constant_features)
                    if rc == -1:
                        break

                    # Push left child on stack
                    rc = stack.push(start, split.pos, depth + 1, node_id, 1,
                                    split.impurity_left, n_constant_features)
                    if rc == -1:
                        break



再看下best first:

下面的frontier就是一个优先队列,保存了每个tree node的relative reduction in impurity,所以知道了,哪个relative reduction in impurity大,就分裂哪个结点。

想到的坏处是,树不平衡,可能偏深(因为此时max_depth无效了),过拟合,造成繁华能力不够。。。。

with nogil:
            # add root to frontier
            rc = self._add_split_node(splitter, tree, 0, n_node_samples,
                                      INFINITY, IS_FIRST, IS_LEFT, NULL, 0,
                                      &split_node_left)
            if rc >= 0:
                rc = _add_to_frontier(&split_node_left, frontier)

            if rc == -1:
                with gil:
                    raise MemoryError()

            while not frontier.is_empty():
                frontier.pop(&record)




下面是随机森林的代码关键点寻找路线:

sklearn源码解析:ensemble模型 零碎记录;如何看sklearn代码,以tree的feature_importance为例_第1张图片


你可能感兴趣的:(scikit-learn)