随机森林分类器。 scikit-learn v0.19.1
随机森林是一个元估计器,它适合数据集的各个子样本上的多个决策树分类器,并使用平均值来提高预测精度和控制过度拟合。 子样本大小始终与原始输入样本大小相同,但如果bootstrap = True(默认值),则会使用替换来绘制样本。
先看这个类的参数:
class sklearn.ensemble.RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False, class_weight=None)
具体参数意义如下:
参数:
n_estimators : integer, optional (default=10) 整数,可选择(默认值为10)。
The number of trees in the forest.
森林里(决策)树的数目。
criterion : string, optional (default=”gini”) 字符串,可选择(默认值为“gini”)。
The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. Note: this parameter is tree-specific.
衡量分裂质量的性能(函数)。 受支持的标准是基尼不纯度的"gini",和信息增益的"entropy"(熵)。
注意:这个参数是特定树的。
首先Gini不纯度和Gini系数(coefficient)没有关系。Gini impurity衡量的是从一个集合中随机选择一个元素,基于该集合中标签的概率分布为元素分配标签的错误率。对于任何一个标签下的元素,其被分类正确的条件概率可以理解为在选择元素时选中该标签的概率与在分类时选中该标签的概率。基于上述描述,Gini impurity的计算就非常简单了,即1减去所有分类正确的概率,得到的就是分类不正确的概率。若元素数量非常多,切所有元素单独属于一个分类时,Gini不纯度达到极小值0。
设元素的标签为 1,2,…,m , fi 为该标签在集合中的比例,那么
IG(f)=∑mi=1fi(1−fi)=∑mi=1(fi–fi2)=∑mi=1fi–∑mi=1fi2=1–∑mi=1fi2
max_features : int, float, string or None, optional (default=”auto”) 整数,浮点数,字符串或者无值,可选的(默认值为"auto")
The number of features to consider when looking for the best split:
- If int, then consider max_features features at each split.
- If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split.
- If “auto”, then max_features=sqrt(n_features).
- If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).
- If “log2”, then max_features=log2(n_features).
- If None, then max_features=n_features.
Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than
max_features
features.寻找最佳分割时需要考虑的特征数目:
&如果是int,就要考虑每一次分割处的max_feature特征
&如果是float,那么max_features就是一个百分比,那么(max_feature*n_features)特征整数值是在每个分割处考虑的。
&如果是auto,那么max_features=sqrt(n_features),即n_features的平方根值。
&如果是log2,那么max_features=log2(n_features)
&如果是None,那么max_features=n_features
注意:寻找分割点不会停止,直到找到最少一个有效的节点划分区,即使它需要有效检查超过max_features的特征。
max_depth : integer or None, optional (default=None) 整数或者无值,可选的(默认为None)
以上叶子节点是纯净的,这句话其实我不太理解。The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
(决策)树的最大深度。如果值为None,那么会扩展节点,直到所有的叶子是纯净的,或者直到所有叶子包含少于min_sample_split的样本。
min_samples_split : int, float, optional (default=2) 整数,浮点数,可选的(默认值为2)
The minimum number of samples required to split an internal node:
- If int, then consider min_samples_split as the minimum number.
- If float, then min_samples_split is a percentage and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
Changed in version 0.18: Added float values for percentages.
分割内部节点所需要的最小样本数量:
~如果为int,那么考虑min_samples_split作为最小的数字。
~如果为float,那么min_samples_split是一个百分比,并且把ceil(min_samples_split*n_samples)是每一个分割最小的样本数量。
在版本0.18中更改:为百分比添加浮点值。
min_samples_leaf : int, float, optional (default=1) 整数,浮点数,可选的(默认值为1)
The minimum number of samples required to be at a leaf node:
- If int, then consider min_samples_leaf as the minimum number.
- If float, then min_samples_leaf is a percentage and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.
Changed in version 0.18: Added float values for percentages.
需要在叶子结点上的最小样本数量:
~如果为int,那么考虑min_samples_leaf作为最小的数字。
~如果为float,那么min_samples_leaf为一个百分比,并且ceil(min_samples_leaf*n_samples)是每一个节点的最小样本数量。
在版本0.18中更改:为百分比添加浮点值。
min_weight_fraction_leaf : float, optional (default=0.) 浮点数,可选的(默认值是0.0)
The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.
一个叶子节点所需要的权重总和(所有的输入样本)的最小加权分数。当sample_weight没有提供时,样本具有相同的权重
max_leaf_nodes : int or None, optional (default=None) 整数或者无值,可选的(默认值为None)
Grow trees with
max_leaf_nodes
in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.以最优的方法使用max_leaf_nodes来生长树。最好的节点被定义为不纯度上的相对减少。如果为None,那么不限制叶子节点的数量。
min_impurity_split : float, 浮点数
Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.
树早期生长的阈值。如果一个节点的不纯度超过阈值那么这个节点将会分裂,否则它还是一片叶子。
Deprecated since version 0.19:
min_impurity_split
has been deprecated in favor ofmin_impurity_decrease
in 0.19 and will be removed in 0.21. Usemin_impurity_decrease
instead.自0.19版以后不推荐使用:min_impurity_split已被弃用,取而代之的是0.19中的min_impurity_decrease。min_impurity_split将在0.21中被删除。 使用min_impurity_decrease。
min_impurity_decrease : float, optional (default=0.) 浮点数,可选的(默认值为0)
A node will be split if this split induces a decrease of the impurity greater than or equal to this value.
The weighted impurity decrease equation is the following:
N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity)where
N
is the total number of samples,N_t
is the number of samples at the current node,N_t_L
is the number of samples in the left child, andN_t_R
is the number of samples in the right child.
N
,N_t
,N_t_R
andN_t_L
all refer to the weighted sum, ifsample_weight
is passed.New in version 0.19.
如果节点的分裂导致的不纯度的下降程度大于或者等于这个节点的值,那么这个节点将会被分裂。
不纯度加权减少方程式如下:
N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity)N是样本总的数量,N_t是当前节点处的样本数量,N_t_L是左孩子节点样本的数量,还有N_t_R是右孩子节点的样本数量。N,N_t,N_t_R和N_t_L全部是指加权总和,如果sample_weight通过的话。
0.19版本新加的参数。
bootstrap : boolean, optional (default=True) 布尔值,可选的(默认值为True)
Whether bootstrap samples are used when building trees.
建立决策树时,是否使用有放回抽样。
oob_score : bool (default=False) bool,(默认值为False)
Whether to use out-of-bag samples to estimate the generalization accuracy.
是否使用袋外样本来估计泛化精度。
n_jobs : integer, optional (default=1) 整数,可选的(默认值为1)
The number of jobs to run in parallel for both fit and predict. If -1, then the number of jobs is set to the number of cores.
用于拟合和预测的并行运行的工作(作业)数量。如果值为-1,那么工作数量被设置为核的数量。
random_state : int, RandomState instance or None, optional (default=None) 整数,RandomState实例,或者为None,可选(默认值为None)
RandomStateIf int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
RandomStateIf int,random_state是随机数生成器使用的种子; 如果是RandomState实例,random_state就是随机数生成器; 如果为None,则随机数生成器是np.random使用的RandomState实例。
verbose : int, optional (default=0) 整数,可选的(默认值为0)
Controls the verbosity of the tree building process.
控制决策树建立过程的冗余度。
warm_start : bool, optional (default=False) 布尔值,可选的(默认值为False)
When set to
True
, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.当被设置为True时,重新使用之前呼叫的解决方案,用来给全体拟合和添加更多的估计器,反之,仅仅只是为了拟合一个全新的森林。
class_weight : dict, list of dicts, “balanced”, 字典,字典序列,"balanced"
属性:“balanced_subsample” or None, optional (default=None) Weights associated with classes in the form
{class_label: weight}
. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y.Note that for multioutput (including multilabel) weights should be defined for each class of every column in its own dict. For example, for four-class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}].
The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as
n_samples / (n_classes * np.bincount(y))
The “balanced_subsample” mode is the same as “balanced” except that weights are computed based on the bootstrap sample for every tree grown.
For multi-output, the weights of each column of y will be multiplied.
Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.
“balanced_subsample” 或者None,(默认值为None),与格式{class_label: weight}相关联的类的可选的权值。如果没有给值,所有的类到都应该有一个权值。对于多输出问题,一个字典序列可以按照y的列的顺利被提供。
请注意,对于多输出(包括多标签),其权值应该被定义为它自己字典的每一列的每一个类。例如,对于四类多标签分类,权值应该如[{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] 这样,而不是[{1:1}, {2:5}, {3:1}, {4:1}].这样。
"balanced"模式使用y的值来自动的调整权值,与输入数据中类别频率成反比,如:
n_samples / (n_classes * np.bincount(y))
"balanced_subsample"模式和"balanced"相同,除了权值是基于每棵成长树有放回抽样计算的。
对于多输出,y的每列权值将相乘。
请注意,如果指定了sample_weight,这些权值将会和sample_weight相乘(通过拟合方法传递)。
estimators_ : list of DecisionTreeClassifier 决策树分类器的序列
The collection of fitted sub-estimators. 拟合的子估计器的集合。
classes_ : array of shape = [n_classes] or a list of such arrays 数组维度=[n_classes]的数组或者一个这样数组的序列。
The classes labels (single output problem), or a list of arrays of class labels (multi-output problem).
类别标签(单一输出问题),或者类别标签的数组序列(多输出问题)。
n_classes_ : int or list 整数或者序列
The number of classes (single output problem), or a list containing the number of classes for each output (multi-output problem).
类别的数量(单输出问题),或者一个序列,包含每一个输出的类别数量(多输出问题)
n_features_ : int 整数
The number of features when
fit
is performed.执行拟合时的特征数量。
n_outputs_ : int 整数
The number of outputs when
fit
is performed.执行拟合时的输出数量。
feature_importances_ : array of shape = [n_features] 维度等于n_features的数组
The feature importances (the higher, the more important the feature).
特征的重要性(值越高,特征越重要)
oob_score_ : float 浮点数
Score of the training dataset obtained using an out-of-bag estimate.
使用袋外估计获得的训练数据集的得分。
oob_decision_function_ : array of shape = [n_samples, n_classes] 维度=[n_samples,n_classes]的数组
Decision function computed with out-of-bag estimate on the training set. If n_estimators is small it might be possible that a data point was never left out during the bootstrap. In this case, oob_decision_function_ might contain NaN.
在训练集上用袋外估计计算的决策函数。如果n_estimators很小的话,那么在有放回抽样中,一个数据点也不会被忽略是可能的。在这种情况下,oob_decision_function_ 可能包括NaN。
Notes
The default values for the parameters controlling the size of the trees (e.g. max_depth
, min_samples_leaf
, etc.) lead to fully grown and unpruned trees which can potentially be very large on some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values.
The features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data, max_features=n_features
and bootstrap=False
, if the improvement of the criterion is identical for several splits enumerated during the search of the best split. To obtain a deterministic behaviour during fitting, random_state
has to be fixed.
注意点:
参数的默认值控制决策树的大小(例如,max_depth,,min_samples_leaf等等),导致完全的生长和在某些数据集上可能非常大的未修剪的树。为了降低内容消耗,决策树的复杂度和大小应该通过设置这些参数值来控制。
这些特征总是在每个分割中随机排列。 因此,即使使用相同的训练数据,max_features = n_features和bootstrap = False,如果在搜索最佳分割期间所列举的若干分割的准则的改进是相同的,那么找到的最佳分割点可能会不同。 为了在拟合过程中获得一个确定的行为,random_state将不得不被修正。
例子:
>>> from sklearn.ensemble import RandomForestClassifier >>> from sklearn.datasets import make_classification >>> >>> X, y = make_classification(n_samples=1000, n_features=4, ... n_informative=2, n_redundant=0, ... random_state=0, shuffle=False) >>> clf = RandomForestClassifier(max_depth=2, random_state=0) >>> clf.fit(X, y) RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=2, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1, oob_score=False, random_state=0, verbose=0, warm_start=False) >>> print(clf.feature_importances_) [ 0.17287856 0.80608704 0.01884792 0.00218648] >>> print(clf.predict([[0, 0, 0, 0]])) [1]methods:
apply (X) |
Apply trees in the forest to X, return leaf indices. |
decision_path (X) |
Return the decision path in the forest |
fit (X, y[, sample_weight]) |
Build a forest of trees from the training set (X, y). |
get_params ([deep]) |
Get parameters for this estimator. |
predict (X) |
Predict class for X. |
predict_log_proba (X) |
Predict class log-probabilities for X. |
predict_proba (X) |
Predict class probabilities for X. |
score (X, y[, sample_weight]) |
Returns the mean accuracy on the given test data and labels. |
set_params (**params) |
Set the parameters of this estimator. |
__init__(n_estimators=10, criterion=’gini’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False, class_weight=None)
apply
(
X)
Apply trees in the forest to X, return leaf indices.将森林中的树应用到X,返回树叶索引。
Parameters: | X : array-like or sparse matrix, shape = [n_samples, n_features] 像数组或者稀疏矩阵,维度= [n_samples, n_features]
|
---|---|
Returns: | X_leaves : array_like, shape = [n_samples, n_estimators]
|
decision_path
(
X
)
[source]
Return the decision path in the forest 返回森林中的决策路径。
New in version 0.18.
Parameters: | X : array-like or sparse matrix, shape = [n_samples, n_features]
|
---|---|
Returns: | indicator : sparse csr array, shape = [n_samples, n_nodes] 指针。
n_nodes_ptr : array of size (n_estimators + 1, ) 数组大小(估计器加1)
|
feature_importances_
Returns: | feature_importances_ : array, shape = [n_features] |
---|
fit
(
X,
y,
sample_weight=None
)
Build a forest of trees from the training set (X, y). 从训练数据集(X,y)上建立一个决策树森林。
Parameters: | X : array-like or sparse matrix of shape = [n_samples, n_features]
y : array-like, shape = [n_samples] or [n_samples, n_outputs]
sample_weight : array-like, shape = [n_samples] or None
|
---|---|
Returns: | self : object
|
get_params
(
deep=True
)
[source]
Get parameters for this estimator. 获取此估算器的参数
Parameters: | deep : boolean, optional
|
---|---|
Returns: | params : mapping of string to any
|
predict
(
X
)
Predict class for X. 预测X的类别。
The predicted class of an input sample is a vote by the trees in the forest, weighted by their probability estimates. That is, the predicted class is the one with highest mean probability estimate across the trees.
输入样本的预测类别是森林中树的投票,由它们的概率估计进行加权。 也就是说,预测的类是跨树的平均概率估计最高的类。
Parameters: | X : array-like or sparse matrix of shape = [n_samples, n_features]
|
---|---|
Returns: | y : array of shape = [n_samples] or [n_samples, n_outputs]
|
predict_log_proba
(
X
)
[source]
Predict class log-probabilities for X. 预测X的类对数概率。
The predicted class log-probabilities of an input sample is computed as the log of the mean predicted class probabilities of the trees in the forest.
输入样本的预测类别对数概率被计算为森林中树木的平均预测类别概率的对数。
Parameters: | X : array-like or sparse matrix of shape = [n_samples, n_features]
|
---|---|
Returns: | p : array of shape = [n_samples, n_classes], or a list of n_outputs
|
predict_proba
(
X
)
[source]
Predict class probabilities for X. 预测X的类概率。
The predicted class probabilities of an input sample are computed as the mean predicted class probabilities of the trees in the forest. The class probability of a single tree is the fraction of samples of the same class in a leaf.
输入样本的预测类别概率被计算为森林中树木的平均预测类别概率。 单个树的类概率是叶中同一类的样本的分数。
Parameters: | X : array-like or sparse matrix of shape = [n_samples, n_features]
|
---|---|
Returns: | p : array of shape = [n_samples, n_classes], or a list of n_outputs
|
score
(
X,
y,
sample_weight=None
)
[source]
Returns the mean accuracy on the given test data and labels.
返回给定测试数据和标签的平均精确度。
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
在多标签分类中,这是子集精度,这是一个苛刻的度量标准,因为您需要为每个样本准确地预测每个标签集。
Parameters: | X : array-like, shape = (n_samples, n_features)
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
sample_weight : array-like, shape = [n_samples], optional
|
---|---|
Returns: | score : float
|
set_params
(
**params
)
[source]
Set the parameters of this estimator.
设置此估算器的参数。
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
so that it’s possible to update each component of a nested object.
该方法适用于简单的估计器以及嵌套对象(如管道)。 后者具有
Return | self : |
---|
原文:http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html