xgboost 中的gain freq, cover

assuming that you're using xgboost to fit boosted treesfor binary classification. The importance matrix is actually a data.tableobject with the first column listing the names of all the features actuallyused in the boosted trees.

The meaning of the importance data table is as follows:

  1. The Gain implies the relative contribution of the corresponding feature to the model calculated by taking each feature's contribution for each tree in the model. A higher value of this metric when compared to another feature implies it is more important for generating a prediction.
  2. The Cover metric means the relative number of observations related to this feature. For example, if you have 100 observations, 4 features and 3 trees, and suppose feature1 is used to decide the leaf node for 10, 5, and 2 observations in tree1, tree2 and tree3 respectively; then the metric will count cover for this feature as 10+5+2 = 17 observations. This will be calculated for all the 4 features and the cover will be 17 expressed as a percentage for all features' cover metrics.
  3. The Frequence (frequency) is the percentage representing the relative number of times a particular feature occurs in the trees of the model. In the above example, if feature1 occurred in 2 splits, 1 split and 3 splits in each of tree1, tree2 and tree3; then the weightage for feature1 will be 2+1+3 = 6. The frequency for feature1 is calculated as its percentage weight over weights of all features.

The Gain is the most relevant attribute to interpret therelative importance of each feature.

 

Gain is the improvement in accuracy brought by a feature to the branchesit is on. The idea is that before adding a new split on a feature X to thebranch there was some wrongly classified elements, after adding the split onthis feature, there are two new branches, and each of these branch is moreaccurate (one branch saying if your observation is on this branch then itshould be classified as 1, and the other branch saying the exact opposite).

Cover measures the relative quantity of observations concerned by afeature.

Frequency is a simpler way to measure the Gain. It just counts the number of times a feature isused in all generated trees. You should not use it (unless you know why youwant to use it).

 

假设您正在使用xgboost来适应用于二进制分类的增强型树。重要性矩阵实际上是一个data.table对象,第一列列出了增强树中实际使用的所有功能的名称。

重要性数据表的含义如下:

1.增益意味着相应的特征对通过对模型中的每个树采取每个特征的贡献而计算出的模型的相对贡献。与其他特征相比,此度量值的较高值意味着它对于生成预测更为重要。

2.覆盖度量指的是与此功能相关的观测的相对数量。例如,如果您有100个观察值,4个特征和3棵树,并且假设特征1分别用于决定树1,树2和树3中10个,5个和2个观察值的叶节点;那么该度量将计算此功能的覆盖范围为10 + 5 + 2 = 17个观测值。这将针对所有4项功能进行计算,并将以17个百分比表示所有功能的覆盖指标。

3.频率(频率)是表示特定特征在模型树中发生的相对次数的百分比。在上面的例子中,如果feature1发生在2个分裂中,1个分裂和3个分裂在每个树1,树2和树3中;那么特征1的权重将是2 + 1 + 3 = 6。特征1的频率被计算为其在所有特征的权重上的百分比权重。

增益是解释每个特征的相对重要性的最相关属性。

 

增益是功能对分支所带来的准确度的提高。这个想法是,在向特征X添加新的分支之前,在分支上存在一些错误分类的元素,在对该特征添加分割之后,有两个新分支,并且这些分支中的每一个更精确(一个分支说if你的观察是在这个分支上,那么它应该被归类为1,而另一个分支则说相反)。

封面测量一个要素所涉及的观测的相对数量。

频率是测量增益的一种更简单的方法。它只计算在所有生成的树中使用要素的次数。你不应该使用它(除非你知道你为什么要使用它)。

你可能感兴趣的:(机器学习,XGBoost,机器学习)