CatBoost: unbiased boosting with categorical features
CatBoost: 类别型特征的无偏提升
Liudmila Prokhorenkova, Gleb Gusev, etc
This paper presents the key algorithmic techniques behind CatBoost, a new gradient boosting toolkit. Their combination leads to CatBoost outperforming other publicly available boosting implementations in terms of quality on a variety of datasets.
本文介绍了一种新的梯度提升工具CatBoost
背后的关键算法技术。它们的结合使得CatBoost在各种数据集上的质量优于其他公开的boosting实现。
Two critical algorithmic advances introduced in CatBoost are the implementation of ordered boosting, a permutation-driven alternative to the classic algorithm, and an innovative algorithm for processing categorical features. Both techniques were created to fight a prediction shift caused by a special kind of target leakage present in all currently existing implementations of gradient boosting algorithms.
在CatBoost中引入的两个关键算法进步是有序boosting的实现
,这是经典算法的排列驱动替代方案,以及处理分类特征的创新算法
。这两种技术都是为了对抗由一种特殊的目标泄漏引起的预测偏移,这种目标泄漏存在于目前所有现有的梯度增强算法实现中。
In this paper, we provide a detailed analysis of this problem and demonstrate that proposed algorithms solve it effectively, leading to excellent empirical results.
在本文中,我们对该问题进行了详细的分析,并证明所提出的算法有效地解决了该问题,并具备优秀的实证结果。
Gradient boosting is a powerful machine-learning technique that achieves state-of-the-art results in a variety of practical tasks.
We show in this paper that all existing implementations of gradient boosting face the following statistical issue.
A prediction model F F F obtained after several steps of boosting relies on the targets of all training examples. We demonstrate that this actually leads to a shift of the distribution of F ( x k ) ∣ x k F (x_k) | x_k F(xk)∣xk for a training example x k x_k xk from the distribution of F ( x ) ∣ x F (x) | x F(x)∣x for a test example x x x. This finally leads to a prediction shift
of the learned model.
《分布不一致》经过几步提升得到的预测模型 F F F依赖于所有训练样本的目标。我们证明,这实际上导致了 F ( x k ) ∣ x k F (x_k) | x_k F(xk)∣xk(对于训练样本 x k x_k xk)的分布从 F ( x ) ∣ x F (x) | x F(x)∣x(对于测试样本 x x x)的分布的偏移。这最终导致了学习模型的预测转移
。
Further, there is a similar issue in standard algorithms of preprocessing categorical features. One of the most effective ways to use them in gradient boosting is converting categories to their target statistics. A target statistic is a simple statistical model itself, and it can also cause target leakage and a prediction shift.
《分类特征处理问题》此外,分类特征预处理的标准算法也存在类似的问题。在梯度增强中使用它们的最有效的方法之一是将类别转换为目标统计信息。目标统计量本身是一个简单的统计模型,它还可能导致目标泄漏和预测偏移。
In this paper, we propose ordering principle
to solve both problems. Relying on it, we derive ordered boosting
, a modification of standard gradient boosting algorithm, which avoids target leakage, and a new algorithm for processing categorical features. Their combination is implemented as an open-source library called CatBoost (for “Categorical Boosting”), which outperforms the existing state-of-the-art implementations of gradient boosted decision trees — XGBoost and LightGBM — on a diverse set of popular machine learning tasks.
本文提出了“排序原则”来解决这两个问题。在此基础上,我们导出了“有序提升”算法,即对标准梯度提升算法的改进,避免了目标泄漏,并提出了一种新的分类特征处理算法。它们的组合被实现为一个名为CatBoost
(意为“分类增强”)的开源库,在各种流行的机器学习任务上,它比现有的最先进的梯度增强决策树实现XGBoost
和LightGBM
表现得更好。
CatBoost is an implementation of gradient boosting, which uses binary decision trees as base predictors.
A categorical feature is one with a discrete set of values called categories that are not comparable to each other.
类别特征是一组称为类别的离散值,这些值彼此之间没有可比性。
One popular technique for dealing with categorical features in boosted trees is one-hot encoding, i.e., for each category, adding a new binary feature indicating it. However, in the case of high cardinality features (like, e.g., “user ID” feature), such technique leads to infeasibly large number of new features.
《onehot》在提升树中处理分类特征的一种流行技术是onehot编码,也就是说,对于每个类别,添加一个新的二进制特征来指示它。然而,在高基数特征的情况下(例如,“用户ID”特性),这种技术会导致大量的新特征。
To address this issue, one can group categories into a limited number of clusters and then apply one-hot encoding. A popular method is to group categories by target statistics (TS) that estimate expected target value in each category.
《TS特征化》要解决这个问题,可以将类别分组到数量有限的桶中,然后应用onehot编码。一种流行的方法是根据目标统计量(TS)对类别进行分组,目标统计量估计每个类别的预期目标值。
Importantly, among all possible partitions of categories into two sets, an optimal split on the training data in terms of logloss, Gini index, MSE can be found among thresholds for the numerical TS feature.
重要的是,在所有可能将类别划分为两个集合的情况下,在数值TS特征的阈值之间可以找到训练数据在对数、基尼指数、均方误差方面的最佳分割。
In LightGBM, categorical features are converted to gradient statistics at each step of gradient boosting. Though providing important information for building a tree, this approach can dramatically increase (i) computation time, since it calculates statistics for each categorical value at each step, and (ii) memory consumption to store which category belongs to which node for each split based on a categorical feature. LightGBM groups tail categories into one cluster and thus looses part of information. Besides, the authors claim that it is still better to convert categorical features with high cardinality to numerical features
在LightGBM中,分类特征在梯度提升的每一步都被转换为梯度统计信息。尽管为构建树提供了重要信息,但这种方法可以显著增加(i)计算时间,因为它在每一步计算每个分类值的统计信息,以及(ii)存储基于分类特征的每次分裂的哪个类别属于哪个节点的内存消耗。LightGBM将尾部类别分组到一个桶中,因此会丢失部分信息。此外,作者认为将具有高基数的类别特征转换为数值特征更好。
Note that TS features require calculating and storing only one number per one category.
请注意,TS特征要求每个类别只计算和存储一个数字。
Thus, using TS as new numerical features seems to be the most efficient method of handling categorical features with minimum information loss. TS are widely-used, e.g., in the click prediction task (click-through rates), where such categorical features as user, region, ad, publisher play a crucial role. We further focus on ways to calculate TS and leave one-hot encoding and gradient statistics out of the scope of the current paper. At the same time, we believe that the ordering principle proposed in this paper is also effective for gradient statistics.
因此,使用TS作为新的数值特征似乎是处理类别特征的最有效的方法和最小的信息损失。TS被广泛使用,例如在点击预测任务(点击率)中,用户、地区、广告、发布者等分类特征起着至关重要的作用。我们进一步关注TS的计算方法,并将onehot编码和梯度统计排除在本文的讨论范围之外。同时,我们认为本文提出的排序原则对于梯度统计也是有效的。
As discussed in Section 3.1, an effective and efficient way to deal with a categorical feature i i i is to substitute the category x k i x^i_k xki of k-th training example with one numeric feature equal to sometarget statistic (TS) x ^ k i \hat{x}^i_k x^ki. Commonly, it estimates the expected target y y y conditioned by the category: x ^ k i ≈ E ( y ∣ x i = x k i ) \hat{x}^i_k \approx E(y | x^i = x^i_k) x^ki≈E(y∣xi=xki).
如3.1节所述,处理分类特征 i i i的一种有效且高效的方法是将第k个训练示例中的类别 x k i x^i_k xki替换为一个等于某个目标统计量(TS) x ^ k i \hat{x}^i_k x^ki的数字特征。通常,它估计预期目标 y y y的条件是: x ^ k i ≈ E ( y ∣ x i = x k i ) \hat{x}^i_k \approx E(y | x^i = x^ i_k) x^ki≈E(y∣xi=xki)。
Greedy TS
A straightforward approach is to estimate E ( y ∣ x i = x k i ) E(y | x^i = x^i_k) E(y∣xi=xki) as the average value of y y y over the training examples with the same category x k i x^i_k xki. This estimate is noisy for low-frequency categories, and one usually smoothes it by some prior p p p:
一个简单的方法是估计 E ( y ∣ x i = x k i ) E(y | x^i = x^i_k) E(y∣xi=xki)为具有相同类别 x k i x^i_k xki的训练示例中 y y y的平均值。对于低频类别,这个估计是有噪声的,通常用一些先验 p p p平滑它
x ^ k i = ∑ j = 1 n I ( x j i = x k i ) ∗ y j + a p ∑ j = 1 n I ( x j i = x k i ) + a \hat{x}^i_k = \frac{\sum_{j=1}^n{I(x^i_j=x^i_k)*y_j} + ap}{\sum_{j=1}^n{I(x^i_j=x^i_k)} + a} x^ki=∑j=1nI(xji=xki)+a∑j=1nI(xji=xki)∗yj+ap
其中 a > 0 a>0 a>0是超参数, p p p一般设定为target的均值
The problem of such greedy approach is target leakage
: feature x ^ k i \hat{x}^i_k x^ki is computed using y k y_k yk, the target of x k x_k xk. This leads to a conditional shift: the distribution of x ^ i ∣ y \hat{x}_i|y x^i∣y differs for training and test examples.
这种贪婪方法的问题是目标泄漏:特征 x ^ k i \hat{x}^i_k x^ki使用 y k y_k yk计算。这导致了一个条件转移: x ^ i ∣ y \hat{x} i|y x^i∣y的分布对于训练和测试示例是不同的。
The following extreme example illustrates how dramatically this may affect the generalization error of the learned model
下面的极端例子说明了这种方法是如何显著地影响学习模型的泛化误差的
假定第i个特征是categorical的,它的所有值都是唯一的(唯一的话, ∑ j = 1 n I ( x j i = x k i ) = 1 \sum_{j=1}^n{I(x^i_j=x^i_k)} = 1 ∑j=1nI(xji=xki)=1),并且对一个分类任务来说,它的每一个类别 A A A,我们有 P ( y = 1 ∣ x i = A ) = 0.5 P(y=1|x^i=A)=0.5 P(y=1∣xi=A)=0.5,即第i个特征的每个类别对应的0,1是一半一半。则在训练集中, x ^ k i = y k + a p 1 + a \hat{x}^i_k=\frac{y_k+ap}{1+a} x^ki=1+ayk+ap。因此对于该特征,设定阈值为 t = 0.5 + a p 1 + a t=\frac{0.5+ap}{1+a} t=1+a0.5+ap时,就可以将样本完美的分开。然而对于测试集,特征的greedy TS 是 p p p, 0 i f p < t e l s e 1 0 \space if \space p
当然,conditional shift
也可以修正,常用的方法是令 D k = D − { x k } D_k = D - \{x_k\} Dk=D−{xk},然后计算TS如下
x ^ k i = ∑ x j ∈ D k I ( x j i = x k i ) ∗ y j + a p ∑ x j ∈ D k I ( x j i = x k i ) + a (5) \hat{x}^i_k = \frac{\sum_{x_j \in D_k}{I(x^i_j=x^i_k)*y_j} + ap}{\sum_{x_j \in D_k}{I(x^i_j=x^i_k)} + a} \tag{5} x^ki=∑xj∈DkI(xji=xki)+a∑xj∈DkI(xji=xki)∗yj+ap(5)
Holdout TS
One way is to partition the training dataset into two parts D = D ^ 0 ∪ D ^ 1 D = \hat{D}_0 \cup \hat{D}_1 D=D^0∪D^1 and use D k = D ^ 0 D_k = \hat{D}_0 Dk=D^0 for calculating the TS according to (5) and D ^ 1 \hat{D}_1 D^1 for training (e.g., applied in for Criteo dataset). Though such holdout TS satisfies P1, this approach significantly reduces the amount of data used both for training the model and calculating the TS.
一种方法是将训练数据集划分为 D = D ^ 0 ∪ D ^ 1 D = \hat{D}_0 \cup \hat{D}_1 D=D^0∪D^1两部分,使用 D k = D ^ 0 D_k = \hat{D}_0 Dk=D^0根据(5)计算TS,使用 D ^ 1 \hat{D}_1 D^1进行训练(例如,应用于for Criteo数据集)(即 D ^ 1 \hat{D}_1 D^1的TS是用 D − D ^ 1 = D ^ 0 D-\hat{D}_1=\hat{D}_0 D−D^1=D^0来计算的)。虽然这种坚持TS满足P1(训练测试同分布),但该方法显著减少了用于训练模型和计算TS的数据量。即没有使用所有训练集计算TS
Leave-one-out TS
At first glance, a leave-one-out technique might work well: take D k = D − { x k } D_k = D - \{x_k\} Dk=D−{xk}for training examples x k x_k xk and D k = D D_k = D Dk=D for test ones. Surprisingly, it does not prevent target leakage. Indeed, consider a constant categorical feature: x k i = A x^i_k = A xki=A for all examples. Let n + n^+ n+ be the number of examples with y = 1 y = 1 y=1, then x ^ k i = n + − y k + a p n − 1 + a \hat{x}^i_k = \frac{n^+ - y_k + ap}{n-1+a} x^ki=n−1+an+−yk+ap and one can perfectly classify the training dataset by making a split with threshold x ^ k i = n + − 0.5 + a p n − 1 + a \hat{x}^i_k = \frac{n^+ - 0.5 + ap}{n-1+a} x^ki=n−1+an+−0.5+ap.
乍一看,leave-one-out可能很有效:训练示例 x k x_k xk的TS用 D k = D − { x k } D_k = D - \{x_k\} Dk=D−{xk},测试示例的TS用 D k = D D_k = D Dk=D。令人惊讶的是,它并不能防止目标泄漏。实际上,考虑一个常量分类特征:对于所有的例子, x k i = A x^i_k = A xki=A。设 n + n^+ n+为 y = 1 y = 1 y=1的样本数,则 x ^ k i = n + − y k + a p n − 1 + a \hat{x}^i_k = \frac{n^+ - y_k + ap}{n-1+a} x^ki=n−1+an+−yk+ap,通过阈值 x ^ k i = n + − 0.5 + a p n − 1 + a \hat{x}^i_k = \frac{n^+ - 0.5 + ap}{n-1+a} x^ki=n−1+an+−0.5+ap进行分割,就可以很好地对训练数据集进行分类。
Ordered TS
CatBoost uses a more effective strategy. It relies on the ordering principle, the central idea of the paper, and is inspired by online learning algorithms which get training examples sequentially in time). Clearly, the values of TS for each example rely only on the observed history. To adapt this idea to standard offline setting, we introduce an artificial “time”, i.e., a random permutation σ \sigma σ of the training examples. Then, for each example, we use all the available “history” to compute its TS, i.e., take D k = { x j : σ ( j ) < σ ( k ) } D_k = \{x_j : \sigma(j) < \sigma(k)\} Dk={xj:σ(j)<σ(k)} in Equation (5) for a training example and D k = D D_k = D Dk=D for a test one. The obtained ordered TS satisfies the requirement P1 and allows to use all training data for learning the model (P2). Note that, if we use only one random permutation, then preceding examples have TS with much higher variance than subsequent ones. To this end, CatBoost uses different permutations for different steps of gradient boosting, see details in Section 5
CatBoost采用了更有效的策略。它依赖于排序原则,这是本文的中心思想,并受到在线学习算法的启发,在线学习算法可以及时得到训练示例)。显然,每个例子的TS值只依赖于观察到的历史。为了使这一思想适用于标准的离线设置,我们引入了一个人工的“时间”,例如训练示例的随机排列 σ \sigma σ。然后,对于每个例子,我们使用所有可用的“历史”来计算它的TS,即,对于训练例子,取 D k = { x j : σ ( j ) < σ ( k ) } D_k = \{x_j: \sigma(j) < \sigma(k)\} Dk={xj:σ(j)<σ(k)}来计算TS,对于测试例子,取 D k = D D_k = D Dk=D。得到的有序TS满足P1的要求,并允许使用所有训练数据学习模型(P2)。请注意,如果我们只使用一个随机排列,那么前面的例子的TS具有比后面的例子更高的方差。为此,CatBoost对梯度增强的不同步骤使用不同的排列,详见第5节
In this section, we reveal the problem of prediction shift in gradient boosting, which was neither recognized nor previously addressed. Like in case of TS, prediction shift is caused by a special kind of target leakage. Our solution is called ordered boosting and resembles the ordered TS method.
在本节中,我们揭示了在梯度增强中预测偏移的问题,这个问题以前没有被认识到也没有被解决。与TS的情况一样,预测偏移是由一种特殊的目标泄漏引起的。我们的解决方案称为有序提升,类似于有序TS方法。
As in the case of TS, these problems are caused by the target leakage. Indeed, gradients used at each step are estimated using the target values of the same data points the current model F t − 1 F ^{t-1} Ft−1 was built on. However, the conditional distribution F t − 1 ( x k ) ∣ x k F^{t-1}(x_k) | x_k Ft−1(xk)∣xk for a training example x k x_k xk is shifted, in general, from the distribution F t − 1 ( x ) ∣ x F^{t-1}(x) | x Ft−1(x)∣x for a test example x x x. We call this a prediction shift
.
在TS的情况下,这些问题是由目标泄漏引起的。实际上,每一步使用的梯度是使用当前模型 F t − 1 F ^{t-1} Ft−1所建立的相同数据点的目标值来估计的。但是,对于训练示例 x k x k xk,条件分布 F t − 1 ( x k ) ∣ x k F^{t-1}(x k) | x k Ft−1(xk)∣xk通常会从测试示例 x x x的 F t − 1 ( x ) ∣ x F^{t-1}(x) | x Ft−1(x)∣x转移。我们称之为预测偏移
。
详细预测偏移的例子可以看论文
Here we propose a boosting algorithm which does not suffer from the prediction shift problem described in Section 4.1. Assuming access to an unlimited amount of training data, we can easily construct such an algorithm. At each step of boosting, we sample a new dataset D t D_t Dt independently and obtain unshifted residuals by applying the current model to new training examples. In practice, however, labeled data is limited. Assume that we learn a model with I I I trees. To make the residual r I − 1 ( x k , y k ) r^{I-1}(x_k, y_k) rI−1(xk,yk) unshifted, we need to have F I − 1 F^{I-1} FI−1 trained without the example x k x_k xk. Since we need unbiased residuals for all training examples, no examples may be used for training F I − 1 F^{I-1} FI−1, which at first glance makes the training process impossible.
在这里,我们提出了一个boost算法,它不受4.1节中描述的预测偏移问题的影响。假设可以访问无限数量的训练数据,我们可以很容易地构建这样一个算法。在提升的每一步,我们可以独立采样出一个新的数据集 d t d_t dt,并通过将当前模型应用于新的训练示例来获得未偏移残差。然而,在实践中,有标签的数据是有限的。假设我们学习一个有 I I I棵树的模型。为了使残差 r I − 1 ( x k , y k ) r^{I-1}(x_k, y_k) rI−1(xk,yk)不偏移,我们需要在没有 x k x_k xk示例的情况下训练 F I − 1 F^{I-1} FI−1。由于我们需要所有训练样本的无偏残差,因此没有样本可以用于训练 F I − 1 F^{I-1} FI−1,这乍一看使得训练过程不可能进行。
However, it is possible to maintain a set of models differing by examples used for their training. Then, for calculating the residual on an example, we use a model trained without it. In order to construct such a set of models, we can use the ordering principle previously applied to TS in Section 3.2.
然而,有可能维护一组模型,不同的例子用于它们的训练。然后,在一个例子上计算残差,我们使用一个在这个例子上没有经过训练的模型。为了构建这样一组模型,我们可以使用3.2节中应用于TS的排序原则。
To illustrate the idea, assume that we take one random permutation σ \sigma σ of the training examples and maintain n n n different supporting models M 1 , . . . , M n M_1, ..., M_n M1,...,Mn such that the model M i M_i Mi is learned using only the first i i i examples in the permutation. At each step, in order to obtain the residual for j-th sample, we use the model M j − 1 M_{j-1} Mj−1 (see Figure 1). The resulting Algorithm 1 is called ordered boosting below.
为了说明这个想法,假设我们取训练示例的一个随机排列 σ \sigma σ,并维持 n n n个不同的支持模型$ m_1,…, M_n ,使得模型 ,使得模型 ,使得模型M_i 只使用排列中的前 只使用排列中的前 只使用排列中的前i 例子来学习。在每一步中,为了得到第 j 个样本的残差,我们使用模型 例子来学习。在每一步中,为了得到第j个样本的残差,我们使用模型 例子来学习。在每一步中,为了得到第j个样本的残差,我们使用模型M_{j-1}$(见图1)。得到的算法1称为下面的ordered boosting
。
Unfortunately, this algorithm is not feasible in most practical tasks due to the need of training n n n different models, what increase the complexity and memory requirements by n n n times. In CatBoost, we implemented a modification of this algorithm on the basis of the gradient boosting algorithm with decision trees as base predictors (GBDT) described in Section 5.
不幸的是,由于需要训练 n n n个不同的模型,这种算法在大多数实际任务中是不可用的,这增加了复杂度和内存需求 n n n倍。在CatBoost中,我们在第5节描述的以决策树作为基础预测器(GBDT)的梯度增强算法的基础上实现了该算法的修改。
Ordered boosting with categorical features
In Sections 3.2 and 4.2 we proposed to use random permutations σ c a t \sigma_{cat} σcat and σ b o o s t \sigma_{boost} σboost of training examples for the TS calculation and for ordered boosting, respectively. Combining them in one algorithm, we should take σ c a t = σ b o o s t \sigma_{cat}=\sigma_{boost} σcat=σboost to avoid prediction shift. This guarantees that target y i y_i yi is not used for training M i M_i Mi (neither for the TS calculation, nor for the gradient estimation). See Section F of the supplementary material for theoretical guarantees. Empirical results confirming the importance of having σ c a t = σ b o o s t \sigma_{cat}=\sigma_{boost} σcat=σboost are presented in Section G of the supplementary material.
在3.2节和4.2节中,我们提出分别使用训练示例的随机排列 σ c a t \sigma_{cat} σcat和 σ b o o s t \sigma_{boost} σboost来进行TS计算和有序提升。将它们组合在一个算法中,我们应该取 σ c a t = σ b o o s t \sigma_{cat}=\sigma_{boost} σcat=σboost来避免预测偏移。这保证了目标 y i y_i yi不用于训练 M i M_i Mi(既不用于TS计算,也不用于梯度估计)。理论保证见补充材料F节。实证结果证实了 σ c a t = σ b o o s t \sigma_{cat}=\sigma_{boost} σcat=σboost的重要性,将在G节给出
Algorithm 1: Ordered boosting
input: $\{(x_k, y_k)\}_{k=1}^n, I$
$\sigma$ = random permutation of $[1, n]$
$M_i = 0$ for $i=1,...,n$
for t=1 to I do
for i = 1 to n do
$r_i = y_i - M_{\sigma(i)-1}(x_i)$
for i = 1 to n do
$h = LearnModel((x_j, r_j)), \sigma{j}\le i$
$M_i = M_i + h$
return $M_n$
CatBoost has two boosting modes, Ordered and Plain. The latter mode is the standard GBDT algorithm with inbuilt ordered TS. The former mode presents an efficient modification of Algorithm 1. A formal description of the algorithm is included in Section B of the supplementary material. In this section, we overview the most important implementation details.
CatBoost有两种增强模式,Ordered和Plain。后者是标准的GBDT算法,内置了有序TS。前者是对算法1的有效改进。补充材料的B节包含了算法的正式描述。在本节中,我们将概述最重要的实现细节。
At the start, CatBoost generates s + 1 s + 1 s+1 independent random permutations of the training dataset. The permutations σ 1 , . . . , σ s \sigma_1, . . . , \sigma_s σ1,...,σs are used for evaluation of splits that define tree structures (i.e., the internal nodes), while σ 0 \sigma_0 σ0 serves for choosing the leaf values b j b_j bj of the obtained trees (see Equation (3) ).
开始时,CatBoost生成训练数据集的 s + 1 s + 1 s+1独立随机排列。排列 σ 1 , . . . , σ s \sigma_1,..., \sigma_s σ1,...,σs用于对定义树结构(即内部节点)的分段进行评估,而 σ 0 \sigma_0 σ0用于选择获得的树的叶值 b j b_j bj(见式(3))。
For examples with short history in a given permutation, both TS and predictions used by ordered boosting ( M σ ( i ) − 1 ( x i ) M_{\sigma(i)-1}(x_i) Mσ(i)−1(xi) in Algorithm 1) have a high variance. Therefore, using only one permutation may increase the variance of the final model predictions, while several permutations allow us to reduce this effect in a way we further describe. The advantage of several permutations is confirmed by our experiments in Section 6.
Building a tree
In CatBoost, base predictors are oblivious decision trees
also called decision tables. Term oblivious means that the same splitting criterion is used across an entire level of the tree. Such trees are balanced, less prone to overfitting, and allow speeding up execution at testing time significantly. The procedure of building a tree in CatBoost is described in Algorithm 2.
Algorithm 2: Building a tree in CatBoost
input:
M
{(x_i, y_i)}_{i=1}^n
L
{\sigma_i}_{i=1}^n
Mode
grad = CalcGradient(L, M, y)
r = random(1, s)
if Mode = Plain then
G = (grad_r(i) for i = 1,...,n)
if Mode = Ordered then
G = (grad_{r, \sigma_r(i)-1}(i) for i = 1,...,n)
T = empty tree
for each step of top-down procedure do
for each candidate split c do
T_c = add split c to T
if Mode = Plain then
\Delta(i) = avg(grad_r(p) for p: leaf_r(p)=leaf_r(i)) for i = 1,...,n
if Mode = Ordered then
\Delta(i) = avg(grad_{r, \sigma_{r}(i)-1}(p) for p: leaf_r(p)=leaf_r(i), \sigma_r(p) < \sigma_r(i)) for i = 1,...,n
loss(T_c) = cos(\Delta, G)
T = argmin_{T_c}(loss(T_c))
if Mode = Plain then
M_{r^'}(i) = M_{r^'}(i)-\alpha avg(grad_{r^'}(p) for p: leaf_{r^'}(p)=leaf_{r^'}(i)) for r^'=1,...,s,i=1,...,n
if Mode = Ordered then
M_{r^', j}(i) = M_{r^', j}(i)-\alpha avg(grad_{r^', j}(p) for p: leaf_{r^'}(p)=leaf_{r^'}(i), \sigma_{r^'}(p) <= j) for r^'=1,...,s,i=1,...,n, j>= \sigma_{r^'}-1
return T, M
在CatBoost中,基本预测器是“对称决策树”,也称为决策表。对称意味着在树的整个层次上使用相同的分割标准。这样的树是平衡的,不容易过度拟合,并允许在测试时显著加快执行速度。在CatBoost中构建树的过程在算法2中描述。
In the Ordered boosting mode, during the learning process, we maintain the supporting models M r , j M_{r,j} Mr,j , where M r , j ( i ) M_{r,j}(i) Mr,j(i) is the current prediction for the i-th example based on the first j examples in the permutation σ r , r = 1 , . . . , s \sigma_r, r=1,...,s σr,r=1,...,s.
M r , j ( i ) M_{r,j}(i) Mr,j(i)是基于 σ r , r = 1 , . . . , s \sigma_r, r=1,...,s σr,r=1,...,s排列情况下,基于前j个样本训练的模型对第i个样本的预测。
At each iteration t of the algorithm, we sample a random permutation σ r \sigma_r σr from σ 1 , . . . , σ s {\sigma_1,..., \sigma_s} σ1,...,σs and construct a tree T t T_t Tt on the basis of it.
在第t次迭代时,我们从 σ 1 , . . . , σ s {\sigma_1,..., \sigma_s} σ1,...,σs中随机采样一个 σ r \sigma_r σr,在它基础上得到初始的 T t T_t Tt。
First, for categorical features, all TS are computed according to this permutation σ r \sigma_r σr.
首先,根据排列 s i g m a r sigma_r sigmar计算出离散特征的目标统计量。
Second, the permutation σ r \sigma_r σr affects the tree learning procedure.
然后,排列 s i g m a r sigma_r sigmar也影响树学习过程。
Namely, based on M r , j ( i ) M_{r,j}(i) Mr,j(i), we compute the corresponding gradients g r a d r , j ( i ) = ∂ L ( y i , s ) ∂ s ∣ s = M r , j ( i ) grad_{r,j}(i) = \frac{\partial L(y_i, s)}{\partial s}|_{s=M_{r,j}(i)} gradr,j(i)=∂s∂L(yi,s)∣s=Mr,j(i)
根据(上一棵树?)输入的 M M M,我们可以算出 g r a d = C a l c G r a d i e n t ( L , M , y ) grad = CalcGradient(L, M, y) grad=CalcGradient(L,M,y),即对于输入的 M r , j ( i ) M_{r,j}(i) Mr,j(i),可以得到 g r a d r , j ( i ) = ∂ L ( y i , s ) ∂ s ∣ s = M r , j ( i ) grad_{r,j}(i) = \frac{\partial L(y_i, s)}{\partial s}|_{s=M_{r,j}(i)} gradr,j(i)=∂s∂L(yi,s)∣s=Mr,j(i)。
Then, while constructing a tree, we approximate the gradient G in terms of the cosine similarity cos(.,.)
, where, for each example i, we take the gradient g r a d r , σ ( i ) − 1 grad_{r, \sigma(i)-1} gradr,σ(i)−1(it is based only on the previous examples in σ r \sigma_r σr).
然后,在构造树的时候,我们用余弦相似度cos(.,.)
来近似梯度G,其中对每个样本 i i i,我们取其梯度为 g r a d r , σ ( i ) − 1 grad_{r, \sigma(i)-1} gradr,σ(i)−1(在排列 σ r \sigma_r σr情况下,它取决于之前的样本)。
At the candidate splits evaluation step, the leaf value Δ ( i ) \Delta(i) Δ(i) for example i i i is obtained individually by averaging the gradients g r a d r , σ r ( i ) − 1 grad_{r, \sigma_r(i)-1} gradr,σr(i)−1 of the preceding examples p p p lying in the same leaf l e a f r ( i ) leaf_r(i) leafr(i) the example i i i belongs to.
在候选节点拆分执行环节,前面的样本 p p p在同一个叶子上的梯度 g r a d r , σ r ( i ) − 1 grad_{r, \sigma_r(i)-1} gradr,σr(i)−1的平均,得到了例子 i i i所属的叶子值 Δ ( i ) \Delta(i) Δ(i)。
Note that l e a f r ( i ) leaf_r(i) leafr(i) depends on the chosen permutation σ r \sigma_r σr, because σ r \sigma_r σr can influence the values of ordered TS for example i i i.
注意 l e a f r ( i ) leaf_r(i) leafr(i)依赖于排列 σ r \sigma_r σr,因为目标统计量TS依赖于 σ r \sigma_r σr。
When the tree structure T t T_t Tt (i.e., the sequence of splitting attributes) is built, we use it to boost all the models M r ′ , j M_{r^{'}, j} Mr′,j.
当树结构 T t T_t Tt被构造后,我们用它来得到所有预测值 M r ′ , j M_{r^{'}, j} Mr′,j。
Let us stress that one common tree structure T t T_t Tt is used for all the models, but this tree is added to different M r ′ , j M_{r^{'}, j} Mr′,j with different sets of leaf values depending on r ′ r^{'} r′ and j j j, as described in Algorithm 2.
让我们强调一下,所有的模型都使用了一个通用的树结构 T t T_t Tt,但是这个树被添加到不同的 M r ′ , j M_{r^{'}, j} Mr′,j中,根据 r ′ r^{'} r′和 j j j有不同的叶值集,如算法2中所述。
The Plain boosting mode works similarly to a standard GBDT procedure, but, if categorical features are present, it maintains s s s supporting models M r M_r Mr corresponding to TS based on σ 1 , . . . , σ s {\sigma_1,..., \sigma_s} σ1,...,σs.
Plain boost模式的工作原理类似于标准GBDT过程,但是,如果存在分类特征,它维护基于 σ 1 , . . . , σ s {\sigma_1,..., \sigma_s} σ1,...,σs构造的TS生成的模型 M r M_r Mr
Choosing leaf values
Given all the trees constructed, the leaf values of the final model F are calculated by the standard gradient boosting procedure equally for both modes. Training examples i are matched to leaves l e a f 0 ( i ) leaf_0(i) leaf0(i), i.e., we use permutation σ 0 \sigma_0 σ0 to calculate TS here. When the final model F is applied to a new example at testing time, we use TS calculated on the whole training data according to Section 3.2
.
对于所有构建的树,最终模型F的叶值通过标准梯度提升程序进行计算,这对于两种模式是相同的。训练示例i匹配树叶 l e a f 0 ( i ) leaf_0(i) leaf0(i),即我们使用置换 σ 0 \sigma_0 σ0来计算这里的TS。“将最终模型F应用于一个新示例时,我们使用根据第3.2节对整个训练数据计算的TS来作为特征”。
Feature combinations
Another important detail of CatBoost is using combinations of categorical features as additional categorical features which capture high-order dependencies like joint information of user ID and ad topic in the task of ad click prediction.
CatBoost的另一个重要细节是使用分类特征的组合作为附加的分类特征,在广告点击预测任务中捕获高阶依赖关系,如用户ID和广告主题的联合信息。
The number of possible combinations grows exponentially with the number of categorical features in the dataset, and it is infeasible to process all of them
.
可能的组合的数量随着数据集中类别特征的数量呈指数增长,处理所有的组合是不可行的。
CatBoost constructs combinations in a greedy way. Namely, for each split of a tree, CatBoost combines (concatenates) all categorical features (and their combinations) already used for previous splits in the current tree with all categorical features in the dataset. Combinations are converted to TS on the fly.
CatBoost以贪婪的方式构造组合。也就是说,对于树的每次拆分,CatBoost将当前树中以前拆分中使用的所有分类特征(及其组合)与数据集中的所有分类特征结合(连接)。组合动态地转换为TS。
Other important details
Finally, let us discuss two options of the CatBoost algorithm not covered above.
The first one is subsampling of the dataset at each iteration of boosting procedure, as proposed by Friedman. We claimed earlier in Section 4.1 that this approach alone cannot fully avoid the problem of prediction shift
. However, since it has proved effective, we implemented it in both modes of CatBoost as a Bayesian bootstrap procedure.
The second option deals with first several examples in a permutation. For examples i with small values σ r ( i ) \sigma_r(i) σr(i), the variance of g r a d r , σ r ( i ) − 1 ( i ) grad_{r,\sigma_r(i)-1}(i) gradr,σr(i)−1(i) can be high. Therefore, we discard Δ ( i ) \Delta(i) Δ(i) from the beginning of the permutation, when we calculate l o s s ( T c ) loss(T_c) loss(Tc) in Algorithm 2. Particularly, we eliminate the corresponding components of vectors G G G and Δ \Delta Δ when calculating the cosine similarity between them.
由于前面几个样本的梯度方差会特别大,因此我们忽略前面几个的 Δ ( i ) \Delta(i) Δ(i),即计算相似性的时候把前面几个分量干掉。
In this paper, we identify and analyze the problem of prediction shifts present in all existing implementations of gradient boosting. We propose a general solution, ordered boosting with ordered TS, which solves the problem. This idea is implemented in CatBoost, which is a new gradient boosting library. Empirical results demonstrate that CatBoost outperforms leading GBDT packages and leads to new state-of-the-art results on common benchmarks.
在这篇论文中,我们识别和分析了所有现有的梯度增强实现中存在的预测偏移问题。我们提出了一种通用的解决方案,即带有序TS的有序提升,解决了这一问题。这个想法是在CatBoost中实现的,这是一个新的梯度增强库。经验结果表明,CatBoost的性能优于领先的GBDT包,并在通用基准测试中获得最新的结果。