XGBoost 阅读之 Weighted Quantile Sketch

3.3 Weighted Quantile Sketch(加权分位数略图)

One important step in the approximate algorithm is to propose candidate split points. Usually percentiles of a feature are used to make candidates distribute evenly on the data. Formally, let multi-set represent the k-th feature values and second order gradient statistics of each training instances. We can define a rank functions as

which represents the proportion of instances whose feature value k is smaller than z. The goal is to find candidate split points such that

Here ε is an approximation factor. Intuitively, this means that there is roughly 1/ε candidate points. Here each data point is weighted by hi. To see why hi represents the weight, we can rewrite Eq(3) as 

which is exactly weighted squared loss with labels  gi/hi and wieghts hi. For large datasets, it is non-trivial to find candidate splits that satisfy the criteria. When every instance has equal weights, an existing algorithm called quantile sketch[14, 24] solves the problem. However, there is no existing quantile sketch for the weighted datasets. Therefore, most existing approximate algorithms either resorted to sorting on a random subset of data which have a chance of failure or heuristics that do not have theoretical guarantee.






    To solve this problem we introduced a novel distributed weighted quantile sketch algorithm that can handle weighted data with a provable theoretical guarantee. The general idea is to propose a data structure that supports merge and prune operations, with each operation proven to maintain a certain accuracy level. A detailed description of the algorithm as well as proofs are given in the supplementary material5(link in the foot note)

5:Link to the supplementary material:https://homes.cs.washington.edu/~tqchen/pdf/xgboost-supp.pdf



3.3 总结:

  1. 提出了一种给定权重情况下,寻找候选分桶点的算法,详细原理在附录中。

  2. 解释了为什么用二阶导作为权重是合理的。但是有两个问题,一是符号错了。二是,即是加上符号,也没说清楚为什么合理。即,使用-hi/gi作为label有什么意义?

  3. 对于第二个问题的理解:看式3,由于目标是损失函数二阶泰勒展开后的两余项,当只有一个样本时,显然最优解是-gi/hi。这就解释了为什么3.3中要化成使用-gi/hi做label的形式。当求样本集最优且不考虑正则化项时,其值就是这个式子:(-sigma(gi))/sigma(hi)。显然可见,如果损失函数在各处二阶导都相同,则其值就是avg(gi)/const与各个样本点处的二阶导差异无关。而正因为存在二阶导处处不同的损失函数,在考虑所有样本时,其最优解不是avg(gi)。因此,最优的w相当于按二阶导加权后的结果。
