XGBoost: A Scalable Tree Boosting System 阅读笔记

摘要

提升树广泛应用于机器学习的各个领域，在这篇论文中，提出了一个新的提升树方式。

1. 介绍

论文的创新点共一下四点：

We design and build a highly scalable end-to-end tree boosting system.
We propose a theoretically justied weighted quantile sketch for efficient proposal calculation.
We introduce a novel sparsity-aware algorithm for parallel tree learning.
We propose an effective cache-aware block structure for out-of-core tree learning.

2.

2.1-2.2 介绍了梯度提升树和XGBoost的计算方式，具体信息可以参考https://www.jianshu.com/p/8fd9f6aef825和https://www.jianshu.com/p/0bb0ff3eb2ef
2.3 防止过拟合的方式
(1) Shrinkage: Shrinkage scales newly added weights by a factor
after each step of tree boosting. 对于每一棵树的结果加上一个权重,这类似于深度学习的学习率，学习率小就更新慢，大就更新快。
(2) Column Subsampling: 列采样是在随机森林中常用的方式。通过实验证明，列采样不仅可以提高准确率，而且也可以加快计算速度。

3. 分裂点选择算法

3.1 Basic Exact Greedy Algorithm
为了找到最优的分裂点，一种方式是遍历所有特征，从而找到最优分裂点。为了提升效率，需要对连续值进行排序去加快梯度统计。

来自论文

3.2 Approximate Algorithm
Greedy Algorithm是一种强有力的选择特征方式，然而对计算机的要求比较高和费时间，因为需要遍历所有特征和将所有特征方式内存中.Approximate Algorithm 算法有两种方式来实现：
(1) global：在树构建的初始状态阶段选出所有候选分裂点，后面每层都使用相同的策略选择分裂点。
(2) local：每次分裂后重新选出候选分裂点，适合深度较大的树，因为不需要提前准备过多的候选分裂点。

来自论文
3.3 Weighted Quantile Sketch
近似算法一个重要的步骤是挑选候选集的分裂位置
3.4 Sparsity-aware Split Finding
造成数据稀疏的原因有以下四点
(1) 数据缺失
(2) 存在大量的0值
(3) 使用了one-hot编码
为了解决数据稀疏的问题，算法中给缺失值（稀疏值）规定了默认的方向（default direction），将缺失值数据统一划分到左子树或者右子树，之后计算两种不同划分方向的Gain之后，就确定了缺失值划分到哪一侧的增益是最大的。

来自论文

4.系统设计

4.1 Column Block for Parallel Learning
为了减少计算排序过程的消耗，提出将数据存储在block上，Data in each block is stored in the compressed column (CSC) format, with each column sorted by the corresponding feature value. This input data layout only needs to be computed once before training, and can be reused in later iterations.
4.2 Cache-aware Access
由于排序之后的梯度值在内存中存放不是连续的，所以访问是命中率不高
(1) For the exact greedy algorithm, we can alleviate the problem by a cache-aware prefetching algorithm. Specically, we allocate an internal buffer in each thread, fetch the gradient statistics into it, and then perform accumulation in a mini-batch manner.
(2) ComputationFor approximate algorithms, we solve the problem by choosing a correct block size. We dene the block size to be maximum number of examples in contained in a block, as this reflects the cache storage cost of gradient statistics.
4.3 Blocks for Out-of-core
上面的方法都是从内存和进程方面的优化，现在来谈谈硬盘的优化
(1) Block Compression: 在列的方向上对数据进行压缩，然后再读取的时候使用一个独立的线程解压。对于行的索引，只保留第一列的索引
(2) Block Sharding: 将数据进行拆分，保存到多个硬盘上，然后使用多线程读数据。

5-6 忽略

7. 结论

We proposed a novel sparsity aware algorithm for handling sparse data and a theoretically justied weighted quantile sketch for approximate learning. Our experience shows that cache access patterns, data compression and sharding are essential elements for building a scalable end-to-end system for tree boosting. These lessons can be applied to other machine learning systems as well. By combining these insights, XGBoost is able to solve.

参考文献

https://zhuanlan.zhihu.com/p/104603959
https://zhuanlan.zhihu.com/p/89572181