Adam坤

XGBoost: A Scalable Tree Boosting System（XGBoost：一个可扩展的树提升系统）

XGBoost: A Scalable Tree Boosting System

ABSTRACT

Tree boosting is a highly e ective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quan-tile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compres-sion and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.
树推进是一种高效且广泛使用的机器学习方法。在本文中，我们描述了一个可扩展的端到端树推进系统XGBoost，它被数据科学家广泛使用，以在许多机器学习挑战中获得最先进的结果。我们提出了一种新的稀疏数据稀疏感知算法和近似树学习的加权量化草图。更重要的是，我们提供有关缓存访问模式，数据压缩和分片的见解，以构建可扩展的树提升系统。通过结合这些见解，XGBoost使用比现有系统少得多的资源来扩展数十亿个示例。

Keywords

Large-scale Machine Learning

1.INTRODUCTION

Machine learning and data-driven approaches are becoming very important in many areas. Smart spam classi ers protect our email by learning from massive amounts of spam data and user feedback; advertising systems learn to match the right ads with the right context; fraud detection systems protect banks from malicious attackers; anomaly event detection systems help experimental physicists to nd events that lead to new physics. There are two important factors that drive these successful applications: usage of e ective (statistical) models that capture the complex data dependencies and scalable learning systems that learn the model of interest from large datasets.
机器学习和数据驱动方法在许多领域都非常重要。智能垃圾邮件分类器通过学习大量的spam数据和用户反馈来保护我们的电子邮件; 广告系统学会将正确的广告与正确的背景相匹配; 欺诈检测系统保护银行免受恶意攻击者的侵害异常事件检测系统帮助实验物理学家找到导致新物理学的事件。驱动这些成功应用程序有两个重要因素：使用捕获复杂数据依赖关系的有效（统计）模型和可从大型数据集中学习感兴趣模型的可扩展学习系统。

Among the machine learning methods used in practice, gradient tree boosting [10]1 is one technique that shines in many applications. Tree boosting has been shown to give state-of-the-art results on many standard classification benchmarks [16]. LambdaMART [5], a variant of tree boost-ing for ranking, achieves state-of-the-art result for ranking problems. Besides being used as a stand-alone predictor, it is also incorporated into real-world production pipelines for ad click through rate prediction [15]. Finally, it is the de-facto choice of ensemble method and is used in challenges such as the Netix prize [3].

在实践中使用的机器学习方法中，梯度树增强[10] 1是一种在许多应用中闪耀的技术。树木增强已被证明可以在许多标准分类基准上给出最先进的结果[16]。 LambdaMART [5]是用于排名的树推进的变体，它实现了排名问题的最新结果。除了用作独立预测器之外，它还被整合到实际生产流水线中，用于广告点击率预测[15]。最后，它是集合方法的事实上的选择，并用于Netix奖[3]等挑战。

In this paper, we describe XGBoost, a scalable machine learning system for tree boosting. The system is available as an open source package2. The impact of the system has been widely recognized in a number of machine learning and data mining challenges. Take the challenges hosted by the machine learning competition site Kaggle for example. A-mong the 29 challenge winning solutions 3 published at Kag-gle’s blog during 2015, 17 solutions used XGBoost. Among these solutions, eight solely used XGBoost to train the mod-el, while most others combined XGBoost with neural net-s in ensembles. For comparison, the second most popular method, deep neural nets, was used in 11 solutions. The success of the system was also witnessed in KDDCup 2015, where XGBoost was used by every winning team in the top-10. Moreover, the winning teams reported that ensemble methods outperform a well-con gured XGBoost by only a small amount [1].

在本文中，我们描述了XGBoost，一种用于树木提升的可扩展机器学习系统。该系统可作为开源软件包2使用。该系统的影响已在许多机器学习和数据挖掘挑战中得到广泛认可。以机器学习竞赛网站Kaggle主持的挑战为例。 A-mong在2015年Kag-gle的博客上发布了29个挑战获胜解决方案，17个解决方案使用了XGBoost。在这些解决方案中，八个仅使用XGBoost来训练模型，而大多数其他解决方案将XGBoost与神经网络结合在一起。为了比较，第二种最常用的方法是深度神经网络，用于11种解决方案。该系统的成功也在KDDCup 2015中见证，其中XGBoost被前10名中的每个获胜团队使用。此外，获胜团队报告说，整体方法仅仅在很少量的情况下胜过良好的XGBoost [1]。

These results demonstrate that our system gives state-of-the-art results on a wide range of problems. Examples of the problems in these winning solutions include: store sales prediction; high energy physics event classi cation; web text classi cation; customer behavior prediction; motion detec-tion; ad click through rate prediction; malware classi cation; product categorization; hazard risk prediction; massive on-line course dropout rate prediction. While domain depen-dent data analysis and feature engineering play an important role in these solutions, the fact that XGBoost is the consen-sus choice of learner shows the impact and importance of our system and tree boosting.

这些结果表明，我们的系统在广泛的问题上提供了最先进的结果。这些获胜解决方案中存在的问题包括：商店销售预测; 高能物理事件分类; 网络文本分类; 顾客行为预测; 运动检测; 广告点击率预测; 恶意软件分类; 产品分类; 危险风险预测; 大规模的在线课程辍学率预测。虽然域依赖数据分析和特征工程在这些解决方案中发挥着重要作用，但XGBoost是学习者的共识选择这一事实表明了我们的系统和树提升的影响和重要性。

The most important factor behind the success of XGBoost is its scalability in all scenarios. The system runs more than ten times faster than existing popular solutions on a single machine and scales to billions of examples in distributed or memory-limited settings. The scalability of XGBoost is due to several important systems and algorithmic optimizations. These innovations include: a novel tree learning algorithm is for handling sparse data; a theoretically justi ed weighted quantile sketch procedure enables handling instance weights in approximate tree learning. Parallel and distributed com-puting makes learning faster which enables quicker model ex-ploration. More importantly, XGBoost exploits out-of-core computation and enables data scientists to process hundred millions of examples on a desktop. Finally, it is even more exciting to combine these techniques to make an end-to-end system that scales to even larger data with the least amount of cluster resources. The major contributions of this paper is listed as follows:

XGBoost成功背后最重要的因素是它在所有场景中的可扩展性。该系统在单台机器上运行速度比现有流行解决方案快十倍以上，并且可以在分布式或内存限制设置中扩展到数十亿个示例。 XGBoost的可扩展性归功于几个重要的系统和算法优化。这些创新包括：一种新颖的树学习算法，用于处理稀疏数据;理论上加权的加权分位数草图程序使得能够在近似树学习中处理实例权重。并行和分布式计算使学习更快，从而可以更快地进行模型探索。更重要的是，XGBoost利用核外计算，使数据科学家能够在桌面上处理数亿个示例。最后，结合这些技术使端到端系统以最少的集群资源扩展到更大的数据更令人兴奋。本文的主要贡献如下：

We design and build a highly scalable end-to-end tree boosting system.

We propose a theoretically justi ed weighted quantile sketch for e cient proposal calculation.

We introduce a novel sparsity-aware algorithm for par-allel tree learning.

We propose an e ective cache-aware block structure for out-of-core tree learning.

While there are some existing works on parallel tree boost-ing [22, 23, 19], the directions such as out-of-core compu-tation, cache-aware and sparsity-aware learning have not been explored. More importantly, an end-to-end system that combines all of these aspects gives a novel solution for real-world use-cases. This enables data scientists as well as researchers to build powerful variants of tree boosting al-gorithms [7, 8]. Besides these major contributions, we also make additional improvements in proposing a regularized learning objective, which we will include for completeness.

The remainder of the paper is organized as follows. We will rst review tree boosting and introduce a regularized objective in Sec. 2. We then describe the split nding meth-ods in Sec. 3 as well as the system design in Sec. 4, including experimental results when relevant to provide quantitative support for each optimization we describe. Related work is discussed in Sec. 5. Detailed end-to-end evaluations are included in Sec. 6. Finally we conclude the paper in Sec. 7.

我们设计并构建了一个高度可扩展的端到端树推进系统。

我们提出了一个理论上合理的加权分位数草图，用于有效的提议计算。

我们为par-allel树学习引入了一种新颖的稀疏感知算法。

我们提出了一种用于核外树学习的有效缓存感知块结构。

虽然现有一些关于并行树增强的工作[22,23,19]，但尚未探索诸如核外计算，高速缓存感知和稀疏感知学习等方向。更重要的是，结合所有这些方面的端到端系统为现实世界的用例提供了一种新颖的解决方案。这使数据科学家和研究人员能够构建树木增强算法的强大变体[7,8]。除了这些主要贡献之外，我们还在提出正规化学习目标方面做出了进一步的改进，我们将包括完整性。

在本文的其余部分安排如下。我们将首先回顾树的推进并在Sec中引入正则化的目标。然后我们描述了Sec中的分裂方法。 3以及Sec中的系统设计。 4，包括相关的实验结果，为我们描述的每个优化提供定量支持。相关工作在第二节中讨论。 5.详细的端到端评估包含在Sec。最后，我们在第二节总结了这篇论文。 7。

TREE BOOSTING IN A NUTSHELL

We review gradient tree boosting algorithms in this sec-tion. The derivation follows from the same idea in existing literatures in gradient boosting. Specicially the second order method is originated from Friedman et al. [12]. We make mi-nor improvements in the reguralized objective, which were found helpful in practice.

我们将在本节中回顾渐变树增强算法。推导遵循现有文献中梯度增强的相同思想。特别地，二阶方法源自Friedman等人。[12]。我们对法律化的目标进行了微观改进，这在实践中是有帮助的。

2.1 Regularized Learning Objective

图1：树集合模型。给定示例的最终预测是每棵树的预测总和。

它进入叶子并通过总结相应叶子中的分数（由w给出）来计算最终预测。要了解模型中使用的函数集，我们最小化以下正则化目标。

Here l is a di erentiable convex loss function that measures the di erence between the prediction y^i and the target yi. The second term penalizes the complexity of the model (i.e., the regression tree functions). The additional regular-ization term helps to smooth the nal learnt weights to avoid over- tting. Intuitively, the regularized objective will tend to select a model employing simple and predictive functions. A similar regularization technique has been used in Regu-larized greedy forest (RGF) [25] model. Our objective and the corresponding learning algorithm is simpler than RGF and easier to parallelize. When the regularization parame-ter is set to zero, the objective falls back to the traditional gradient tree boosting.

这里l是一个不可靠的凸损失函数，它测量预测y ^ i和目标yi之间的差异。第二项惩罚模型的复杂性（即回归树函数）。额外的规则化术语有助于平滑最终学习的权重，以避免过度。直观地，正则化目标将倾向于选择采用简单和预测函数的模型。类似的正则化技术已被用于Regu-larized贪婪林（RGF）[25]模型。我们的目标和相应的学习算法比RGF更简单，更易于并行化。当正则化参数设置为零时，目标回退到传统的梯度树提升。

2.2 Gradient Tree Boosting

图2：结构分数计算。我们只需要总结每个叶子上的梯度和二阶梯度统计量，然后应用得分公式来获得质量得分。

Eq (6) can be used as a scoring function to measure the quality of a tree structure q. This score is like the impurity score for evaluating decision trees, except that it is derived for a wider range of objective functions. Fig. 2 illustrates how this score can be calculated.
方程（6）可以用作评分函数来测量树结构q的质量。该评分类似于评估决策树的杂质评分，除了它是针对更广泛的目标函数得出的。图2说明了如何计算得分。

This formula is usually used in practice for evaluating the split candidates.

2.3 Shrinkage and Column Subsampling

Besides the regularized objective mentioned in Sec. 2.1, two additional techniques are used to further prevent over-tting. The rst technique is shrinkage introduced by Fried-man [11]. Shrinkage scales newly added weights by a factor n after each step of tree boosting. Similar to a learning rate in tochastic optimization, shrinkage reduces the inuence of each individual tree and leaves space for future trees and leaves space for future trees to improve the model. The second technique is column (feature) subsampling. This technique is used in RandomForest [4, 13], It is implemented in a commercial software TreeNet 4

(6)for gradient boosting, but is not implemented in existing opensource packages. According to user feedback, using col-umn sub-sampling prevents over- tting even more so than the traditional row sub-sampling (which is also supported). The usage of column sub-samples also speeds up computa-tions of the parallel algorithm described later.

除了第二节中提到的正则化目标。 2.1，使用另外两种技术来进一步防止过度使用。第一种技术是Fried-man引入的收缩[11]。在树木提升的每个步骤之后，收缩比例新增加了因子n的权重。与随机优化中的学习速率类似，收缩减少了每棵树的影响，为未来的树木留下了空间，为未来的树木留出了改进模型的空间。第二种技术是列（特征）子采样。这种技术用于RandomForest [4,13]，它是在商业软件TreeNet 4中实现的

（6）用于梯度增强，但未在现有的开源软件包中实现。根据用户反馈，使用柱子采样比传统的行子采样（也支持）更能防止过度采样。列子样本的使用也加速了后面描述的并行算法的计算。

3. SPLIT FINDING ALGORITHMS

3.1 Basic Exact Greedy Algorithm

One of the key problems in tree learning is to nd the best split as indicated by Eq (7). In order to do so, a s-plit nding algorithm enumerates over all the possible splits on all the features. We call this the exact greedy algorithm. Most existing single machine tree boosting implementation-s, such as scikit-learn [20], R’s gbm [21] as well as the single machine version of XGBoost support the exact greedy algo-rithm. The exact greedy algorithm is shown in Alg. 1. It is computationally demanding to enumerate all the possible splits for continuous features. In order to do so e ciently, the algorithm must rst sort the data according to feature values and visit the data in sorted order to accumulate the gradient statistics for the structure score in Eq (7).

树学习中的关键问题之一是找到方程（7）所示的最佳分裂。为此，s-plit nding算法枚举所有特征上的所有可能分裂。我们称之为精确的贪婪算法。大多数现有的单机树提升实现，如scikit-learn [20]，R的gbm [21]以及XGBoost的单机版本都支持精确的贪婪算法。确切的贪婪算法如Alg所示。 1.枚举连续特征的所有可能分裂在计算上要求很高。为了有效地执行此操作，算法必须首先根据特征值对数据进行排序，并按排序顺序访问数据，以累积方程（7）中结构分数的梯度统计。

3.2 Approximate Algorithm

The exact greedy algorithm is very powerful since it enu-merates over all possible splitting points greedily. However, it is impossible to e ciently do so when the data does not t entirely into memory. Same problem also arises in the dis-tributed setting. To support e ective gradient tree boosting in these two settings, an approximate algorithm is needed.

We summarize an approximate framework, which resem-bles the ideas proposed in past literatures [17, 2, 22], in Alg. 2. To summarize, the algorithm rst proposes candi-date splitting points according to percentiles of feature dis-tribution (a speci c criteria will be given in Sec. 3.3). The algorithm then maps the continuous features into bucket-s split by these candidate points, aggregates the statistics and nds the best solution among proposals based on the aggregated statistics.

确切的贪婪算法非常强大，因为它贪婪地计算所有可能的分裂点。但是，当数据不完全进入内存时，不可能有效地这样做。在分布式设置中也会出现同样的问题。为了支持这两种设置中的有效梯度树增强，需要一种近似算法。

我们总结了一个近似的框架，它类似于过去的文献[17,2,22]中提出的观点，在Alg中。 2.总之，算法首先根据特征分布的百分位数提出了候选分裂点（具体标准将在3.3节中给出）。然后，算法将连续特征映射到由这些候选点划分的桶中，汇总统计数据并根据聚合统计数据找出提案中的最佳解决方案。

There are two variants of the algorithm, depending on when the proposal is given. The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split nding at all levels.The local variant re-proposes after each split. The global method requires less proposal steps than the local method. However, usually more candidate points are needed for the global proposal because candidates are not re ned after each split. The local proposal re nes the candidates after splits, and can potentially be more appropriate for deeper trees. A comparison of di erent algorithms on a Higgs boson dataset is given by Fig. 3. We nd that the local proposal indeed requires fewer candidates. The global proposal can be as accurate as the local one given enough candidates.

Most existing approximate algorithms for distributed tree learning also follow this framework. Notably, it is also possi-ble to directly construct approximate histograms of gradient statistics [22]. It is also possible to use other variants of bin-ning strategies instead of quantile [17]. Quantile strategy bene t from being distributable and recomputable, which we will detail in next subsection. From Fig. 3, we also nd that the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

Our system e ciently supports exact greedy for the single machine setting, as well as approximate algorithm with both local and global proposal methods for all settings. Users can freely choose between the methods according to their needs.

该算法有两种变体，具体取决于提议的时间。全局变体在树构建的初始阶段提出所有候选分裂，并且在所有级别使用相同的分裂结构提议。在每次分割之后重新提出本地变体。全局方法比本地方法需要更少的提议步骤。但是，全球提案通常需要更多候选点，因为在每次拆分后都不会考虑候选人。本地提案在拆分后重新确定候选人，并且可能更适合更深层的树木。图3给出了希格斯玻色子数据集上不同算法的比较。我们认为本地提案确实需要较少的候选者。全球提案可以与给予足够候选人的当地提案一样准确。

用于分布式树学习的大多数现有近似算法也遵循该框架。值得注意的是，直接构建梯度统计的近似直方图也是可能的[22]。也可以使用其他变化的bin-ning策略而不是分位数[17]。分位数和可重组的分位数策略有所好处，我们将在下一小节中详细介绍。从图3中，我们还发现，在给定合理的近似水平的情况下，分位数策略可以获得与精确贪婪相同的精度。

我们的系统有效地支持单机设置的精确贪婪，以及所有设置的本地和全局提议方法的近似算法。用户可以根据自己的需要自由选择方法。

3.3 Weighted Quantile Sketch

One important step in the approximate algorithm is to propose candidate split points. Usually percentiles of a fea-ture are used to make candidates distribute evenly on the data. Formally, let multi-set Dk = f(x1k; h1); (x2k; h2) (xnk; hn)g represent the k-th feature values and second order gradient statistics of each training instances. We can de ne a rank functions rk : R ! [0; +1) as

近似算法中的一个重要步骤是提出候选分裂点。通常使用特征的百分位来使候选者均匀地分布在数据上。形式上，让多组Dk = f（x1k; h1）; （x2k; h2）（xnk; hn）g表示每个训练实例的第k个特征值和二阶梯度统计。我们可以定义一个等级函数rk：R！[0; +1）as

which is exactly weighted squared loss with labels gi=hi and weights hi. For large datasets, it is non-trivial to nd can-didate splits that satisfy the criteria. When every instance has equal weights, an existing algorithm called quantile s-ketch [14, 24] solves the problem. However, there is no existing quantile sketch for the weighted datasets. There-fore, most existing approximate algorithms either resorted to sorting on a random subset of data which have a chance of failure or heuristics that do not have theoretical guarantee.

To solve this problem, we introduced a novel distributed weighted quantile sketch algorithm that can handle weighted data with a provable theoretical guarantee. The general idea is to propose a data structure that supports merge and prune operations, with each operation proven to maintain a certain accuracy level. A detailed description of the algorithm as well as proofs are given in the supplementary material5(link in the footnote).

这是标签gi = hi和权重hi的加权平方损失。对于大型数据集，满足条件的nd can-didate拆分是非常重要的。当每个实例具有相等的权重时，称为分位数s-ketch [14,24]的现有算法解决了该问题。但是，加权数据集不存在现有的分位数草图。因此，大多数现有的近似算法要么对有可能失败的随机数据子集进行排序，要么使用没有理论保证的启发式算法。

为了解决这个问题，我们引入了一种新颖的分布式加权分位数草图算法，该算法可以处理加权数据并具有可证明的理论保证一般的想法是提出一种支持合并和修剪操作的数据结构，每个操作都被证明可以保持一定的准确度。补充材料5（脚注中的链接）给出了算法的详细描述以及证明。

3.4 Sparsity-aware Split Finding

In many real-world problems, it is quite common for the input x to be sparse. There are multiple possible causes for sparsity: 1) presence of missing values in the data; 2) frequent zero entries in the statistics; and, 3) artifacts of feature engineering such as one-hot encoding. It is important to make the algorithm aware of the sparsity pattern in the data. In order to do so, we propose to add a default direction in each tree node, which is shown in Fig. 4. When a value is missing in the sparse matrix x, the instance is classified into the default direction. There are two choices of default direction in each branch. The optimal default directions are learnt from the data. The algorithm is shown in Alg. 3. The key improvement is to only visit the non-missing entries Ik. The presented algorithm treats the non-presence as a missing value and learns the best direction to handle missing values. The same algorithm can also be applied when the non-presence corresponds to a user speci ed value by limiting the enumeration only to consistent solutions.

To the best of our knowledge, most existing tree learning algorithms are either only optimized for dense data, or need speci c procedures to handle limited cases such as categorical encoding. XGBoost handles all sparsity patterns in a uni ed way. More importantly, our method exploits the sparsity to make computation complexity linear to number of non-missing entries in the input. Fig. 5 shows the comparison of sparsity aware and a naive implementation on an Allstate-10K dataset (description of dataset given in Sec. 6). We find that the sparsity aware algorithm runs 50 times faster than the naive version. This confirms the importance of the sparsity aware algorithm.

在许多现实问题中，输入x稀疏是很常见的。稀疏性有多种可能的原因：1）数据中存在缺失值; 2）统计中频繁的零项; 3）特征工程的工件，例如单热编码。使算法了解数据中的稀疏模式非常重要。为此，我们建议在每个树节点中添加一个默认方向，如图4所示。当稀疏矩阵x中缺少一个值时，该实例被分类为默认方向。每个分支中有两种默认方向选择。从数据中学习最佳默认方向。算法显示在Alg中。 3.关键的改进是只访问非缺失的条目Ik。所提出的算法将非存在视为缺失值并且学习处理缺失值的最佳方向。当非存在对应于用户指定的值时，也可以通过将枚举限制为一致的解决方案来应用相同的算法。

据我们所知，大多数现有的树学习算法要么仅针对密集数据进行优化，要么需要特定的过程来处理有限的情况，例如分类编码。 XGBoost以统一的方式处理所有稀疏模式。更重要的是，我们的方法利用稀疏性使计算复杂度与输入中的非缺失条目的数量成线性关系。图5显示了对Allstate-10K数据集的稀疏性和初始实现的比较（第6节中给出的数据集的描述）。我们发现稀疏感知算法比天真版本快50倍。这证实了稀疏感知算法的重要性。

4. SYSTEM DESIGN

4.1 Column Block for Parallel Learning

The most time consuming part of tree learning is to get the data into sorted order. In order to reduce the cost of sorting, we propose to store the data in in-memory units, which we called block. Data in each block is stored in the compressed column (CSC) format, with each column sorted by the corresponding feature value. This input data layout only needs to be computed once before training, and can be reused in later iterations.

In the exact greedy algorithm, we store the entire dataset in a single block and run the split search algorithm by lin-early scanning over the presorted entries. We do the split nding of all leaves collectively, so one scan over the block will collect the statistics of the split candidates in all leaf branches. Fig. 6 shows how we transform a dataset into the format and nd the optimal split using the block structure.

The block structure also helps when using the approximate algorithms. Multiple blocks can be used in this case, with each block corresponding to subset of rows in the dataset. Different blocks can be distributed across machines, or stored on disk in the out-of-core setting. Using the sorted structure, the quantile finding step becomes a linear scan over the sorted columns. This is especially valuable for lo-cal proposal algorithms, where candidates are generated frequently at each branch. The binary search in histogram aggregation also becomes a linear time merge style algorithm. Collecting statistics for each column can be parallelized, giving us a parallel algorithm for split finding. Importantly, the column block structure also supports column subsampling, as it is easy to select a subset of columns in a block.

树学习中最耗时的部分是将数据按顺序排列。为了降低排序成本，我们建议将数据存储在内存单元中，我们称之为块。每个块中的数据以压缩列（CSC）格式存储，每列按相应的特征值排序。此输入数据布局仅需要在训练之前计算一次，并且可以在以后的迭代中重复使用。

在精确的贪婪算法中，我们将整个数据集存储在一个块中，并通过对预先排序的条目进行lin-early扫描来运行拆分搜索算法。我们共同对所有叶子进行分割，因此对块进行一次扫描将收集所有叶子分支中的分割候选者的统计数据。图6显示了我们如何使用块结构将数据集转换为格式并找到最佳分割。

在使用近似算法时，块结构也有帮助。在这种情况下可以使用多个块，每个块对应于数据集中的行的子集。不同的块可以跨机器分布，也可以在核外设置中存储在磁盘上。使用排序结构，分位数查找步骤变为对排序列的线性扫描。这对于本地提议算法特别有价值，其中候选者在每个分支处经常生成。直方图聚合中的二分搜索也变为线性时间合并样式算法。收集每列的统计数据可以并行化，为我们提供了一种用于拆分查找的并行算法。重要的是，列块结构还支持列子采样，因为很容易在块中选择列的子集。

图7：精确贪婪算法中缓存感知预取的影响。我们发现缓存缺失效应会影响大型数据集（1000万个实例）的性能。当数据集很大时，使用缓存感知预取可将性能提高两倍。

图8：短距离数据依赖模式，可能由于缓存未命中而导致停顿。

图9：块大小在近似算法中的影响。我们发现过于小的块会导致无效的并行化，而过大的块也会因缓存未命中而减慢训练速度。

Time Complexity Analysis

4.2 Cache-aware Access

While the proposed block structure helps optimize the computation complexity of split nding, the new algorithm requires indirect fetches of gradient statistics by row index, since these values are accessed in order of feature. This is a non-continuous memory access. A naive implementation of split enumeration introduces immediate read/write de-pendency between the accumulation and the non-continuous memory fetch operation (see Fig. 8). This slows down split nding when the gradient statistics do not t into CPU cache and cache miss occur.

For the exact greedy algorithm, we can alleviate the prob-lem by a cache-aware prefetching algorithm. Speci cally, we allocate an internal bu er in each thread, fetch the gra-dient statistics into it, and then perform accumulation in a mini-batch manner. This prefetching changes the direct read/write dependency to a longer dependency and helps to reduce the runtime overhead when number of rows in the is large. Figure 7 gives the comparison of cache-aware vs. non cache-aware algorithm on the the Higgs and the All-state dataset. We nd that cache-aware implementation of the exact greedy algorithm runs twice as fast as the naive version when the dataset is large.

For approximate algorithms, we solve the problem by choos-ing a correct block size. We de ne the block size to be max-imum number of examples in contained in a block, as this reects the cache storage cost of gradient statistics. Choos-ing an overly small block size results in small workload for each thread and leads to ine cient parallelization. On the other hand, overly large blocks result in cache misses, as the gradient statistics do not t into the CPU cache. A good choice of block size balances these two factors. We compared various choices of block size on two data sets. The results are given in Fig. 9. This result validates our discussion and shows that choosing 216 examples per block balances the cache property and parallelization.

虽然所提出的块结构有助于优化分割中的计算复杂度，但是新算法需要通过行索引间接提取梯度统计，因为这些值是按特征的顺序访问的。这是一种非连续的内存访问。分裂枚举的简单实现引入了累积和非连续存储器获取操作之间的立即读/写依赖性（参见图8）。当梯度统计信息不进入CPU缓存并发生缓存未命中时，这会降低分割速度。

对于精确的贪婪算法，我们可以通过缓存感知预取算法来缓解问题。具体来说，我们在每个线程中分配一个内部缓冲区，获取其中的梯度统计信息，然后以小批量方式执行累积。此预取将直接读/写依赖关系更改为更长的依赖关系，并有助于在其中的行数较大时减少运行时开销。图7给出了Higgs和All-state数据集上缓存感知与非缓存感知算法的比较。我们发现，当数据集很大时，精确贪婪算法的缓存感知实现的运行速度是天真版本的两倍。

对于近似算法，我们通过选择正确的块大小来解决问题。我们将块大小定义为块中包含的最大数量的示例，因为这反映了梯度统计的高速缓存存储成本。选择过小的块大小会导致每个线程的工作量很小，并导致无效的并行化。另一方面，过大的块会导致高速缓存未命中，因为梯度统计信息不会进入CPU高速缓存。块大小的良好选择平衡了这两个因素。我们在两个数据集上比较了块大小的各种选择。结果如图9所示。该结果验证了我们的讨论，并表明每个块选择216个示例可以平衡缓存属性和并行化。

4.3 Blocks for Out-of-core Computation

One goal of our system is to fully utilize a machine’s re-sources to achieve scalable learning. Besides processors and memory, it is important to utilize disk space to handle data that does not t into main memory. To enable out-of-core computation, we divide the data into multiple blocks and store each block on disk. During computation, it is impor-tant to use an independent thread to pre-fetch the block into a main memory bu er, so computation can happen in con-currence with disk reading. However, this does not entirely solve the problem since the disk reading takes most of the computation time. It is important to reduce the overhead and increase the throughput of disk IO. We mainly use two techniques to improve the out-of-core computation.

Block Compression The rst technique we use is block compression. The block is compressed by columns, and de-compressed on the y by an independent thread when load-ing into main memory. This helps to trade some of the computation in decompression with the disk reading cost. We use a general purpose compression algorithm for com-pressing the features values. For the row index, we substract the row index by the begining index of the block and use a 16bit integer to store each o set. This requires 216 examples per block, which is con rmed to be a good setting. In most of the dataset we tested, we achieve roughly a 26% to 29% compression ratio.

Block Sharding The second technique is to shard the data onto multiple disks in an alternative manner. A pre-fetcher thread is assigned to each disk and fetches the data into an in-memory bu er. The training thread then alternatively reads the data from each bu er. This helps to increase the throughput of disk reading when multiple disks are available.

我们系统的一个目标是充分利用机器的资源来实现可扩展的学习。除了处理器和内存之外，利用磁盘空间来处理不会进入主内存的数据也很重要。为了实现核外计算，我们将数据分成多个块并将每个块存储在磁盘上。在计算过程中，使用独立的线程将块预取到主存储器中是很重要的，因此计算可以在与磁盘读取相关的情况下发生。但是，这并不能完全解决问题，因为磁盘读取占用了大部分计算时间。减少开销并增加磁盘IO的吞吐量非常重要。我们主要使用两种技术来改进核外计算。

块压缩我们使用的第一种技术是块压缩。该块由列压缩，并在加载到主存储器时由独立线程在y上解压缩。这有助于将解压缩中的一些计算与磁盘读取成本进行交换。我们使用通用压缩算法来压缩特征值。对于行索引，我们通过块的开始索引来减去行索引，并使用16位整数来存储每个o set。这需要每个块216个示例，这被认为是一个很好的设置。在我们测试的大多数数据集中，我们实现了大约26％到29％的压缩比。

块分片第二种技术是以另一种方式将数据分片到多个磁盘上。为每个磁盘分配一个预取线程，并将数据提取到内存中。然后，训练线程交替地从每个存储器读取数据。当有多个磁盘可用时，这有助于提高磁盘读取的吞吐量。

5.RELATED WORKS

Our system implements gradient boosting [10], which per-forms additive optimization in functional space. Gradient tree boosting has been successfully used in classi cation [12], learning to rank [5], structured prediction [8] as well as other elds. XGBoost incorporates a regularized model to prevent overfitting. This this resembles previous work on regularized greedy forest [25], but simpli es the objective and algorithm for parallelization. Column sampling is a simple but e ective technique borrowed from RandomForest [4]. While sparsity-aware learning is essential in other types of models such as linear models [9], few works on tree learning have considered this topic in a principled way. The algorithm proposed in this paper is the rstuni ed approach to handle all kinds of sparsity patterns. There are several existing works on parallelizing tree learn-ing [22, 19]. Most of these algorithms fall into the approximate framework described in this paper. Notably, it is also possible to partition data by columns [23] and apply the ex-act greedy algorithm. This is also supported in our frame-work, and the techniques such as cache-aware prefecthing can be used to bene t this type of algorithm. While most existing works focus on the algorithmic aspect of parallelization, our work improves in two unexplored system directions:out-of-core computation and cache-aware learning. This gives us insights on how the system and the algorithm can be jointly optimized and provides an end-to-end system that can handle large scale problems with very limited computing resources. We also summarize the comparison between our system and existing opensource implementations in Table 1.

Quantile summary (without weights) is a classical prob-lem in the database community [14, 24]. However, the ap-proximate tree boosting algorithm reveals a more general problem { nding quantiles on weighted data. To the best of our knowledge, the weighted quantile sketch proposed in this paper is the rst method to solve this problem. The weighted quantile summary is also not speci c to the tree learning and can bene t other applications in data science and machine learning in the future.

我们的系统实现了梯度增强[10]，它在功能空间中进行了添加优化。梯度树增强已成功用于分类[12]，学习排名[5]，结构化预测[8]以及其他领域。 XGBoost采用正则化模型来防止过度拟合。这类似于以前关于正则化贪婪森林的工作[25]，但简化了并行化的目标和算法。柱采样是一种从RandomForest [4]借来的简单但有效的技术。虽然稀疏感知学习在其他类型的模型（如线性模型[9]）中是必不可少的，但很少有关于树学习的工作以原则方式考虑该主题。本文提出的算法是处理各种稀疏模式的rstuni ed方法。有几个关于并行树学习的现有工作[22,19]。大多数这些算法都属于本文所述的近似框架。值得注意的是，也可以按列[23]对数据进行分区，并应用ex-act贪婪算法。我们的框架工作也支持这一点，并且可以使用诸如缓存感知预知之类的技术来获得这种类型的算法。虽然大多数现有的工作都集中在并行化的算法方面，但我们的工作在两个未开发的系统方向上进行了改进：核外计算和缓存感知学习。这为我们提供了有关如何联合优化系统和算法的见解，并提供了一个端到端系统，可以处理非常有限的计算资源的大规模问题。我们还总结了表1中我们的系统与现有开源实现之间的比较。

分位数摘要（无权重）是数据库社区中的经典问题[14,24]。然而，近似树提升算法揭示了一个更普遍的问题{加权数据上的分数。据我们所知，本文提出的加权分位数草图是解决该问题的第一种方法。加权分位数摘要也不是树学习的特定，并且可以在未来的数据科学和机器学习中获得其他应用。

6.END TO END EVALUATIONS

6.1 System Implementation

We implemented XGBoost as an open source package6. The package is portable and reusable. It supports various weighted classification and rank objective functions, as well as user de ned objective function. It is available in popular languages such as python, R, Julia and integrates naturally with language native data science pipelines such as scikit-learn. The distributed version is built on top of the rabit library7 for allreduce. The portability of XGBoost makes it available in many ecosystems, instead of only being tied to a specific platform. The distributed XGBoost runs natively on Hadoop, MPI Sun Grid engine. Recently, we also enable distributed XGBoost on jvm bigdata stacks such as Flink and Spark. The distributed version has also been integrated into cloud platform Tianchi8 of Alibaba. We believe that there will be more integrations in the future.

我们将XGBoost实现为开源包6。该包装是便携式和可重复使用的。它支持各种加权分类和秩目标函数，以及用户定义的目标函数。它以流行的语言提供，例如python，R，Julia，并且自然地与语言本地数据科学管道集成，例如scikit-learn。分布式版本建立在rabit库7之上，用于allreduce。 XGBoost的可移植性使其可用于许多生态系统，而不仅仅是绑定到特定平台。分布式XGBoost在Hadoop，MPI Sun Grid引擎上本机运行。最近，我们还在jvm bigdata堆栈（如Flink和Spark）上启用了分布式XGBoost。分布式版本也已集成到阿里巴巴的云平台天池8中。我们相信未来会有更多的整合。

6.2 Dataset and Setup

We used four datasets in our experiments. A summary of these datasets is given in Table 2. In some of the experiments, we use a randomly selected subset of the data either due to slow baselines or to demonstrate the performance of the algorithm with varying dataset size. We use a su x to denote the size in these cases. For example Allstate-10K means a subset of the Allstate dataset with 10K instances.

The rst dataset we use is the Allstate insurance claim dataset9. The task is to predict the likelihood and cost of an insurance claim given di erent risk factors. In the exper-iment, we simpli ed the task to only predict the likelihood of an insurance claim. This dataset is used to evaluate the impact of sparsity-aware algorithm in Sec. 3.4. Most of the sparse features in this data come from one-hot encoding. We randomly select 10M instances as training set and use the rest as evaluation set.

The second dataset is the Higgs boson dataset10 from high energy physics. The data was produced using Monte Carlo simulations of physics events. It contains 21 kinematic prop-erties measured by the particle detectors in the accelerator. It also contains seven additional derived physics quantities of the particles. The task is to classify whether an event corresponds to the Higgs boson. We randomly select 10M instances as training set and use the rest as evaluation set.

The third dataset is the Yahoo! learning to rank challenge dataset [6], which is one of the most commonly used bench-marks in learning to rank algorithms. The dataset contains 20K web search queries, with each query corresponding to a list of around 22 documents. The task is to rank the docu-ments according to relevance of the query. We use the o cial train test split in our experiment.

The last dataset is the criteo terabyte click log dataset11. We use this dataset to evaluate the scaling property of the system in the out-of-core and the distributed settings. The data contains 13 integer features and 26 ID features of user, item and advertiser information. Since a tree based model is better at handling continuous features, we preprocess the data by calculating the statistics of average CTR and count of ID features on the rst ten days, replacing the ID fea-tures by the corresponding count statistics during the next ten days for training. The training set after preprocessing contains 1.7 billion instances with 67 features (13 integer, 26 average CTR statistics and 26 counts). The entire dataset is more than one terabyte in LibSVM format.

我们在实验中使用了四个数据集。表2中给出了这些数据集的摘要。在一些实验中，由于基线较慢，我们使用随机选择的数据子集，或者演示具有不同数据集大小的算法的性能。在这些情况下，我们使用su x来表示大小。例如，Allstate-10K表示具有10K实例的Allstate数据集的子集。

我们使用的第一个数据集是Allstate保险索赔数据集9。任务是根据不同的风险因素预测保险索赔的可能性和成本。在实验中，我们简化了仅预测保险索赔可能性的任务。此数据集用于评估稀疏性感知算法在Sec中的影响。 3.4。此数据中的大多数稀疏功能都来自单热编码。我们随机选择10M实例作为训练集，并将其余部分用作评估集。

第二个数据集是来自高能物理学的希格斯玻色子数据集10。数据是使用物理事件的蒙特卡罗模拟生成的。它包含21个运动学特性，由加速器中的粒子探测器测量。它还包含七个额外的粒子派生物理量。任务是分类事件是否与希格斯玻色子相对应。我们随机选择10M实例作为训练集，并将其余部分用作评估集。

第三个数据集是Yahoo!学习排名挑战数据集[6]，这是学习排名算法最常用的基准标记之一。数据集包含20K Web搜索查询，每个查询对应于大约22个文档的列表。任务是根据查询的相关性对文档进行排名。我们在实验中使用了公式列车测试分组。

最后一个数据集是criteo terabyte click log dataset11。我们使用此数据集来评估系统在核外和分布式设置中的扩展属性。该数据包含13个整数功能和26个用户，项目和广告商信息的ID功能。由于基于树的模型更好地处理连续特征，我们通过计算前十天的平均CTR和ID特征的统计数据来预处理数据，在接下来的十天内用相应的计数统计数据替换ID特征。为了训练。预处理后的训练集包含17个具有67个特征的实例（13个整数，26个平均点击率统计和26个计数）。整个数据集的LibSVM格式超过1TB。

We use the rst three datasets for the single machine par-allel setting, and the last dataset for the distributed and out-of-core settings. All the single machine experiments are conducted on a Dell PowerEdge R420 with two eight-core Intel Xeon (E5-2470) (2.3GHz) and 64GB of memory. If not speci ed, all the experiments are run using all the available cores in the machine. The machine settings of the distribut-ed and the out-of-core experiments will be described in the corresponding section. In all the experiments, we boost trees with a common setting of maximum depth equals 8, shrink-age equals 0.1 and no column subsampling unless explicitly speci ed. We can nd similar results when we use other settings of maximum depth.

我们将前三个数据集用于单机par-allel设置，并将最后一个数据集用于分布式和核外设置。所有单机实验均在戴尔PowerEdge R420上进行，配备两个八核Intel Xeon（E5-2470）（2.3GHz）和64GB内存。如果未指定，则使用机器中的所有可用核心运行所有实验。分布式和核外实验的机器设置将在相应的部分中描述。在所有实验中，我们使用最大深度等于8的共同设置来提升树，收缩年龄等于0.1并且除非明确指定，否则不进行列子采样。当我们使用其他最大深度设置时，我们可以得到类似的结果。

图10：Yahoo LTRC数据集上XGBoost和pG-BRT之间的比较。

表4：雅虎上500棵树的学习与排名比较 LTRC数据集

6.3 Classification

In this section, we evaluate the performance of XGBoost on a single machine using the exact greedy algorithm on Higgs-1M data, by comparing it against two other common-ly used exact greedy tree boosting implementations. Since scikit-learn only handles non-sparse input, we choose the dense Higgs dataset for a fair comparison. We use the 1M subset to make scikit-learn nish running in reasonable time. Among the methods in comparison, R’s GBM uses a greedy approach that only expands one branch of a tree, which makes it faster but can result in lower accuracy, while both scikit-learn and XGBoost learn a full tree. The results are shown in Table 3. Both XGBoost and scikit-learn give better performance than R’s GBM, while XGBoost runs more than 10x faster than scikit-learn. In this experiment, we also find column subsamples gives slightly worse performance than using all the features. This could due to the fact that there are few important features in this dataset and we can benefit from greedily select from all the features.

在本节中，我们使用Higgs-1M数据上的精确贪婪算法，通过将其与其他两种常用的精确贪婪树提升实现进行比较，评估XGBoost在单台机器上的性能。由于scikit-learn只处理非稀疏输入，我们选择密集的Higgs数据集进行公平比较。我们使用1M子集在合理的时间内运行scikit-learn nish。在比较的方法中，R的GBM使用贪婪的方法，只扩展树的一个分支，这使得它更快但可能导致更低的准确性，而scikit-learn和XGBoost都学习完整的树。结果显示在表3中.XGBoost和scikit-learn都比R的GBM提供更好的性能，而XGBoost的运行速度比scikit-learn快10倍。在此实验中，我们还发现列子样本的性能略差于使用所有功能。这可能是因为此数据集中的重要特征很少，我们可以从所有功能中贪婪地选择。

6.4 LearningtoRank

We next evaluate the performance of XGBoost on the learning to rank problem. We compare against pGBRT [22], the best previously pubished system on this task. XGBoost runs exact greedy algorithm, while pGBRT only support an approximate algorithm. The results are shown in Table 4 and Fig. 10. We nd that XGBoost runs faster. Interest-ingly, subsampling columns not only reduces running time, and but also gives a bit higher performance for this prob-lem. This could due to the fact that the subsampling helps prevent over tting, which is observed by many of the users.

我们接下来评估XGBoost在学习排名问题上的表现。我们比较pGBRT [22]，这是此任务中最好的先前发布的系统。 XGBoost运行精确的贪心算法，而pGBRT仅支持近似算法。结果显示在表4和图10中。我们发现XGBoost运行得更快。有趣的是，二次采样列不仅减少了运行时间，而且还为这个问题提供了更高的性能。这可能是由于子采样有助于防止过度使用，这是许多用户观察到的。

图11：对criteo数据的不同子集的核外方法的比较。丢失的数据点是由于磁盘空间不足造成的。我们可以发现基本算法只能处理200M的例子。添加压缩可提供3倍的加速，并且分成两个磁盘可提供另外2倍的加速。系统从400M示例开始耗尽文件缓存。在此之后，算法确实必须依赖磁盘。压缩+分片方法在用尽le缓存时具有不那么显着的减速，并且之后呈现线性趋势。

Figure 12: Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on di erent subset of criteo data. XGBoost runs more 10x than spark per iteration and 2.2x as H2O’s optimized version (However, H2O is slow in loading the data, get-ting worse end-to-end time). Note that spark suffers from drastic slow down when running out of memory. XGBoost runs faster and scales smoothly to the full 1.7 billion examples with given resources by utilizing out-of-core computation.

图12：32个EC2节点上不同分布式系统的比较，在不同的criteo数据子集上进行10次迭代。 XGBoost每次迭代的运行次数比火花多10倍，H2O的优化版本运行2.2倍（但是，H2O在加载数据时速度慢，端到端时间更差）。请注意，当内存不足时，火花会急剧减速。通过利用核外计算，XGBoost运行速度更快，并且通过给定资源可以平滑地扩展到完整的17亿个示例。

6.5 Out-of-core Experiment

We also evaluate our system in the out-of-core setting on the criteo data. We conducted the experiment on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). The results are shown in Figure 11. We can nd that compression helps to speed up computation by factor of three, and sharding into two disks further gives 2x speedup. For this type of experiment, it is important to use a very large dataset to drain the system le cache for a real out-of-core setting. This is indeed our setup. We can observe a transition point when the system runs out of le cache. Note that the transition in the nal method is less dramatic. This is due to larger disk throughput and better utilization of computation resources. Our nal method is able to process 1.7 billion examples on a single machine.

我们还在criteo数据的out-of-core设置中评估我们的系统。我们在一台AWS c3.8xlarge机器上进行了实验（32个vcores，两个320 GB SSD，60 GB RAM）。结果显示在图11中。我们可以确定压缩有助于将计算速度提高三倍，并且分成两个磁盘进一步提供2倍的加速。对于此类实验，使用非常大的数据集来排空系统文件缓存以实现真正的核外设置非常重要。这确实是我们的设置。当系统用完le缓存时，我们可以观察到一个转换点。请注意，nal方法的转换不那么引人注目。这是由于更大的磁盘吞吐量和更好的计算资源利用率。我们的nal方法能够在一台机器上处理17亿个示例。

6.6 Distributed Experiment

Finally, we evaluate the system in the distributed setting. We set up a YARN cluster on EC2 with m3.2xlarge ma-chines, which is a very common choice for clusters. Each machine contains 8 virtual cores, 30GB of RAM and two 80GB SSD local disks. The dataset is stored on AWS S3 instead of HDFS to avoid purchasing persistent storage.

We rst compare our system against two production-level distributed systems: Spark MLLib [18] and H2O 12. We use 32 m3.2xlarge machines and test the performance of the systems with various input size. Both of the baseline systems are in-memory analytics frameworks that need to store the data in RAM, while XGBoost can switch to out-of-core set-ting when it runs out of memory. The results are shown in Fig. 12. We can nd that XGBoost runs faster than the baseline systems. More importantly, it is able to take advantage of out-of-core computing and smoothly scale to all 1.7 billion examples with the given limited computing re-sources. The baseline systems are only able to handle subset of the data with the given resources. This experiment shows the advantage to bring all the system improvement togeth-er and solve a real-world scale problem. We also evaluate the scaling property of XGBoost by varying the number of machines. The results are shown in Fig. 13. We can nd XGBoost’s performance scales linearly as we add more ma-chines. Importantly, XGBoost is able to handle the entire 1.7 billion data with only four machines. This shows the system’s potential to handle even larger data.

最后，我们在分布式设置中评估系统。我们在EC2上使用m3.2xlarge机器建立了一个YARN集群，这是集群的一个非常常见的选择。每台机器包含8个虚拟内核，30GB内存和两个80GB SSD本地磁盘。数据集存储在AWS S3而不是HDFS上，以避免购买持久存储。

我们首先将我们的系统与两个生产级分布式系统进行比较：Spark MLLib [18]和H2O 12.我们使用32 m3.2xlarge机器并测试具有不同输入尺寸的系统的性能。两个基线系统都是内存分析框架，需要将数据存储在RAM中，而XGBoost可以在内存不足时切换到核外设置。结果显示在图12中。我们可以发现XGBoost的运行速度比基线系统快。更重要的是，它能够利用核外计算，并在给定有限的计算资源的情况下平滑扩展到所有17亿个示例。基线系统只能处理具有给定资源的数据子集。该实验显示了将所有系统改进提供给解决方案并解决实际规模问题的优势。我们还通过改变机器的数量来评估XGBoost的缩放属性。结果显示在图13中。当我们添加更多的机器时，我们可以线性地找到XGBoost的性能标度。重要的是，XGBoost只需要四台机器即可处理整个17亿个数据。这表明系统有可能处理更大的数据。

Figure 13: Scaling of XGBoost with different num-ber of machines on criteo full 1.7 billion dataset. Using more machines results in more le cache and makes the system run faster, causing the trend to be slightly super linear. XGBoost can process the entire dataset using as little as four machines, and scales smoothly by utilizing more available resources.

图13：在criteo完整的17亿数据集上使用不同数量的机器缩放XGBoost。使用更多的机器会导致更多的缓存并使系统运行得更快，从而使趋势略微超线性。 XGBoost可以使用少至四台机器处理整个数据集，并通过利用更多可用资源顺利扩展。

7.CONCLUSION

In this paper, we described the lessons we learnt when building XGBoost, a scalable tree boosting system that is widely used by data scientists and provides state-of-the-art results on many problems. We proposed a novel sparsity aware algorithm for handling sparse data and a theoretically justi ed weighted quantile sketch for approximate learning. Our experience shows that cache access patterns, data com-pression and sharding are essential elements for building a scalable end-to-end system for tree boosting. These lessons can be applied to other machine learning systems as well. By combining these insights, XGBoost is able to solve real-world scale problems using a minimal amount of resources.

在本文中，我们描述了我们在构建XGBoost时学到的经验教训，XGBoost是一个可扩展的树推进系统，被数据科学家广泛使用，并提供了许多问题的最新结果。我们提出了一种用于处理稀疏数据的新型稀疏感知算法和用于近似学习的理论上加权的加权分位数草图。我们的经验表明，缓存访问模式，数据压缩和分片是构建可扩展的端到端系统以实现树提升的基本要素。这些课程也可以应用于其他机器学习系统。通过结合这些见解，XGBoost能够使用最少量的资源解决实际规模问题。

Acknowledgments

We would like to thank Tyler B. Johnson, Marco Tulio Ribeiro, Sameer Singh, Arvind Krishnamurthy for their valuable feedback. We also sincerely thank Tong He, Bing Xu, Michael Benesty, Yuan Tang, Hongliang Liu, Qiang Kou, Nan Zhu and all other con-tributors in the XGBoost community. This work was supported in part by ONR (PECASE) N000141010672, NSF IIS 1258741 and the TerraSwarm Research Center sponsored by MARCO and DARPA.

我们要感谢Tyler B. Johnson，Marco Tulio Ribeiro，Sameer Singh，Arvind Krishnamurthy提供的宝贵意见。我们也衷心感谢Tong He，Bing Xu，Michael Benesty，Yuan Tang，刘洪亮，Qiang Kou，Nan Zhu以及XGBoost社区的所有其他贡献者。这项工作部分由ONR（PECASE）N000141010672，NSF IIS 1258741和由MARCO和DARPA赞助的TerraSwarm研究中心提供支持。

8.REFERENCES

[1]R. Bekkerman. The present and the future of the kdd cup competition: an outsider’s perspective.

[2]R. Bekkerman, M. Bilenko, and J. Langford. Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, New York, NY, USA, 2011.

[3]J. Bennett and S. Lanning. The netix prize. In

Proceedings of the KDD Cup Workshop 2007, pages 3{6, New York, Aug. 2007.

[4]L. Breiman. Random forests. Maching Learning, 45(1):5{32, Oct. 2001.

[5]C. Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11:23{581, 2010.

[6]O. Chapelle and Y. Chang. Yahoo! Learning to Rank Challenge Overview. Journal of Machine Learning Research - W & CP, 14:1{24, 2011.

[7]T. Chen, H. Li, Q. Yang, and Y. Yu. General functional
matrix factorization using gradient boosting. In Proceeding of 30th International Conference on Machine Learning (ICML’13), volume 1, pages 436{444, 2013.

[8]T. Chen, S. Singh, B. Taskar, and C. Guestrin. E cient second-order gradient boosting for conditional random elds. In Proceeding of 18th Arti cial Intelligence and Statistics Conference (AISTATS’15), volume 1, 2015.

[9]R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classi cation. Journal of Machine Learning Research, 9:1871{1874, 2008.

[10]J. Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5):1189{1232, 2001.

[11]J. Friedman. Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4):367{378, 2002.

[12]J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2):337{407, 2000.

[13]J. H. Friedman and B. E. Popescu. Importance sampled learning ensembles, 2003.

[14]M. Greenwald and S. Khanna. Space-e cient online computation of quantile summaries. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, pages 58{66, 2001.

[15]X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi,

A.Atallah, R. Herbrich, S. Bowers, and J. Q. n. Candela. Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, ADKDD’14, 2014.

[16]P. Li. Robust Logitboost and adaptive base class (ABC) Logitboost. In Proceedings of the Twenty-Sixth Conference Annual Conference on Uncertainty in Arti cial Intelligence (UAI’10), pages 302{311, 2010.

[17]P. Li, Q. Wu, and C. J. Burges. Mcrank: Learning to rank using multiple classi cation and gradient boosting. In

Advances in Neural Information Processing Systems 20, pages 897{904. 2008.

[18]X. Meng, J. Bradley, B. Yavuz, E. Sparks,
S.Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde,
S.Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh,

M.Zaharia, and A. Talwalkar. MLlib: Machine learning in apache spark. Journal of Machine Learning Research, 17(34):1{7, 2016.

[19]B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo. Planet: Massively parallel learning of tree ensembles with mapreduce. Proceeding of VLDB Endowment, 2(2):1426{1437, Aug. 2009.

[20]F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
B.Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,

D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825{2830, 2011.

[21]G. Ridgeway. Generalized Boosted Models: A guide to the gbm package.

[22]S. Tyree, K. Weinberger, K. Agrawal, and J. Paykin. Parallel boosted regression trees for web search ranking. In

Proceedings of the 20th international conference on World wide web, pages 387{396. ACM, 2011.

[23]J. Ye, J.-H. Chow, J. Chen, and Z. Zheng. Stochastic gradient boosted distributed decision trees. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM ’09.

[24]Q. Zhang and W. Wang. A fast algorithm for approximate quantiles in high speed data streams. In Proceedings of the 19th International Conference on Scienti c and Statistical Database Management, 2007.

[25]T. Zhang and R. Johnson. Learning nonlinear functions using regularized greedy forest. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 2014.

你可能感兴趣的:(AI程序员,算法,机器学习,数据科学,Python,程序员,论文研读)

机器学习与深度学习间关系与区别 ℒℴѵℯ心·动ꦿ໊ོ꫞ 人工智能学习深度学习 python
一、机器学习概述定义机器学习（MachineLearning,ML）是一种通过数据驱动的方法，利用统计学和计算算法来训练模型，使计算机能够从数据中学习并自动进行预测或决策。机器学习通过分析大量数据样本，识别其中的模式和规律，从而对新的数据进行判断。其核心在于通过训练过程，让模型不断优化和提升其预测准确性。主要类型1.监督学习（SupervisedLearning）监督学习是指在训练数据集中包含输入
android系统selinux中添加新属性property 辉色投像
1.定位/android/system/sepolicy/private/property_contexts声明属性开头：persist.charge声明属性类型：u:object_r:system_prop:s0图12.定位到android/system/sepolicy/public/domain.te删除neverallow{domain-init}default_prop:property
C语言宏函数南林yan C语言 c语言
一、什么是宏函数？通过宏定义的函数是宏函数。如下，编译器在预处理阶段会将Add(x,y)替换为((x)*(y))#defineAdd(x,y)((x)*(y))#defineAdd(x,y)((x)*(y))intmain(){inta=10;intb=20;intd=10;intc=Add(a+d,b)*2;cout<
理解Gunicorn：Python WSGI服务器的基石范范0825 ipython linux 运维
理解Gunicorn：PythonWSGI服务器的基石介绍Gunicorn，全称GreenUnicorn，是一个为PythonWSGI（WebServerGatewayInterface）应用设计的高效、轻量级HTTP服务器。作为PythonWeb应用部署的常用工具，Gunicorn以其高性能和易用性著称。本文将介绍Gunicorn的基本概念、安装和配置，帮助初学者快速上手。1.什么是Gunico
c++ 的iostream 和 c++的stdio的区别和联系黄卷青灯77 c++算法开发语言 iostream stdio
在C++中，iostream和C语言的stdio.h都是用于处理输入输出的库，但它们在设计、用法和功能上有许多不同。以下是两者的区别和联系：区别1.编程风格iostream（C++风格）：C++标准库中的输入输出流类库，支持面向对象的输入输出操作。典型用法是cin（输入）和cout（输出），使用>操作符来处理数据。更加类型安全，支持用户自定义类型的输入输出。#includeintmain(){in
LocalDateTime 转 String igotyback java 开发语言
importjava.time.LocalDateTime;importjava.time.format.DateTimeFormatter;publicclassMain{publicstaticvoidmain(String[]args){//获取当前时间LocalDateTimenow=LocalDateTime.now();//定义日期格式化器DateTimeFormatterformat
店群合一模式下的社区团购新发展——结合链动 2+1 模式、AI 智能名片与 S2B2C 商城小程序源码说私域人工智能小程序
摘要：本文探讨了店群合一的社区团购平台在当今商业环境中的重要性和优势。通过分析店群合一模式如何将互联网社群与线下终端紧密结合，阐述了链动2+1模式、AI智能名片和S2B2C商城小程序源码在这一模式中的应用价值。这些创新元素的结合为社区团购带来了新的机遇，提升了用户信任感、拓展了营销渠道，并实现了线上线下的完美融合。一、引言随着互联网技术的不断发展，社区团购作为一种新兴的商业模式，在满足消费者日常需
每日一题——第八十九题互联网打工人no1 C语言程序设计每日一练 c语言
题目：在字符串中找到提取数字，并统计一共找到多少整数，a123xxyu23&8889，那么找到的整数为123，23，8889//思想：#include#include#includeintmain(){charstr[]="a123xxyu23&8889";intcount=0;intnum=0;//用于临时存放当前正在构建的整数。boolinNum=false;//用于标记当前是否正在读取一个整
Python数据分析与可视化实战指南 William数据分析 python python 数据
在数据驱动的时代，Python因其简洁的语法、强大的库生态系统以及活跃的社区，成为了数据分析与可视化的首选语言。本文将通过一个详细的案例，带领大家学习如何使用Python进行数据分析，并通过可视化来直观呈现分析结果。一、环境准备1.1安装必要库在开始数据分析和可视化之前，我们需要安装一些常用的库。主要包括pandas、numpy、matplotlib和seaborn等。这些库分别用于数据处理、数学
每日一题——第八十一题互联网打工人no1 C语言程序设计每日一练 c语言
打印如下图案:#includeintmain(){inti,j;charch='A';for(i=1;i<5;i++,ch++){for(j=0;j<5-i;j++){printf("");//控制空格输出}for(j=1;j<2*i;j++)//条件j<2*i{printf("%c",ch);//控制字符输出}printf("\n");}return0;}
每日一题——第八十二题互联网打工人no1 C语言程序设计每日一练 c语言
题目：将一个控制台输入的字符串中的所有元音字母复制到另一字符串中#include#include#include#include#defineMAX_INPUT1024boolisVowel(charp);intmain(){charinput[MAX_INPUT];charoutput[MAX_INPUT];printf("请输入一串字符串：\n");fgets(input,sizeof(inp
每日一题——第八十三题互联网打工人no1 C语言程序设计每日一练 c语言
题目：将输入的整形数字输出,输出1990，输出"1990"#include#defineMAX_INPUT1024intmain(){intarrr_num[MAX_INPUT];intnum,i=0;printf("请输入一个数字：");scanf_s("%d",&num);while(num!=0){arrr_num[i++]=num%10;num/=10;}printf("\"");for(
git常用命令笔记咩酱-小羊 git 笔记
###用习惯了idea总是不记得git的一些常见命令，需要用到的时候总是担心旁边站了人~~~记个笔记@_@，告诉自己看笔记不丢人初始化初始化一个新的Git仓库gitinit配置配置用户信息gitconfig--globaluser.name"YourName"gitconfig--globaluser.email"[email protected]"基本操作克隆远程仓库gitclone查看
python os.environ 江湖偌大 python 深度学习
os.environ['TF_CPP_MIN_LOG_LEVEL']='0'#默认值，输出所有信息os.environ['TF_CPP_MIN_LOG_LEVEL']='1'#屏蔽通知信息（INFO）os.environ['TF_CPP_MIN_LOG_LEVEL']='2'#屏蔽通知信息和警告信息（INFO\WARNING）os.environ['TF_CPP_MIN_LOG_LEVEL']='
Python中os.environ基本介绍及使用方法鹤冲天Pro #Python python 服务器开发语言
文章目录python中os.environos.environ简介os.environ进行环境变量的增删改查python中os.environ的使用详解1.简介2.key字段详解2.1常见key字段3.os.environ.get()用法4.环境变量的增删改查和判断是否存在4.1新增环境变量4.2更新环境变量4.3获取环境变量4.4删除环境变量4.5判断环境变量是否存在python中os.envi
Pyecharts数据可视化大屏：打造沉浸式数据分析体验我的运维人生信息可视化数据分析数据挖掘运维开发技术共享
Pyecharts数据可视化大屏：打造沉浸式数据分析体验在当今这个数据驱动的时代，如何将海量数据以直观、生动的方式展现出来，成为了数据分析师和企业决策者关注的焦点。Pyecharts，作为一款基于Python的开源数据可视化库，凭借其丰富的图表类型、灵活的配置选项以及高度的定制化能力，成为了构建数据可视化大屏的理想选择。本文将深入探讨如何利用Pyecharts打造数据可视化大屏，并通过实际代码案例
Goolge earth studio 进阶4——路径修改与平滑陟彼高冈yu Google earth studio 进阶教程旅游
如果我们希望在大约中途时获得更多的城市鸟瞰视角。可以将相机拖动到这里并创建一个新的关键帧。camera_target_clip_7EarthStudio会自动平滑我们的路径，所以当我们通过这个关键帧时，不是一个生硬的角度，而是一个平滑的曲线。camera_target_clip_8路径上有贝塞尔控制手柄，允许我们调整路径的形状。右键单击，我们可以选择“平滑路径”，这是默认的自动平滑算法，或者我们可
Python教程：一文了解使用Python处理XPath 旦莫 Python进阶 python 开发语言
目录1.环境准备1.1安装lxml1.2验证安装2.XPath基础2.1什么是XPath？2.2XPath语法2.3示例XML文档3.使用lxml解析XML3.1解析XML文档3.2查看解析结果4.XPath查询4.1基本路径查询4.2使用属性查询4.3查询多个节点5.XPath的高级用法5.1使用逻辑运算符5.2使用函数6.实战案例6.1从网页抓取数据6.1.1安装Requests库6.1.2代
python os.environ_python os.environ 读取和设置环境变量 weixin_39605414 python os.environ
>>>importos>>>os.environ.keys()['LC_NUMERIC','GOPATH','GOROOT','GOBIN','LESSOPEN','SSH_CLIENT','LOGNAME','USER','HOME','LC_PAPER','PATH','DISPLAY','LANG','TERM','SHELL','J2REDIR','LC_MONETARY','QT_QPA
将cmd中命令输出保存为txt文本文件落难Coder Windows cmd window
最近深度学习本地的训练中我们常常要在命令行中运行自己的代码，无可厚非，我们有必要保存我们的炼丹结果，但是复制命令行输出到txt是非常麻烦的，其实Windows下的命令行为我们提供了相应的操作。其基本的调用格式就是：运行指令>输出到的文件名称或者具体保存路径测试下，我打开cmd并且ping一下百度：pingwww.baidu.com>./data.txt看下相同目录下data.txt的输出：如果你再
基于社交网络算法优化的二维最大熵图像分割智能算法研学社（Jack旭）智能优化算法应用图像分割算法 php 开发语言
智能优化算法应用：基于社交网络优化的二维最大熵图像阈值分割-附代码文章目录智能优化算法应用：基于社交网络优化的二维最大熵图像阈值分割-附代码1.前言2.二维最大熵阈值分割原理3.基于社交网络优化的多阈值分割4.算法结果：5.参考文献：6.Matlab代码摘要：本文介绍基于最大熵的图像分割，并且应用社交网络算法进行阈值寻优。1.前言阅读此文章前，请阅读《图像分割：直方图区域划分及信息统计介绍》htt
使用 FinalShell 进行远程连接（ssh 远程连接 Linux 服务器）编程经验分享开发工具服务器 ssh linux
目录前言基本使用教程新建远程连接连接主机自定义命令路由追踪前言后端开发，必然需要和服务器打交道，部署应用，排查问题，查看运行日志等等。一般服务器都是集中部署在机房中，也有一些直接是云服务器，总而言之，程序员不可能直接和服务器直接操作，一般都是通过ssh连接来登录服务器。刚接触远程连接时，使用的是XSHELL来远程连接服务器，连接上就能够操作远程服务器了，但是仅用XSHELL并没有上传下载文件的功能
DIV+CSS+JavaScript技术制作网页（旅游主题网页设计与制作）云南大理 STU学生网页设计网页设计期末网页作业 html静态网页 html5期末大作业网页设计 web大作业
️精彩专栏推荐作者主页:【进入主页—获取更多源码】web前端期末大作业：【HTML5网页期末作业(1000套)】程序员有趣的告白方式：【HTML七夕情人节表白网页制作(110套)】文章目录二、网站介绍三、网站效果▶️1.视频演示2.图片演示四、网站代码HTML结构代码CSS样式代码五、更多源码二、网站介绍网站布局方面：计划采用目前主流的、能兼容各大主流浏览器、显示效果稳定的浮动网页布局结构。网站程
探索OpenAI和LangChain的适配器集成：轻松切换模型提供商 nseejrukjhad langchain easyui 前端 python
#探索OpenAI和LangChain的适配器集成：轻松切换模型提供商##引言在人工智能和自然语言处理的世界中，OpenAI的模型提供了强大的能力。然而，随着技术的发展，许多人开始探索其他模型以满足特定需求。LangChain作为一个强大的工具，集成了多种模型提供商，通过提供适配器，简化了不同模型之间的转换。本篇文章将介绍如何使用LangChain的适配器与OpenAI集成，以便轻松切换模型提供商
使用Faiss进行高效相似度搜索 llzwxh888 faiss python
在现代AI应用中，快速和高效的相似度搜索是至关重要的。Faiss（FacebookAISimilaritySearch）是一个专门用于快速相似度搜索和聚类的库，特别适用于高维向量。本文将介绍如何使用Faiss来进行相似度搜索，并结合Python代码演示其基本用法。什么是Faiss？Faiss是一个由FacebookAIResearch团队开发的开源库，主要用于高维向量的相似性搜索和聚类。Faiss
python是什么意思中文-在python中%是什么意思编程大乐趣
Python中%有两种：1、数值运算：%代表取模，返回除法的余数。如：>>>7%212、%操作符（字符串格式化，stringformatting），说明如下：%[(name)][flags][width].[precision]typecode(name)为命名flags可以有+，-，''或0。+表示右对齐。-表示左对齐。''为一个空格，表示在正数的左侧填充一个空格，从而与负数对齐。0表示使用0填
利用LangChain的StackExchange组件实现智能问答系统 nseejrukjhad langchain microsoft 数据库 python
利用LangChain的StackExchange组件实现智能问答系统引言在当今的软件开发世界中，StackOverflow已经成为程序员解决问题的首选平台之一。而LangChain作为一个强大的AI应用开发框架，提供了StackExchange组件，使我们能够轻松地将StackOverflow的海量知识库集成到我们的应用中。本文将详细介绍如何使用LangChain的StackExchange组件
如何部分格式化提示模板:LangChain中的高级技巧 nseejrukjhad langchain java 服务器 python
标题:如何部分格式化提示模板:LangChain中的高级技巧内容:如何部分格式化提示模板:LangChain中的高级技巧引言在使用大型语言模型(LLM)时,提示工程是一个关键环节。LangChain提供了强大的提示模板功能,让我们能更灵活地构建和管理提示。本文将介绍LangChain中一个高级特性-部分格式化提示模板,这个技巧可以让你的提示管理更加高效和灵活。什么是部分格式化提示模板?部分格式化提
Day1笔记-Python简介&标识符和关键字&输入输出 ~在杰难逃~ Python python 开发语言大数据数据分析数据挖掘
大家好，从今天开始呢，杰哥开展一个新的专栏，当然，数据分析部分也会不定时更新的，这个新的专栏主要是讲解一些Python的基础语法和知识，帮助0基础的小伙伴入门和学习Python，感兴趣的小伙伴可以开始认真学习啦！一、Python简介【了解】1.计算机工作原理编程语言就是用来定义计算机程序的形式语言。我们通过编程语言来编写程序代码，再通过语言处理程序执行向计算机发送指令，让计算机完成对应的工作，编程
python八股文面试题分享及解析(1) Shawn________ python
#1.'''a=1b=2不用中间变量交换a和b'''#1.a=1b=2a,b=b,aprint(a)print(b)结果：21#2.ll=[]foriinrange(3):ll.append({'num':i})print(11)结果:#[{'num':0},{'num':1},{'num':2}]#3.kk=[]a={'num':0}foriinrange(3):#0,12#可变类型，不仅仅改变
redis学习笔记——不仅仅是存取数据 Everyday都不同 returnSource expire/del incr/lpush 数据库分区 redis
最近项目中用到比较多redis，感觉之前对它一直局限于get/set数据的层面。其实作为一个强大的NoSql数据库产品，如果好好利用它，会带来很多意想不到的效果。（因为我搞java，所以就从jedis的角度来补充一点东西吧。PS：不一定全，只是个人理解，不喜勿喷） 1、关于JedisPool.returnSource(Jedis jeids) 这个方法是从red
SQL性能优化-持续更新中。。。。。。 atongyeye oracle sql
1 通过ROWID访问表--索引你可以采用基于ROWID的访问方式情况,提高访问表的效率, , ROWID包含了表中记录的物理位置信息..ORACLE采用索引(INDEX)实现了数据和存放数据的物理位置(ROWID)之间的联系. 通常索引提供了快速访问ROWID的方法,因此那些基于索引列的查询就可以得到性能上的提高. 2 共享SQL语句--相同的sql放入缓存 3 选择最有效率的表
[JAVA语言]JAVA虚拟机对底层硬件的操控还不完善 comsci JAVA虚拟机
如果我们用汇编语言编写一个直接读写CPU寄存器的代码段，然后利用这个代码段去控制被操作系统屏蔽的硬件资源，这对于JVM虚拟机显然是不合法的，对操作系统来讲，这样也是不合法的，但是如果是一个工程项目的确需要这样做，合同已经签了，我们又不能够这样做，怎么办呢？那么一个精通汇编语言的那种X客，是否在这个时候就会发生某种至关重要的作用呢？ &n
lvs- real 男人50 LVS
#!/bin/bash # # Script to start LVS DR real server. # description: LVS DR real server # #. /etc/rc.d/init.d/functions VIP=10.10.6.252 host='/bin/hostname' case "$1" in sta
生成公钥和私钥 oloz DSA 安全加密
package com.msserver.core.util; import java.security.KeyPair; import java.security.PrivateKey; import java.security.PublicKey; import java.security.SecureRandom; public class SecurityUtil {
UIView 中加入的cocos2d，背景透明 374016526 cocos2d glClearColor
要点是首先pixelFormat:kEAGLColorFormatRGBA8，必须有alpha层才能透明。然后view设置为透明glView.opaque = NO;[director setOpenGLView:glView];[self.viewController.view setBackgroundColor:[UIColor clearColor]];[self.viewControll
mysql常用命令香水浓 mysql
连接数据库 mysql -u troy -ptroy 备份表 mysqldump -u troy -ptroy mm_database mm_user_tbl > user.sql 恢复表（与恢复数据库命令相同） mysql -u troy -ptroy mm_database < user.sql 备份数据库 mysqldump -u troy -ptroy
我的架构经验系列文章 - 后端架构 - 系统层面 agevs JavaScript jquery css html5
系统层面：高可用性所谓高可用性也就是通过避免单独故障加上快速故障转移实现一旦某台物理服务器出现故障能实现故障快速恢复。一般来说，可以采用两种方式，如果可以做业务可以做负载均衡则通过负载均衡实现集群，然后针对每一台服务器进行监控，一旦发生故障则从集群中移除；如果业务只能有单点入口那么可以通过实现Standby机加上虚拟IP机制，实现Active机在出现故障之后虚拟IP转移到Standby的快速
利用ant进行远程tomcat部署 aijuans tomcat
在javaEE项目中，需要将工程部署到远程服务器上，如果部署的频率比较高，手动部署的方式就比较麻烦，可以利用Ant工具实现快捷的部署。这篇博文详细介绍了ant配置的步骤（http://www.cnblogs.com/GloriousOnion/archive/2012/12/18/2822817.html），但是在tomcat7以上不适用，需要修改配置，具体如下： 1.配置tomcat的用户角色
获取复利总收入 baalwolf 获取
public static void main(String args[]){ int money=200; int year=1; double rate=0.1; &
eclipse.ini解释 BigBird2012 eclipse
大多数java开发者使用的都是eclipse，今天感兴趣去eclipse官网搜了一下eclipse.ini的配置，供大家参考，我会把关键的部分给大家用中文解释一下。还是推荐有问题不会直接搜谷歌，看官方文档，这样我们会知道问题的真面目是什么，对问题也有一个全面清晰的认识。 Overview 1、Eclipse.ini的作用 Eclipse startup is controlled by th
AngularJS实现分页功能 bijian1013 JavaScript AngularJS 分页
对于大多数web应用来说显示项目列表是一种很常见的任务。通常情况下，我们的数据会比较多，无法很好地显示在单个页面中。在这种情况下，我们需要把数据以页的方式来展示，同时带有转到上一页和下一页的功能。既然在整个应用中这是一种很常见的需求，那么把这一功能抽象成一个通用的、可复用的分页（Paginator）服务是很有意义的。 &nbs
[Maven学习笔记三]Maven archetype bit1129 ArcheType
archetype的英文意思是原型，Maven archetype表示创建Maven模块的模版，比如创建web项目，创建Spring项目等等. mvn archetype提供了一种命令行交互式创建Maven项目或者模块的方式， mvn archetype 1.在LearnMaven-ch03目录下，执行命令mvn archetype:gener
【Java命令三】jps bit1129 Java命令
jps很简单，用于显示当前运行的Java进程，也可以连接到远程服务器去查看 [hadoop@hadoop bin]$ jps -help usage: jps [-help] jps [-q] [-mlvV] [<hostid>] Definitions: <hostid>: <hostname>[:
ZABBIX2.2 2.4 等各版本之间的兼容性 ronin47
zabbix更新很快，从2009年到现在已经更新多个版本，为了使用更多zabbix的新特性，随之而来的便是升级版本，zabbix版本兼容性是必须优先考虑的一点客户端AGENT兼容 zabbix1.x到zabbix2.x的所有agent都兼容zabbix server2.4：如果你升级zabbix server，客户端是可以不做任何改变，除非你想使用agent的一些新特性。 Zabbix代理（p
unity 3d还是cocos2dx哪个适合游戏？ brotherlamp unity自学 unity教程 unity视频 unity资料 unity
unity 3d还是cocos2dx哪个适合游戏？问：unity 3d还是cocos2dx哪个适合游戏？答：首先目前来看unity视频教程因为是3d引擎，目前对2d支持并不完善，unity 3d 目前做2d普遍两种思路，一种是正交相机，3d画面2d视角，另一种是通过一些插件，动态创建mesh来绘制图形单元目前用的较多的是2d toolkit，ex2d，smooth moves，sm2，
百度笔试题：一个已经排序好的很大的数组，现在给它划分成m段，每段长度不定，段长最长为k，然后段内打乱顺序，请设计一个算法对其进行重新排序 bylijinnan java 算法面试百度招聘
import java.util.Arrays; /** * 最早是在陈利人老师的微博看到这道题： * #面试题#An array with n elements which is K most sorted，就是每个element的初始位置和它最终的排序后的位置的距离不超过常数K * 设计一个排序算法。It should be faster than O(n*lgn)。
获取checkbox复选框的值 chiangfai checkbox
<title>CheckBox</title> <script type = "text/javascript"> doGetVal: function doGetVal() { //var fruitName = document.getElementById("apple").value;//根据
MySQLdb用户指南 chenchao051 mysqldb
原网页被墙，放这里备用。 MySQLdb User's Guide Contents Introduction Installation _mysql MySQL C API translation MySQL C API function mapping Some _mysql examples MySQLdb
HIVE 窗口及分析函数 daizj hive 窗口函数分析函数
窗口函数应用场景：（1）用于分区排序（2）动态Group By （3）Top N （4）累计计算（5）层次查询一、分析函数用于等级、百分点、n分片等。函数说明 RANK() &nbs
PHP ZipArchive 实现压缩解压Zip文件 dcj3sjt126com PHP zip
PHP ZipArchive 是PHP自带的扩展类，可以轻松实现ZIP文件的压缩和解压，使用前首先要确保PHP ZIP 扩展已经开启，具体开启方法就不说了，不同的平台开启PHP扩增的方法网上都有，如有疑问欢迎交流。这里整理一下常用的示例供参考。一、解压缩zip文件 01 02 03 04 05 06 07 08 09 10 11
精彩英语贺词 dcj3sjt126com 英语
I'm always here 我会一直在这里支持你 &nb
基于Java注解的Spring的IoC功能 e200702084 java spring bean IOC Office
java模拟post请求 geeksun java
一般API接收客户端（比如网页、APP或其他应用服务）的请求，但在测试时需要模拟来自外界的请求，经探索，使用HttpComponentshttpClient可模拟Post提交请求。此处用HttpComponents的httpclient来完成使命。 import org.apache.http.HttpEntity ; import org.apache.http.HttpRespon
Swift语法之 ---- ?和!区别 hongtoushizi ?swift !
转载自： http://blog.sina.com.cn/s/blog_71715bf80102ux3v.html Swift语言使用var定义变量，但和别的语言不同，Swift里不会自动给变量赋初始值，也就是说变量不会有默认值，所以要求使用变量之前必须要对其初始化。如果在使用变量之前不进行初始化就会报错： var stringValue : String //
centos7安装jdk1.7 jisonami jdk centos
安装JDK1.7 步骤1、解压tar包在当前目录 [root@localhost usr]#tar -xzvf jdk-7u75-linux-x64.tar.gz 步骤2：配置环境变量在etc/profile文件下添加 export JAVA_HOME=/usr/java/jdk1.7.0_75 export CLASSPATH=/usr/java/jdk1.7.0_75/lib
数据源架构模式之数据映射器 home198979 PHP 架构数据映射器 datamapper
前面分别介绍了数据源架构模式之表数据入口、数据源架构模式之行和数据入口数据源架构模式之活动记录，相较于这三种数据源架构模式，数据映射器显得更加“高大上”。一、概念数据映射器（Data Mapper）：在保持对象和数据库（以及映射器本身）彼此独立的情况下，在二者之间移动数据的一个映射器层。概念永远都是抽象的，简单的说，数据映射器就是一个负责将数据映射到对象的类数据。 &nb
在Python中使用MYSQL pda158 mysql python
缘由　　近期在折腾一个小东西须要抓取网上的页面。然后进行解析。将结果放到数据库中。　　了解到 Python在这方面有优势，便选用之。　　由于我有台 server上面安装有 mysql，自然使用之。在进行数据库的这个操作过程中遇到了不少问题，这里记录一下，大家共勉。　　 python中mysql的调用　　百度之后能够通过MySQLdb进行数据库操作。
单例模式 hxl1988_0311 java 单例设计模式单件
package com.sosop.designpattern.singleton; /* * 单件模式：保证一个类必须只有一个实例，并提供全局的访问点 * * 所以单例模式必须有私有的构造器，没有私有构造器根本不用谈单件 * * 必须考虑到并发情况下创建了多个实例对象 * */ /** * 虽然有锁，但是只在第一次创建对象的时候加锁，并发时不会存在效率
27种迹象显示你应该辞掉程序员的工作 vipshichg 工作
1、你仍然在等待老板在2010年答应的要提拔你的暗示。 2、你的上级近10年没有开发过任何代码。 3、老板假装懂你说的这些技术，但实际上他完全不知道你在说什么。 4、你干完的项目6个月后才部署到现场服务器上。 5、时不时的，老板在检查你刚刚完成的工作时，要求按新想法重新开发。 6、而最终这个软件只有12个用户。 7、时间全浪费在办公室政治中，而不是用在开发好的软件上。 8、部署前5分钟才开始测试。