好饭不怕晚_

跟我读论文系列之CatBoost

文章目录

- 理论
- - TITLE 标题
  - AUTHOR 作者
  - ABSTRACT 摘要
  - INTRODUCTION 介绍
  - BACKGROUND 背景
  - CATEGORICAL FEATURES 离散特征
  - - Related work on categorical features
    - Target statistics 目标统计量TS
  - Prediction shift and ordered boosting
  - - Prediction shift
    - Ordered boosting
  - Practical implementation of ordered boosting
  - Conclusion
- 优点

理论

论文地址
readpaper

TITLE 标题

CatBoost: unbiased boosting with categorical features

CatBoost: 类别型特征的无偏提升

AUTHOR 作者

Liudmila Prokhorenkova, Gleb Gusev, etc

ABSTRACT 摘要

This paper presents the key algorithmic techniques behind CatBoost, a new gradient boosting toolkit. Their combination leads to CatBoost outperforming other publicly available boosting implementations in terms of quality on a variety of datasets.

本文介绍了一种新的梯度提升工具CatBoost背后的关键算法技术。它们的结合使得CatBoost在各种数据集上的质量优于其他公开的boosting实现。

Two critical algorithmic advances introduced in CatBoost are the implementation of ordered boosting, a permutation-driven alternative to the classic algorithm, and an innovative algorithm for processing categorical features. Both techniques were created to fight a prediction shift caused by a special kind of target leakage present in all currently existing implementations of gradient boosting algorithms.

在CatBoost中引入的两个关键算法进步是有序boosting的实现，这是经典算法的排列驱动替代方案，以及处理分类特征的创新算法。这两种技术都是为了对抗由一种特殊的目标泄漏引起的预测偏移，这种目标泄漏存在于目前所有现有的梯度增强算法实现中。

In this paper, we provide a detailed analysis of this problem and demonstrate that proposed algorithms solve it effectively, leading to excellent empirical results.

在本文中，我们对该问题进行了详细的分析，并证明所提出的算法有效地解决了该问题，并具备优秀的实证结果。

INTRODUCTION 介绍

Gradient boosting is a powerful machine-learning technique that achieves state-of-the-art results in a variety of practical tasks.

We show in this paper that all existing implementations of gradient boosting face the following statistical issue.

A prediction model $F$ obtained after several steps of boosting relies on the targets of all training examples. We demonstrate that this actually leads to a shift of the distribution of $F (x_k) | x_k$ for a training example $x_k$ from the distribution of $F (x) ∣ x$ for a test example $x$ . This finally leads to a prediction shift of the learned model.

《分布不一致》经过几步提升得到的预测模型 $F$ 依赖于所有训练样本的目标。我们证明，这实际上导致了 $F (x_k) | x_k$ (对于训练样本 $x_k$ )的分布从 $F (x) ∣ x$ (对于测试样本 $x$ )的分布的偏移。这最终导致了学习模型的预测转移。

Further, there is a similar issue in standard algorithms of preprocessing categorical features. One of the most effective ways to use them in gradient boosting is converting categories to their target statistics. A target statistic is a simple statistical model itself, and it can also cause target leakage and a prediction shift.

《分类特征处理问题》此外，分类特征预处理的标准算法也存在类似的问题。在梯度增强中使用它们的最有效的方法之一是将类别转换为目标统计信息。目标统计量本身是一个简单的统计模型，它还可能导致目标泄漏和预测偏移。

In this paper, we propose ordering principle to solve both problems. Relying on it, we derive ordered boosting, a modification of standard gradient boosting algorithm, which avoids target leakage, and a new algorithm for processing categorical features. Their combination is implemented as an open-source library called CatBoost (for “Categorical Boosting”), which outperforms the existing state-of-the-art implementations of gradient boosted decision trees — XGBoost and LightGBM — on a diverse set of popular machine learning tasks.

本文提出了“排序原则”来解决这两个问题。在此基础上，我们导出了“有序提升”算法，即对标准梯度提升算法的改进，避免了目标泄漏，并提出了一种新的分类特征处理算法。它们的组合被实现为一个名为CatBoost(意为“分类增强”)的开源库，在各种流行的机器学习任务上，它比现有的最先进的梯度增强决策树实现XGBoost和LightGBM表现得更好。

BACKGROUND 背景

CatBoost is an implementation of gradient boosting, which uses binary decision trees as base predictors.

CATEGORICAL FEATURES 离散特征

Related work on categorical features

A categorical feature is one with a discrete set of values called categories that are not comparable to each other.

类别特征是一组称为类别的离散值，这些值彼此之间没有可比性。

One popular technique for dealing with categorical features in boosted trees is one-hot encoding, i.e., for each category, adding a new binary feature indicating it. However, in the case of high cardinality features (like, e.g., “user ID” feature), such technique leads to infeasibly large number of new features.

《onehot》在提升树中处理分类特征的一种流行技术是onehot编码，也就是说，对于每个类别，添加一个新的二进制特征来指示它。然而，在高基数特征的情况下(例如，“用户ID”特性)，这种技术会导致大量的新特征。

To address this issue, one can group categories into a limited number of clusters and then apply one-hot encoding. A popular method is to group categories by target statistics (TS) that estimate expected target value in each category.

《TS特征化》要解决这个问题，可以将类别分组到数量有限的桶中，然后应用onehot编码。一种流行的方法是根据目标统计量(TS)对类别进行分组，目标统计量估计每个类别的预期目标值。

Importantly, among all possible partitions of categories into two sets, an optimal split on the training data in terms of logloss, Gini index, MSE can be found among thresholds for the numerical TS feature.

重要的是，在所有可能将类别划分为两个集合的情况下，在数值TS特征的阈值之间可以找到训练数据在对数、基尼指数、均方误差方面的最佳分割。

In LightGBM, categorical features are converted to gradient statistics at each step of gradient boosting. Though providing important information for building a tree, this approach can dramatically increase (i) computation time, since it calculates statistics for each categorical value at each step, and (ii) memory consumption to store which category belongs to which node for each split based on a categorical feature. LightGBM groups tail categories into one cluster and thus looses part of information. Besides, the authors claim that it is still better to convert categorical features with high cardinality to numerical features

在LightGBM中，分类特征在梯度提升的每一步都被转换为梯度统计信息。尽管为构建树提供了重要信息，但这种方法可以显著增加(i)计算时间，因为它在每一步计算每个分类值的统计信息，以及(ii)存储基于分类特征的每次分裂的哪个类别属于哪个节点的内存消耗。LightGBM将尾部类别分组到一个桶中，因此会丢失部分信息。此外，作者认为将具有高基数的类别特征转换为数值特征更好。

Note that TS features require calculating and storing only one number per one category.

请注意，TS特征要求每个类别只计算和存储一个数字。

Thus, using TS as new numerical features seems to be the most efficient method of handling categorical features with minimum information loss. TS are widely-used, e.g., in the click prediction task (click-through rates), where such categorical features as user, region, ad, publisher play a crucial role. We further focus on ways to calculate TS and leave one-hot encoding and gradient statistics out of the scope of the current paper. At the same time, we believe that the ordering principle proposed in this paper is also effective for gradient statistics.

因此，使用TS作为新的数值特征似乎是处理类别特征的最有效的方法和最小的信息损失。TS被广泛使用，例如在点击预测任务(点击率)中，用户、地区、广告、发布者等分类特征起着至关重要的作用。我们进一步关注TS的计算方法，并将onehot编码和梯度统计排除在本文的讨论范围之外。同时，我们认为本文提出的排序原则对于梯度统计也是有效的。

Target statistics 目标统计量TS

As discussed in Section 3.1, an effective and efficient way to deal with a categorical feature $i$ is to substitute the category $x^i_k$ of k-th training example with one numeric feature equal to sometarget statistic (TS) $\hat{x}^i_k$ . Commonly, it estimates the expected target $y$ conditioned by the category: $\hat{x}^i_k \approx E(y | x^i = x^i_k)$ .

如3.1节所述，处理分类特征 $i$ 的一种有效且高效的方法是将第k个训练示例中的类别 $x^i_k$ 替换为一个等于某个目标统计量(TS) $\hat{x}^i_k$ 的数字特征。通常，它估计预期目标 $y$ 的条件是: $\hat{x}^i_k \approx E(y | x^i = x^ i_k)$ 。

Greedy TS

A straightforward approach is to estimate $x^i = x^i_k)$ as the average value of $y$ over the training examples with the same category $x^i_k$ . This estimate is noisy for low-frequency categories, and one usually smoothes it by some prior $p$ :

一个简单的方法是估计 $x^i = x^i_k)$ 为具有相同类别 $x^i_k$ 的训练示例中 $y$ 的平均值。对于低频类别，这个估计是有噪声的，通常用一些先验 $p$ 平滑它

$\hat{x}^i_k = \frac{\sum_{j=1}^n{I(x^i_j=x^i_k)*y_j} + ap}{\sum_{j=1}^n{I(x^i_j=x^i_k)} + a}$

其中 $a > 0$ 是超参数， $p$ 一般设定为target的均值

The problem of such greedy approach is target leakage: feature $\hat{x}^i_k$ is computed using $y_k$ , the target of $x_k$ . This leads to a conditional shift: the distribution of $\hat{x}_i|y$ differs for training and test examples.

这种贪婪方法的问题是目标泄漏:特征 $\hat{x}^i_k$ 使用 $y_k$ 计算。这导致了一个条件转移: $\hat{x} i|y$ 的分布对于训练和测试示例是不同的。

The following extreme example illustrates how dramatically this may affect the generalization error of the learned model

下面的极端例子说明了这种方法是如何显著地影响学习模型的泛化误差的

假定第i个特征是categorical的，它的所有值都是唯一的（唯一的话， $\sum_{j=1}^n{I(x^i_j=x^i_k)} = 1$ ），并且对一个分类任务来说，它的每一个类别 $A$ ，我们有 $P(y=1|x^i=A)=0.5$ ，即第i个特征的每个类别对应的0,1是一半一半。则在训练集中， $\hat{x}^i_k=\frac{y_k+ap}{1+a}$ 。因此对于该特征，设定阈值为 $t=\frac{0.5+ap}{1+a}$ 时，就可以将样本完美的分开。然而对于测试集，特征的greedy TS 是 $p$ ， $\space if \space p0 if p<t else 1$

当然，conditional shift也可以修正，常用的方法是令 $D_k = D - \{x_k\}$ ，然后计算TS如下

$\hat{x}^i_k = \frac{\sum_{x_j \in D_k}{I(x^i_j=x^i_k)*y_j} + ap}{\sum_{x_j \in D_k}{I(x^i_j=x^i_k)} + a} \tag{5}$

Holdout TS

One way is to partition the training dataset into two parts $\hat{D}_0 \cup \hat{D}_1$ and use $D_k = \hat{D}_0$ for calculating the TS according to (5) and $\hat{D}_1$ for training (e.g., applied in for Criteo dataset). Though such holdout TS satisfies P1, this approach significantly reduces the amount of data used both for training the model and calculating the TS.

一种方法是将训练数据集划分为 $\hat{D}_0 \cup \hat{D}_1$ 两部分，使用 $D_k = \hat{D}_0$ 根据(5)计算TS，使用 $\hat{D}_1$ 进行训练(例如，应用于for Criteo数据集)(即 $\hat{D}_1$ 的TS是用 $D-\hat{D}_1=\hat{D}_0$ 来计算的)。虽然这种坚持TS满足P1（训练测试同分布），但该方法显著减少了用于训练模型和计算TS的数据量。即没有使用所有训练集计算TS

Leave-one-out TS

At first glance, a leave-one-out technique might work well: take $D_k = D - \{x_k\}$ for training examples $x_k$ and $D_k = D$ for test ones. Surprisingly, it does not prevent target leakage. Indeed, consider a constant categorical feature: $x^i_k = A$ for all examples. Let $n^+$ be the number of examples with $y = 1$ , then $\hat{x}^i_k = \frac{n^+ - y_k + ap}{n-1+a}$ and one can perfectly classify the training dataset by making a split with threshold $\hat{x}^i_k = \frac{n^+ - 0.5 + ap}{n-1+a}$ .

乍一看，leave-one-out可能很有效:训练示例 $x_k$ 的TS用 $D_k = D - \{x_k\}$ ，测试示例的TS用 $D_k = D$ 。令人惊讶的是，它并不能防止目标泄漏。实际上，考虑一个常量分类特征:对于所有的例子， $x^i_k = A$ 。设 $n^+$ 为 $y = 1$ 的样本数，则 $\hat{x}^i_k = \frac{n^+ - y_k + ap}{n-1+a}$ ，通过阈值 $\hat{x}^i_k = \frac{n^+ - 0.5 + ap}{n-1+a}$ 进行分割，就可以很好地对训练数据集进行分类。

Ordered TS

CatBoost uses a more effective strategy. It relies on the ordering principle, the central idea of the paper, and is inspired by online learning algorithms which get training examples sequentially in time). Clearly, the values of TS for each example rely only on the observed history. To adapt this idea to standard offline setting, we introduce an artificial “time”, i.e., a random permutation $\sigma$ of the training examples. Then, for each example, we use all the available “history” to compute its TS, i.e., take $D_k = \{x_j : \sigma(j) < \sigma(k)\}$ in Equation (5) for a training example and $D_k = D$ for a test one. The obtained ordered TS satisfies the requirement P1 and allows to use all training data for learning the model (P2). Note that, if we use only one random permutation, then preceding examples have TS with much higher variance than subsequent ones. To this end, CatBoost uses different permutations for different steps of gradient boosting, see details in Section 5

CatBoost采用了更有效的策略。它依赖于排序原则，这是本文的中心思想，并受到在线学习算法的启发，在线学习算法可以及时得到训练示例)。显然，每个例子的TS值只依赖于观察到的历史。为了使这一思想适用于标准的离线设置，我们引入了一个人工的“时间”，例如训练示例的随机排列 $\sigma$ 。然后，对于每个例子，我们使用所有可用的“历史”来计算它的TS，即，对于训练例子，取 $D_k = \{x_j: \sigma(j) < \sigma(k)\}$ 来计算TS，对于测试例子，取 $D_k = D$ 。得到的有序TS满足P1的要求，并允许使用所有训练数据学习模型(P2)。请注意，如果我们只使用一个随机排列，那么前面的例子的TS具有比后面的例子更高的方差。为此，CatBoost对梯度增强的不同步骤使用不同的排列，详见第5节

Prediction shift and ordered boosting

Prediction shift

In this section, we reveal the problem of prediction shift in gradient boosting, which was neither recognized nor previously addressed. Like in case of TS, prediction shift is caused by a special kind of target leakage. Our solution is called ordered boosting and resembles the ordered TS method.

在本节中，我们揭示了在梯度增强中预测偏移的问题，这个问题以前没有被认识到也没有被解决。与TS的情况一样，预测偏移是由一种特殊的目标泄漏引起的。我们的解决方案称为有序提升，类似于有序TS方法。

As in the case of TS, these problems are caused by the target leakage. Indeed, gradients used at each step are estimated using the target values of the same data points the current model $F ^{t-1}$ was built on. However, the conditional distribution $F^{t-1}(x_k) | x_k$ for a training example $x_k$ is shifted, in general, from the distribution $F^{t-1}(x) | x$ for a test example $x$ . We call this a prediction shift.

在TS的情况下，这些问题是由目标泄漏引起的。实际上，每一步使用的梯度是使用当前模型 $F ^{t-1}$ 所建立的相同数据点的目标值来估计的。但是，对于训练示例 $x k$ ，条件分布 $F^{t-1}(x k) | x k$ 通常会从测试示例 $x$ 的 $F^{t-1}(x) | x$ 转移。我们称之为预测偏移。

详细预测偏移的例子可以看论文

Ordered boosting

Here we propose a boosting algorithm which does not suffer from the prediction shift problem described in Section 4.1. Assuming access to an unlimited amount of training data, we can easily construct such an algorithm. At each step of boosting, we sample a new dataset $D_t$ independently and obtain unshifted residuals by applying the current model to new training examples. In practice, however, labeled data is limited. Assume that we learn a model with $I$ trees. To make the residual $r^{I-1}(x_k, y_k)$ unshifted, we need to have $F^{I-1}$ trained without the example $x_k$ . Since we need unbiased residuals for all training examples, no examples may be used for training $F^{I-1}$ , which at first glance makes the training process impossible.

在这里，我们提出了一个boost算法，它不受4.1节中描述的预测偏移问题的影响。假设可以访问无限数量的训练数据，我们可以很容易地构建这样一个算法。在提升的每一步，我们可以独立采样出一个新的数据集 $d_t$ ，并通过将当前模型应用于新的训练示例来获得未偏移残差。然而，在实践中，有标签的数据是有限的。假设我们学习一个有 $I$ 棵树的模型。为了使残差 $r^{I-1}(x_k, y_k)$ 不偏移，我们需要在没有 $x_k$ 示例的情况下训练 $F^{I-1}$ 。由于我们需要所有训练样本的无偏残差，因此没有样本可以用于训练 $F^{I-1}$ ，这乍一看使得训练过程不可能进行。

However, it is possible to maintain a set of models differing by examples used for their training. Then, for calculating the residual on an example, we use a model trained without it. In order to construct such a set of models, we can use the ordering principle previously applied to TS in Section 3.2.

然而，有可能维护一组模型，不同的例子用于它们的训练。然后，在一个例子上计算残差，我们使用一个在这个例子上没有经过训练的模型。为了构建这样一组模型，我们可以使用3.2节中应用于TS的排序原则。

To illustrate the idea, assume that we take one random permutation $\sigma$ of the training examples and maintain $n$ different supporting models $M_1, ..., M_n$ such that the model $M_i$ is learned using only the first $i$ examples in the permutation. At each step, in order to obtain the residual for j-th sample, we use the model $M_{j-1}$ (see Figure 1). The resulting Algorithm 1 is called ordered boosting below.

为了说明这个想法，假设我们取训练示例的一个随机排列 $\sigma$ ，并维持 $n$ 个不同的支持模型$ m_1,…, M_n $，使得模型$ M_i $只使用排列中的前$ i $例子来学习。在每一步中，为了得到第 j 个样本的残差，我们使用模型$ M_{j-1}$(见图1)。得到的算法1称为下面的ordered boosting。

Unfortunately, this algorithm is not feasible in most practical tasks due to the need of training $n$ different models, what increase the complexity and memory requirements by $n$ times. In CatBoost, we implemented a modification of this algorithm on the basis of the gradient boosting algorithm with decision trees as base predictors (GBDT) described in Section 5.

不幸的是，由于需要训练 $n$ 个不同的模型，这种算法在大多数实际任务中是不可用的，这增加了复杂度和内存需求 $n$ 倍。在CatBoost中，我们在第5节描述的以决策树作为基础预测器(GBDT)的梯度增强算法的基础上实现了该算法的修改。

Ordered boosting with categorical features

In Sections 3.2 and 4.2 we proposed to use random permutations $\sigma_{cat}$ and $\sigma_{boost}$ of training examples for the TS calculation and for ordered boosting, respectively. Combining them in one algorithm, we should take $\sigma_{cat}=\sigma_{boost}$ to avoid prediction shift. This guarantees that target $y_i$ is not used for training $M_i$ (neither for the TS calculation, nor for the gradient estimation). See Section F of the supplementary material for theoretical guarantees. Empirical results confirming the importance of having $\sigma_{cat}=\sigma_{boost}$ are presented in Section G of the supplementary material.

在3.2节和4.2节中，我们提出分别使用训练示例的随机排列 $\sigma_{cat}$ 和 $\sigma_{boost}$ 来进行TS计算和有序提升。将它们组合在一个算法中，我们应该取 $\sigma_{cat}=\sigma_{boost}$ 来避免预测偏移。这保证了目标 $y_i$ 不用于训练 $M_i$ (既不用于TS计算，也不用于梯度估计)。理论保证见补充材料F节。实证结果证实了 $\sigma_{cat}=\sigma_{boost}$ 的重要性，将在G节给出

Algorithm 1: Ordered boosting

input: $\{(x_k, y_k)\}_{k=1}^n, I$
$\sigma$ = random permutation of $[1, n]$
$M_i = 0$ for $i=1,...,n$
for t=1 to I do
    for i = 1 to n do
        $r_i = y_i - M_{\sigma(i)-1}(x_i)$
    for i = 1 to n do
        $h = LearnModel((x_j, r_j)), \sigma{j}\le i$
        $M_i = M_i + h$
return $M_n$

Practical implementation of ordered boosting

CatBoost has two boosting modes, Ordered and Plain. The latter mode is the standard GBDT algorithm with inbuilt ordered TS. The former mode presents an efficient modification of Algorithm 1. A formal description of the algorithm is included in Section B of the supplementary material. In this section, we overview the most important implementation details.

CatBoost有两种增强模式，Ordered和Plain。后者是标准的GBDT算法，内置了有序TS。前者是对算法1的有效改进。补充材料的B节包含了算法的正式描述。在本节中，我们将概述最重要的实现细节。

At the start, CatBoost generates $s + 1$ independent random permutations of the training dataset. The permutations $\sigma_1, . . . , \sigma_s$ are used for evaluation of splits that define tree structures (i.e., the internal nodes), while $\sigma_0$ serves for choosing the leaf values $b_j$ of the obtained trees (see Equation (3) ).

开始时，CatBoost生成训练数据集的 $s + 1$ 独立随机排列。排列 $\sigma_1,..., \sigma_s$ 用于对定义树结构(即内部节点)的分段进行评估，而 $\sigma_0$ 用于选择获得的树的叶值 $b_j$ (见式(3))。

For examples with short history in a given permutation, both TS and predictions used by ordered boosting ( $M_{\sigma(i)-1}(x_i)$ in Algorithm 1) have a high variance. Therefore, using only one permutation may increase the variance of the final model predictions, while several permutations allow us to reduce this effect in a way we further describe. The advantage of several permutations is confirmed by our experiments in Section 6.

Building a tree

In CatBoost, base predictors are oblivious decision trees also called decision tables. Term oblivious means that the same splitting criterion is used across an entire level of the tree. Such trees are balanced, less prone to overfitting, and allow speeding up execution at testing time significantly. The procedure of building a tree in CatBoost is described in Algorithm 2.

Algorithm 2: Building a tree in CatBoost
input:
    M
    {(x_i, y_i)}_{i=1}^n
    L
    {\sigma_i}_{i=1}^n
    Mode
grad = CalcGradient(L, M, y)
r = random(1, s)
if Mode = Plain then
    G = (grad_r(i) for i = 1,...,n)
if Mode = Ordered then
    G = (grad_{r, \sigma_r(i)-1}(i) for i = 1,...,n)
T = empty tree
for each step of top-down procedure do
    for each candidate split c do
        T_c = add split c to T
        if Mode = Plain then
            \Delta(i) = avg(grad_r(p) for p: leaf_r(p)=leaf_r(i)) for i = 1,...,n
        if Mode = Ordered then
            \Delta(i) = avg(grad_{r, \sigma_{r}(i)-1}(p) for p: leaf_r(p)=leaf_r(i), \sigma_r(p) < \sigma_r(i)) for i = 1,...,n
        loss(T_c) = cos(\Delta, G)
    T = argmin_{T_c}(loss(T_c))

if Mode = Plain then
    M_{r^'}(i) = M_{r^'}(i)-\alpha avg(grad_{r^'}(p) for p: leaf_{r^'}(p)=leaf_{r^'}(i)) for r^'=1,...,s,i=1,...,n
if Mode = Ordered then
    M_{r^', j}(i) = M_{r^', j}(i)-\alpha avg(grad_{r^', j}(p) for p: leaf_{r^'}(p)=leaf_{r^'}(i), \sigma_{r^'}(p) <= j) for r^'=1,...,s,i=1,...,n, j>= \sigma_{r^'}-1

return T, M

在CatBoost中，基本预测器是“对称决策树”，也称为决策表。对称意味着在树的整个层次上使用相同的分割标准。这样的树是平衡的，不容易过度拟合，并允许在测试时显著加快执行速度。在CatBoost中构建树的过程在算法2中描述。

In the Ordered boosting mode, during the learning process, we maintain the supporting models $M_{r,j}$ , where $M_{r,j}(i)$ is the current prediction for the i-th example based on the first j examples in the permutation $\sigma_r, r=1,...,s$ .

$M_{r,j}(i)$ 是基于 $\sigma_r, r=1,...,s$ 排列情况下，基于前j个样本训练的模型对第i个样本的预测。

At each iteration t of the algorithm, we sample a random permutation $\sigma_r$ from ${\sigma_1,..., \sigma_s}$ and construct a tree $T_t$ on the basis of it.

在第t次迭代时，我们从 ${\sigma_1,..., \sigma_s}$ 中随机采样一个 $\sigma_r$ ，在它基础上得到初始的 $T_t$ 。

First, for categorical features, all TS are computed according to this permutation $\sigma_r$ .

首先，根据排列 $sigma_r$ 计算出离散特征的目标统计量。

Second, the permutation $\sigma_r$ affects the tree learning procedure.

然后，排列 $sigma_r$ 也影响树学习过程。

Namely, based on $M_{r,j}(i)$ , we compute the corresponding gradients $grad_{r,j}(i) = \frac{\partial L(y_i, s)}{\partial s}|_{s=M_{r,j}(i)}$

根据（上一棵树？）输入的 $M$ ，我们可以算出 $g r a d = C a l c G r a d i e n t (L, M, y)$ ，即对于输入的 $M_{r,j}(i)$ ，可以得到 $grad_{r,j}(i) = \frac{\partial L(y_i, s)}{\partial s}|_{s=M_{r,j}(i)}$ 。

Then, while constructing a tree, we approximate the gradient G in terms of the cosine similarity cos(.,.), where, for each example i, we take the gradient $grad_{r, \sigma(i)-1}$ (it is based only on the previous examples in $\sigma_r$ ).

然后，在构造树的时候，我们用余弦相似度cos(.，.)来近似梯度G，其中对每个样本 $i$ ，我们取其梯度为 $grad_{r, \sigma(i)-1}$ （在排列 $\sigma_r$ 情况下，它取决于之前的样本）。

At the candidate splits evaluation step, the leaf value $\Delta(i)$ for example $i$ is obtained individually by averaging the gradients $grad_{r, \sigma_r(i)-1}$ of the preceding examples $p$ lying in the same leaf $leaf_r(i)$ the example $i$ belongs to.

在候选节点拆分执行环节，前面的样本 $p$ 在同一个叶子上的梯度 $grad_{r, \sigma_r(i)-1}$ 的平均，得到了例子 $i$ 所属的叶子值 $\Delta(i)$ 。

Note that $leaf_r(i)$ depends on the chosen permutation $\sigma_r$ , because $\sigma_r$ can influence the values of ordered TS for example $i$ .

注意 $leaf_r(i)$ 依赖于排列 $\sigma_r$ ，因为目标统计量TS依赖于 $\sigma_r$ 。

When the tree structure $T_t$ (i.e., the sequence of splitting attributes) is built, we use it to boost all the models $M_{r^{'}, j}$ .

当树结构 $T_t$ 被构造后，我们用它来得到所有预测值 $M_{r^{'}, j}$ 。

Let us stress that one common tree structure $T_t$ is used for all the models, but this tree is added to different $M_{r^{'}, j}$ with different sets of leaf values depending on $r^{'}$ and $j$ , as described in Algorithm 2.

让我们强调一下，所有的模型都使用了一个通用的树结构 $T_t$ ，但是这个树被添加到不同的 $M_{r^{'}, j}$ 中，根据 $r^{'}$ 和 $j$ 有不同的叶值集，如算法2中所述。

The Plain boosting mode works similarly to a standard GBDT procedure, but, if categorical features are present, it maintains $s$ supporting models $M_r$ corresponding to TS based on ${\sigma_1,..., \sigma_s}$ .

Plain boost模式的工作原理类似于标准GBDT过程，但是，如果存在分类特征，它维护基于 ${\sigma_1,..., \sigma_s}$ 构造的TS生成的模型 $M_r$

Choosing leaf values

Given all the trees constructed, the leaf values of the final model F are calculated by the standard gradient boosting procedure equally for both modes. Training examples i are matched to leaves $leaf_0(i)$ , i.e., we use permutation $\sigma_0$ to calculate TS here. When the final model F is applied to a new example at testing time, we use TS calculated on the whole training data according to Section 3.2.

对于所有构建的树，最终模型F的叶值通过标准梯度提升程序进行计算，这对于两种模式是相同的。训练示例i匹配树叶 $leaf_0(i)$ ，即我们使用置换 $\sigma_0$ 来计算这里的TS。“将最终模型F应用于一个新示例时，我们使用根据第3.2节对整个训练数据计算的TS来作为特征”。

Feature combinations

Another important detail of CatBoost is using combinations of categorical features as additional categorical features which capture high-order dependencies like joint information of user ID and ad topic in the task of ad click prediction.

CatBoost的另一个重要细节是使用分类特征的组合作为附加的分类特征，在广告点击预测任务中捕获高阶依赖关系，如用户ID和广告主题的联合信息。

The number of possible combinations grows exponentially with the number of categorical features in the dataset, and it is infeasible to process all of them.

可能的组合的数量随着数据集中类别特征的数量呈指数增长，处理所有的组合是不可行的。

CatBoost constructs combinations in a greedy way. Namely, for each split of a tree, CatBoost combines (concatenates) all categorical features (and their combinations) already used for previous splits in the current tree with all categorical features in the dataset. Combinations are converted to TS on the fly.

CatBoost以贪婪的方式构造组合。也就是说，对于树的每次拆分，CatBoost将当前树中以前拆分中使用的所有分类特征(及其组合)与数据集中的所有分类特征结合(连接)。组合动态地转换为TS。

Other important details

Finally, let us discuss two options of the CatBoost algorithm not covered above.

The first one is subsampling of the dataset at each iteration of boosting procedure, as proposed by Friedman. We claimed earlier in Section 4.1 that this approach alone cannot fully avoid the problem of prediction shift. However, since it has proved effective, we implemented it in both modes of CatBoost as a Bayesian bootstrap procedure.

The second option deals with first several examples in a permutation. For examples i with small values $\sigma_r(i)$ , the variance of $grad_{r,\sigma_r(i)-1}(i)$ can be high. Therefore, we discard $\Delta(i)$ from the beginning of the permutation, when we calculate $loss(T_c)$ in Algorithm 2. Particularly, we eliminate the corresponding components of vectors $G$ and $\Delta$ when calculating the cosine similarity between them.

由于前面几个样本的梯度方差会特别大，因此我们忽略前面几个的 $\Delta(i)$ ，即计算相似性的时候把前面几个分量干掉。

Conclusion

In this paper, we identify and analyze the problem of prediction shifts present in all existing implementations of gradient boosting. We propose a general solution, ordered boosting with ordered TS, which solves the problem. This idea is implemented in CatBoost, which is a new gradient boosting library. Empirical results demonstrate that CatBoost outperforms leading GBDT packages and leads to new state-of-the-art results on common benchmarks.

在这篇论文中，我们识别和分析了所有现有的梯度增强实现中存在的预测偏移问题。我们提出了一种通用的解决方案，即带有序TS的有序提升，解决了这一问题。这个想法是在CatBoost中实现的，这是一个新的梯度增强库。经验结果表明，CatBoost的性能优于领先的GBDT包，并在通用基准测试中获得最新的结果。

优点

将离散特征转为连续的target statistics
有序提升算法

你可能感兴趣的:(机器学习,机器学习)

机器学习与深度学习间关系与区别 ℒℴѵℯ心·动ꦿ໊ོ꫞ 人工智能学习深度学习 python
一、机器学习概述定义机器学习（MachineLearning,ML）是一种通过数据驱动的方法，利用统计学和计算算法来训练模型，使计算机能够从数据中学习并自动进行预测或决策。机器学习通过分析大量数据样本，识别其中的模式和规律，从而对新的数据进行判断。其核心在于通过训练过程，让模型不断优化和提升其预测准确性。主要类型1.监督学习（SupervisedLearning）监督学习是指在训练数据集中包含输入
数字里的世界17期：2021年全球10大顶级数据中心，中国移动榜首张三叨
你知道吗？2016年，全球的数据中心共计用电4160亿千瓦时，比整个英国的发电量还多40％！前言每天，我们都会创造超过250万TB的数据。并且随着物联网（IOT）的不断普及，这一数据将持续增长。如此庞大的数据被存储在被称为“数据中心”的专用设施中。虽然最早的数据中心建于20世纪40年代，但直到1997-2000年的互联网泡沫期间才逐渐成为主流。当前人类的技术，比如人工智能和机器学习，已经将我们推向
nosql数据库技术与应用知识点皆过客，揽星河 NoSQL nosql 数据库大数据数据分析数据结构非关系型数据库
Nosql知识回顾大数据处理流程数据采集(flume、爬虫、传感器)数据存储(本门课程NoSQL所处的阶段)Hdfs、MongoDB、HBase等数据清洗(入仓)Hive等数据处理、分析(Spark、Flink等)数据可视化数据挖掘、机器学习应用(Python、SparkMLlib等)大数据时代存储的挑战(三高)高并发(同一时间很多人访问)高扩展(要求随时根据需求扩展存储)高效率(要求读写速度快)
Python开发常用的三方模块如下：换个网名有点难 python 开发语言
Python是一门功能强大的编程语言，拥有丰富的第三方库，这些库为开发者提供了极大的便利。以下是100个常用的Python库，涵盖了多个领域：1、NumPy，用于科学计算的基础库。2、Pandas，提供数据结构和数据分析工具。3、Matplotlib，一个绘图库。4、Scikit-learn，机器学习库。5、SciPy，用于数学、科学和工程的库。6、TensorFlow，由Google开发的开源机
Python实现简单的机器学习算法 master_chenchengg python python 办公效率 python开发 IT
Python实现简单的机器学习算法开篇：初探机器学习的奇妙之旅搭建环境：一切从安装开始必备工具箱第一步：安装Anaconda和JupyterNotebook小贴士：如何配置Python环境变量算法初体验：从零开始的Python机器学习线性回归：让数据说话数据准备：从哪里找数据编码实战：Python实现线性回归模型评估：如何判断模型好坏逻辑回归：从分类开始理论入门：什么是逻辑回归代码实现：使用skl
遥感影像的切片处理 sand&wich 计算机视觉 python 图像处理
在遥感影像分析中，经常需要将大尺寸的影像切分成小片段，以便于进行详细的分析和处理。这种方法特别适用于机器学习和图像处理任务，如对象检测、图像分类等。以下是如何使用Python和OpenCV库来实现这一过程，同时确保每个影像片段保留正确的地理信息。准备环境首先，确保安装了必要的Python库，包括numpy、opencv-python和xml.etree.ElementTree。这些库将用于图像处理
ai绘画工具midjourney怎么下载？附作品管理教程设计师早上好
Midjourney是一款功能强大的AI绘画工具，它使用机器学习技术和深度神经网络等算法，可以生成各种艺术风格的绘画作品。在创意设计、广告宣传等方面有着广泛的应用前景。那么，ai绘画工具midjourney怎么下载？本文将为您介绍Midjourney的下载以及作品的相关管理。一、Midjourney下载Midjourney的下载非常简单，只需打开Midjourney官网（点击“GetMidjour
[实践应用] 深度学习之模型性能评估指标 YuanDaima2048 深度学习工具使用深度学习人工智能损失函数性能评估 pytorch python 机器学习
文章总览：YuanDaiMa2048博客文章总览深度学习之模型性能评估指标分类任务回归任务排序任务聚类任务生成任务其他介绍在机器学习和深度学习领域，评估模型性能是一项至关重要的任务。不同的学习任务需要不同的性能指标来衡量模型的有效性。以下是对一些常见任务及其相应的性能评估指标的详细解释和总结。分类任务分类任务是指模型需要将输入数据分配到预定义的类别或标签中。以下是分类任务中常用的性能指标：准确率(
机器学习-聚类算法不良人龍木木机器学习机器学习算法聚类
机器学习-聚类算法1.AHC2.K-means3.SC4.MCL仅个人笔记，感谢点赞关注！1.AHC2.K-means3.SC传统谱聚类：个人对谱聚类算法的理解以及改进4.MCL目前仅专注于NLP的技术学习和分享感谢大家的关注与支持！
未来软件市场是怎么样的？做开发的生存空间如何？ cesske 软件需求
目录前言一、未来软件市场的发展趋势二、软件开发人员的生存空间前言未来软件市场是怎么样的？做开发的生存空间如何？一、未来软件市场的发展趋势技术趋势：人工智能与机器学习：随着技术的不断成熟，人工智能将在更多领域得到应用，如智能客服、自动驾驶、智能制造等，这将极大地推动软件市场的增长。云计算与大数据：云计算服务将继续普及，大数据技术的应用也将更加广泛。企业将更加依赖云计算和大数据来优化运营、提升效率，并
python中zeros用法_Python中的numpy.zeros()用法江平舟 python中zeros用法
numpy.zeros()函数是最重要的函数之一,广泛用于机器学习程序中。此函数用于生成包含零的数组。numpy.zeros()函数提供给定形状和类型的新数组,并用零填充。句法numpy.zeros(shape,dtype=float,order='C'参数形状：整数或整数元组此参数用于定义数组的尺寸。此参数用于我们要在其中创建数组的形状,例如(3,2)或2。dtype：数据类型(可选)此参数用于
【NumPy】深入解析numpy.zeros()函数二七830 numpy
欢迎莅临我的个人主页这里是我深耕Python编程、机器学习和自然语言处理（NLP）领域，并乐于分享知识与经验的小天地！博主简介：我是二七830，一名对技术充满热情的探索者。多年的Python编程和机器学习实践，使我深入理解了这些技术的核心原理，并能够在实际项目中灵活应用。尤其是在NLP领域，我积累了丰富的经验，能够处理各种复杂的自然语言任务。技术专长：我熟练掌握Python编程语言，并深入研究了机
【中国国际航空-注册_登录安全分析报告】风控牛验证码接口安全评测系列安全行为验证极验网易易盾智能手机
前言由于网站注册入口容易被黑客攻击，存在如下安全问题：1.暴力破解密码，造成用户信息泄露2.短信盗刷的安全问题，影响业务及导致用户投诉3.带来经济损失，尤其是后付费客户，风险巨大，造成亏损无底洞所以大部分网站及App都采取图形验证码或滑动验证码等交互解决方案，但在机器学习能力提高的当下，连百度这样的大厂都遭受攻击导致点名批评，图形验证及交互验证方式的安全性到底如何？请看具体分析一、中国国际航空PC
机器学习流形数据降维：UMAP 降维算法小嗷犬 Python 机器学习 #数据分析及可视化机器学习算法人工智能
✅作者简介：人工智能专业本科在读，喜欢计算机与编程，写博客记录自己的学习历程。个人主页：小嗷犬的个人主页个人网站：小嗷犬的技术小站个人信条：为天地立心，为生民立命，为往圣继绝学，为万世开太平。本文目录UMAP简介理论基础特点与优势应用场景在Python中使用UMAP安装umap-learn库使用UMAP可视化手写数字数据集UMAP简介UMAP（UniformManifoldApproximatio
七.正则化愿风去了
吴恩达机器学习之正则化（Regularization）http://www.cnblogs.com/jianxinzhou/p/4083921.html从数学公式上理解L1和L2https://blog.csdn.net/b876144622/article/details/81276818虽然在线性回归中加入基函数会使模型更加灵活，但是很容易引起数据的过拟合。例如将数据投影到30维的基函数上，模
机器学习-------数据标准化罔闻_spider 数据分析算法机器学习人工智能
什么是归一化，它与标准化的区别是什么？一作用在做训练时，需要先将特征值与标签标准化，可以防止梯度防炸和过拟合；将标签标准化后，网络预测出的数据是符合标准正态分布的—StandarScaler()，与真实值有很大差别。因为StandarScaler()对数据的处理是（真实值-平均值）/标准差。同时在做预测时需要将输出数据逆标准化提升模型精度：标准化/归一化使不同维度的特征在数值上更具比较性，提高分类
分享一个基于python的电子书数据采集与可视化分析 hadoop电子书数据分析与推荐系统 spark大数据毕设项目（源码、调试、LW、开题、PPT) 计算机源码社 Python项目大数据大数据 python hadoop 计算机毕业设计选题计算机毕业设计源码数据分析 spark毕设
作者：计算机源码社个人简介：本人八年开发经验，擅长Java、Python、PHP、.NET、Node.js、Android、微信小程序、爬虫、大数据、机器学习等，大家有这一块的问题可以一起交流！学习资料、程序开发、技术解答、文档报告如需要源码，可以扫取文章下方二维码联系咨询Java项目微信小程序项目Android项目Python项目PHP项目ASP.NET项目Node.js项目选题推荐项目实战|p
两种方法判断Python的位数是32位还是64位 sanqima Python编程电脑 python 开发语言
Python从1991年发布以来，凭借其简洁、清晰、易读的语法、丰富的标准库和第三方工具，在Web开发、自动化测试、人工智能、图形识别、机器学习等领域发展迅猛。 Python是一种胶水语言，通过Cython库与C/C++语言进行链接，通过Jython库与Java语言进行链接。 Python是跨平台的，可运行在多种操作系统上，包括但不限于Windows、Linux和macOS。这意味着用Py
使用最大边际相关性(MMR)选择示例：提高AI模型的多样性和相关性 aehrutktrjk 人工智能 easyui 前端 python
使用最大边际相关性(MMR)选择示例：提高AI模型的多样性和相关性引言在机器学习和自然语言处理领域，选择合适的训练示例对模型性能至关重要。最大边际相关性(MaximalMarginalRelevance,MMR)是一种优秀的示例选择方法，它不仅考虑了示例与输入的相关性，还注重保持所选示例之间的多样性。本文将深入探讨如何使用MMR来选择示例，以提高AI模型的性能和泛化能力。什么是最大边际相关性(MM
LangChain集成指南:如何利用多样化的AI提供商 aehrutktrjk 人工智能 langchain python
LangChain集成指南:如何利用多样化的AI提供商引言在人工智能和机器学习领域,LangChain已成为一个强大而灵活的框架,允许开发者轻松集成各种AI服务提供商。本文将深入探讨LangChain的集成能力,介绍如何利用不同的AI提供商来增强你的应用程序,并提供实用的代码示例。LangChain集成概览LangChain支持多种AI提供商的集成,这些集成可以分为两类:独立包集成:这些提供商有独
机器学习VS深度学习 nfgo 机器学习
机器学习（MachineLearning,ML）和深度学习（DeepLearning,DL）是人工智能（AI）的两个子领域，它们有许多相似之处，但在技术实现和应用范围上也有显著区别。下面从几个方面对两者进行区分：1.概念层面机器学习：是让计算机通过算法从数据中自动学习和改进的技术。它依赖于手动设计的特征和数学模型来进行学习，常用的模型有决策树、支持向量机、线性回归等。深度学习：是机器学习的一个子领
大数据毕业设计hadoop+spark+hive知识图谱租房数据分析可视化大屏租房推荐系统 58同城租房爬虫房源推荐系统房价预测系统计算机毕业设计机器学习深度学习人工智能 2401_84572577 程序员大数据 hadoop 人工智能
做了那么多年开发，自学了很多门编程语言，我很明白学习资源对于学一门新语言的重要性，这些年也收藏了不少的Python干货，对我来说这些东西确实已经用不到了，但对于准备自学Python的人来说，或许它就是一个宝藏，可以给你省去很多的时间和精力。别在网上瞎学了，我最近也做了一些资源的更新，只要你是我的粉丝，这期福利你都可拿走。我先来介绍一下这些东西怎么用，文末抱走。（1）Python所有方向的学习路线（
【机器学习与R语言】1-机器学习简介苹果酱0567 面试题汇总与解析 java 中间件开发语言 spring boot 后端
1.基本概念机器学习：发明算法将数据转化为智能行为数据挖掘VS机器学习：前者侧重寻找有价值的信息，后者侧重执行已知的任务。后者是前者的先期准备过程：数据——>抽象化——>一般化。或者：收集数据——推理数据——归纳数据——发现规律抽象化：训练：用一个特定模型来拟合数据集的过程用方程来拟合观测的数据：观测现象——数据呈现——模型建立。通过不同的格式来把信息概念化一般化：一般化：将抽象化的知识转换成可用
Python前沿技术：机器学习与人工智能 4.0啊 Python 人工智能 python 机器学习
Python前沿技术：机器学习与人工智能一、引言随着科技的飞速发展，机器学习和人工智能（AI）已经成为了计算机科学领域的热门话题。Python作为一门易学易用且功能强大的编程语言，已经成为了这两个领域的首选语言之一。本文将深入探讨Python在机器学习和人工智能领域的应用，以及一些前沿技术和工具。二、Python机器学习基础2.1机器学习概述机器学习是人工智能（AI）的一个关键子集，它的核心在于让
chatgpt赋能python：如何在Python中计算平均值 tulingtest ChatGpt python chatgpt numpy 计算机
如何在Python中计算平均值计算平均值是数据分析、统计和机器学习等许多领域中的常见任务。Python作为一门功能强大且易于学习的编程语言，为计算平均值提供了多种方法。在本文中，我们将介绍如何在Python中计算平均值。什么是平均值简单来说，平均值是一组数字的总和除以数字的数量。例如，对于数字序列1，3，5，7，9，平均值是(1+3+5+7+9)/5=5。平均值在数据分析中非常有用，因为它可以提供
Python 初学者入门必知： Anaconda是什么？有什么作用？怎么使用？懒大王爱吃狼 Python基础 python 开发语言 python基础 python学习 anaconda anaconda安装 python教程
初学者在学习Python时，经常看到的一个名字是Anaconda。究竟什么是Anaconda，为什么它如此受欢迎？在这篇文章中，我们将探讨Anaconda，了解Anaconda的从安装到使用的。Anaconda是一个免费开源的Python和R编程发行版，包含上千个适用于数据科学和机器学习的包。同时，配备了Spyder和Jupyternotebook等工具，初学者可以使用它们来学习Python，使用
每天五分钟玩转深度学习PyTorch：模型参数优化器torch.optim 幻风_huanfeng 深度学习框架pytorch 深度学习 pytorch 人工智能神经网络机器学习优化算法
本文重点在机器学习或者深度学习中，我们需要通过修改参数使得损失函数最小化(或最大化)，优化算法就是一种调整模型参数更新的策略。在pytorch中定义了优化器optim，我们可以使用它调用封装好的优化算法，然后传递给它神经网络模型参数，就可以对模型进行优化。本文是学习第6步(优化器)，参考链接pytorch的学习路线随机梯度下降算法在深度学习和机器学习中，梯度下降算法是最常用的参数更新方法，它的公式
一切皆是映射：AI的去中心化：区块链技术的融合 AI大模型应用之禅计算科学神经计算深度学习神经网络大数据人工智能大型语言模型 AI AGI LLM Java Python 架构设计 Agent RPA
一切皆是映射：AI的去中心化：区块链技术的融合作者：禅与计算机程序设计艺术/ZenandtheArtofComputerProgramming关键词：AI，区块链，去中心化，智能合约，共识机制，数据安全，隐私保护，分布式账本技术，机器学习，数据隐私1.背景介绍1.1问题的由来随着人工智能（AI）技术的快速发展，其在各个领域的应用越来越广泛，从自动驾驶、智能医疗到金融服务，AI正在改变着我们的生活。
第五届核磁机器学习班（训练营：2023.6.5~6.17）茗创科技
茗创科技专注于脑科学数据处理，涵盖（EEG/ERP,fMRI,结构像,DTI,ASL,FNIRS）等，欢迎留言讨论及转发推荐，也欢迎了解茗创科技的脑电课程，数据处理服务及脑科学工作站销售业务，可添加我们的工程师（微信号MCKJ-zhouyi或17373158786）咨询。★课程简介★基于血氧水平依赖的功能磁共振成像(fMRI)技术,利用其数据构建的功能性脑网络后,发现脑并不是一个单纯对外界刺激进行
如何有效的学习AI大模型？ Python程序员罗宾学习人工智能语言模型自然语言处理架构
学习AI大模型是一个系统性的过程，涉及到多个学科的知识。以下是一些建议，帮助你更有效地学习AI大模型：基础知识储备：数学基础：学习线性代数、概率论、统计学和微积分等，这些是理解机器学习算法的数学基础。编程技能：掌握至少一种编程语言，如Python，因为大多数AI模型都是用Python实现的。理论学习：机器学习基础：了解监督学习、非监督学习、强化学习等基本概念。深度学习：学习神经网络的基本结构，如卷
html 周华华 html
js 1，数组的排列 var arr=[1,4,234,43,52,]; for(var x=0;x<arr.length;x++){ for(var y=x-1;y<arr.length;y++){ if(arr[x]<arr[y]){ &
【Struts2 四】Struts2拦截器 bit1129 struts2拦截器
Struts2框架是基于拦截器实现的，可以对某个Action进行拦截，然后某些逻辑处理，拦截器相当于AOP里面的环绕通知，即在Action方法的执行之前和之后根据需要添加相应的逻辑。事实上，即使struts.xml没有任何关于拦截器的配置，Struts2也会为我们添加一组默认的拦截器，最常见的是，请求参数自动绑定到Action对应的字段上。 Struts2中自定义拦截器的步骤是：
make:cc 命令未找到解决方法 daizj linux 命令未知 make cc
安装rz sz程序时，报下面错误： [root@slave2 src]# make posix cc -O -DPOSIX -DMD=2 rz.c -o rz make: cc：命令未找到 make: *** [posix] 错误 127 系统：centos 6.6 环境：虚拟机错误原因：系统未安装gcc，这个是由于在安
Oracle之Job应用周凡杨 oracle job
最近写服务，服务上线后，需要写一个定时执行的SQL脚本，清理并更新数据库表里的数据，应用到了Oracle 的 Job的相关知识。在此总结一下。一：查看相关job信息 1、相关视图 dba_jobs all_jobs user_jobs dba_jobs_running 包含正在运行
多线程机制朱辉辉33 多线程
转至http://blog.csdn.net/lj70024/archive/2010/04/06/5455790.aspx 程序、进程和线程：程序是一段静态的代码，它是应用程序执行的蓝本。进程是程序的一次动态执行过程，它对应了从代码加载、执行至执行完毕的一个完整过程，这个过程也是进程本身从产生、发展至消亡的过程。线程是比进程更小的单位，一个进程执行过程中可以产生多个线程，每个线程有自身的
web报表工具FineReport使用中遇到的常见报错及解决办法（一）老A不折腾 web报表 finereport java报表报表工具
FineReport使用中遇到的常见报错及解决办法（一）这里写点抛砖引玉，希望大家能把自己整理的问题及解决方法晾出来，Mark一下，利人利己。出现问题先搜一下文档上有没有，再看看度娘有没有，再看看论坛有没有。有报错要看日志。下面简单罗列下常见的问题，大多文档上都有提到的。 1、address pool is full：含义：地址池满，连接数超过并发数上
mysql rpm安装后没有my.cnf 林鹤霄没有my.cnf
Linux下用rpm包安装的MySQL是不会安装/etc/my.cnf文件的，至于为什么没有这个文件而MySQL却也能正常启动和作用，在这儿有两个说法，第一种说法，my.cnf只是MySQL启动时的一个参数文件，可以没有它，这时MySQL会用内置的默认参数启动，第二种说法，MySQL在启动时自动使用/usr/share/mysql目录下的my-medium.cnf文件，这种说法仅限于r
Kindle Fire HDX root并安装谷歌服务框架之后仍无法登陆谷歌账号的问题 aigo root
原文：http://kindlefireforkid.com/how-to-setup-a-google-account-on-amazon-fire-tablet/ Step 4: Run ADB command from your PC On the PC, you need install Amazon Fire ADB driver and instal
javascript 中var提升的典型实例 alxw4616 JavaScript
// 刚刚在书上看到的一个小问题,很有意思.大家一起思考下吧 myname = 'global'; var fn = function () { console.log(myname); // undefined var myname = 'local'; console.log(myname); // local }; fn() // 上述代码实际上等同于以下代码 m
定时器和获取时间的使用百合不是茶时间的转换定时器
定时器:定时创建任务在游戏设计的时候用的比较多 Timer();定时器 TImerTask();Timer的子类由 Timer 安排为一次执行或重复执行的任务。定时器类Timer在java.util包中。使用时，先实例化，然后使用实例的schedule(TimerTask task, long delay)方法，设定
JDK1.5 Queue bijian1013 java thread java多线程 Queue
JDK1.5 Queue LinkedList： LinkedList不是同步的。如果多个线程同时访问列表，而其中至少一个线程从结构上修改了该列表，则它必须保持外部同步。（结构修改指添加或删除一个或多个元素的任何操作；仅设置元素的值不是结构修改。）这一般通过对自然封装该列表的对象进行同步操作来完成。如果不存在这样的对象，则应该使用 Collections.synchronizedList 方
http认证原理和https bijian1013 http https
一.基础介绍在URL前加https://前缀表明是用SSL加密的。你的电脑与服务器之间收发的信息传输将更加安全。 Web服务器启用SSL需要获得一个服务器证书并将该证书与要使用SSL的服务器绑定。 http和https使用的是完全不同的连接方式，用的端口也不一样,前者是80，后
【Java范型五】范型继承 bit1129 java
定义如下一个抽象的范型类，其中定义了两个范型参数，T1，T2 package com.tom.lang.generics; public abstract class SuperGenerics<T1, T2> { private T1 t1; private T2 t2; public abstract void doIt(T
【Nginx六】nginx.conf常用指令(Directive) bit1129 Directive
1. worker_processes 8; 表示Nginx将启动8个工作者进程，通过ps -ef|grep nginx,会发现有8个Nginx Worker Process在运行 nobody 53879 118449 0 Apr22 ? 00:26:15 nginx: worker process
lua 遍历Header头部 ronin47 lua header 遍历　
local headers = ngx.req.get_headers() ngx.say("headers begin", "<br/>") ngx.say("Host : ", he
java-32.通过交换a,b中的元素，使[序列a元素的和]与[序列b元素的和]之间的差最小(两数组的差最小)。 bylijinnan java
import java.util.Arrays; public class MinSumASumB { /** * Q32.有两个序列a,b，大小都为n,序列元素的值任意整数，无序. * * 要求：通过交换a,b中的元素，使[序列a元素的和]与[序列b元素的和]之间的差最小。 * 例如: * int[] a = {100,99,98,1,2,3
redis 开窍的石头 redis
在redis的redis.conf配置文件中找到# requirepass foobared 把它替换成requirepass 12356789 后边的12356789就是你的密码打开redis客户端输入config get requirepass 返回 redis 127.0.0.1:6379> config get requirepass 1) "require
[JAVA图像与图形]现有的GPU架构支持JAVA语言吗？ comsci java语言
无论是opengl还是cuda，都是建立在C语言体系架构基础上的，在未来，图像图形处理业务快速发展，相关领域市场不断扩大的情况下，我们JAVA语言系统怎么从这么庞大，且还在不断扩大的市场上分到一块蛋糕，是值得每个JAVAER认真思考和行动的事情
安装ubuntu14.04登录后花屏了怎么办 cuiyadll ubuntu
这个情况，一般属于显卡驱动问题。可以先尝试安装显卡的官方闭源驱动。按键盘三个键：CTRL + ALT + F1 进入终端，输入用户名和密码登录终端：安装amd的显卡驱动 sudo apt-get install fglrx 安装nvidia显卡驱动 sudo ap
SSL 与数字证书的基本概念和工作原理 darrenzhu 加密 ssl 证书密钥签名
SSL 与数字证书的基本概念和工作原理 http://www.linuxde.net/2012/03/8301.html SSL握手协议的目的是或最终结果是让客户端和服务器拥有一个共同的密钥，握手协议本身是基于非对称加密机制的，之后就使用共同的密钥基于对称加密机制进行信息交换。 http://www.ibm.com/developerworks/cn/webspher
Ubuntu设置ip的步骤 dcj3sjt126com ubuntu
在单位的一台机器完全装了Ubuntu Server，但回家只能在XP上VM一个，装的时候网卡是DHCP的，用ifconfig查了一下ip是192.168.92.128,可以ping通。转载不是错： Ubuntu命令行修改网络配置方法 /etc/network/interfaces打开后里面可设置DHCP或手动设置静态ip。前面auto eth0，让网卡开机自动挂载. 1. 以D
php包管理工具推荐 dcj3sjt126com PHP Composer
http://www.phpcomposer.com/ Composer是 PHP 用来管理依赖（dependency）关系的工具。你可以在自己的项目中声明所依赖的外部工具库（libraries），Composer 会帮你安装这些依赖的库文件。中文文档入门指南下载安装包列表 Composer 中国镜像
Gson使用四（TypeAdapter） eksliang json gson Gson自定义转换器 gsonTypeAdapter
转载请出自出处：http://eksliang.iteye.com/blog/2175595 一.概述 Gson的TypeAapter可以理解成自定义序列化和返序列化二、应用场景举例例如我们通常去注册时（那些外国网站），会让我们输入firstName，lastName,但是转到我们都
JQM控件之Navbar和Tabs gundumw100 html xml css
在JQM中使用导航栏Navbar是简单的。只需要将data-role="navbar"赋给div即可： <div data-role="navbar"> <ul> <li><a href="#" class="ui-btn-active&qu
利用归并排序算法对大文件进行排序 iwindyforest java 归并排序大文件分治法 Merge sort
归并排序算法介绍，请参照Wikipeida zh.wikipedia.org/wiki/%E5%BD%92%E5%B9%B6%E6%8E%92%E5%BA%8F 基本思想：大文件分割成行数相等的两个子文件，递归（归并排序）两个子文件，直到递归到分割成的子文件低于限制行数低于限制行数的子文件直接排序两个排序好的子文件归并到父文件直到最后所有排序好的父文件归并到输入
iOS UIWebView URL拦截啸笑天 UIWebView
本文译者：candeladiao，原文：URL filtering for UIWebView on the iPhone说明：译者在做app开发时，因为页面的javascript文件比较大导致加载速度很慢，所以想把javascript文件打包在app里，当UIWebView需要加载该脚本时就从app本地读取，但UIWebView并不支持加载本地资源。最后从下文中找到了解决方法，第一次翻译，难免有
索引的碎片整理SQL语句 macroli sql
SET NOCOUNT ON DECLARE @tablename VARCHAR (128) DECLARE @execstr VARCHAR (255) DECLARE @objectid INT DECLARE @indexid INT DECLARE @frag DECIMAL DECLARE @maxfrag DECIMAL --设置最大允许的碎片数量,超过则对索引进行碎片
Angularjs同步操作http请求with $promise qiaolevip 每天进步一点点学习永无止境 AngularJS 纵观千象
// Define a factory app.factory('profilePromise', ['$q', 'AccountService', function($q, AccountService) { var deferred = $q.defer(); AccountService.getProfile().then(function(res) {
hibernate联合查询问题 sxj19881213 sql Hibernate HQL 联合查询
最近在用hibernate做项目，遇到了联合查询的问题，以及联合查询中的N+1问题。针对无外键关联的联合查询，我做了HQL和SQL的实验，希望能帮助到大家。（我使用的版本是hibernate3.3.2） 1 几个常识：（1）hql中的几种join查询，只有在外键关联、并且作了相应配置时才能使用。（2）hql的默认查询策略，在进行联合查询时，会产
struts2.xml wuai struts
<?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE struts PUBLIC "-//Apache Software Foundation//DTD Struts Configuration 2.3//EN" "http://struts.apache