作者:
Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, Joaquin Quiñonero Candela
Facebook
1601 Willow Road, Menlo Park, CA, United States
{panjunfeng, oujin, joaquinq, sbowers}@fb.com
发布时间:August 24 - 27, 2014
最近准备参加个比赛,复习一下特征选择经典的方法——用GBDT的路径当作是特征,然后再用逻辑回归拟合。感觉facebook技术挺强的,快手现在用的AB test好像也是从facebook学来的。AB test和GBDT+LR这两个方法让人有一种轻巧的感觉,原理并不复杂,但是很巧妙。读完这篇文章让我对GBDT和LR的理解更深刻。
用已有特征训练GBDT模型,然后利用GBDT模型学习到的树来构造新特征,最后把这些新特征【加入原有特征一起or not】训练模型。构造的新特征向量是取值0/1的,向量的每个元素对应于GBDT模型中树的叶子结点。当一个样本点通过某棵树最终落在这棵树的一个叶子结点上,那么在新特征向量中这个叶子结点对应的元素值为1,而这棵树的其他叶子结点对应的元素值为0。新特征向量的长度等于GBDT模型里所有树包含的叶子结点数之和。
特征选择的另一个常用方法:LASSO去掉没用的特征
文章主要内容:
1.发现GBDT+LR效果好于单独的
2.精确度会随时间衰减(所以需要学习需要online,探究了线性模型最佳学习速率的选择,树模型训练起来太费时间,只把线性模型(LR)Online)
3.准确率随GBDT数目增加下降速度递减,数据量少的时候可能还会overfit,500棵几乎就饱和了
4.特征重要性:前10个特征贡献了一半的特征重要性,最后300个特征贡献不到1%
5.历史特征比上下文特征重要,但上下文特征在冷启动时很重要
6.数据量大所以需要下采样,100%-10% Uniform subsampling只影响很少的准确度,更少的话准确度会影响变大。Negative down sampling最后计算结果CTR还需要做纠正
体会:
- GBDT的缺点是不方便在线学习,优点是鲁棒性强
- LR的缺点是对异常值敏感,优点是可以在线学习
实现:
训练时可以将一半数据用来训练GBDT,一般用来将训练好的GBDT选择好的特征来训练LR
P.S. sklearn中的GBDT的apply可以给出叶子节点的编号,实现起来挺方便的
In this paper we introduce a model which combines decision trees with logistic regression, outperforming either of these methods on its own by over 3%, an improvement with significant impact to the overall system performance.(DT+LR)
We then explore how a number of fundamental parameters impact the final prediction performance of our system. Not surprisingly, the most important thing is to have the right features; Picking the optimal handling for data freshness, learning rate schema and data sampling improve the model slightly, though much less than adding a high-value feature, or picking the right model to begin with.(特征比模型调参重要)
We begin with an overview of our experimental setup in Section2.
In Section 3 we evaluate different probabilistic linear classifiers and diverse online learning algorithms. In the context of linear classification we go on to evaluate the impact of feature transforms and data freshness. Inspired by the practical lessons learned, particularly around data freshness and online learning, we present a model architecture that incorporates an online learning layer, whilst producing fairly compact models.(整合了在线学习层)
Section 4 describes a key component required for the online learning layer, the online joiner, an experimental piece of infrastructure that can generate a live stream of real-time training data.
Lastly we present ways to trade accuracy for memory and compute time and to cope with massive amounts of training data.(在线学习层大量数据下trade accuracy for memory and compute time)
In Section 5 we describe practical ways to keep memory and latency contained for massive scale applications.
In Section 6 we delve into the tradeoff between training data volume and accuracy.(数据量和准确性之间的权衡)
Evaluation metrics: Since we are most concerned with the impact of the factors to the machine learning model, we use the accuracy of prediction instead of metrics directly related to profit and revenue. In this work, we use Normalized Entropy (NE) and calibration as our major evaluation metric.
NE=−1N∑N1(1+yi2log(pi)+1−yi2log(1−pi))−(plog(p)+(1−p)log(1−p)) N E = − 1 N ∑ 1 N ( 1 + y i 2 l o g ( p i ) + 1 − y i 2 l o g ( 1 − p i ) ) − ( p l o g ( p ) + ( 1 − p ) l o g ( 1 − p ) )
Calibration is the ratio of the average estimated CTR(click through rate) and empirical CTR.
Note that, Area-Under-ROC (AUC) is also a pretty good metric for measuring ranking quality without considering calibration.
In a realistic environment, we expect the prediction to be accurate instead of merely getting the optimal ranking order to avoid potential under-delivery or overdelivery.(希望跟准确的预测避免输出过多或过少)
NE measures the goodness of predictions and implicitly reflects calibration. For example, if a model overpredicts by 2x and we apply a global multiplier 0.5 to fix
the calibration, the corresponding NE will be also improved even though AUC remains the same.
[J. Yi, Y. Chen, J. Li, S. Sett, and T. W. Yan. Predictive model performance: Offline and online evaluations. In KDD, pages 1294–1302, 2013.]
In this section we present a hybrid model structure: the concatenation of boosted decision trees and of a probabilistic sparse linear classifier.
In Section 3.1 we show that decision trees are very powerful input feature transformations, that significantly increase the accuracy of probabilistic linear classifiers.
In Section 3.2 we show how fresher training data leads to more accurate predictions. This motivates the idea to use an online learning method to train the linear classifier.
In Section 3.3 we compare a number of online learning variants for two families of probabilistic linear classifiers.(SGD-based LR and BOPR):
We found that boosted decision trees are a powerful and very convenient way to implement non-linear and tuple transformations. We treat each individual tree as a categorical feature that takes as value the index of the leaf an instance ends up falling in.(把每一棵树当作是一个特征,特征值是instance落进的叶子的index)
We can understand boosted decision tree based transformation as a supervised feature encoding that converts a real-valued vector into a compact binary-valued vector. (把GBDT当作是把实际特征向量转化为二值特征向量的supervised的特征编码器)
A traversal(遍历) from root node to a leaf node represents a rule on certain features.
Fitting a linear classifier on the binary vector is essentially(本质上) learning weights for the set of rules. Boosted decision trees are trained in a batch manner.
Model Structure | NE (relative to Trees only) |
---|---|
LR + Trees | 96.58% |
LR only | 99.43% |
Trees only | 100% (reference) |
Prediction accuracy clearly degrades as the delay between training and test set increases. It can been seen that NE can be reduced by approximately 1% by going from training weekly to training daily.
These findings indicate that it is worth retraining on a daily basis.
每天重新训练树模型消耗时间过长,不适合。
In order to maximize data freshness, one option is to train the linear classifier online, that is, directly as the labelled ad impressions arrive.
In this section we evaluate several ways of setting learning rates for SGD-based online learning for logistic regression.
All the tunable parameters are optimized by grid search.
From the above result, SGD with per-coordinate learning rate achieves the best prediction accuracy, with a NE almost 5% lower than when using per weight learning rate, which performs worst.
This section introduces an experimental system that generates real-time training data used to train the linear classifier via online learning.We will refer to this system as the “online joiner” since the critical operation it does is to join
labels (click/no-click) to training inputs (ad impressions) in an online manner.
Similar infrastructure(基础设施) is used for stream learning for example in the Google Advertising System. The online joiner outputs a real-time training data stream to an infrastructure called Scribe. While the positive labels (clicks) are well defined, there is no such thing as a “no click” button the user can press. For this reason, an impression is considered to have a negative no click label if the user did not click the ad after a fixed, and sufficiently long period of time after seeing the ad. The length of the waiting time window needs to be tuned carefully.
Using too long a waiting window delays the real-time training data and increases the memory allocated to buffering impressions while waiting for the click signal. A too short time window causes some of the clicks to be lost, since the corresponding impression may have been flushed out and labeled as non-clicked. This negatively affects “click coverage,” the fraction of all clicks successfully joined to impressions. As a result, the online joiner system must strike a balance between recency(回访率) and click coverage.
More study on the window size and efficiency can be found
at [L. Golab and M. T. Ozsu. Processing sliding window multi-joins in continuous queries over data streams. In VLDB, pages 500–511, 2003.].
The more trees in the model the longer the time required to make a prediction.
We vary the number of trees from 1 to 2, 000 and train the models on one full day of data, and test the prediction performance on the next day. We constrain that no more than 12 leaves in each tree. The gain from adding trees yields diminishing return. Almost all NE improvement comes from the first 500 trees. The last 1, 000 trees decrease NE by less than 0.1%. Moreover, we see that the normalized entropy for submodel2 begins to regress after 1,000 trees. The reason for this phenomenon
is overfitting. Since the training data for submodel2 is 4x smaller than that in submodel 0 and 1.
Feature count is another model characteristic that can influence trade-offs between estimation accuracy and computation performance. To better understand the effect of feature count we first apply a feature importance to each feature.
In order to measure the importance of a feature we use the statistic Boosting Feature Importance, which aims to capture the cumulative(累积) loss reduction attributable to a feature. 特征的全局重要度通过特征在单颗树中的重要度的平均值来衡量。 单颗树中的重要度是分裂时不纯度的减少。
Typically, a small number of features contributes the majority of explanatory power while the remaining features have only a marginal contribution.
From the above result, we can see that the top 10 features are responsible for about half of the total feature importance, while the last 300 features contribute less than 1% feature importance. Based on this finding, we further experiment
with only keeping the top 10, 20, 50, 100 and 200 features, and evaluate how the performance is effected.
The features used in the Boosting model can be categorized into two types: contextual features and historical features. The value of contextual features depends exclusively on current information regarding the context in which an ad is to be shown, such as the device used by the users or the current page that the user is on. On the contrary, the historical features depend on previous interaction for the ad or user, for example the click through rate of the ad in last week, or the average click through rate of the user.
historical features provide considerably more explanatory power than contextual features.
Type of features | NE (relative to Contextual) |
---|---|
All | 95.65% |
Historical | 96.32% |
Contextual | 100% (reference) |
It should be noticed that contextual features are very important to handle the cold start problem. For new users and ads, contextual features are indispensable for a reasonable click through rate prediction.
The model with contextual features relies more heavily on data freshness than historical features. It is in line with our intuition, since historical features describe long-time accumulated user behaviour, which is much more stable than contextual features.
In this section we evaluate two techniques for down sampling data, uniform subsampling and negative down sampling. In each case we train a set of boosted tree models with 600 trees and evaluate these using both calibration and normalized entropy.
negative down sampling 指负采样,即只对负样本做采样,保持证样本不变。
这样做的好处有两个,一是降低计算复杂度,而是平衡正负样本比例。
在CTR预估中,如果使用了负采样,最后计算结果CTR还需要做纠正。
The data volume demonstrates diminishing return in terms of prediction accuracy. By using only 10% of the data, the normalized entropy is only a 1% reduction in performance relative to the entire training data set. The calibration at this sampling rate shows no performance reduction.
From the result, we can see that the negative down sampling rate has significant effect on the performance of the trained model. The best performance is achieved with negative down sampling rate set to 0.025.
使用Negative下采样之后结果准率率需要修正。
总结文章讨论的方面:
- Data freshness matters
- Transforming real-valued input features with boosted decision trees significantly increases the prediction accuracy of probabilistic linear classifiers.
- Best online learning method: LR with per-coordinate learning rate.
We have described tricks to keep memory and latency contained in massive scale machine learning applications:
- The tradeoff between the number of boosted decision trees and accuracy
- Boosted decision trees give a convenient way of doing feature selection by means of feature importance. One can aggressively reduce the number of active features whilst only moderately hurting prediction accuracy.
- Analyzed the effect of using historical features in combination with context features. For ads and users with history, these features provide superior predictive performance than context features.
Finally, we have discussed ways of subsampling the training data, both uniformly but also more interestingly in a biased way where only the negative examples are subsampled.