Kaggle上关于Predict click-through rates on display ads的经典探讨,主要关于特征处理的技巧

Kaggle讨论区的链接: https://www.kaggle.com/c/criteo-display-ad-challenge/forums/t/10555/3-idiots-solution-libffm



1. Transforming infrequent features into a special tag. Conceptually,infrequent features should only include very little or even no information, so it should be very hard for a model to extract those information. In fact,these features can be very noisy. We gotsignificant improvement (more than 0.0005) by transforming these features into a special tag.


2. Transforming numerical features (I1-I13) to categorical features.

Empirically we observe using categorical features is always better than using numerical features. However, too many features are generated if numerical features are directly transformed into categorical features, so we use v <- floor(log(v)^2) to reduce the number of features generated.

       经验上,我们倾向于使用离散特征,离散特征比连续特征更容易去学习。(离散特征和连续特征相比,离散特征有两大优势,一是:离散特征比连续特征的可解释性强。二是:离散特征比连续特征更容易处理)。使用v <- floor(log(v)^2),是为了压缩连续型特征的范围,使得将连续特征转换成离散特征时,不会出现很多的特征。起到压缩的作用。(Kaggle上作者,将特征进行离散化的方式是通过hash映射完成的,映射到100w维的空间)

3. Data normalization. According to our empirical experience, instance-wise  data normalization makes the optimization problem easier to be solved.  For example, for the following vector x (0, 0, 1, 1, 0, 1,0, 1, 0), the standard data normalization divides each element of x by by the 2-norm of x. The normalized x becomes (0, 0, 0.5, 0.5,0, 0.5, 0, 0.5, 0). The hashing trick do not contribute any improvement on the leader board. We apply the hashing trick only because it makes our life easier to generate features.
经验上,特征向量要经过归一化处理。使用hash映射并没有使得效果有多少的提升。使用hash是将特征进行离散化一种很方便的手段。(将连续特征和离散特征都进行离散化,hash trick 是一种非常好的手段)。



Kaggle上关于Predict click-through rates on display ads的经典探讨,主要关于特征处理的技巧_第1张图片



Does special care is needed for generating the gbdt features to avoid over fitting?
I didn't thought it needed but it looks like I do have some kind of over fitting problem when trying to implement if naively.
A: If you find gbdt over fits, I think reducing the number of trees and the depth of a tree may help. 




Q: Thanks for such a great solution! Hope you don't mind another little question.
How and when did you choose threshold for transforming infrequent features?

A: This threshold was selected based on experiments. We tried something like 2, 5, 10,20, 50, 100, and chose the best one among them.通过实验来确定阈值。


A: Suppose we have an impression whose label is 1. If the prediction for this impression is 0, then we should get an infinite logloss. In practice this is not desired, so on the submission system there is a ceiling (C) for logloss.If the logloss for an impression is greater than this ceiling, then it will be truncated. Note that we do not know the value of this ceiling.Using thisfeature, we can hack the average CTR by two submissions.The firstsubmission contains all ones, so the logloss we get is P1 (nr_non_click*C)/nr_instance.The secondsubmission contains all zeros, so the logloss we get is P2 =(nr_click*C)/nr_instance.
We can then get the average CTR = P2/(P1+P2).

