Kun Gai1, Xiaoqiang Zhu1, Han Li1, Kai Liu2†, Zhe Wang3†
[email protected], {xiaoqiang.zxq, lihan.lh}@alibaba-inc.com
† contribute to this paper while worked at Alibaba
April 19, 2017
CTR prediction in real-world business is a difficult machine learning problem with large scale nonlinear sparse data. In this paper, we introduce an industrial strength solution with model named Large Scale Piece-wise Linear Model (LS-PLM). We formulate the learning problem with L1 and L2,1 regularizers, leading to a non-convex and non-smooth optimization problem. Then, we propose a novel algorithm to solve it efficiently, based on directional derivatives and quasi-Newton method. In addition, we design a distributed system which can run on hundreds of machines parallel and provides us with the industrial scalability. LS-PLM model can capture nonlinear patterns from massive sparse data, saving us from heavy feature engineering jobs. Since 2012, LS-PLM has become the main CTR prediction model in Alibaba’s online display advertising system, serving hundreds of millions users every day.
ctr预测是一个具有大规模非线性稀疏数据的机器学习难题。本文介绍了一种工业强度的求解方法,即大型分段线性模型(LS-PLM)。我们用l1和l2,1正则化器构造了学习问题,得到了一个非凸非光滑的优化问题。然后,基于方向导数和拟牛顿法,提出了一种新的求解方法。此外,我们还设计了一个分布式系统,可以在数百台机器上并行运行,并为我们提供了工业可扩展性。LS-PLM模型可以从大量稀疏数据中捕获非线性模式,从而将我们从繁重的功能工程工作中拯救出来。自2012年以来,LS-PLM已经成为阿里巴巴在线展示广告系统中的主要ctr预测模型,每天为数亿用户提供服务。
Click-through rate (CTR) prediction is a core problem in the multi-billion dollar online advertising industry. To improve the accuracy of CTR prediction, more and more data are involved, making CTR prediction a large scale learning problem, with massive samples and high dimension features.
Traditional solution is to apply a linear logistic regression (LR) model, trained in a parallel manner (Brendan et al. 2013, Andrew & Gao 2007). LR model with L1 regularization can generate sparse solution, making it fast for online prediction. Unfortunately, CTR prediction problem is a highly nonlinear problem. In particular, user-click generation involves many complex factors, like ad quality, context information, user interests, and complex interactions of these factors. To help LR model catch the nonlinearity, feature engineering technique is explored, which is both time and humanity consuming.
Another direction, is to capture the nonlinearity with well-designed models. Facebook (He et al. 2014) uses a hybrid model which combines decision trees with logistic regression. Decision tree plays a nonlinear feature transformation role, whose output is fed to LR model. However, tree-based method is not suitable for very sparse and high dimensional data (Safavian S. R. & Landgrebe D. 1990). (Rendle S. 2010) introduces Factorization Machines(FM), which involves interactions among features using 2nd order functions (or using other given-number-order functions). However, FM can not fit all general nonlinear patterns in data (like other higher order patterns).
In this paper, we present a piece-wise linear model and its training algorithm for large scale data. We name it Large Scale Piecewise Linear Model (LS-PLM). LS-PLM follows the divide-and-conquer strategy, that is, first divides the feature space into several local regions, then fits a linear model in each region, resulting in the output with combinations of weighted linear predictions. Note that, these two steps are learned simultaneously in a supervised manner, aiming to minimize the prediction loss. LS-PLM is superior for web-scale data mining in three aspects:
The learning of LS-PLM with sparsity regularizer can be transformed to be a non-convex and nondifferential optimization problem, which is difficult to be solved. We propose an efficient optimization method for such problems, based on directional derivatives and quasi-Newton method. Due to the ability of capturing nonlinear patterns and scalability to massive data, LS-PLMs have become main CTR prediction models in the online display advertising system in alibaba, serving hundreds of millions users since 2012 early. It is also applied in recommendation systems, search engines and other product systems.
The paper is structured as follows. In Section 2, we present LS-PLM model in detail, including formulation, regularization and optimization issues. In Section 3 we introduce our parallel implementation structure. in Section 4, we evaluate the model carefully and demonstrate the advantage of LS-PLM compared with LR. Finally in Section 5, we close with conclusions.
Figure 1: A demo illustration of LS-PLM model. Figure A is the demo dataset. It is a binary classification problem, with red points belong to positive class and blue points belong to negative class. Figure B shows the classification result using LR model. Figure C shows the classification result using LS-PLM model. It’s clear that LS-PLM can capture the nonlinear distribution of data.
We focus on the large scale CTR prediction application. It is a binary classification problems, with dataset and xt ∈ Rd is usually high dimensional and sparse.
To model the nonlinearity of massive scale data, we employ a divide-and-conquer strategy, similar with
(Jordan & Jacobs 1994). We divide the whole feature space into some local regions. For each region we employ an individual generalized linear-classification model. In this way, we tackle the nonlinearity with a piece-wise linear model. We give our model as follows:
Here Θ = {u1,··· ,um,w1,··· ,wm} ∈ Rd×2m denote the model parameters. {u1,··· ,um} is the parameters for dividing function σ(·), and {w1,··· ,wm} for fitting function η(·). Given instance x, our predicating model p(y|x) consists of two parts: the first part σ(uTj x) divides feature space into m (hyper-parameter) different regions, the second part η(wjTx) gives prediction in each region. Function g(·) ensures our model to satisfy the definition of probability function.
Special Case. Take softmax ( Kivinen & Warmuth 1998) as dividing function σ(x) and sigmoid (Hilbe 2009) as fitting function η(x) and g(x) = x, we get a specific formulation:
(2)
In this case, our mixture model can be seen as a FOE model (Jordan & Jacobs 1994, Wang & Puterman 1998) as follows:
(3)
Eq.(2) is the most common used formulation in our real applications. In the reminder of the paper, without special declaration, we take Eq.(2) as our prediction model. Figure 1 illustrates the model compared with LR in a demo dataset, which shows clearly LS-PLM can capture the nonlinear pattern of data. The objective function of LS-PLM model is formalized as Eq.(4):
arg minΘf(Θ) = loss(Θ) + λkΘk2,1 + βkΘk1 (4)
n
loss(Θ)=−Xhyt log(p(yt=1|xt,Θ)) + (1 − yt)log(p(yt=0|xt,Θ))i (5)
t=1
Here loss(Θ) defined in Eq.(5) is the neg-likelihood loss function and kΘ2,1k and kΘ1k are two regulariza-
tion terms providing different properties. First, L2,1 regularization () is employed for feature selection. As in our model, each dimension of feature is associated with 2m parameters. L2,1 regularization are expected to push all the 2m parameters of one dimension of feature to be zero, that is, to suppress those less important features. Second, L1 regularization (kΘk1 = Pij |θij|) is employed for sparsity. Except with the feature selection property, L1 regularization can also force those parameters of left features to be zero as much as possible, which can help improve the interpretation ability as well as generalization performance of the model.
However, both L1 norm and L2,1 norm are non-smooth functions. This causes the objective function of Eq.(4) to be non-convex and non-smooth, making it difficult to employ those traditional gradient-descent optimization methods (Andrew & Gao 2007, Zhang 2004, Bertsekas 2003) or EM method (Wang & Puterman
1998).
Note that, while (Wang & Puterman 1998) gives the same mixture model formulation as Eq.(3), our model is more general for employing different kinds of prediction functions. Besides, we propose a different objective function for large scale industry data, taking the feature sparsity into consideration explicitly. This is crucial for real-world applications, as prediction speed and memory usage are two key indicators for online model serving. Furthermore, we give a more efficient optimization method to solve the large-scale non-convex problem, which is described in the following section.
Before we present our optimization method, we establish some notations and definitions that will be used in the reminder of the paper. Let ∂ij+f(Θ) denote the right partial derivative of f at Θ with respect to Θij:
(6)
where eij is the ijth standard basis vector. The directional derivative of f as Θ in direction d is denoted as f0(Θ;d), which is defined as:
(7)
A vector d is regarded as a descent direction if f0(Θ;d) < 0. sign(·) is the sign function takes value {−1,0,1}. The projection function
(
Θij , sign(Θij) = sign(Ωij)
πij(Θ;Ω) = (8)
0 , otherwise
means projecting Θ onto the orthant defined by Ω.
As discussed above, our objective function for large scale CTR prediction problem is both non-convex and non-smooth. Here we propose a general and efficient optimization method to solve such kind of non-convex problems. Since the negative-gradients of our objective function do not exists for all Θ, we take the direction d which minimizes the directional derivative of f with Θ as a replace. The directional derivative f0(Θ;d) exists for any Θ and direction d, whic is declared as Lemma 1.
Lemma 1. When an objective function f(Θ) is composed by a smooth loss function with L1 and L2,1 norm, for example the objective function given in Eq. (4), the directional derivative f0(Θ;d) exists for any Θ and direction d.
We leave the proof in Appendix A. Since the directional derivative f0(Θ;d) always exists, we choose the direction as the descent direction which minimizes the directional derivative f0(Θ;d) when the negative gradient of f(Θ) does not exist. The following proposition 2 explicitly gives the direction.
Proposition 2. Given a smooth loss function loss(Θ) and an objective function f(Θ) = loss(Θ)+λkΘk2,1+ βkΘk1, the bounded direction d which minimizes the directional derivative f0(Θ;d) is denoted as follows:
(9)
where and v = max{| − ∇loss(Θ)ij| − β,0}sign(−∇loss(Θ)ij).
More details about the proof can be found in Appendix B. According to the proof, we can see that the negative pseudo-gradient defined in Gao’s work (Andrew & Gao 2007) is a special case of our descent direction. Our proposed method is more general to find the descent direction for those non-smooth and non-convex objective functions.
Based on the direction dk in Eq.(9), we update the model parameters along a descent direction calculated by limited-memory quasi-newton method (LBFGS) (Wang & Puterman 1998), which approximates the inverse Hessian matrix of Eq.(4) on the given orthant. Motivated by the OWLQN method (Andrew & Algorithm 1 Optimize problem Eq.(4)
Input:Choose initial point Θ(0)
S ← {},Y ← {}
for k = 0 to MaxIters do
Gao 2007), we also restrict the signs of model parameters not changing in each iteration. Given the chosen direction dk and the old Θ(k), we constrain the orthant of current iteration as follows:
. (10)
(k)
When Θ = 0, the new Θij would not change sign in current iteration. When Θij = 0, we choose the
(k) orthant decided by the selected directionfor the new Θij .
Given the descent direction dk, we approximate the inverse-Hessian matrix Hk using LBFGS method with a sequence of y(k),s(k). Then the final update direction is Hkd(k). Here we give two tricks to adjust the update direction. First, we constrain the update direction in the orthant with respect to d(k). Second, as our objective function is non-convex, we cannot guarantee that Hk is positive-definite. We use y(k)Ts(k) > 0 as a condition to ensure Hk is a positive-definite matrix. If y(k)Ts(k) ≤ 0, we switch to d(k) as the update direction. The final update direction pk is defined as follows:
Hkd(k);d(k)), y(k)Ts(k) > 0
(11) d( ), otherwise
Given the update direction, we use backtracking line search to find the proper step size α. Same as OWLQN, we project the new Θ(k+1) onto the given orthant decided by the Eq. (10).
Θ(k+1) = π(Θ(k) + αpk;ξ(k)) (12)
A pseudo-code description of optimization is given in Algorithm 1. In fact, only a few steps of the standard LBFGS algorithm need to change. These modifications are:
In this section, we first provide our parallel implementation of LS-PLM model for large-scale data, then introduce an important trick which helps to accelerate the training procedue greatly.
Figure 2: The architecture of parallel implementation. Figure A illustrates the physical distributed topology. It’s a variant of parameter server, where each computation node runs with both a server and a worker, aiming to maximize the utility of computation power as well as memory usuage. Figure B illustrates the parameter server structure in model-parallelism and data-parallelism manner.
To apply Algorithm 1 in large-scale settings, we implement it with a distributed learning framework, as illustrated in Figure 2. It’s a variant of parameter server. In our implementation, each computation node runs with both a server node and a worker node, aiming to
In brief, there are two roles in the framework. The first role is the work node. Each node stores a part of training data and a local model, which only saves the model parameters used for the local training data. The second role is the server node. Each node stores a part of global model, which is mutually-exclusive. In each iteration, all of the worker nodes first calculate the loss and the descent direction with local model and local data in parallel(data parallelism). Then server nodes aggregate the loss and the direction d(k) as well as the corresponding entries of Θ needed to calculate the revised gradient (model parallelism). After finishing calculating the steepest descent direction in Step 1, workers synchronize the corresponding entries of Θ, and then, perform Step 2–6 locally.
Figure 3: Common feature pattern in display advertising. Usually in each page view, a user will see several different ads at the same time. In this situation, user features can be shared across these samples.
In addition to the general-purpose parallel implementation, we also optimized the implementation in online advertising context. Training samples in CTR prediction tasks usually have similar common feature pattern. Take display advertising as an example, as illustrated in Figure 3, during each page view, a user will see several different ads at the same time. For example, user U1 in Figure 3 sees three ads in one visit session, thus generates three samples. In this situation, features of user U1 can be shared across these three samples. These features include user profiles (sex, age, etc.) and user behavior histories during visits of Alibabas e-commerce websites, for example, his/her shopping item IDs, preferred brands or favorite shop IDs.
Remember the model defined in Eq. 2, most computation cost focus on µTi x and wiTx. By employing the common feature trick, we can split the calculation into common and non-common parts and rewrite them as follows:
µTi x = µTi,cxc + µTi,ncxnc (13)
wiTx = wi,cT xc + wi,ncT xnc
Hence, for the common feature part, we need just calculate once and then index the result, which will be utilized by the following samples.
Specifically, we optimize the parallel implementation with common features trick in the following three aspects:
Due to the common feature pattern of our production data, employing the common feature trick improves the performance of training procedure greatly, which will be shown in the following section 4.3.
In this session, we evaluate the performance of LS-PLM. Our datasets are generated from Alibaba’s mobile display advertising product system. As shown in Table 1, we collect seven datasets in continuous sequential Table 1: Alibaba’s mobile display advertising CTR prediction datasets
Dataset |
#features |
#samples (training/validation/testing) |
1 |
3.04 × 106 |
1.34/0.25/0.26 × 109 |
2 |
3.27 × 106 |
1.44/0.26/0.26 × 109 |
3 |
3.49 × 106 |
1.56/0.26/0.25 × 109 |
4 |
3.67 × 106 |
1.62/0.25/0.26 × 109 |
5 |
3.82 × 106 |
1.69/0.26/0.26 × 109 |
6 |
3.95 × 106 |
1.74/0.26/0.26 × 109 |
7 |
4.07 × 106 |
1.78/0.26/0.26 × 109 |
periods, aiming to evaluate the consist performance of the proposed model, which is important for online product serving. In each dataset, training/validation/testing samples are disjointly collected from different days, with proportion about 7:1:1. AUC (Fawcett 2006) metric is used to evaluate the model performance.
LS-PLM is a piece-wise linear model, with division number m controlling the model capacity. We evaluate the division effectiveness on model’s performance. It is executed on dataset 1 and results are shown in Figure
4.
Generally speaking, larger m means more parameters and thus leads to larger model capacity. But the training cost will also increase, both time and memory. Hence, in real applications we have to balance the model performance with the training cost.
Figure 4: Model performance with different divisions.
Figure 4 shows the training and testing AUC with different division number m. We try m = 6,12,24,36, the testing AUC for m = 12 is markedly better than m = 6, and improvement for m = 24,36 is relatively gentle. Thus in all the following experiments , the parameter m is set as 12 for LS-PLM model.
As stated in Session 2, in order to keep our model simpler and more generalized, we prefer to constrain the model parameters sparse by both L1 and L2,1 norms. Here we evaluate the strength of both the regularization terms.
Table 2 gives the results. As expected, both L1 and L2,1 norms can push our model to be sparse. Model trained with L2,1 norm has only 9.4% non-zero parameters left and 18.7% features are kept back. While in L1 norm case, there are only 1.9% non-zero parameters left. Combining them together, we get the sparsest Table 2: Regularization effects on model sparsity and performance β/λ(L1/L2,1) #features #non-zero parameters testing auc
Table 3: Training cost comparision with/without common feature trick
Cost |
Without CF. |
With CF. |
Cost Saving |
Memory cost/node |
89.2 GB |
31 GB |
65.2% |
Time cost/iteration |
121s |
10s |
91.7% |
result. Meanwhile, models trained with different norm get different AUC performance. Again adding the two norms together the model reaches the best AUC performance.
In this experiment, the hyper-parameter m is set to be 12. Parameters β and λ are selected by grid search. {0.01,0.1,1,10} are tried for both norms in the all cases. The model with β = 1 and λ = 1 preforms best.
We prove the effectiveness of common features trick. Specifically, we set up the experiments with 100 workers, each of which uses 12 CPU cores, with up to 110 GB memory totally. As shown in Table 3, compressing instances with common feature trick does not affect the actual dimensions of feature space. However, in practice we can significantly reduce memory usage (reduce to about 1/3) and speed up the calculation (around 12 times faster) compared to the training without common feature trick.
Figure 5: Model performance comparison on 7 different test datasets. LS-PLM owns consistent and markable promotion compared with LR.
We now compare LS-PLM with LR, the widely used CTR prediction model in product setting. The two models are trained using our distributed implementation architecture, running with hundreds of machines for speed-up. The choice of the L1 and L2,1 parameters for LS-PLM and the L1 parameter for LR are based on grid search. β = 0.01,0.1,1,10 and λ = 0.01,0.1,1,10 are tried. The best parameters are β = 1 and λ = 1 for LS-PLM, and β = 1 for LR.
As shown in Figure 5, LS-PLM outperforms LR clearly. The average improvement of AUC for LR is 1.44%, which has significant impact to the overall online ad system performance. Besides, the improvement is stable. This ensures LS-PLM can be safely deployed for daily online production system.
In this paper, a piece-wise linear model, LS-PLM for CTR prediction problem is proposed. It can capture the nonlinear pattern from sparse data and save us from heavy feature engineering jobs, which is crucial for real industry applications. Additionally, powered by our distributed and optimized implementation, our algorithm can handle problems of billion samples with ten million parameters, which is the typical industrial data volume. Regularization terms of L1 and L2,1 are utilized to keep the model sparse. Since 2012, LS-PLM has become the main CTR prediction model in alibaba’s online display advertising system, serving hundreds of millions users every day.
Acknowledgments
We would like to thank Xingya Dai and Yanghui Yan for their help for this work.
Appendix
Proof. the definition of f0(Θ;d) is given as follows:
(14)
As the gradient of loss function exists for any Θ, the directional derivative for the first part is |
|
loss(Θ + αd) − loss(Θ) T lim = ∇loss(Θ) d |
(15) |
loss(Θ + αd) − loss(Θ) =lim
α↓0 α
For the second part, we know if kΘi·k2,1 6= 0, the L2,1 norm’s partial derivative exists. So the directional derivative is
. (16)
However, when kΘi·k2,1 = 0, it means Θij = 0,1 ≤ j ≤ 2m. Then its directional derivative can be denoted as follows:
(17)
So combine the above cases in Eq. (16) and Eq. (17), we get the directional derivative for the second part:
(18)
Same as the second part, the direction derivative for the third part is:
(19)
Based on Eq. (15), Eq. (18) and Eq. (19), we get that for any Θ and direction d, f0(Θ;d) exists.
Proof. Finding the expected direction turns to an optimization problem, which is formulated as follows:
minf0(Θ;d) s.t. kdk2 ≤ C. (20) d
Here the direction d is bounded by a constant scalar C. To solve this problem, we employ lagrange function to combine the objective function and inequality function together:
L(d,µ) = f0(Θ;d) + µ(kdk2 − C). (21)
Here µ ≥ 0 is the lagrange multiplier. Setting the partial derivative of d with respect to L(d,µ) to zero has three cases.
Define
2µdij = s − βsign(Θij)
2µdij = max{|s| − β,0}sign(s)
loss(Θ).
Here we use sign(di·) = [sign(di1),...,sign(di2m)]T for simplicity. Then we get
loss(Θ)i· − βsign(di·)
which implies that sign(di·) = sign(−∇loss(Θ)i· − βsign(di·)). When dij ≥ 0, it implies −∇loss(Θ)ij −
. βsign(dij) ≥ 0. Inversely, we have −∇loss(Θ)ij−βsign(dij) ≤ 0 when dij ≤ 0. So we define v = −∇loss(Θ)i·− βsign(di·) and vj = max{| − ∇loss(Θ)ij| − β,0}sign(−∇loss(Θ)ij). So
As kdi·k ≥ 0, we have 2µkdi·k = max(kvk − λ,0). Thus 2
The lagrange multiplier u is a scalar, and it has equivalent influence for all dij. We can see that the optimal direction which is bounded by C is the same direction as we defined in Eq. (9) without considering the constant scalar µ. Here we finish our proof.