Generalized linear models with nonlinear feature transformations are widely used for large-scale regression and classification problems with sparse inputs. Memorization of feature interactions through a wide set of cross-product feature transformations are effective and interpretable, while generalization requires more feature engineering effort. With less feature engineering, deep neural networks can generalize better to unseen feature combinations through low-dimensional dense embeddings learned for the sparse features. However, deep neural networks with embeddings can over-generalize and recommend less relevant items when the user-item inter- actions are sparse and high-rank. In this paper, we present Wide & Deep learning—jointly trained wide linear models and deep neural networks—to combine the benefits of memorization and generalization for recommender systems. We productionized and evaluated the system on Google Play, a commercial mobile app store with over one billion active users and over one million apps. Online experiment results show that Wide & Deep significantly increased app acquisitions compared with wide-only and deep-only models. We have also open-sourced our implementation in TensorFlow.
译文:具有非线形特征变换的广义线性模型被广泛用于大规模回归和稀疏输入的分类问题。通过一系列跨产品特征变换来记忆特征交互是有效且可解释的,而泛化则需要更多的特征工程工作。利用较少的特征工程,深度神经网络可以通过稀疏特征学习到的低维稠密向量生成更好的未知特征组合。然而,当用户-商品交互行为比较稀疏且排名较高时,有向量的深度神经网络会过拟合并且推荐不太相关的东西。在本文中,我们提出wide & deep学习 —— 同时训练线形模型和深度神经网络,为推荐系统结合记忆和泛化的优势。我们在Google Play上制作并评估了该系统,Google Play是一个商业移动应用商店,拥有超过10亿活跃用户和超过100万个应用。在线实验结果表明,与仅限广泛和仅限深度的模型相比,Wide&Deep明显增加了应用下载量。我们还在TensorFlow中开源了我们的实现。
解读:这里有两个词很重要,在后面也会反复出现:memorization 和 generalization,译文中翻译成记忆和泛化并不是特别好(但也想不到什么更合理的)。memorization是指学习已知的特征变换和特征组合对结果的影响,generalization是指学习未知的特征变换和特征组合对结果的影响。以论文中用到的Google Play预测举例,用线性模型学习用户的年龄、工作,应用的下载量、类型等对用户是否会下载应用的影响是memorization,用深度模型学习未知的特征组合(不是很恰当的例子:用户的年龄*工作/应用的下载量+类型)对用户是否会下载应用的影响是generalization
A recommender system can be viewed as a search ranking system, where the input query is a set of user and contextual information, and the output is a ranked list of items. Given a query, the recommendation task is to find the relevant items in a database and then rank the items based on certain objectives, such as clicks or purchases.
One challenge in recommender systems, similar to the general search ranking problem, is to achieve both memorization and generalization. Memorization can be loosely defined as learning the frequent co-occurrence of items or features and exploiting the correlation available in the historical data. Generalization, on the other hand, is based on transitivity of correlation and explores new feature combinations that have never or rarely occurred in the past. Recommendations based on memorization are usually more topical and directly relevant to the items on which users have already performed actions. Compared with memorization, generalization tends to improve the diversity of the recommended items. In this paper, we focus on the apps recommendation problem for the Google Play store, but the approach should apply to generic recommender systems.
For massive-scale online recommendation and ranking systems in an industrial setting, generalized linear models such as logistic regression are widely used because they are simple, scalable and interpretable. The models are often trained on binarized sparse features with one-hot encoding. E.g., the binary feature “user_installed_app=netflix” has value 1 if the user installed Netflix. Memorization can be achieved effectively using cross-product transformations over sparse features, such as AND(user_installed_app=netflix, impression_app=pandora”), whose value is 1 if the user installed Netflix and then is later shown Pandora. This explains how the co-occurrence of a feature pair correlates with the target label. Generalization can be added by using features that are less granular, such as AND(user_installed_category=video, impression_category=music), but manual feature engineer- ing is often required. One limitation of cross-product trans- formations is that they do not generalize to query-item feature pairs that have not appeared in the training data.
Embedding-based models, such as factorization machines [5] or deep neural networks, can generalize to previously un- seen query-item feature pairs by learning a low-dimensional dense embedding vector for each query and item feature, with less burden of feature engineering. However, it is difficult to learn effective low-dimensional representations for queries and items when the underlying query-item matrix is sparse and high-rank, such as users with specific preferences or niche items with a narrow appeal. In such cases, there should be no interactions between most query-item pairs, but dense embeddings will lead to nonzero predictions for all query-item pairs, and thus can over-generalize and make less relevant recommendations. On the other hand, linear mod- els with cross-product feature transformations can memorize these “exception rules” with much fewer parameters.
In this paper, we present the Wide & Deep learning frame- work to achieve both memorization and generalization in one model, by jointly training a linear model component and a neural network component as shown in Figure 1.
The main contributions of the paper include:
• The Wide & Deep learning framework for jointly train- ing feed-forward neural networks with embeddings and linear model with feature transformations for generic recommender systems with sparse inputs.
• The implementation and evaluation of the Wide & Deep recommender system productionized on Google Play, a mobile app store with over one billion active users and over one million apps.
• We have open-sourced our implementation along with a high-level API in TensorFlow1.
While the idea is simple, we show that the Wide & Deep framework significantly improves the app acquisition rate on the mobile app store, while satisfying the training and serving speed requirements.
译文:推荐系统可以看作一个搜索排序系统,其中输入语句是一组用户和上下文信息,输出是一个排了序的商品列表。给定一个查询语句,推荐任务是在数据库中查询相关的商品,然后基于某些目标(例如点击或者购买)对商品排名。
与一般搜索排名问题类似,推荐系统中的一个挑战是实现记忆和泛化。记忆可以宽松定义为学习商品或者特征的共同出现频繁程度和利用历史数据中可用的相关性。
另一方面,泛化是基于相关性的传递性,探索从未出现或者极少出现过的新的特征组合。基于记忆的推荐系统通常更加直接地和与用户交互过的商品相关。和基于记忆的推荐系统相比,基于泛化的推荐系统倾向于提升推荐商品的多样性。本文中,我们关注的是Google Play商店的应用推荐问题,但该方法应适用于通用推荐系统。
在工业环境中的大规模在线推荐和排名系统中,广义线性模型(如逻辑回归)被广泛使用,因为它们简单,可扩展且可解释。模型通常使用one-hot编码的二值化稀疏特征训练。例如,如果用户安装了Netflix,则二进制功能“user_installed_app = netflix”的值为1。在稀疏特征上使用跨产品特征变换可以有效的实现记忆,例如AND(user_installed_app = netflix,impression_app = pandora“),如果用户安装了Netflix且出现在Pandora则其值为1。这解释了特征对的共现如何与目标标签相关联。可以使用不太精细的特征(例如AND(user_installed_category = video,impression_category = music))添加泛化,但通常需要人工特征工程。跨产品变换的一个局限性是不能产生没有在训练集中出现过的查询语句-商品特征对。
基于嵌入的模型,例如分解机或者深度神经网络,通过学习每个query和item的低维稠密embedding向量,可以泛化从未出现过的查询语句-商品特征对,同时减少特征工程的负担。然而,由于基础的query-item矩阵是稀疏和高排序的,例如具有特定偏好的用户或者小众商品,学习query和item的低维表征是困难的。在这种情况下,大部分的query-item对之间没有交互,但是稠密embedding会导致对所有quert-item对的非零预测,因此可能过拟合和使用不太相关的推荐。另一方面,有着跨产品交叉特征变换的线性模型可以用更少的参数记忆这些“异常规则”。
在本文中,我们提出wide & deep模型,模型通过同时训练一个线形模型和一个神经网络(图1)来同时实现记忆和泛化。
本论文的主要贡献包括:
(1)通用于具有稀疏输入的推荐系统的wide&deep框架,同时训练带有嵌入的前馈神经网络和带有特征变换的线形模型。
(2)在Google Play上实施的Wide&Deep推荐系统的实施和评估,Google Play是一个拥有超过10亿活跃用户和超过100万个应用的移动应用商店。
(3)我们开源了基于Tensorflow1的高级API的实现。
虽然这个想法很简单,但是实践表明wide&deep框架显著提高了移动app score 的app下载率,同时满足了训练和服务的速度要求。
An overview of the app recommender system is shown in Figure 2. A query, which can include various user and contextual features, is generated when a user visits the app store. The recommender system returns a list of apps (also referred to as impressions) on which users can perform certain actions such as clicks or purchases. These user actions, along with the queries and impressions, are recorded in the logs as the training data for the learner.
Since there are over a million apps in the database, it is intractable to exhaustively score every app for every query within the serving latency requirements (often O(10) milliseconds). Therefore, the first step upon receiving a query is retrieval. The retrieval system returns a short list of items that best match the query using various signals, usually a combination of machine-learned models and human-defined rules. After reducing the candidate pool, the ranking system ranks all items by their scores. The scores are usually P(y|x), the probability of a user action label y given the features x, including user features (e.g., country, language, demographics), contextual features (e.g., device, hour of the day, day of the week), and impression features (e.g., app age, historical statistics of an app). In this paper, we focus on the ranking model using the Wide & Deep learning framework.
译文:app推荐系统的框架如图2所示。当用户访问app store的时候,会生成一个包含了丰富的用户和上下文信息的query。推荐系统会返回一个应用列表(也称为印象),用户可以在上面执行某些操作,例如点击或购买。这些用户行为以及查询和印象,会作为模型的训练数据。
数据库中有超过一百万个应用程序,因此在服务延迟要求(通常为o(10)毫秒)内为每个查询语句全面的对每个app评分是不现实的。因此,接收到查询语句的第一步是检索。检索系统通过机器学习模型和人工定义规则筛选返回与查询最匹配的item的简短列表。减少候选池中app数量后,排名系统通过分数对这些app进行排序。分数通常是根据用户特征(国家、语言、人口统计)、上下文特征(设备、时间、星期)、印象特征(应用年龄、应用历史数据)x预测的用户行为标签y=1的概率p(y|x)。在本文中,我们重点关注wide&deep学习框架在排名模型上的应用。
解读:这里的impressions翻译成印象可能不便于理解,应该是指(a list of apps)应用的列表,论及特征(impression features)的时候,指应用的特征
The wide component is a generalized linear model of the form y = w T x + b y = w^Tx + b y=wTx+b, as illustrated in Figure 1 (left). y is the prediction, x = [ x 1 , x 2 , . . . , x d ] x=[x_1,x_2,...,x_d] x=[x1,x2,...,xd]is a vector of d features, w = [ w 1 , w 2 , . . . , w d ] w=[w_1,w_2,...,w_d] w=[w1,w2,...,wd]are the model parameters and b is the bias. The feature set includes raw input features and transformed features. One of the most important transformations is the cross-product transformation, which is defined as:
θ k ( x ) = ∏ i = 1 d x i c k i , c k i ∈ { 0 , 1 } \theta_k(x)=\prod^d_{i=1}x_i^{c_{ki}}, c_{ki}\in\{0,1\} θk(x)=i=1∏dxicki,cki∈{0,1},where c k i c_{ki} ckiis a boolean variable that is 1 if the i-th feature is part of the k-th transformation φk, and 0 otherwise. For binary features, a cross-product transformation (e.g., “AND(gender=female, language=en)”) is 1 if and only if the constituent features (“gender=female” and “language=en”) are all 1, and 0 otherwise. This captures the interactions between the binary features, and adds nonlinearity to the generalized linear model.
译文:wide部分是 y = w T x + b y = w^Tx + b y=wTx+b,形式的广义线性模型,如图1左边部分所示。y是预测值, x = [ x 1 , x 2 , . . . , x d ] x=[x_1,x_2,...,x_d] x=[x1,x2,...,xd]是特征向量, w = [ w 1 , w 2 , . . . , w d ] w=[w_1,w_2,...,w_d] w=[w1,w2,...,wd]是模型参数,b是偏置常量。特征包括原始输入特征和变换特征。最重要的变换特征是交叉产品变换,定义如下: θ k ( x ) = ∏ i = 1 d x i c k i , c k i ∈ { 0 , 1 } \theta_k(x)=\prod^d_{i=1}x_i^{c_{ki}}, c_{ki}\in\{0,1\} θk(x)=i=1∏dxicki,cki∈{0,1}, c k i c_{ki} cki是一个布尔变量,如果第i个特征是第k个变换的一部分则为1,反之为0.对于二值特征,一个组合特征当原特征都为0的时候才会0(例如“性别=女”且“语言=英语”时,AND(性别=女,语言=英语)=1,其他情况均为0)。这捕获了二元特征之间的相互作用,并为广义线性模型增加了非线性。
解读:这里的 θ k ( x ) = ∏ i = 1 d x i c k i , c k i ∈ { 0 , 1 } \theta_k(x)=\prod^d_{i=1}x_i^{c_{ki}}, c_{ki}\in\{0,1\} θk(x)=∏i=1dxicki,cki∈{0,1}公式非常数学抽象化,其实就是特征组合。如性别和语言的组合特征,性别:{男,女},语言:{中文,英语},组合特征:{男且中文,男且英语,女且中文,女且英语},某样本性别=女,语言=英语,则组合特征女且英语=1,其他为0
The deep component is a feed-forward neural network, as shown in Figure 1 (right). For categorical features, the original inputs are feature strings (e.g., “language=en”). Each of these sparse, high-dimensional categorical features are first converted into a low-dimensional and dense real-valued vector, often referred to as an embedding vector. The dimensionality of the embeddings are usually on the order of O(10) to O(100). The embedding vectors are initialized ran- domly and then the values are trained to minimize the final loss function during model training. These low-dimensional dense embedding vectors are then fed into the hidden layers of a neural network in the forward pass. Specifically, each hidden layer performs the following computation:
a ( l + 1 ) = f ( W ( l ) a ( l ) + b ( l ) ) a^{(l+1)}=f(W^{(l)}a^{(l)}+b^{(l)}) a(l+1)=f(W(l)a(l)+b(l))
where l is the layer number and f is the activation function, often rectified linear units (ReLUs). a ( l ) , b ( l ) , W ( l ) a^{(l)},b^{(l)},W^{(l)} a(l),b(l),W(l) are the activations, bias, and model weights at l-th layer.
译文:deep部分是前馈神经网络,如图1(右)所示。对于类别型特征,原始输入是特征字符串(例如,“语言=英语”)。这些稀疏的高维类别特征会先转换成低维稠密的实数向量,通常被称为嵌入向量。嵌入向量的维度通常通常在o(10)到o(100)的量级。随机初始化嵌入向量,然后在模型训练中最小化最终损失函数。这些低维稠密向量馈送到前向传递中的神经网络的隐藏层中。 具体来说,每个隐藏层执行以下计算: a ( l + 1 ) = f ( W ( l ) a ( l ) + b ( l ) ) a^{(l+1)}=f(W^{(l)}a^{(l)}+b^{(l)}) a(l+1)=f(W(l)a(l)+b(l)),l是层数,f是激活函数,通常使用RELU单元。 a ( l ) , b ( l ) , W ( l ) a^{(l)},b^{(l)},W^{(l)} a(l),b(l),W(l)是第l层的激活、偏置和模型权重。
The wide component and deep component are combined using a weighted sum of their output log odds as the prediction, which is then fed to one common logistic loss function for joint training. Note that there is a distinction be- tween joint training and ensemble. In an ensemble, individual models are trained separately without knowing each other, and their predictions are combined only at inference time but not at training time. In contrast, joint training optimizes all parameters simultaneously by taking both the wide and deep part as well as the weights of their sum into account at training time. There are implications on model size too: For an ensemble, since the training is disjoint, each individual model size usually needs to be larger (e.g., with more features and transformations) to achieve reasonable accuracy for an ensemble to work. In comparison, for joint training the wide part only needs to complement the weak- nesses of the deep part with a small number of cross-product feature transformations, rather than a full-size wide model.
Joint training of a Wide & Deep Model is done by back- propagating the gradients from the output to both the wide and deep part of the model simultaneously using mini-batch stochastic optimization. In the experiments, we used Follow- the-regularized-leader (FTRL) algorithm [3] with L1 regularization as the optimizer for the wide part of the model, and AdaGrad [1] for the deep part.The combined model is illustrated in Figure 1 (center). For a logistic regression problem, the model’s prediction is: P ( Y = 1 ∣ x ) = σ ( w w i d e T [ x , φ ( x ) ] + w d e e p T a ( l f ) + b ) P(Y=1|x)=\sigma(w_{wide}^T[x,\varphi(x)]+w_{deep}^Ta^{(lf)}+b) P(Y=1∣x)=σ(wwideT[x,φ(x)]+wdeepTa(lf)+b)
where Y is the binary class label, σ(·) is the sigmoid func- tion, φ(x) are the cross product transformations of the orig- inal features x, and b is the bias term. w w i d e w_{wide} wwide is the vector of all wide model weights, and w d e e p w_{deep} wdeep are the weights applied on the final activations a ( l f ) a^{(lf)} a(lf).
译文:wide的部分和deep的部分使用其输出对数几率的加权和作为预测,然后将其输入到联合训练的一个共同的逻辑损失函数。注意到这里的联合训练和集成学习是有区别的。集成学习中,每个模型是独立训练的,而且他们的预测是在推理时合并而不是在训练时合并。相比之下,联合训练在训练时同时考虑wide和deep模型以及加权和来优化所有参数。这对模型大小也有影响:对于集成学习而言,由于训练是独立的,因此每个模型的大小通常会更大(例如:更多特征和交叉特征)来实现一个集成模型合理的精确度。相比之下,在联合训练中,wide部分只需要通过少量的跨产品特征变换来补充深度模型的不足,而且不是全量的模型。
wide和deep模型的联合训练是通过使用小批量随机优化同时将输出的梯度反向传播到模型的wide和deep部分来完成的。 在实验中,我们使用带L1正则的FTRL算法作为wide部分的优化器,AdaGrad作为deep部分的优化器。
这个联合模型如图1(中)所示。对于逻辑回归问题,模型的预测是:
P ( Y = 1 ∣ x ) = σ ( w w i d e T [ x , θ ( x ) ] + w d e e p T a ( l f ) + b ) P(Y=1|x)=\sigma(w_{wide}^T[x,\theta(x)]+w_{deep}^Ta^{(lf)}+b) P(Y=1∣x)=σ(wwideT[x,θ(x)]+wdeepTa(lf)+b)其中,Y是二值分类标签, σ \sigma σ是sigmoid函数, φ ( x ) \varphi(x) φ(x)是跨产品特征变换,b是偏置项, w w i d e w_{wide} wwide是wide模型的权重, w d e e p w_{deep} wdeep是用于最终激活函数 a ( l f ) a^{(lf)} a(lf)的权重。
The implementation of the apps recommendation pipeline consists of three stages: data generation, model training, and model serving as shown in Figure 3.
译文:如图3所示,app推荐系统管道的实现包括了三个阶段:数据生成,模型训练和模型服务。
In this stage, user and app impression data within a period of time are used to generate training data. Each example corresponds to one impression. The label is app acquisition: 1 if the impressed app was installed, and 0 otherwise.
Vocabularies, which are tables mapping categorical feature strings to integer IDs, are also generated in this stage. The system computes the ID space for all the string features that occurred more than a minimum number of times. Continuous real-valued features are normalized to [0, 1] by map- ping a feature value x to its cumulative distribution function P (X ≤ x), divided into n q n_q nq quantiles. The normalized value
is i − 1 n q − 1 \frac{i-1}{n_q-1} nq−1i−1 for values in the i-th quantiles. Quantile boundaries are computed during data generation.
译文:在此阶段,一段时间内的用户和应用程序展示数据用于生成训练数据。 每个样本对应一次展示。 标签是应用程序下载:如果下载了展示的应用程序,则为1,否则为0。
这个阶段还会生成存储分类特征字符串和对应ID的映射表。系统计算出现次数超过最少次数要求的特征字符串的ID。通过将特征值x映射到其累积分布函数 p ( X ≤ x p(X \le x p(X≤x),将连续值特征标准化为[0,1],将其分成 n q n_q nq份。标准化值中第i份的值就是 i − 1 n q − 1 \frac{i-1}{n_q-1} nq−1i−1。在数据生成阶段计算了分位数边界。
The model structure we used in the experiment is shown in Figure 4. During training, our input layer takes in training data and vocabularies and generate sparse and dense features together with a label. The wide component consists of the cross-product transformation of user installed apps and impression apps. For the deep part of the model, A 32- dimensional embedding vector is learned for each categorical feature. We concatenate all the embeddings together with the dense features, resulting in a dense vector of approximately 1200 dimensions. The concatenated vector is then fed into 3 ReLU layers, and finally the logistic output unit.
The Wide & Deep models are trained on over 500 billion examples. Every time a new set of training data arrives, the model needs to be re-trained. However, retraining from scratch every time is computationally expensive and delays the time from data arrival to serving an updated model. To tackle this challenge, we implemented a warm-starting system which initializes a new model with the embeddings and the linear model weights from the previous model.
Before loading the models into the model servers, a dry run of the model is done to make sure that it does not cause problems in serving live traffic. We empirically validate the model quality against the previous model as a sanity check.
译文:实验中我们使用的模型框架如图4所示。训练过程中,输入层接收训练数据和词汇表,同时生成带label的稀疏和稠密的特征。wide部分包括用户安装的app和展示的app的跨产品特征变换。deep部分,为每个分类特征学习学习32维的嵌入向量。我们将所有向量和稠密特征连接成一个约1200维的稠密向量。然后将连接的矢量输入3个ReLU层,最后输入逻辑输出单元。
wide&deep模型在超过5000亿个样本的数据集上训练。当加入一组新的数据时,需要重新训练模型。然而,每次从头开始重新训练是非常耗费计算资源的,并且延迟了从数据到达到服务更新的时间。为了应对这一挑战,我们实施了一个热启动系统,该系统使用先前模型中的嵌入和线性模型权重初始化。在将模型加载到模型服务器之前,先完成模型的干运行,以确保它不会在提供实时流量时出现问题。 我们根据先前的模型验证模型质量作为健全性检查。
Once the model is trained and verified, we load it into the model servers. For each request, the servers receive a set of app candidates from the app retrieval system and user features to score each app. Then, the apps are ranked from the highest scores to the lowest, and we show the apps to the users in this order. The scores are calculated by running a forward inference pass over the Wide & Deep model.
In order to serve each request on the order of 10 ms, we optimized the performance using multithreading parallelism by running smaller batches in parallel, instead of scoring all candidate apps in a single batch inference step.
译文:模型经过训练和验证后,我们将其加载到模型服务器中。对于每个请求,服务器从app检索系统中接收一组app,并给每个app通过用户特征打分。然后,应用程序从高到低排分并展示给用户。通过wide&deep模型运行一个前向推理传递计算得分。
为了在10ms内相应请求,我们通过并行运行较小批量使用多线程并行性来优化性能,而不是在单个批量推理步骤中对所有候选应用程序进行评分。
略
The idea of combining wide linear models with cross- product feature transformations and deep neural networks with dense embeddings is inspired by previous work, such as factorization machines [5] which add generalization to linear models by factorizing the interactions between two variables as a dot product between two low-dimensional embedding vectors. In this paper, we expanded the model capacity by learning highly nonlinear interactions between embeddings via neural networks instead of dot products.
In language models, joint training of recurrent neural net- works (RNNs) and maximum entropy models with n-gram features has been proposed to significantly reduce the RNN complexity (e.g., hidden layer sizes) by learning direct weights between inputs and outputs [4]. In computer vision, deep residual learning [2] has been used to reduce the difficulty of training deeper models and improve accuracy with shortcut connections which skip one or more layers. Joint training of neural networks with graphical models has also been applied to human pose estimation from images [6]. In this work we explored the joint training of feed-forward neural networks and linear models, with direct connections between sparse features and the output unit, for generic recommendation and ranking problems with sparse input data.
In the recommender systems literature, collaborative deep learning has been explored by coupling deep learning for content information and collaborative filtering (CF) for the ratings matrix [7]. There has also been previous work on mobile app recommender systems, such as AppJoy which used CF on users’ app usage records [8]. Different from the CF-based or content-based approaches in the previous work, we jointly train Wide & Deep models on user and impression data for app recommender systems.
译文:把交叉特征变换的wide模型和使用稠密嵌入向量的deep模型结合起来的思想受到了之前工作的启发。例如分解机,它通过将两个变量之间的相互作用分解为点积来增加线性模型的推广。在两个低维嵌入向量之间。在本文中,我们通过神经网络而不是点积来学习嵌入之间的高度非线性相互作用来扩展模型容量。在语言模型中,已经提出联合训练递归神经网络(RNN)和具有n-gram特征的最大熵模型,通过学习输入和输出之间的直接权重来显着降低RNN复杂度(例如,隐藏层大小)[4] ]。在计算机视觉中,深度残差学习[2]已被用于减少训练更深层模型的难度,并通过跳过一个或多个层的快捷连接来提高准确性。神经网络与图形模型的联合训练也已应用于图像中的人体姿态估计[6]。在这项工作中,我们探索了前馈神经网络和线性模型的联合训练,稀疏特征和输出单元之间的直接连接,用于稀疏输入数据的通用推荐和排序问题。
在推荐系统文献中,通过将内容信息的深度学习与评级矩阵的协同过滤(CF)相结合,探索了协作深度学习[7]。此前还有一些关于移动应用程序推荐系统的工作,例如AppJoy,它在用户的应用程序使用记录中使用了CF [8]。与之前工作中基于CF或基于内容的方法不同,我们联合培训针对应用推荐系统的用户和印象数据的Wide&Deep模型。
Memorization and generalization are both important for recommender systems. Wide linear models can effectively memorize sparse feature interactions using cross-product feature transformations, while deep neural networks can generalize to previously unseen feature interactions through low- dimensional embeddings. We presented the Wide & Deep learning framework to combine the strengths of both types of model. We productionized and evaluated the framework on the recommender system of Google Play, a massive-scale commercial app store. Online experiment results showed that the Wide & Deep model led to significant improvement on app acquisitions over wide-only and deep-only models.
译文:记忆和概括对于推荐系统都很重要。宽线性模型可以使用跨产品特征转换有效地记忆稀疏特征交互,而深度神经网络可以通过低维嵌入来生成以前看不见的特征交互。我们介绍了Wide&Deep学习框架,以结合两种模型的优势。我们在Google Play的推荐系统上制作并评估了该框架,Google Play是一个大规模的商业应用商店。在线实验结果表明,Wide&Deep模型在广泛和仅深度模型上的应用程序获取方面取得了显着改进。
ps:目前简单的翻译了论文,如有错误,还请指正,谢谢!接下来会对开源代码深入学习实践,如果对论文有更深的理解将会不断更新,欢迎交流~~