作者:石塔西来源:https://zhuanlan.zhihu.com/
p/47970601
整理:深度传送门
在充分理解了Feature Column之后,看Wide & Deep的代码,就很清晰了。我按照先整体,再Wide侧,再Deep侧的“从整体到局部”的顺序介绍一下Wide & Deep的代码。由于篇幅原因,请看官同学下载TensorFlow代码,与本文相互比照阅读。
TensorFlow的Wide&Deep实现,叫作tf.estimator.DNNLinearCombinedClassifier,是在tensorflow/python/estimator/canned/dnn_linear_combined.py中实现的。
这个类本身没什么干货,它将构建网络的工作,都委托了出去。不过,在这个类的构造函数中,我们可以看到如下代码
if n_classes == 2: head = head_lib._binary_logistic_head_with_sigmoid_cross_entropy_loss( weight_column=weight_column,label_vocabulary=label_vocabulary,loss_reduction=loss_reduction)else: head = head_lib._multi_class_head_with_softmax_cross_entropy_loss( n_classes,weight_column=weight_column,label_vocabulary=label_vocabulary,loss_reduction=loss_reduction)
所谓head,就是一个辅助类。比如,如果我们做二分类,一定要用binary cross-entropy loss,而且经常要监视auc;而如果做回归,一定会用mean square loss。对于常见的任务,经常要在estimator的model_fn中重复实现一些代码,很繁琐,所以TensorFlow把这些常规、重复率高的代码提取出来组成head。因为都是辅助代码,不影响我们理解wide&deep的实现,这里就不再展开。
可以看到,tensorflow的实现,既支持二分类,也支持多分类
而且还支持给不同的训练样本以不同的权重(通过指定输入中某列为weight_column来实现),权重将乘在这个样本的loss上,最终也会乘在这个样本的梯度上。
真正的构建model的代码是在_dnn_linear_combined_model_fn中完成的。
先构建Deep侧
dnn_logit_fn = dnn._dnn_logit_fn_builder( units=head.logits_dimension, hidden_units=dnn_hidden_units, feature_columns=dnn_feature_columns, activation_fn=dnn_activation_fn, dropout=dnn_dropout, input_layer_partitioner=input_layer_partitioner, batch_norm=batch_norm)dnn_logits = dnn_logit_fn(features=features, mode=mode)
再构建Wide侧
logit_fn = linear._linear_logit_fn_builder( units=head.logits_dimension, feature_columns=linear_feature_columns, sparse_combiner=linear_sparse_combiner)linear_logits = logit_fn(features=features)
最终的logits是二者相加
if dnn_logits is not None and linear_logits is not None: logits = dnn_logits + linear_logitselif dnn_logits is not None: logits = dnn_logitselse: logits = linear_logits
dnn_logits,linear_logits的维度必须相等,具体是多少,是head.logits_dimension由决定的。二分类时是1,而多分类时就是#classes。
如何同时让Adagrad训练Deep,而让FTRL训练Wide?答案是通过group来完成的
train_ops = []if dnn_logits is not None: train_ops.append( dnn_optimizer.minimize(loss,var_list=ops.get_collection(ops.GraphKeys.TRAINABLE_VARIABLES,scope=dnn_absolute_scope)))if linear_logits is not None: train_ops.append( linear_optimizer.minimize(loss,var_list=ops.get_collection(ops.GraphKeys.TRAINABLE_VARIABLES,scope=linear_absolute_scope)))train_op = control_flow_ops.group(*train_ops)
最后通过head来构建estimator
head.create_estimator_spec(features=features,mode=mode,labels=labels,train_op_fn=_train_op_fn,logits=logits)
介绍完算法的整体流程,让我们先聚焦于“Wide”则这个局部。Deep侧主要是处理dense输入,没什么技术含量。Wide侧处理高阶、稀疏输入,才是应该关注的重点。
_linear_logit_fn_builder.linear_logit_fn是在tensorflow/python/estimator/canned/linear.py
这个module中实现的。
linear_model = feature_column._LinearModel( feature_columns=feature_columns, units=units, sparse_combiner=sparse_combiner, name='linear_model')logits = linear_model(features)
上面代码中的_LinearModel是在feature_column.py这个module中定义的。其实,无论deep侧还是wide侧,无非都是“全连接”(Fully Connection, FC)而已,只不过wide则是一层全连接,deep侧是多层全连接。但是,所以二者实现“全连接”的方式有很大不同
Wide侧的输入一般是超大且稀疏的,所以没有将所有接入wide侧的Feature Column提取的输入拼接成一个大矩阵,因为那将破坏输入的稀疏性。而是每个Feature Column单独连接(用稀疏矩阵相乘来实现),最后将各列连接结果叠加。
接下来会说到,Deep侧输入都是稠密的(categorical特征在接入deep侧之前必须经过embedding映射成稠密向量),因此是将各列提取出来的dense tensor拼接成一个大的稠密矩阵,再喂入第一层FC
按照以上这个思路,在_LinearModel的构造函数中,为每列构造一个FC Layer,再构造一个BiasLayer
column_layers = {}for column in sorted(self._feature_columns, key=lambda x: x.name): column_name = _strip_leading_slashes(vs.name) column_layer = _FCLinearWrapper(column, units, sparse_combiner, self._weight_collections, trainable, column_name, **kwargs) column_layers[column_name] = column_layerself._column_layers = self._add_layers(column_layers)self._bias_layer = _BiasLayer(units=units, trainable=trainable, weight_collections=self._weight_collections, name='bias_layer',**kwargs)
调用时,遍历所有feature column,每个feature column单独从builder(即实现了cache功能的input)中提取数据(一般是稀疏的),再做一层全连接,最后,再把每个feature column全连接后的结果叠加起来。
def call(self, features): weighted_sums = [] ordered_columns = [] # lazy wrapper of original input builder = _LazyBuilder(features) # 遍历所有feature column,每列从builder中提取数据,再做一层连接 for layer in sorted(self._column_layers.values(), key=lambda x: x.name): column = layer._feature_column ordered_columns.append(column) weighted_sum = layer(builder) weighted_sums.append(weighted_sum) # 把每个feature column连接后的结果叠加起来 predictions_no_bias = math_ops.add_n(weighted_sums, name='weighted_sum_no_bias') predictions = nn_ops.bias_add(predictions_no_bias, self._bias_layer( builder,scope=variable_scope.get_variable_scope()), name='weighted_sum') return predictions
接下来,我们再看看这个全连接层 _FCLinearWrapper是如何实现的。这个类不过是判断输入是CategoricalColumn还是NumericColumn,
如果是NumericColumn,就调用matmul做稠密矩阵的相乘
如果是CategoricalColumn,就调用safe_embedding_lookup_sparse做稀疏矩阵(CategoricalColumn提取出来的输入)与稠密矩阵(Fully connection weight,优化变量)的相乘。
class _FCLinearWrapper(base.Layer): def build(self, _): if isinstance(self._feature_column, _CategoricalColumn): weight = self.add_variable( name='weights', shape=(self._feature_column._num_buckets, self._units), initializer=init_ops.zeros_initializer(), trainable=self.trainable) else: num_elements = self._feature_column._variable_shape.num_elements() weight = self.add_variable( name='weights', shape=[num_elements, self._units], initializer=init_ops.zeros_initializer(), trainable=self.trainable) self._weight_var = weight self.built = True def call(self, builder): weighted_sum = _create_weighted_sum( column=self._feature_column, builder=builder, units=self._units, sparse_combiner=self._sparse_combiner, weight_collections=self._weight_collections, trainable=self.trainable, weight_var=self._weight_var) return weighted_sum def _create_weighted_sum(column,builder,units,sparse_combiner,weight_collections,trainable,weight_var=None): """Creates a weighted sum for a dense/categorical column for linear_model.""" if isinstance(column, _CategoricalColumn): return _create_categorical_column_weighted_sum( column=column, builder=builder, units=units, sparse_combiner=sparse_combiner, weight_collections=weight_collections, trainable=trainable, weight_var=weight_var) else: return _create_dense_column_weighted_sum( column=column, builder=builder, units=units, weight_collections=weight_collections, trainable=trainable, weight_var=weight_var)
稀疏输入与稠密权重相乘的代码,注意safe_embedding_lookup_sparse的调用。
def _create_categorical_column_weighted_sum(column,builder,units,sparse_combiner,weight_collections,trainable,weight_var=None): sparse_tensors = column._get_sparse_tensors( builder, weight_collections=weight_collections,trainable=trainable) id_tensor = sparse_ops.sparse_reshape(sparse_tensors.id_tensor, [ array_ops.shape(sparse_tensors.id_tensor)[0], -1]) weight_tensor = sparse_tensors.weight_tensor if weight_tensor is not None: weight_tensor = sparse_ops.sparse_reshape(weight_tensor, [array_ops.shape(weight_tensor)[0], -1]) # 稀疏向量(input)与稠密矩阵(权重)的相乘是由embedding_lookup_sparse完成的 # 根据id_tensor,得到每个样本的非零token_id(允许重复), # 以这些非零token_id为行号从weights中抽取相应的行向量,再根据weight_tensor进行加权平均 # 另外,与最初设想不同的是,如果wide侧特征中,某列中某token重复出现多次, # 这种加权不是由weight(实际上weight都是1)造成的,而是由这些token反复出现累加造成的 return embedding_ops.safe_embedding_lookup_sparse( weight, id_tensor, sparse_weights=weight_tensor, combiner=sparse_combiner,name='weighted_sum')
稠密输入与稠密权重相乘的代码
def _create_dense_column_weighted_sum(column,builder,units,weight_collections,trainable,weight_var=None): tensor = column._get_dense_tensor( builder,weight_collections=weight_collections,trainable=trainable) num_elements = column._variable_shape.num_elements() # pylint: disable=protected-access batch_size = array_ops.shape(tensor)[0] tensor = array_ops.reshape(tensor, shape=(batch_size, num_elements)) return math_ops.matmul(tensor, weight, name='weighted_sum')
介绍完了Wide部分,让我们再来看Deep部分。dnn._dnn_logit_fn_builder就直接调用_DNNModel类。这个类是在tensorflow/python/estimator/canned/dnn.py中定义的。
在_DNNModel构造函数中,第一件事就是生成一个InputLayer
self._input_layer = feature_column.InputLayer(feature_columns=feature_columns,name='input_layer',create_scope_now=False)self._add_layer(self._input_layer, 'input_layer')
因为Deep侧只能接受dense feature column,InputLayer的作用就是遍历接到Deep侧的所有feature column,让每个feature column从input中提取dense tensor(所以,categorical特征要接到Deep侧时,必须包装在indicator_column或embedding_column中),再把提取出来的dense tensor都拼接成一个大的dense tensor。具体实现逻辑在feature_column.py这个module的_internal_input_layer函数中。
接下来,就是先逐层生成所需要的全连接层。
for layer_id, num_hidden_units in enumerate(hidden_units): # 注意这里所有Dense都没有设置regularization,而是使用的默认参数None # 看来regularization不是在此设置的,如doc中所说,要想加上L1/L2 regularization,只能通过设置optimizer的方式 # To apply L1 and L2 regularization, you can set dnn_optimizer to: # tf.train.ProximalAdagradOptimizer(learning_rate=0.1,l1_regularization_strength=0.001,l2_regularization_strength=0.001) hidden_layer = core_layers.Dense( units=num_hidden_units, activation=activation_fn, kernel_initializer=init_ops.glorot_uniform_initializer(), name=hidden_layer_scope, _scope=hidden_layer_scope) self._add_layer(hidden_layer, hidden_layer_scope.name) self._hidden_layers.append(hidden_layer) if self._dropout is not None: dropout_layer = core_layers.Dropout(rate=self._dropout) self._add_layer(dropout_layer, dropout_layer.name) self._dropout_layers.append(dropout_layer) if self._batch_norm: batch_norm_layer = normalization.BatchNormalization( momentum=0.999,trainable=True, name='batchnorm_%d' % layer_id,_scope='batchnorm_%d' % layer_id) self._add_layer(batch_norm_layer, batch_norm_layer.name) self._batch_norm_layers.append(batch_norm_layer) self._logits_layer = core_layers.Dense(units=units,activation=None, kernel_initializer=init_ops.glorot_uniform_initializer(), name=logits_scope, _scope=logits_scope) self._add_layer(self._logits_layer, logits_scope.name)
然后,再逐层调用以上生成的全连接层,构造出整个网络。
def call(self, features, mode): is_training = mode == model_fn.ModeKeys.TRAIN # 各feature column从input features中提取dense tensor,再concat成一个大的dense tensor net = self._input_layer(features) for i in range(len(self._hidden_layers)): net = self._hidden_layers[i](net) if self._dropout is not None and is_training: net = self._dropout_layers[i](net, training=True) if self._batch_norm: net = self._batch_norm_layers[i](net, training=is_training) _add_hidden_layer_summary(net, self._hidden_layer_scope_names[i]) logits = self._logits_layer(net) _add_hidden_layer_summary(logits, self._logits_scope_name) return logits
在这个过程中,考虑了dropout和batch normalization,但是没有加入L1/L2正则。根据DNNLinearCombinedClassifier文档上的说法,如果你要引入L1/L2正则,是通过设置optimizer这个参数来完成的。
# To apply L1 and L2 regularization, you can set dnn_optimizer to:tf.train.ProximalAdagradOptimizer( learning_rate=0.1, l1_regularization_strength=0.001, l2_regularization_strength=0.001)
至此,TensorFlow中自带的Wide & Deep的代码就“串讲”完了。总结一下
Wide & Deep模型实践起来很简单,但是只有搞清楚了如下4个问题,才算真正理解了它:
为什么要deep?
为什么要wide?
什么样的特征喂进deep侧?
什么样的特征喂进wide侧?
Wide & Deep代码比较清晰,不难理解。看点在于学习Wide侧是如何处理高维、稀疏的categorical特征,并把这种技巧应用于自己的代码中。我就是在学习了这种技巧后,开发了如何实现支持多值、稀疏、共享权重的DeepFM,感兴趣的同学请戳上面的链接查看详情。
推荐阅读
1. 推荐工业界大规模实战角度详解TensorFlow源码如何实现Wide & Deep模型(一)
2. 推荐工业界实战角度详解TensorFlow中Wide & Deep源码之Feature Column
2. 神仙打架!也评Deep Interest Evolution Network
3. 一图胜千言: 解读阿里的Deep Image CTR Model
关于深度传送门
深度传送门是一个专注于深度推荐系统与CTR预估的交流社区,传送推荐、广告以及NLP等相关领域工业界第一手的论文、资源等相关技术分享,欢迎关注!加技术交流群请添加小助手deepdeliver,备注姓名+学校/公司+方向。