tensorflow loss到多少_推荐工业界实战角度详解TensorFlow中Wide & Deep源码(三)...

作者:石塔西来源:https://zhuanlan.zhihu.com/

p/47970601

整理:深度传送门

在充分理解了Feature Column之后,看Wide & Deep的代码,就很清晰了。我按照先整体,再Wide侧,再Deep侧的“从整体到局部”的顺序介绍一下Wide & Deep的代码。由于篇幅原因,请看官同学下载TensorFlow代码,与本文相互比照阅读。

整体

TensorFlow的Wide&Deep实现,叫作tf.estimator.DNNLinearCombinedClassifier,是在tensorflow/python/estimator/canned/dnn_linear_combined.py中实现的。

这个类本身没什么干货,它将构建网络的工作,都委托了出去。不过,在这个类的构造函数中,我们可以看到如下代码

if n_classes == 2:  head = head_lib._binary_logistic_head_with_sigmoid_cross_entropy_loss(   weight_column=weight_column,label_vocabulary=label_vocabulary,loss_reduction=loss_reduction)else:  head = head_lib._multi_class_head_with_softmax_cross_entropy_loss( n_classes,weight_column=weight_column,label_vocabulary=label_vocabulary,loss_reduction=loss_reduction)
  • 所谓head,就是一个辅助类。比如,如果我们做二分类,一定要用binary cross-entropy loss,而且经常要监视auc;而如果做回归,一定会用mean square loss。对于常见的任务,经常要在estimator的model_fn中重复实现一些代码,很繁琐,所以TensorFlow把这些常规、重复率高的代码提取出来组成head。因为都是辅助代码,不影响我们理解wide&deep的实现,这里就不再展开。

  • 可以看到,tensorflow的实现,既支持二分类,也支持多分类

  • 而且还支持给不同的训练样本以不同的权重(通过指定输入中某列为weight_column来实现),权重将乘在这个样本的loss上,最终也会乘在这个样本的梯度上。

真正的构建model的代码是在_dnn_linear_combined_model_fn中完成的。

先构建Deep侧

dnn_logit_fn = dnn._dnn_logit_fn_builder(   units=head.logits_dimension,  hidden_units=dnn_hidden_units,  feature_columns=dnn_feature_columns,  activation_fn=dnn_activation_fn,  dropout=dnn_dropout,  input_layer_partitioner=input_layer_partitioner,  batch_norm=batch_norm)dnn_logits = dnn_logit_fn(features=features, mode=mode)

再构建Wide侧

logit_fn = linear._linear_logit_fn_builder(   units=head.logits_dimension,  feature_columns=linear_feature_columns,  sparse_combiner=linear_sparse_combiner)linear_logits = logit_fn(features=features)

最终的logits是二者相加

if dnn_logits is not None and linear_logits is not None:  logits = dnn_logits + linear_logitselif dnn_logits is not None:  logits = dnn_logitselse:  logits = linear_logits

dnn_logits,linear_logits的维度必须相等,具体是多少,是head.logits_dimension由决定的。二分类时是1,而多分类时就是#classes。

如何同时让Adagrad训练Deep,而让FTRL训练Wide?答案是通过group来完成的

train_ops = []if dnn_logits is not None:  train_ops.append(  dnn_optimizer.minimize(loss,var_list=ops.get_collection(ops.GraphKeys.TRAINABLE_VARIABLES,scope=dnn_absolute_scope)))if linear_logits is not None:  train_ops.append(  linear_optimizer.minimize(loss,var_list=ops.get_collection(ops.GraphKeys.TRAINABLE_VARIABLES,scope=linear_absolute_scope)))train_op = control_flow_ops.group(*train_ops)

最后通过head来构建estimator

head.create_estimator_spec(features=features,mode=mode,labels=labels,train_op_fn=_train_op_fn,logits=logits)

Wide部分

介绍完算法的整体流程,让我们先聚焦于“Wide”则这个局部。Deep侧主要是处理dense输入,没什么技术含量。Wide侧处理高阶、稀疏输入,才是应该关注的重点

_linear_logit_fn_builder.linear_logit_fn是在tensorflow/python/estimator/canned/linear.py这个module中实现的。

linear_model = feature_column._LinearModel(   feature_columns=feature_columns,  units=units,  sparse_combiner=sparse_combiner,  name='linear_model')logits = linear_model(features)

上面代码中的_LinearModel是在feature_column.py这个module中定义的。其实,无论deep侧还是wide侧,无非都是“全连接”(Fully Connection, FC)而已,只不过wide则是一层全连接,deep侧是多层全连接。但是,所以二者实现“全连接”的方式有很大不同

  • Wide侧的输入一般是超大且稀疏的,所以没有将所有接入wide侧的Feature Column提取的输入拼接成一个大矩阵,因为那将破坏输入的稀疏性。而是每个Feature Column单独连接(用稀疏矩阵相乘来实现),最后将各列连接结果叠加。

  • 接下来会说到,Deep侧输入都是稠密的(categorical特征在接入deep侧之前必须经过embedding映射成稠密向量),因此是将各列提取出来的dense tensor拼接成一个大的稠密矩阵,再喂入第一层FC

按照以上这个思路,在_LinearModel的构造函数中,为每列构造一个FC Layer,再构造一个BiasLayer

column_layers = {}for column in sorted(self._feature_columns, key=lambda x: x.name):  column_name = _strip_leading_slashes(vs.name)  column_layer = _FCLinearWrapper(column, units, sparse_combiner,                self._weight_collections, trainable,                column_name, **kwargs)  column_layers[column_name] = column_layerself._column_layers = self._add_layers(column_layers)self._bias_layer = _BiasLayer(units=units,      trainable=trainable,      weight_collections=self._weight_collections,      name='bias_layer',**kwargs)

调用时,遍历所有feature column,每个feature column单独从builder(即实现了cache功能的input)中提取数据(一般是稀疏的),再做一层全连接,最后,再把每个feature column全连接后的结果叠加起来。

def call(self, features):  weighted_sums = []  ordered_columns = []  # lazy wrapper of original input  builder = _LazyBuilder(features)    # 遍历所有feature column,每列从builder中提取数据,再做一层连接  for layer in sorted(self._column_layers.values(), key=lambda x: x.name):    column = layer._feature_column     ordered_columns.append(column)    weighted_sum = layer(builder)    weighted_sums.append(weighted_sum)  # 把每个feature column连接后的结果叠加起来  predictions_no_bias = math_ops.add_n(weighted_sums, name='weighted_sum_no_bias')    predictions = nn_ops.bias_add(predictions_no_bias,    self._bias_layer( builder,scope=variable_scope.get_variable_scope()),     name='weighted_sum')    return predictions

接下来,我们再看看这个全连接层 _FCLinearWrapper是如何实现的。这个类不过是判断输入是CategoricalColumn还是NumericColumn,

  • 如果是NumericColumn,就调用matmul做稠密矩阵的相乘

  • 如果是CategoricalColumn,就调用safe_embedding_lookup_sparse做稀疏矩阵(CategoricalColumn提取出来的输入)与稠密矩阵(Fully connection weight,优化变量)的相乘。

class _FCLinearWrapper(base.Layer):  def build(self, _):    if isinstance(self._feature_column, _CategoricalColumn):      weight = self.add_variable(        name='weights',        shape=(self._feature_column._num_buckets, self._units),         initializer=init_ops.zeros_initializer(),        trainable=self.trainable)    else:      num_elements = self._feature_column._variable_shape.num_elements()       weight = self.add_variable(        name='weights',        shape=[num_elements, self._units],        initializer=init_ops.zeros_initializer(),        trainable=self.trainable)    self._weight_var = weight    self.built = True    def call(self, builder):    weighted_sum = _create_weighted_sum(      column=self._feature_column,      builder=builder,      units=self._units,      sparse_combiner=self._sparse_combiner,      weight_collections=self._weight_collections,      trainable=self.trainable,      weight_var=self._weight_var)    return weighted_sum    def _create_weighted_sum(column,builder,units,sparse_combiner,weight_collections,trainable,weight_var=None):  """Creates a weighted sum for a dense/categorical column for linear_model."""  if isinstance(column, _CategoricalColumn):    return _create_categorical_column_weighted_sum(    column=column,    builder=builder,    units=units,    sparse_combiner=sparse_combiner,    weight_collections=weight_collections,    trainable=trainable,    weight_var=weight_var)  else:    return _create_dense_column_weighted_sum(    column=column,    builder=builder,    units=units,    weight_collections=weight_collections,    trainable=trainable,    weight_var=weight_var)

稀疏输入与稠密权重相乘的代码,注意safe_embedding_lookup_sparse的调用。

def _create_categorical_column_weighted_sum(column,builder,units,sparse_combiner,weight_collections,trainable,weight_var=None):  sparse_tensors = column._get_sparse_tensors( builder,    weight_collections=weight_collections,trainable=trainable)    id_tensor = sparse_ops.sparse_reshape(sparse_tensors.id_tensor, [    array_ops.shape(sparse_tensors.id_tensor)[0], -1])      weight_tensor = sparse_tensors.weight_tensor  if weight_tensor is not None:    weight_tensor = sparse_ops.sparse_reshape(weight_tensor, [array_ops.shape(weight_tensor)[0], -1])    # 稀疏向量(input)与稠密矩阵(权重)的相乘是由embedding_lookup_sparse完成的  # 根据id_tensor,得到每个样本的非零token_id(允许重复),  # 以这些非零token_id为行号从weights中抽取相应的行向量,再根据weight_tensor进行加权平均  # 另外,与最初设想不同的是,如果wide侧特征中,某列中某token重复出现多次,  # 这种加权不是由weight(实际上weight都是1)造成的,而是由这些token反复出现累加造成的  return embedding_ops.safe_embedding_lookup_sparse(      weight,      id_tensor,      sparse_weights=weight_tensor,      combiner=sparse_combiner,name='weighted_sum')

稠密输入与稠密权重相乘的代码

def _create_dense_column_weighted_sum(column,builder,units,weight_collections,trainable,weight_var=None):  tensor = column._get_dense_tensor( builder,weight_collections=weight_collections,trainable=trainable)  num_elements = column._variable_shape.num_elements() # pylint: disable=protected-access  batch_size = array_ops.shape(tensor)[0]  tensor = array_ops.reshape(tensor, shape=(batch_size, num_elements))  return math_ops.matmul(tensor, weight, name='weighted_sum')

Deep部分

介绍完了Wide部分,让我们再来看Deep部分。dnn._dnn_logit_fn_builder就直接调用_DNNModel类。这个类是在tensorflow/python/estimator/canned/dnn.py中定义的。

在_DNNModel构造函数中,第一件事就是生成一个InputLayer

self._input_layer = feature_column.InputLayer(feature_columns=feature_columns,name='input_layer',create_scope_now=False)self._add_layer(self._input_layer, 'input_layer')

因为Deep侧只能接受dense feature column,InputLayer的作用就是遍历接到Deep侧的所有feature column,让每个feature column从input中提取dense tensor(所以,categorical特征要接到Deep侧时,必须包装在indicator_column或embedding_column中),再把提取出来的dense tensor都拼接成一个大的dense tensor。具体实现逻辑在feature_column.py这个module的_internal_input_layer函数中。

接下来,就是先逐层生成所需要的全连接层。

for layer_id, num_hidden_units in enumerate(hidden_units):  # 注意这里所有Dense都没有设置regularization,而是使用的默认参数None  # 看来regularization不是在此设置的,如doc中所说,要想加上L1/L2 regularization,只能通过设置optimizer的方式  # To apply L1 and L2 regularization, you can set dnn_optimizer to:  # tf.train.ProximalAdagradOptimizer(learning_rate=0.1,l1_regularization_strength=0.001,l2_regularization_strength=0.001)  hidden_layer = core_layers.Dense(      units=num_hidden_units,      activation=activation_fn,      kernel_initializer=init_ops.glorot_uniform_initializer(),      name=hidden_layer_scope,      _scope=hidden_layer_scope)    self._add_layer(hidden_layer, hidden_layer_scope.name)  self._hidden_layers.append(hidden_layer)    if self._dropout is not None:    dropout_layer = core_layers.Dropout(rate=self._dropout)    self._add_layer(dropout_layer, dropout_layer.name)    self._dropout_layers.append(dropout_layer)      if self._batch_norm:    batch_norm_layer = normalization.BatchNormalization(      momentum=0.999,trainable=True,      name='batchnorm_%d' % layer_id,_scope='batchnorm_%d' % layer_id)    self._add_layer(batch_norm_layer, batch_norm_layer.name)    self._batch_norm_layers.append(batch_norm_layer)  self._logits_layer = core_layers.Dense(units=units,activation=None,      kernel_initializer=init_ops.glorot_uniform_initializer(),      name=logits_scope,      _scope=logits_scope)  self._add_layer(self._logits_layer, logits_scope.name)

然后,再逐层调用以上生成的全连接层,构造出整个网络。

def call(self, features, mode):  is_training = mode == model_fn.ModeKeys.TRAIN    # 各feature column从input features中提取dense tensor,再concat成一个大的dense tensor  net = self._input_layer(features)    for i in range(len(self._hidden_layers)):    net = self._hidden_layers[i](net)        if self._dropout is not None and is_training:      net = self._dropout_layers[i](net, training=True)        if self._batch_norm:      net = self._batch_norm_layers[i](net, training=is_training)          _add_hidden_layer_summary(net, self._hidden_layer_scope_names[i])      logits = self._logits_layer(net)  _add_hidden_layer_summary(logits, self._logits_scope_name)  return logits

在这个过程中,考虑了dropoutbatch normalization但是没有加入L1/L2正则。根据DNNLinearCombinedClassifier文档上的说法,如果你要引入L1/L2正则,是通过设置optimizer这个参数来完成的

# To apply L1 and L2 regularization, you can set dnn_optimizer to:tf.train.ProximalAdagradOptimizer(    learning_rate=0.1,    l1_regularization_strength=0.001,    l2_regularization_strength=0.001)

总结

至此,TensorFlow中自带的Wide & Deep的代码就“串讲”完了。总结一下

  • Wide & Deep模型实践起来很简单,但是只有搞清楚了如下4个问题,才算真正理解了它:

    1. 为什么要deep?

    2. 为什么要wide?

    3. 什么样的特征喂进deep侧?

    4. 什么样的特征喂进wide侧?

  • Wide & Deep代码比较清晰,不难理解。看点在于学习Wide侧是如何处理高维、稀疏的categorical特征,并把这种技巧应用于自己的代码中。我就是在学习了这种技巧后,开发了如何实现支持多值、稀疏、共享权重的DeepFM,感兴趣的同学请戳上面的链接查看详情。

推荐阅读

1. 推荐工业界大规模实战角度详解TensorFlow源码如何实现Wide & Deep模型(一)

2. 推荐工业界实战角度详解TensorFlow中Wide & Deep源码之Feature Column

2. 神仙打架!也评Deep Interest Evolution Network

3. 一图胜千言: 解读阿里的Deep Image CTR Model

关于深度传送门

深度传送门是一个专注于深度推荐系统与CTR预估的交流社区,传送推荐、广告以及NLP等相关领域工业界第一手的论文、资源等相关技术分享,欢迎关注!加技术交流群请添加小助手deepdeliver,备注姓名+学校/公司+方向。

tensorflow loss到多少_推荐工业界实战角度详解TensorFlow中Wide & Deep源码(三)..._第1张图片

tensorflow loss到多少_推荐工业界实战角度详解TensorFlow中Wide & Deep源码(三)..._第2张图片

你可能感兴趣的:(tensorflow,loss到多少)