Tensorflow相关知识(二)运用loss及gradients更新variables

背景:之前使用tensorflow的时候一直用下载好的模型的训练的代码,我们想要细化到模型内部,看看tensorflow到底如何执行模型训练的步骤。

参考:http://wiki.jikexueyuan.com/project/tensorflow-zh/api_docs/python/train.html

https://www.cnblogs.com/wuzhitj/p/6648641.html

目录

一、总览Optimizers

二、计算gradients并apply

2.1 计算graidents并apply的方法

2.2 运算并apply梯度相关函数

函数tf.train.Optimizer.__init__

函数tf.train.Optimizer.minimize

函数tf.train.Optimizer.compute_gradients

函数tf.train.Optimizer.apply_gradients

2.3 Gating Gradients并行的运用梯度

2.4 Slots额外训练变量

2.5 optimizers

 三、运算梯度

3.1 梯度运算

tf.gradients计算偏导和

class tf.AggregationMethod

3.2 Gradient Clipping梯度截断

tf.clip_by_value通过梯度值截断梯度

其他函数

四、退化学习率

五、移动平均

六、协调器和队列运行器(Coordinator and QueueRunner)分布式执行(Summary Operations)

七、记录写入文件Adding Summaries to Event Files

八、Training utilities

tf.train.global_step(sess, global_step_tensor)

tf.train.write_graph(graph_def, logdir, name, as_text=True)


一、总览Optimizers

用于运算相应的loss的梯度,然后将梯度应用于variable上。

class tf.train.Optimizers

这个class定义了用于训练模型的API和Ops,但是不会直接用到这个类,而是用到它的子类,比如GradientDescentOptimizer, AdagradOptimizer, MomentumOptimizer这些

例如:

# Create an optimizer with the desired parameters.
opt = GradientDescentOptimizer(learning_rate=0.1)
# Add Ops to the graph to minimize a cost by updating a list of variables.
# "cost" is a Tensor, and the list of variables contains variables.Variable
# objects.
opt_op = opt.minimize(cost, )

训练过程之中使用返回的ops

# Execute opt_op to do one step of training:
opt_op.run()

二、计算gradients并apply

2.1 计算graidents并apply的方法

一次性的:

函数minimize()可以运算梯度并且将梯度应用于variable,因为此函数是简单的合并了分步骤的两个函数compute_gradients()apply_gradients().

分步的 :

  1. Compute the gradients with compute_gradients().
  2. Process the gradients as you wish.
  3. Apply the processed gradients with apply_gradients().

例如:

# Create an optimizer.
opt = GradientDescentOptimizer(learning_rate=0.1)

# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss, )

# grads_and_vars is a list of tuples (gradient, variable).  Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(MyCapper(gv[0]), gv[1])) for gv in grads_and_vars]

# Ask the optimizer to apply the capped gradients.
opt.apply_gradients(capped_grads_and_vars)

注意,得到的capped_grads_and_vars是一个tuples,(元组不可更改,只可连接和删除)

2.2 运算并apply梯度相关函数

函数tf.train.Optimizer.__init__

tf.train.Optimizer.__init__(use_locking, name)

Create a new Optimizer.This must be called by the constructors of subclasses.

输入参数:

  • use_locking: Bool. If True apply use locks to prevent concurrent updates to variables.
  • name: A non-empty string. The name to use for accumulators created for the optimizer.

函数tf.train.Optimizer.minimize

tf.train.Optimizer.minimize(loss, global_step=None, var_list=None, gate_gradients=1, name=None)

Add operations to minimize 'loss' by updating 'var_list'.也就是上文说的一次性的调用这个函数来运用loss更新相应的variable

函数简单的结合compute_gradients() and apply_gradients(). 如果想要在运用梯度处理数据之前进行一些操作,需要将两个函数分开运行compute_gradients() and apply_gradients() .

输入参数:

  • loss: A Tensor containing the value to minimize.
  • global_step: Optional Variable to increment by one after the variables have been updated.
  • var_list: Optional list of variables.Variable to update to minimize 'loss'. Defaults to the list of variables collected in the graph under the key GraphKeys.TRAINABLE_VARIABLES.
  • gate_gradients: How to gate the computation of gradients. Can be GATE_NONE, GATE_OP, or GATE_GRAPH.(仅仅是并行性的问题)
  • name: Optional name for the returned operation.

Returns:

An Operation that updates the variables in 'var_list'. If 'global_step' was not None, that operation also increments global_step.

函数tf.train.Optimizer.compute_gradients

tf.train.Optimizer.compute_gradients(loss, var_list=None, gate_gradients=1)

Compute gradients of "loss" for the variables in "var_list".

This is the first part of minimize(). It returns a list of (gradient, variable) pairs where "gradient" is the gradient for "variable". Note that "gradient" can be a Tensor, a IndexedSlices, or None if there is no gradient for the given variable.

Args:

  • loss: A Tensor containing the value to minimize.
  • var_list: Optional list of variables.Variable to update to minimize "loss". Defaults to the list of variables collected in the graph under the key GraphKey.TRAINABLE_VARIABLES.
  • gate_gradients: How to gate the computation of gradients. Can be GATE_NONE, GATE_OP, or GATE_GRAPH.

Returns:

A list of (gradient, variable) pairs.

Raises:

  • TypeError: If var_list contains anything else than variables.Variable.
  • ValueError: If some arguments are invalid.

函数tf.train.Optimizer.apply_gradients

tf.train.Optimizer.apply_gradients(grads_and_vars, global_step=None, name=None)

Apply gradients to variables.

This is the second part of minimize(). It returns an Operation that applies gradients.

Args:

  • grads_and_vars: List of (gradient, variable) pairs as returned by compute_gradients().
  • global_step: Optional Variable to increment by one after the variables have been updated.
  • name: Optional name for the returned operation. Default to the name passed to the Optimizer constructor.

Returns:

An Operation that applies the specified gradients. If 'global_step' was not None, that operation also increments global_step.

2.3 Gating Gradients并行的运用梯度

这部分推导可能涉及BP算法,暂时不深究。

其值可以取:GATE_NONE, GATE_OP 或 GATE_GRAPH 
GATE_NONE : 并行地计算和应用梯度。提供最大化的并行执行,但是会导致有的数据结果没有再现性。比如两个matmul操作的梯度依赖输入值,使用GATE_NONE可能会出现有一个梯度在其他梯度之前便应用到某个输入中,导致出现不可再现的(non-reproducible)结果 
GATE_OP: 对于每个操作Op,确保每一个梯度在使用之前都已经计算完成。这种做法防止了那些具有多个输入,并且梯度计算依赖输入情形中,多输入Ops之间的竞争情况出现。 
GATE_GRAPH: 确保所有的变量对应的所有梯度在他们任何一个被使用前计算完成。该方式具有最低级别的并行化程度,但是对于想要在应用它们任何一个之前处理完所有的梯度计算时很有帮助的。

2.4 Slots额外训练变量

一些optimizer的之类,比如 MomentumOptimizer 和 AdagradOptimizer 分配和管理着额外的用于训练的变量。这些变量称之为’Slots’,Slots有相应的名称,可以向optimizer访问的slots名称。有助于在log debug一个训练算法以及报告slots状态

tf.train.Optimizer.get_slot_names()获得slots的名称

Return a list of the names of slots created by the Optimizer.

See get_slot().

Returns:

A list of strings


tf.train.Optimizer.get_slot(var, name)获得相应的slots值

Return a slot named "name" created for "var" by the Optimizer.

Some Optimizer subclasses use additional variables. For example Momentum and Adagrad use variables to accumulate updates. This method gives access to these Variables if for some reason you need them.

Use get_slot_names() to get the list of slot names created by the Optimizer.

Args:

  • var: A variable passed to minimize() or apply_gradients().
  • name: A string.

Returns:

The Variable for the slot if it was created, None otherwise.

2.5 optimizers

具体参见 http://wiki.jikexueyuan.com/project/tensorflow-zh/api_docs/python/train.html

class tf.train.GradientDescentOptimizer 使用梯度下降算法的Optimizer
tf.train.GradientDescentOptimizer.__init__(learning_rate, 
use_locking=False, name=’GradientDescent’)
构建一个新的梯度下降优化器(Optimizer)
class tf.train.AdadeltaOptimizer 使用Adadelta算法的Optimizer
tf.train.AdadeltaOptimizer.__init__(learning_rate=0.001, 
rho=0.95, epsilon=1e-08, 
use_locking=False, name=’Adadelta’)
创建Adadelta优化器
class tf.train.AdagradOptimizer 使用Adagrad算法的Optimizer
tf.train.AdagradOptimizer.__init__(learning_rate, 
initial_accumulator_value=0.1, 
use_locking=False, name=’Adagrad’)
创建Adagrad优化器
class tf.train.MomentumOptimizer 使用Momentum算法的Optimizer
tf.train.MomentumOptimizer.__init__(learning_rate, 
momentum, use_locking=False, 
name=’Momentum’, use_nesterov=False)
创建momentum优化器
momentum:动量,一个tensor或者浮点值
class tf.train.AdamOptimizer 使用Adam 算法的Optimizer
tf.train.AdamOptimizer.__init__(learning_rate=0.001,
beta1=0.9, beta2=0.999, epsilon=1e-08,
use_locking=False, name=’Adam’)
创建Adam优化器
class tf.train.FtrlOptimizer 使用FTRL 算法的Optimizer
tf.train.FtrlOptimizer.__init__(learning_rate, 
learning_rate_power=-0.5, 
initial_accumulator_value=0.1, 
l1_regularization_strength=0.0, 
l2_regularization_strength=0.0,
use_locking=False, name=’Ftrl’)
创建FTRL算法优化器
class tf.train.RMSPropOptimizer 使用RMSProp算法的Optimizer
tf.train.RMSPropOptimizer.__init__(learning_rate, 
decay=0.9, momentum=0.0, epsilon=1e-10, 
use_locking=False, name=’RMSProp’)
创建RMSProp算法优化器

 三、运算梯度

可以通过更底层的函数来对梯度进行更底层的操作。

3.1 梯度运算

tf.gradients计算偏导和

注意区分,前面的tf.train.Optimizer.compute_gradients和这里的tf.gradients是不一样的。

tf.gradients(ys, xs, grad_ys=None, name='gradients', colocate_gradients_with_ops=False, gate_gradients=False, aggregation_method=None)

Constructs symbolic partial derivatives of ys w.r.t. x in xs.构建一个符号函数,计算ys关于xs中x的偏导的和,
返回xs中每个x对应的sum(dy/dx)

ys and xs are each a Tensor or a list of tensors. grad_ys is a list of Tensor, holding the gradients received by the ys. The list must be the same length as ys.

gradients() adds ops to the graph to output the partial derivatives of ys with respect to xs. It returns a list of Tensor of length len(xs) where each tensor is the sum(dy/dx) for y in ys.

grad_ys is a list of tensors of the same length as ys that holds the initial gradients for each y in ys. When grad_ys is None, we fill in a tensor of '1's of the shape of y for each y in ys. A user can provide their own initial 'grad_ys` to compute the derivatives using a different initial gradient for each y (e.g., if one wanted to weight the gradient differently for each value in each y).

Args:

  • ys: A Tensor or list of tensors to be differentiated.
  • xs: A Tensor or list of tensors to be used for differentiation.
  • grad_ys: Optional. A Tensor or list of tensors the same size as ys and holding the gradients computed for each y in ys.
  • name: Optional name to use for grouping all the gradient ops together. defaults to 'gradients'.
  • colocate_gradients_with_ops: If True, try colocating gradients with the corresponding op.
  • gate_gradients: If True, add a tuple around the gradients returned for an operations. This avoids some race conditions.
  • aggregation_method: Specifies the method used to combine gradient terms. Accepted values are constants defined in the class AggregationMethod.

Returns:

A list of sum(dy/dx) for each x in xs.

Raises:

  • LookupError: if one of the operations between x and y does not have a registered gradient function.
  • ValueError: if the arguments are invalid.

class tf.AggregationMethod

A class listing aggregation methods used to combine gradients.集合的方法用于聚集梯度。

Computing partial derivatives can require aggregating gradient contributions. This class lists the various methods that can be used to combine gradients in the graph:主要用于聚集梯度  计算偏导数需要聚集梯度贡献,这个类拥有在计算图中聚集梯度的很多方法。

  • ADD_N: All of the gradient terms are summed as part of one operation using the "AddN" op. It has the property that all gradients must be ready before any aggregation is performed.
  • DEFAULT: The system-chosen default aggregation method.

tf.gradients(ys, xs, grad_ys=None, name=’gradients’, 
colocate_gradients_with_ops=False, gate_gradients=False, 
aggregation_method=None)
构建一个符号函数,计算ys关于xs中x的偏导的和,
返回xs中每个x对应的sum(dy/dx)
tf.stop_gradient(input, name=None) 停止计算梯度,
在EM算法、Boltzmann机等可能会使用到

3.2 Gradient Clipping梯度截断

可以在图中加截断函数,用这些函数实现基本的数据截断。对于梯度爆炸和梯度消失特别有用(particularly useful for handling exploding or vanishing gradients.)

tf.clip_by_value通过梯度值截断梯度

tf.clip_by_value(t, clip_value_min, clip_value_max, name=None)

设置张量的最大值最小值.

给定张量t,操作返回一个相同类型和形状的张量 它的值被 clip_value_min and clip_value_max. 所有小于 clip_value_min are 设置为 clip_value_min. 所有大于 clip_value_max 设置为 clip_value_max.

Args:

  • t: A Tensor.
  • clip_value_min: A 0-D (scalar) Tensor. The minimum value to clip by.
  • clip_value_max: A 0-D (scalar) Tensor. The maximum value to clip by.
  • name: A name for the operation (optional).

Returns:

A clipped Tensor.

其他函数

tf.clip_by_value(t, clip_value_min, clip_value_max, name=None) 基于定义的min与max对tesor数据进行截断操作,
目的是为了应对梯度爆发或者梯度消失的情况
tf.clip_by_norm(t, clip_norm, axes=None, name=None) 使用L2范式标准化tensor最大值为clip_norm
返回 t * clip_norm / l2norm(t)
tf.clip_by_average_norm(t, clip_norm, name=None) 使用平均L2范式规范tensor数据t,
并以clip_norm为最大值
返回 t * clip_norm / l2norm_avg(t)
tf.clip_by_global_norm(t_list, 
clip_norm, use_norm=None, name=None)
返回t_list[i] * clip_norm / max(global_norm, clip_norm)
其中global_norm = sqrt(sum([l2norm(t)**2 for t in t_list]))
tf.global_norm(t_list, name=None) 返回global_norm = sqrt(sum([l2norm(t)**2 for t in t_list]))

四、退化学习率

tf.train.exponential_decay(learning_rate, global_step, decay_steps, decay_rate, staircase=False, name=None)

对学习率进行指数衰退,公式如下

decayed_learning_rate = learning_rate *
                        decay_rate ^ (global_step / decay_steps)

If the argument staircase is True, then global_step /decay_steps is an integer division and the decayed learning rate follows a staircase function.

例如: 每100000 步进行0.96的decay:

...
global_step = tf.Variable(0, trainable=False)
starter_learning_rate = 0.1
learning_rate = tf.exponential_decay(starter_learning_rate, global_step,
                                     100000, 0.96, staircase=True)
optimizer = tf.GradientDescent(learning_rate)
# Passing global_step to minimize() will increment it at each step.
optimizer.minimize(...my loss..., global_step=global_step)

五、移动平均

一些训练优化算法,比如GradientDescent 和Momentum 在优化过程中便可以使用到移动平均方法。使用移动平均常常可以较明显地改善结果。

操作 描述
class tf.train.ExponentialMovingAverage 将指数衰退加入到移动平均中
tf.train.ExponentialMovingAverage.apply(var_list=None) 对var_list变量保持移动平均
tf.train.ExponentialMovingAverage.average_name(var) 返回var均值的变量名称
tf.train.ExponentialMovingAverage.average(var) 返回var均值变量
tf.train.ExponentialMovingAverage.variables_to_restore(moving_avg_variables=None) 返回用于保存的变量名称的映射
# Create variables.
var0 = tf.Variable(...)
var1 = tf.Variable(...)
# ... use the variables to build a training model...
...
# Create an op that applies the optimizer.  This is what we usually
# would use as a training op.
opt_op = opt.minimize(my_loss, [var0, var1])

# Create an ExponentialMovingAverage object
ema = tf.train.ExponentialMovingAverage(decay=0.9999)

# Create the shadow variables, and add ops to maintain moving averages
# of var0 and var1.
maintain_averages_op = ema.apply([var0, var1])

# Create an op that will update the moving averages after each training
# step.  This is what we will use in place of the usuall trainig op.
with tf.control_dependencies([opt_op]):
    training_op = tf.group(maintain_averages_op)

...train the model by running training_op...

恢复shadow variable values

# Create a Saver that loads variables from their saved shadow values.
shadow_var0_name = ema.average_name(var0)
shadow_var1_name = ema.average_name(var1)
saver = tf.train.Saver({shadow_var0_name: var0, shadow_var1_name: var1})
saver.restore(...checkpoint filename...)
# var0 and var1 now hold the moving average values

六、协调器和队列运行器(Coordinator and QueueRunner)分布式执行(Summary Operations)

参见https://www.cnblogs.com/wuzhitj/p/6648641.html

七、记录写入文件Adding Summaries to Event Files

操作 描述
class tf.train.SummaryWriter 将summary协议buffer写入事件文件中
tf.train.SummaryWriter.__init__(logdir, graph=None, max_queue=10, flush_secs=120, graph_def=None) 创建一个SummaryWriter实例以及新建一个事件文件
tf.train.SummaryWriter.add_summary(summary, global_step=None) 将一个summary添加到事件文件中
tf.train.SummaryWriter.add_session_log(session_log, global_step=None) 添加SessionLog到一个事件文件中
tf.train.SummaryWriter.add_event(event) 添加一个事件到事件文件中
tf.train.SummaryWriter.add_graph(graph, global_step=None, graph_def=None) 添加一个Graph到时间文件中
tf.train.SummaryWriter.add_run_metadata(run_metadata, tag, global_step=None) 为一个单一的session.run()调用添加一个元数据信息
tf.train.SummaryWriter.flush() 刷新时间文件到硬盘中
tf.train.SummaryWriter.close() 将事件问价写入硬盘中并关闭该文件
tf.train.summary_iterator(path) 一个用于从时间文件中读取时间协议buffer的迭代器

 tf.train.SummaryWriter 

从session之中去除summary op, and pass it to a SummaryWriter 写入event file. Event files 包括 Event protos that can contain Summary protos along with the timestamp and step. You can then use TensorBoard to visualize the contents of the event files. See TensorBoard and Summaries for more details.
如果我们传递一个Graph进入该构建器中,它将被添加到Event files当中,这一点与使用add_graph()具有相同功能。 
TensorBoard 将从事件文件中提取该graph,并将其显示。所以我们能直观地看到我们建立的graph。我们通常从我们启动的session中传递graph:

...create a graph...
# Launch the graph in a session.
sess = tf.Session()
# Create a summary writer, add the 'graph' to the event file.
writer = tf.train.SummaryWriter(, sess.graph)

八、Training utilities


tf.train.global_step(sess, global_step_tensor)

Small helper to get the global step.

# Creates a variable to hold the global_step.
global_step_tensor = tf.Variable(10, trainable=False, name='global_step')
# Creates a session.
sess = tf.Session()
# Initializes the variable.
sess.run(global_step_tensor.initializer)
print 'global_step:', tf.train.global_step(sess, global_step_tensor)

global_step: 10

Args:

  • sess: A brain Session object.
  • global_step_tensor: Tensor or the name of the operation that contains the global step.

Returns:

The global step value.


tf.train.write_graph(graph_def, logdir, name, as_text=True)

Writes a graph proto on disk.

The graph is written as a binary proto unless as_text is True.

v = tf.Variable(0, name='my_variable')
sess = tf.Session()
tf.train.write_graph(sess.graph_def, '/tmp/my-model', 'train.pbtxt')

Args:

  • graph_def: A GraphDef protocol buffer.
  • logdir: Directory where to write the graph.
  • name: Filename for the graph.
  • as_text: If True, writes the graph as an ASCII proto.

你可能感兴趣的:(tensorflow,python,机器学习)