背景:之前使用tensorflow的时候一直用下载好的模型的训练的代码,我们想要细化到模型内部,看看tensorflow到底如何执行模型训练的步骤。
参考:http://wiki.jikexueyuan.com/project/tensorflow-zh/api_docs/python/train.html
https://www.cnblogs.com/wuzhitj/p/6648641.html
目录
一、总览Optimizers
二、计算gradients并apply
2.1 计算graidents并apply的方法
2.2 运算并apply梯度相关函数
函数tf.train.Optimizer.__init__
函数tf.train.Optimizer.minimize
函数tf.train.Optimizer.compute_gradients
函数tf.train.Optimizer.apply_gradients
2.3 Gating Gradients并行的运用梯度
2.4 Slots额外训练变量
2.5 optimizers
三、运算梯度
3.1 梯度运算
tf.gradients计算偏导和
class tf.AggregationMethod
3.2 Gradient Clipping梯度截断
tf.clip_by_value通过梯度值截断梯度
其他函数
四、退化学习率
五、移动平均
六、协调器和队列运行器(Coordinator and QueueRunner)分布式执行(Summary Operations)
七、记录写入文件Adding Summaries to Event Files
八、Training utilities
tf.train.global_step(sess, global_step_tensor)
tf.train.write_graph(graph_def, logdir, name, as_text=True)
用于运算相应的loss的梯度,然后将梯度应用于variable上。
class tf.train.Optimizers
这个class定义了用于训练模型的API和Ops,但是不会直接用到这个类,而是用到它的子类,比如GradientDescentOptimizer
, AdagradOptimizer
, MomentumOptimizer这些
例如:
# Create an optimizer with the desired parameters.
opt = GradientDescentOptimizer(learning_rate=0.1)
# Add Ops to the graph to minimize a cost by updating a list of variables.
# "cost" is a Tensor, and the list of variables contains variables.Variable
# objects.
opt_op = opt.minimize(cost, )
训练过程之中使用返回的ops
# Execute opt_op to do one step of training:
opt_op.run()
一次性的:
函数minimize()可以运算梯度并且将梯度应用于variable,因为此函数是简单的合并了分步骤的两个函数compute_gradients()
和apply_gradients()
.
分步的 :
compute_gradients()
.apply_gradients()
.例如:
# Create an optimizer.
opt = GradientDescentOptimizer(learning_rate=0.1)
# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss, )
# grads_and_vars is a list of tuples (gradient, variable). Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(MyCapper(gv[0]), gv[1])) for gv in grads_and_vars]
# Ask the optimizer to apply the capped gradients.
opt.apply_gradients(capped_grads_and_vars)
注意,得到的capped_grads_and_vars是一个tuples,(元组不可更改,只可连接和删除)
函数tf.train.Optimizer.__init__
tf.train.Optimizer.__init__(use_locking, name)
Create a new Optimizer.This must be called by the constructors of subclasses.
输入参数:
use_locking
: Bool. If True apply use locks to prevent concurrent updates to variables.name
: A non-empty string. The name to use for accumulators created for the optimizer.函数tf.train.Optimizer.minimize
tf.train.Optimizer.minimize(loss, global_step=None, var_list=None, gate_gradients=1, name=None)
Add operations to minimize 'loss' by updating 'var_list'.也就是上文说的一次性的调用这个函数来运用loss更新相应的variable
函数简单的结合compute_gradients() and apply_gradients(). 如果想要在运用梯度处理数据之前进行一些操作,需要将两个函数分开运行compute_gradients() and apply_gradients() .
输入参数:
loss
: A Tensor containing the value to minimize.global_step
: Optional Variable to increment by one after the variables have been updated.var_list
: Optional list of variables.Variable to update to minimize 'loss'. Defaults to the list of variables collected in the graph under the key GraphKeys.TRAINABLE_VARIABLES.gate_gradients
: How to gate the computation of gradients. Can be GATE_NONE, GATE_OP, or GATE_GRAPH.(仅仅是并行性的问题)name
: Optional name for the returned operation.Returns:
An Operation that updates the variables in 'var_list'. If 'global_step' was not None, that operation also increments global_step.
函数tf.train.Optimizer.compute_gradients
tf.train.Optimizer.compute_gradients(loss, var_list=None, gate_gradients=1)
Compute gradients of "loss" for the variables in "var_list".
This is the first part of minimize(). It returns a list of (gradient, variable) pairs where "gradient" is the gradient for "variable". Note that "gradient" can be a Tensor, a IndexedSlices, or None if there is no gradient for the given variable.
Args:
loss
: A Tensor containing the value to minimize.var_list
: Optional list of variables.Variable to update to minimize "loss". Defaults to the list of variables collected in the graph under the key GraphKey.TRAINABLE_VARIABLES.gate_gradients
: How to gate the computation of gradients. Can be GATE_NONE, GATE_OP, or GATE_GRAPH.Returns:
A list of (gradient, variable) pairs.
Raises:
TypeError
: If var_list contains anything else than variables.Variable.ValueError
: If some arguments are invalid.
函数tf.train.Optimizer.apply_gradients
tf.train.Optimizer.apply_gradients(grads_and_vars, global_step=None, name=None)
Apply gradients to variables.
This is the second part of minimize(). It returns an Operation that applies gradients.
Args:
grads_and_vars
: List of (gradient, variable) pairs as returned by compute_gradients().global_step
: Optional Variable to increment by one after the variables have been updated.name
: Optional name for the returned operation. Default to the name passed to the Optimizer constructor.Returns:
An Operation that applies the specified gradients. If 'global_step' was not None, that operation also increments global_step.
这部分推导可能涉及BP算法,暂时不深究。
其值可以取:GATE_NONE, GATE_OP 或 GATE_GRAPH
GATE_NONE : 并行地计算和应用梯度。提供最大化的并行执行,但是会导致有的数据结果没有再现性。比如两个matmul操作的梯度依赖输入值,使用GATE_NONE可能会出现有一个梯度在其他梯度之前便应用到某个输入中,导致出现不可再现的(non-reproducible)结果
GATE_OP: 对于每个操作Op,确保每一个梯度在使用之前都已经计算完成。这种做法防止了那些具有多个输入,并且梯度计算依赖输入情形中,多输入Ops之间的竞争情况出现。
GATE_GRAPH: 确保所有的变量对应的所有梯度在他们任何一个被使用前计算完成。该方式具有最低级别的并行化程度,但是对于想要在应用它们任何一个之前处理完所有的梯度计算时很有帮助的。
一些optimizer的之类,比如 MomentumOptimizer 和 AdagradOptimizer 分配和管理着额外的用于训练的变量。这些变量称之为’Slots’,Slots有相应的名称,可以向optimizer访问的slots名称。有助于在log debug一个训练算法以及报告slots状态
tf.train.Optimizer.get_slot_names()获得slots的名称
Return a list of the names of slots created by the Optimizer.
See get_slot().
Returns:
A list of strings
tf.train.Optimizer.get_slot(var, name)获得相应的slots值
Return a slot named "name" created for "var" by the Optimizer.
Some Optimizer subclasses use additional variables. For example Momentum and Adagrad use variables to accumulate updates. This method gives access to these Variables if for some reason you need them.
Use get_slot_names() to get the list of slot names created by the Optimizer.
Args:
var
: A variable passed to minimize() or apply_gradients().name
: A string.Returns:
The Variable for the slot if it was created, None otherwise.
具体参见 http://wiki.jikexueyuan.com/project/tensorflow-zh/api_docs/python/train.html
class tf.train.GradientDescentOptimizer | 使用梯度下降算法的Optimizer |
tf.train.GradientDescentOptimizer.__init__(learning_rate, use_locking=False, name=’GradientDescent’) |
构建一个新的梯度下降优化器(Optimizer) |
class tf.train.AdadeltaOptimizer | 使用Adadelta算法的Optimizer |
tf.train.AdadeltaOptimizer.__init__(learning_rate=0.001, rho=0.95, epsilon=1e-08, use_locking=False, name=’Adadelta’) |
创建Adadelta优化器 |
class tf.train.AdagradOptimizer | 使用Adagrad算法的Optimizer |
tf.train.AdagradOptimizer.__init__(learning_rate, initial_accumulator_value=0.1, use_locking=False, name=’Adagrad’) |
创建Adagrad优化器 |
class tf.train.MomentumOptimizer | 使用Momentum算法的Optimizer |
tf.train.MomentumOptimizer.__init__(learning_rate, momentum, use_locking=False, name=’Momentum’, use_nesterov=False) |
创建momentum优化器 momentum:动量,一个tensor或者浮点值 |
class tf.train.AdamOptimizer | 使用Adam 算法的Optimizer |
tf.train.AdamOptimizer.__init__(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, use_locking=False, name=’Adam’) |
创建Adam优化器 |
class tf.train.FtrlOptimizer | 使用FTRL 算法的Optimizer |
tf.train.FtrlOptimizer.__init__(learning_rate, learning_rate_power=-0.5, initial_accumulator_value=0.1, l1_regularization_strength=0.0, l2_regularization_strength=0.0, use_locking=False, name=’Ftrl’) |
创建FTRL算法优化器 |
class tf.train.RMSPropOptimizer | 使用RMSProp算法的Optimizer |
tf.train.RMSPropOptimizer.__init__(learning_rate, decay=0.9, momentum=0.0, epsilon=1e-10, use_locking=False, name=’RMSProp’) |
创建RMSProp算法优化器 |
可以通过更底层的函数来对梯度进行更底层的操作。
tf.gradients
计算偏导和注意区分,前面的tf.train.Optimizer.compute_gradients和这里的tf.gradients是不一样的。
tf.gradients(ys, xs, grad_ys=None, name='gradients', colocate_gradients_with_ops=False, gate_gradients=False, aggregation_method=None)
Constructs symbolic partial derivatives of ys
w.r.t. x in xs
.构建一个符号函数,计算ys关于xs中x的偏导的和,
返回xs中每个x对应的sum(dy/dx)
ys
and xs
are each a Tensor
or a list of tensors. grad_ys
is a list of Tensor
, holding the gradients received by the ys
. The list must be the same length as ys
.
gradients()
adds ops to the graph to output the partial derivatives of ys
with respect to xs
. It returns a list of Tensor
of length len(xs)
where each tensor is the sum(dy/dx)
for y in ys
.
grad_ys
is a list of tensors of the same length as ys
that holds the initial gradients for each y in ys
. When grad_ys
is None, we fill in a tensor of '1's of the shape of y for each y in ys
. A user can provide their own initial 'grad_ys` to compute the derivatives using a different initial gradient for each y (e.g., if one wanted to weight the gradient differently for each value in each y).
Args:
ys
: A Tensor
or list of tensors to be differentiated.xs
: A Tensor
or list of tensors to be used for differentiation.grad_ys
: Optional. A Tensor
or list of tensors the same size as ys
and holding the gradients computed for each y in ys
.name
: Optional name to use for grouping all the gradient ops together. defaults to 'gradients'.colocate_gradients_with_ops
: If True, try colocating gradients with the corresponding op.gate_gradients
: If True, add a tuple around the gradients returned for an operations. This avoids some race conditions.aggregation_method
: Specifies the method used to combine gradient terms. Accepted values are constants defined in the class AggregationMethod
.Returns:
A list of sum(dy/dx)
for each x in xs
.
Raises:
LookupError
: if one of the operations between x
and y
does not have a registered gradient function.ValueError
: if the arguments are invalid.
class tf.AggregationMethod
A class listing aggregation methods used to combine gradients.集合的方法用于聚集梯度。
Computing partial derivatives can require aggregating gradient contributions. This class lists the various methods that can be used to combine gradients in the graph:主要用于聚集梯度 计算偏导数需要聚集梯度贡献,这个类拥有在计算图中聚集梯度的很多方法。
ADD_N
: All of the gradient terms are summed as part of one operation using the "AddN" op. It has the property that all gradients must be ready before any aggregation is performed.DEFAULT
: The system-chosen default aggregation method.
tf.gradients(ys, xs, grad_ys=None, name=’gradients’, colocate_gradients_with_ops=False, gate_gradients=False, aggregation_method=None) |
构建一个符号函数,计算ys关于xs中x的偏导的和, 返回xs中每个x对应的sum(dy/dx) |
tf.stop_gradient(input, name=None) | 停止计算梯度, 在EM算法、Boltzmann机等可能会使用到 |
可以在图中加截断函数,用这些函数实现基本的数据截断。对于梯度爆炸和梯度消失特别有用(particularly useful for handling exploding or vanishing gradients.)
tf.clip_by_value通过梯度值截断梯度
tf.clip_by_value(t, clip_value_min, clip_value_max, name=None)
设置张量的最大值最小值.
给定张量t
,操作返回一个相同类型和形状的张量 它的值被 clip_value_min
and clip_value_max
. 所有小于 clip_value_min
are 设置为 clip_value_min
. 所有大于 clip_value_max
设置为 clip_value_max
.
Args:
t
: A Tensor
.clip_value_min
: A 0-D (scalar) Tensor
. The minimum value to clip by.clip_value_max
: A 0-D (scalar) Tensor
. The maximum value to clip by.name
: A name for the operation (optional).Returns:
A clipped Tensor
.
tf.clip_by_value(t, clip_value_min, clip_value_max, name=None) | 基于定义的min与max对tesor数据进行截断操作, 目的是为了应对梯度爆发或者梯度消失的情况 |
tf.clip_by_norm(t, clip_norm, axes=None, name=None) | 使用L2范式标准化tensor最大值为clip_norm 返回 t * clip_norm / l2norm(t) |
tf.clip_by_average_norm(t, clip_norm, name=None) | 使用平均L2范式规范tensor数据t, 并以clip_norm为最大值 返回 t * clip_norm / l2norm_avg(t) |
tf.clip_by_global_norm(t_list, clip_norm, use_norm=None, name=None) |
返回t_list[i] * clip_norm / max(global_norm, clip_norm) 其中global_norm = sqrt(sum([l2norm(t)**2 for t in t_list])) |
tf.global_norm(t_list, name=None) | 返回global_norm = sqrt(sum([l2norm(t)**2 for t in t_list])) |
tf.train.exponential_decay(learning_rate, global_step, decay_steps, decay_rate, staircase=False, name=None)
对学习率进行指数衰退,公式如下
decayed_learning_rate = learning_rate *
decay_rate ^ (global_step / decay_steps)
If the argument staircase
is True
, then global_step /decay_steps
is an integer division and the decayed learning rate follows a staircase function.
例如: 每100000 步进行0.96的decay:
...
global_step = tf.Variable(0, trainable=False)
starter_learning_rate = 0.1
learning_rate = tf.exponential_decay(starter_learning_rate, global_step,
100000, 0.96, staircase=True)
optimizer = tf.GradientDescent(learning_rate)
# Passing global_step to minimize() will increment it at each step.
optimizer.minimize(...my loss..., global_step=global_step)
一些训练优化算法,比如GradientDescent 和Momentum 在优化过程中便可以使用到移动平均方法。使用移动平均常常可以较明显地改善结果。
操作 | 描述 |
---|---|
class tf.train.ExponentialMovingAverage | 将指数衰退加入到移动平均中 |
tf.train.ExponentialMovingAverage.apply(var_list=None) | 对var_list变量保持移动平均 |
tf.train.ExponentialMovingAverage.average_name(var) | 返回var均值的变量名称 |
tf.train.ExponentialMovingAverage.average(var) | 返回var均值变量 |
tf.train.ExponentialMovingAverage.variables_to_restore(moving_avg_variables=None) | 返回用于保存的变量名称的映射 |
# Create variables.
var0 = tf.Variable(...)
var1 = tf.Variable(...)
# ... use the variables to build a training model...
...
# Create an op that applies the optimizer. This is what we usually
# would use as a training op.
opt_op = opt.minimize(my_loss, [var0, var1])
# Create an ExponentialMovingAverage object
ema = tf.train.ExponentialMovingAverage(decay=0.9999)
# Create the shadow variables, and add ops to maintain moving averages
# of var0 and var1.
maintain_averages_op = ema.apply([var0, var1])
# Create an op that will update the moving averages after each training
# step. This is what we will use in place of the usuall trainig op.
with tf.control_dependencies([opt_op]):
training_op = tf.group(maintain_averages_op)
...train the model by running training_op...
恢复shadow variable values
# Create a Saver that loads variables from their saved shadow values.
shadow_var0_name = ema.average_name(var0)
shadow_var1_name = ema.average_name(var1)
saver = tf.train.Saver({shadow_var0_name: var0, shadow_var1_name: var1})
saver.restore(...checkpoint filename...)
# var0 and var1 now hold the moving average values
参见https://www.cnblogs.com/wuzhitj/p/6648641.html
操作 | 描述 |
---|---|
class tf.train.SummaryWriter | 将summary协议buffer写入事件文件中 |
tf.train.SummaryWriter.__init__(logdir, graph=None, max_queue=10, flush_secs=120, graph_def=None) | 创建一个SummaryWriter实例以及新建一个事件文件 |
tf.train.SummaryWriter.add_summary(summary, global_step=None) | 将一个summary添加到事件文件中 |
tf.train.SummaryWriter.add_session_log(session_log, global_step=None) | 添加SessionLog到一个事件文件中 |
tf.train.SummaryWriter.add_event(event) | 添加一个事件到事件文件中 |
tf.train.SummaryWriter.add_graph(graph, global_step=None, graph_def=None) | 添加一个Graph到时间文件中 |
tf.train.SummaryWriter.add_run_metadata(run_metadata, tag, global_step=None) | 为一个单一的session.run()调用添加一个元数据信息 |
tf.train.SummaryWriter.flush() | 刷新时间文件到硬盘中 |
tf.train.SummaryWriter.close() | 将事件问价写入硬盘中并关闭该文件 |
tf.train.summary_iterator(path) | 一个用于从时间文件中读取时间协议buffer的迭代器 |
tf.train.SummaryWriter
从session之中去除summary op, and pass it to a SummaryWriter 写入event file. Event files 包括 Event
protos that can contain Summary
protos along with the timestamp and step. You can then use TensorBoard to visualize the contents of the event files. See TensorBoard and Summaries for more details.
如果我们传递一个Graph进入该构建器中,它将被添加到Event files当中,这一点与使用add_graph()具有相同功能。
TensorBoard 将从事件文件中提取该graph,并将其显示。所以我们能直观地看到我们建立的graph。我们通常从我们启动的session中传递graph:
...create a graph...
# Launch the graph in a session.
sess = tf.Session()
# Create a summary writer, add the 'graph' to the event file.
writer = tf.train.SummaryWriter(, sess.graph)
tf.train.global_step(sess, global_step_tensor)
Small helper to get the global step.
# Creates a variable to hold the global_step.
global_step_tensor = tf.Variable(10, trainable=False, name='global_step')
# Creates a session.
sess = tf.Session()
# Initializes the variable.
sess.run(global_step_tensor.initializer)
print 'global_step:', tf.train.global_step(sess, global_step_tensor)
global_step: 10
Args:
sess
: A brain Session
object.global_step_tensor
: Tensor
or the name
of the operation that contains the global step.Returns:
The global step value.
tf.train.write_graph(graph_def, logdir, name, as_text=True)
Writes a graph proto on disk.
The graph is written as a binary proto unless as_text is True
.
v = tf.Variable(0, name='my_variable')
sess = tf.Session()
tf.train.write_graph(sess.graph_def, '/tmp/my-model', 'train.pbtxt')
Args:
graph_def
: A GraphDef
protocol buffer.logdir
: Directory where to write the graph.name
: Filename for the graph.as_text
: If True
, writes the graph as an ASCII proto.