有时候我们需要为不同的网络层设置不同的学习率。比如在fine-tune的时候,一个best practice就是在ImageNet上预训练的骨干部分使用较小的学习率,而新添加的部分使用较大的学习率。如图所示的计算图谱,如果我们希望骨干部分的残差网络学习率小一点,而新增加的aspp模块学习率稍大一点。
虽然TensorFlow对使用不同的学习率没有提供比较便捷的支持,但使用TF提供的低层API简单封装一下优化器便可达到我们的目的。
先写一个函数用于获取具体的优化器:
def get_solver(kind, lr):
kind = kind.lower()
if kind == 'adam':
solver = tf.train.AdamOptimizer(lr)
elif kind == 'sgd':
solver = tf.train.GradientDescentOptimizer(lr)
elif kind == 'momentum':
solver = tf.train.MomentumOptimizer(lr, momentum=0.9)
else:
raise NotImplementedError('Solver `%s` not available, choose from{`adam`, `sgd`, `momentum`}' % kind)
return solver
然后新建一个类封装现有的优化器:
class OptimizerWrapper:
"""
This is a wrapper class for the tensorflow optimizer. During the network fine-tuning, we would like
different parts of the network to have different learning rate.
Usages:
Case 1: optimizer = OptimizerWrapper('adam', lr=1e-3);
Case 2: optimizer = OptimizerWrapper('momentum', params={'resnet_v1_101': 1e-2, 'other_layer': 1e-1})
"""
def __init__(self, type, lr=1e-3, params=None):
"""
:param type: str, choose from{'adam', 'sgd', 'momentum'}
:param lr: float, optional, used in the Case 1 of the usages.
:param params: dict, with items(str->float), specify different learning rates for different scopes
"""
self.params = params
if params is None:
self.solver = get_solver(type, lr)
else:
self.vars = []
self.solvers = []
for scope, lr in params.items():
scope_solver = get_solver(type, lr)
scope_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope)
self.vars.extend(scope_vars)
self.solvers.append(scope_solver)
def minimize(self, loss, global_step=None):
global_step = tf.train.get_or_create_global_step() if global_step is None else global_step
if self.params is None:
return self.solver.minimize(loss, global_step=global_step)
else:
counter = 0
grads = tf.gradients(loss, self.vars)
ops = []
for i, (scope, lr) in enumerate(self.params.items()):
scope_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope)
scope_grads = grads[counter:counter+len(scope_vars)]
ops.append(self.solvers[i].apply_gradients(zip(scope_grads, scope_vars), global_step=global_step))
# set it to None so only one solver gets passed in the global step;
# otherwise, all the solvers will increase the global step
global_step = None
counter += len(scope_vars)
return tf.group(*ops)
这个wrapper主要有两种使用场景:
optimizer = OptimizerWrapper('adam', lr=1e-3)
optimizer = OptimizerWrapper('momentum',
params={'resnet_v1_101': 1e-3, 'new_layer': 1e-2})
使用不同的学习率的时候global_step
只能传递给其中一个优化器,不然训练一步以后global_step
会被自增k
次。其中,k
为用到的优化器的个数。
在使用不同的学习率的时候,所有需要更新的变量的scope都要显示地在参数params
里说明,没有指明的scope里的变量不会得到更新。(算一个小小的bug)
how-to-set-layer-wise-learning-rate-in-tensorflow