Training Tensorflow model requirements
- a model represented has a computational graph.
- a loss function to minimize and optimize over.
- the gradient computation of the model weights relative to the loss to perform backpropagation of the error signal.
- a training routine that iteratively does all of the above and updates the weights accordingly.
# load data
images, labels = LoadData(...)
# Create a model and make predictions
predictions = MyModel(images)
# Define a losses function
slim.losses.log_loss(predictions, labels)
# Get total model loss and regularization loss
total_loss = slim.losses.get_total_loss()
# Define Optimization method (SGD, Momentum, RMSProp, AdaGrad, Adam optimizer)
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
# create_train_op at each steps:
# compute loss, comput gradients and compute update_ops
train_op = slim.learning.create_train_op(total_loss, optimizer)
# Where checkpoints and event files are stored.
logdir = "/logdir/path"
slim.learning.train(
train_op,
logdir,
number_of_steps=1000, # number of gradient steps
save_summaries_secs=60, # compute summaries every 60 secs
save_interval_secs=300) # save model checkpoint every 5 min
It is important to monitor the ‘health’ of the training because optimization could stop functioning properly.
For example for the following reasons:
In the simplest use case, we use a model to create the predictions, then specify
the metrics and finally call the evaluation
method:
slim.evaluation.evaluation()
will perform a single evaluation run.
# Create model and obtain the predictions:
images, labels = LoadData(...)
predictions = MyModel(images)
# Choose the metrics to compute:
names_to_values, names_to_updates = slim.metrics.aggregate_metric_map({
"accuracy": slim.metrics.accuracy(predictions, labels),
"mse": slim.metrics.mean_squared_error(predictions, labels),
})
# Initialize variables
inital_op = tf.group(
tf.global_variables_initializer(),
tf.local_variables_initializer())
with tf.Session() as sess:
# Run evaluation
metric_values = slim.evaluation.evaluation(
sess,
num_evals=10,
inital_op=initial_op,
eval_op=names_to_updates.values(),
final_op=name_to_values.values())
# print final metric values
for metric, value in zip(names_to_values.keys(), metric_values):
logging.info('Metric %s has value: %f', metric, value)
Often, one wants to evaluate a model checkpoint saved on disk.
The evaluation can be performed periodically during training on a set schedule.
Instead of calling the evaluation()
method, we now call evaluation_loop()
method. We now provide in addition the logging and checkpoint directory, as well as, a evaluation time interval.
# Load the data
images, labels = load_data(...)
# Define the network
predictions = MyModel(images)
# Choose the metrics to compute:
names_to_values, names_to_updates = slim.metrics.aggregate_metric_map({
'accuracy': slim.metrics.accuracy(predictions, labels),
'precision': slim.metrics.precision(predictions, labels),
'recall': slim.metrics.recall(predictions, targets),
})
# Define the summaries to write:
for metric_name, metric_value in names_to_values.iteritems():
tf.summary.scalar(metric_name, metric_value)
# Define other summaries to write (loss, activations, gradients)
tf.summary.scalar(...)
tf.summary.histogram(...)
checkpoint_dir = '/tmp/my_model_dir/'
log_dir = '/tmp/my_model_eval/'
# evaluate for 1000 batches:
num_evals = 1000
# Setup the global step.
slim.get_or_create_global_step()
slim.evaluation.evaluation_loop(
master='',
checkpoint_dir,
log_dir,
num_evals=num_evals,
eval_op=names_to_updates.values(),
summary_op=tf.summary.merge(summary_ops), # Merge summaries (list of summary operations)
eval_interval_secs=600) # How often to run the evaluation
When a model has already been trained, and we only wish to evaluate it from its last checkpoint, TF-Slim has provided us with a method calle evaluate_once()
. It only evaluates the model at the given checkpoint path.
logits, nodes = CNN_model(inputs, dropout = 0.5, is_training=False)
predictions = tf.argmax(logits, 1)
# Define streaming metrics
names_to_values, names_to_updates = slim.metrics.aggregate_metric_map({
'eval/Accuracy': slim.metrics.streaming_accuracy(predictions, targets),
'eval/Recall@3': slim.metrics.streaming_sparse_recall_at_k(
tf.to_float(logits), tf.expand_dims(targets,1), 3),
'eval/Precision': slim.metrics.streaming_precision(predictions, targets),
'eval/Recall': slim.metrics.streaming_recall(predictions, targets)
})
print('Running evaluation Loop...')
# Only load latest checkpoint
checkpoint_path = tf.train.latest_checkpoint(checkpoint_dir)
metric_values = slim.evaluation.evaluate_once(
num_evals=num_evals,
master='',
checkpoint_path=checkpoint_path,
logdir=checkpoint_dir,
eval_op=names_to_updates.values(),
final_op=names_to_values.values())
# print final metric values
names_to_values = dict(zip(names_to_values.keys(), metric_values))
for name in names_to_values:
print('%s: %f' % (name, names_to_values[name]))