整理学习到的东西,以备以后可以查看。
初始化FLAGS:tf.app.flags.FLAGS。
定义参数:tf.app.flags.DEFINE_xxx,xxx为参数类型。传入的第一个参数为变量名,如train_dir,可通过FLAGS.train_dir取得该变量的值。第二个参数为默认值,第三个参数为说明内容。当不设置该变量的值时,通过FLAGS.train_dir取到的是默认值,即/tmp/cifar10_train,若要设置该变量的值,可通过运行时写参数--train_dir test_dir来设置,即python cifar10_train.py --train_dir test_dir,cifar10_train就是这段代码的文件名,这样FLAGS.train_dir取到的就是test_dir。如果键入-h或--help,就会打印说明内容。其他变量类似。
FLAGS = tf.app.flags.FLAGS
tf.app.flags.DEFINE_string('train_dir', '/tmp/cifar10_train',
"""Directory where to write event logs """
"""and checkpoint.""")
tf.app.flags.DEFINE_integer('max_steps', 100000,
"""Number of batches to run.""")
tf.app.flags.DEFINE_boolean('log_device_placement', False,
"""Whether to log device placement.""")
tf.app.flags.DEFINE_integer('log_frequency', 10,
"""How often to log results to the console.""")
程序开始执行部分代码:
def main(argv=None): # pylint: disable=unused-argument
cifar10.maybe_download_and_extract()
if tf.gfile.Exists(FLAGS.train_dir):
tf.gfile.DeleteRecursively(FLAGS.train_dir)
tf.gfile.MakeDirs(FLAGS.train_dir)
train()
if __name__ == '__main__':
tf.app.run()
tf.app.run(),若未传入参数,则执行main(argv=...)函数,该函数同时会解析FLAGS。若不执行tf.app.run(),FLAGS不能正常使用。在main函数第一步,执行下载和解压数据集。
def maybe_download_and_extract():
"""Download and extract the tarball from Alex's website."""
dest_directory = FLAGS.data_dir
if not os.path.exists(dest_directory):
os.makedirs(dest_directory)
filename = DATA_URL.split('/')[-1]
filepath = os.path.join(dest_directory, filename)
if not os.path.exists(filepath):
def _progress(count, block_size, total_size):
sys.stdout.write('\r>> Downloading %s %.1f%%' % (filename,
float(count * block_size) / float(total_size) * 100.0))
sys.stdout.flush()
filepath, _ = urllib.request.urlretrieve(DATA_URL, filepath, _progress)
print()
statinfo = os.stat(filepath)
print('Successfully downloaded', filename, statinfo.st_size, 'bytes.')
extracted_dir_path = os.path.join(dest_directory, 'cifar-10-batches-bin')
if not os.path.exists(extracted_dir_path):
tarfile.open(filepath, 'r:gz').extractall(dest_directory)
文件名为cifar10.py,这里可以设置下载路径,方法在第一节已经说过。程序不是很复杂,先判断文件夹是否存在,不存在则创建,DATA_URL在文件开头已经定义,即为cifar10数据集的下载地址,这里将url最后一个斜杠后面的内容作为文件名,并将其与数据文件夹结合得到下载文件存放的路径,接下来判断文件是否存在,如果存在就定义解压的文件夹,然后判断解压文件夹是否存在,若存在表明数据集已经下载并解压了,就不需要操作了。如果还没下载,则通过urllib.request.urlretrieve直接下载文件,其最后一项是一个回调函数,用于显示下载进度,下载进度为当前下载量除以总下载量,下载结束之后就接着后面进行解压。
def main(argv=None): # pylint: disable=unused-argument
cifar10.maybe_download_and_extract()
if tf.gfile.Exists(FLAGS.eval_dir):
tf.gfile.DeleteRecursively(FLAGS.eval_dir)
tf.gfile.MakeDirs(FLAGS.eval_dir)
evaluate()
这里使用gfile模块,判断评估的文件夹是否存在,存在就删除,即如果之前进行训练过,则把之前的记录删掉,然后再创建该文件夹。完成此操作之后开始训练。
def train():
"""Train CIFAR-10 for a number of steps."""
with tf.Graph().as_default():
global_step = tf.train.get_or_create_global_step()
with tf.Graph().as_default()作用:定义默认图。tf.train.get_or_create_global_step():创建并返回global step tensor。这两行还不是理解的很清楚。
# Get images and labels for CIFAR-10.
# Force input pipeline to CPU:0 to avoid operations sometimes ending up on
# GPU and resulting in a slow down.
with tf.device('/cpu:0'):
images, labels = cifar10.distorted_inputs()
注释部分写得比较清楚,获取图片和标签,这里使用with tf.device('/cpu:0'),强制使用CPU做读取操作,以免使用GPU,因为使用GPU读取反而会使读取速度下降。
cifar10.py文件中,读取数据部分代码如下:
def distorted_inputs():
"""Construct distorted input for CIFAR training using the Reader ops.
Returns:
images: Images. 4D tensor of [batch_size, IMAGE_SIZE, IMAGE_SIZE, 3] size.
labels: Labels. 1D tensor of [batch_size] size.
Raises:
ValueError: If no data_dir
"""
if not FLAGS.data_dir:
raise ValueError('Please supply a data_dir')
data_dir = os.path.join(FLAGS.data_dir, 'cifar-10-batches-bin')
images, labels = cifar10_input.distorted_inputs(data_dir=data_dir,
batch_size=FLAGS.batch_size)
if FLAGS.use_fp16:
images = tf.cast(images, tf.float16)
labels = tf.cast(labels, tf.float16)
return images, labels
实际上,该函数调用cifar10_input,py文件夹中的distorted_inputs函数读取到图片和标签,然后判断是否需要转换数据类型,并决定是否转换。tf.cast函数用于转换数据类型。
进入cifar10_input.py文件查看具体实现方法。
def distorted_inputs(data_dir, batch_size):
"""Construct distorted input for CIFAR training using the Reader ops.
Args:
data_dir: Path to the CIFAR-10 data directory.
batch_size: Number of images per batch.
Returns:
images: Images. 4D tensor of [batch_size, IMAGE_SIZE, IMAGE_SIZE, 3] size.
labels: Labels. 1D tensor of [batch_size] size.
"""
filenames = [os.path.join(data_dir, 'data_batch_%d.bin' % i)
for i in xrange(1, 6)]
for f in filenames:
if not tf.gfile.Exists(f):
raise ValueError('Failed to find file: ' + f)
filenames为数据集文件名,即为data_batch_0.bin--data_batch_5.bin,之后判断文件是否存在,不存在就抛出异常错误。
# Create a queue that produces the filenames to read.
filename_queue = tf.train.string_input_producer(filenames)
接下来创建一个文件队列,用于训练的时候取数据。
with tf.name_scope('data_augmentation'):
# Read examples from files in the filename queue.
read_input = read_cifar10(filename_queue)
tf.name_scope的使用会方便后续命名,可以参考这里。使用之后,下面的变量如果命名为'a',之后在另一个地方也使用tf.name_scope,然后在命名一个'a',那也不会出现命名冲突,因为两个变量的命名空间不一样。继续查看read_cifar10函数。
def read_cifar10(filename_queue):
"""Reads and parses examples from CIFAR10 data files.
Recommendation: if you want N-way read parallelism, call this function
N times. This will give you N independent Readers reading different
files & positions within those files, which will give better mixing of
examples.
Args:
filename_queue: A queue of strings with the filenames to read from.
Returns:
An object representing a single example, with the following fields:
height: number of rows in the result (32)
width: number of columns in the result (32)
depth: number of color channels in the result (3)
key: a scalar string Tensor describing the filename & record number
for this example.
label: an int32 Tensor with the label in the range 0..9.
uint8image: a [height, width, depth] uint8 Tensor with the image data
"""
class CIFAR10Record(object):
pass
result = CIFAR10Record()
# Dimensions of the images in the CIFAR-10 dataset.
# See http://www.cs.toronto.edu/~kriz/cifar.html for a description of the
# input format.
label_bytes = 1 # 2 for CIFAR-100
result.height = 32
result.width = 32
result.depth = 3
image_bytes = result.height * result.width * result.depth
# Every record consists of a label followed by the image, with a
# fixed number of bytes for each.
record_bytes = label_bytes + image_bytes
定义一个CIFAR10Record类,用于存储读取到的相关信息。这里简要介绍一下cifar10二进制文件中的数据存储格式:每一幅图片由32*32*3+1个字节组成,第一位表示图片标签,取值为0到9,代表10个种类,在meta文件里面可以查看具体0到9代表什么,紧接着32*32=1024个字节为图片的R通道,后1024个字节为G通道,最后1024个字节为B通道,图片的宽和高位32。根据cifar10二进制文件的存储结构,很容易就能知道上面代码定义的参数含义了。
# Read a record, getting filenames from the filename_queue. No
# header or footer in the CIFAR-10 format, so we leave header_bytes
# and footer_bytes at their default of 0.
reader = tf.FixedLengthRecordReader(record_bytes=record_bytes)
result.key, value = reader.read(filename_queue)
通过tf.FixedLengthRecordReader设置每次读取文件的字节数,然后使用read函数就可以读取到对应的字节。这里,每read一次,文件指针就会往后移动一次(纯属个人理解),移动的长度为reader设置的长度。所以这里每次read一次,就会依次取出一幅图片和它对应的标签,读取到的值为value。
# Convert from a string to a vector of uint8 that is record_bytes long.
record_bytes = tf.decode_raw(value, tf.uint8)
接着将读取到的值解码为无符号整形,因为图片取值范围为0到255,即8位无符号整形,最终得到解码结果。
# The first bytes represent the label, which we convert from uint8->int32.
result.label = tf.cast(
tf.strided_slice(record_bytes, [0], [label_bytes]), tf.int32)
# The remaining bytes after the label represent the image, which we reshape
# from [depth * height * width] to [depth, height, width].
depth_major = tf.reshape(
tf.strided_slice(record_bytes, [label_bytes],
[label_bytes + image_bytes]),
[result.depth, result.height, result.width])
# Convert from [depth, height, width] to [height, width, depth].
result.uint8image = tf.transpose(depth_major, [1, 2, 0])
return result
tf.stride_slice作用为:读取指定位置的数据,具体用法可以查看这里。标签位于首位,所以这里读取起始位置设置为0,结束位置为标签所占的字节,将读取到的标签转换为32位整形。读取图片时,起始位置为标签占用的字节数,结束位置为总字节数。读取到的数据转换成3*32*32的矩阵,这样就将数据分成了三个通道,每个通道都为32*32的矩阵,通过后面的转置操作,将其变为32*32*3的矩阵,即为最终的RGB图像,最终把结果返回。
reshaped_image = tf.cast(read_input.uint8image, tf.float32)
height = IMAGE_SIZE
width = IMAGE_SIZE
将读取到的数据转换为32位浮点型数据,再设置宽和高。
# Randomly crop a [height, width] section of the image.
distorted_image = tf.random_crop(reshaped_image, [height, width, 3])
随机剪裁图片。原图大小为32*32,这里随机剪裁用于训练,剪裁之后图片大小为24*24。
# Randomly flip the image horizontally.
distorted_image = tf.image.random_flip_left_right(distorted_image)
随机左右翻转图片。
# Because these operations are not commutative, consider randomizing
# the order their operation.
# NOTE: since per_image_standardization zeros the mean and makes
# the stddev unit, this likely has no effect see tensorflow#1458.
distorted_image = tf.image.random_brightness(distorted_image,
max_delta=63)
distorted_image = tf.image.random_contrast(distorted_image,
lower=0.2, upper=1.8)
随机调整图片的亮度,即在原始像素值基础上加上一个随机值,该值范围为:[-max_delta,max_delta]。随机调整对比度,对比度范围为:[0.2,1.8]。
# Subtract off the mean and divide by the variance of the pixels.
float_image = tf.image.per_image_standardization(distorted_image)
图片标准化,这样可以加快训练速度。通过以上预处理,可以加强模型的泛化能力。
# Set the shapes of tensors.
float_image.set_shape([height, width, 3])
read_input.label.set_shape([1])
设置图片和标签的尺寸。
# Ensure that the random shuffling has good mixing properties.
min_fraction_of_examples_in_queue = 0.4
min_queue_examples = int(NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN *
min_fraction_of_examples_in_queue)
print ('Filling queue with %d CIFAR images before starting to train. '
'This will take a few minutes.' % min_queue_examples)
设置队列最小样本数量。队列的作用是在取到数据之后,将数据打乱再放到网络的输入作为训练。
# Generate a batch of images and labels by building up a queue of examples.
return _generate_image_and_label_batch(float_image, read_input.label,
min_queue_examples, batch_size,
shuffle=True)
最终返回一个batch的数据,该数据的产生函数如下:
def _generate_image_and_label_batch(image, label, min_queue_examples,
batch_size, shuffle):
"""Construct a queued batch of images and labels.
Args:
image: 3-D Tensor of [height, width, 3] of type.float32.
label: 1-D Tensor of type.int32
min_queue_examples: int32, minimum number of samples to retain
in the queue that provides of batches of examples.
batch_size: Number of images per batch.
shuffle: boolean indicating whether to use a shuffling queue.
Returns:
images: Images. 4D tensor of [batch_size, height, width, 3] size.
labels: Labels. 1D tensor of [batch_size] size.
"""
# Create a queue that shuffles the examples, and then
# read 'batch_size' images + labels from the example queue.
num_preprocess_threads = 16
if shuffle:
images, label_batch = tf.train.shuffle_batch(
[image, label],
batch_size=batch_size,
num_threads=num_preprocess_threads,
capacity=min_queue_examples + 3 * batch_size,
min_after_dequeue=min_queue_examples)
else:
images, label_batch = tf.train.batch(
[image, label],
batch_size=batch_size,
num_threads=num_preprocess_threads,
capacity=min_queue_examples + 3 * batch_size)
设置读取线程,可以优于单线程的读取速度。判断是否打乱顺序,调用tf.train.shuffle_batch(打乱数据顺序)或tf.train.batch(顺序读取)读取数据,设置batch大小,队列容量和出队之后最少数据量。至此,数据读取已经结束,这里返回的数据即可输入到网络进行训练。
回到cifar10_train.py文件的train()函数,读取数据之后的操作如下:
# Build a Graph that computes the logits predictions from the
# inference model.
logits = cifar10.inference(images)
这里通过cifar10.py中的接口构建图。该模型结构大致为:卷积-->最大池化-->局部响应归一化-->卷积-->局部结构归一化-->最大池化-->全连接-->全连接-->softmax linear。
def inference(images):
"""Build the CIFAR-10 model.
Args:
images: Images returned from distorted_inputs() or inputs().
Returns:
Logits.
"""
# We instantiate all variables using tf.get_variable() instead of
# tf.Variable() in order to share variables across multiple GPU training runs.
# If we only ran this model on a single GPU, we could simplify this function
# by replacing all instances of tf.get_variable() with tf.Variable().
#
# conv1
with tf.variable_scope('conv1') as scope:
kernel = _variable_with_weight_decay('weights',
shape=[5, 5, 3, 64],
stddev=5e-2,
wd=None)
conv = tf.nn.conv2d(images, kernel, [1, 1, 1, 1], padding='SAME')
biases = _variable_on_cpu('biases', [64], tf.constant_initializer(0.0))
pre_activation = tf.nn.bias_add(conv, biases)
conv1 = tf.nn.relu(pre_activation, name=scope.name)
_activation_summary(conv1)
以上为第一个卷积层。 _variable_with_weight_decay函数如下:
def _variable_with_weight_decay(name, shape, stddev, wd):
"""Helper to create an initialized Variable with weight decay.
Note that the Variable is initialized with a truncated normal distribution.
A weight decay is added only if one is specified.
Args:
name: name of the variable
shape: list of ints
stddev: standard deviation of a truncated Gaussian
wd: add L2Loss weight decay multiplied by this float. If None, weight
decay is not added for this Variable.
Returns:
Variable Tensor
"""
dtype = tf.float16 if FLAGS.use_fp16 else tf.float32
var = _variable_on_cpu(
name,
shape,
tf.truncated_normal_initializer(stddev=stddev, dtype=dtype))
if wd is not None:
weight_decay = tf.multiply(tf.nn.l2_loss(var), wd, name='weight_loss')
tf.add_to_collection('losses', weight_decay)
return var
首先确定数据类型,通过_variable_on_cpu函数得到变量,通过tf.truncated_normal_initializer初始化变量,即以stddev为标准差的正态分布,然后将该变量返回。如果wd不为空,则计算衰减的权重,tf.nn.l2_loss(var)=sum(var**2)/2,并将其添加到几何中,若要查看其值,则调用tf.get_collection即可。
def _variable_on_cpu(name, shape, initializer):
"""Helper to create a Variable stored on CPU memory.
Args:
name: name of the variable
shape: list of ints
initializer: initializer for Variable
Returns:
Variable Tensor
"""
with tf.device('/cpu:0'):
dtype = tf.float16 if FLAGS.use_fp16 else tf.float32
var = tf.get_variable(name, shape, initializer=initializer, dtype=dtype)
return var
该函数主要用于创建变量,通过tf.get_variable创建,并指定其命名,大小,初始化函数和数据类型。
回到第一层卷积层的定义,该卷积层的核大小为5*5*3,产生64个输出,即提取64个特征,卷积核初始化时其变量初始化为标准差为5e-2的正态分布产生的随机数。卷积结果为将图片与该卷积核卷积,每次移动的步长为1,且卷积之后保持图片大小不变。之后初始化一个64维的偏执,初始值为0。将卷积结果和偏置相加之后经过一个relu激活函数,即得到第一个卷积层的输出。_activation_summary用于统计某一个值的变化,在训练结束之后可以通过tensorboard查看其变化过程。
def _activation_summary(x):
"""Helper to create summaries for activations.
Creates a summary that provides a histogram of activations.
Creates a summary that measures the sparsity of activations.
Args:
x: Tensor
Returns:
nothing
"""
# Remove 'tower_[0-9]/' from the name in case this is a multi-GPU training
# session. This helps the clarity of presentation on tensorboard.
tensor_name = re.sub('%s_[0-9]*/' % TOWER_NAME, '', x.op.name)
tf.summary.histogram(tensor_name + '/activations', x)
tf.summary.scalar(tensor_name + '/sparsity',
tf.nn.zero_fraction(x))
re.sub为替换字符的作用,若变量名(x.op.name)形如:TOWER_NAME_xxx..../name,xxx...表示0个数字或者n个数字,即类似:TOWER_NAME_/name,TOWER_NAME_0/name,TOWER_NAME_23849/name,则用''替换TOWER_NAME_xxx..../,即去掉该部分,仅留下name。通过tf.summary.histogram绘制变量x的直方图,通过tf.summary.scalar对标量汇总,def zero_fraction定义如下:
def zero_fraction(value, name=None):
"""Returns the fraction of zeros in `value`.
If `value` is empty, the result is `nan`.
This is useful in summaries to measure and report sparsity. For example,
```python
z = tf.nn.relu(...)
summ = tf.summary.scalar('sparsity', tf.nn.zero_fraction(z))
```
Args:
value: A tensor of numeric type.
name: A name for the operation (optional).
Returns:
The fraction of zeros in `value`, with type `float32`.
"""
with ops.name_scope(name, "zero_fraction", [value]):
value = ops.convert_to_tensor(value, name="value")
zero = constant_op.constant(0, dtype=value.dtype, name="zero")
return math_ops.reduce_mean(
math_ops.cast(math_ops.equal(value, zero), dtypes.float32))
可以看到其功能是统计值为0的个数。
# pool1
pool1 = tf.nn.max_pool(conv1, ksize=[1, 3, 3, 1], strides=[1, 2, 2, 1],
padding='SAME', name='pool1')
# norm1
norm1 = tf.nn.lrn(pool1, 4, bias=1.0, alpha=0.001 / 9.0, beta=0.75,
name='norm1')
卷积层之后紧接最大池化层,池化的尺寸为3*3,移动步长为2,池化结果保持大小不变,再进行局部响应归一化,提高模型泛化能力。具体实现方法不是很理解。
# conv2
with tf.variable_scope('conv2') as scope:
kernel = _variable_with_weight_decay('weights',
shape=[5, 5, 64, 64],
stddev=5e-2,
wd=None)
conv = tf.nn.conv2d(norm1, kernel, [1, 1, 1, 1], padding='SAME')
biases = _variable_on_cpu('biases', [64], tf.constant_initializer(0.1))
pre_activation = tf.nn.bias_add(conv, biases)
conv2 = tf.nn.relu(pre_activation, name=scope.name)
_activation_summary(conv2)
# norm2
norm2 = tf.nn.lrn(conv2, 4, bias=1.0, alpha=0.001 / 9.0, beta=0.75,
name='norm2')
# pool2
pool2 = tf.nn.max_pool(norm2, ksize=[1, 3, 3, 1],
strides=[1, 2, 2, 1], padding='SAME', name='pool2')
同上,再加一个卷积层,卷积核大小5*5,输出64维向量,加上偏执之后通过relu激活层,再对卷积结果总结,然后局部相应归一化,最大池化。
# local3
with tf.variable_scope('local3') as scope:
# Move everything into depth so we can perform a single matrix multiply.
reshape = tf.reshape(pool2, [images.get_shape().as_list()[0], -1])
dim = reshape.get_shape()[1].value
weights = _variable_with_weight_decay('weights', shape=[dim, 384],
stddev=0.04, wd=0.004)
biases = _variable_on_cpu('biases', [384], tf.constant_initializer(0.1))
local3 = tf.nn.relu(tf.matmul(reshape, weights) + biases, name=scope.name)
_activation_summary(local3)
随后,添加全连接层。首先,将上面最大池化的输出拉成一个一维向量,获取这个向量的维数,初始化权重,设置输出为384个元素,这里的权重设置wd不为空,会计算衰减值。与poll层输出相乘后加上偏置,通过relu激活函数,统计输出。
# local4
with tf.variable_scope('local4') as scope:
weights = _variable_with_weight_decay('weights', shape=[384, 192],
stddev=0.04, wd=0.004)
biases = _variable_on_cpu('biases', [192], tf.constant_initializer(0.1))
local4 = tf.nn.relu(tf.matmul(local3, weights) + biases, name=scope.name)
_activation_summary(local4)
第二个全连接层与前面类似,初始化权重和偏执设置输出个数为192,与前一层相乘之后加上偏置,通过relu,统计。
# linear layer(WX + b),
# We don't apply softmax here because
# tf.nn.sparse_softmax_cross_entropy_with_logits accepts the unscaled logits
# and performs the softmax internally for efficiency.
with tf.variable_scope('softmax_linear') as scope:
weights = _variable_with_weight_decay('weights', [192, NUM_CLASSES],
stddev=1/192.0, wd=None)
biases = _variable_on_cpu('biases', [NUM_CLASSES],
tf.constant_initializer(0.0))
softmax_linear = tf.add(tf.matmul(local4, weights), biases, name=scope.name)
_activation_summary(softmax_linear)
return softmax_linear
最后计算输出,输出个数与标签个数一致。
# Calculate loss.
loss = cifar10.loss(logits, labels)
cifar10_train.py文件中,在构造完整个网络模型之后,紧接着构造了损失函数。
def loss(logits, labels):
"""Add L2Loss to all the trainable variables.
Add summary for "Loss" and "Loss/avg".
Args:
logits: Logits from inference().
labels: Labels from distorted_inputs or inputs(). 1-D tensor
of shape [batch_size]
Returns:
Loss tensor of type float.
"""
# Calculate the average cross entropy loss across the batch.
labels = tf.cast(labels, tf.int64)
cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(
labels=labels, logits=logits, name='cross_entropy_per_example')
cross_entropy_mean = tf.reduce_mean(cross_entropy, name='cross_entropy')
tf.add_to_collection('losses', cross_entropy_mean)
# The total loss is defined as the cross entropy loss plus all of the weight
# decay terms (L2 loss).
return tf.add_n(tf.get_collection('losses'), name='total_loss')
用tf.nn.sparse_softmax_cross_entropy_with_logits计算稀疏交叉熵,这里的labels大小为batch_size,logits大小为batch_size*num_classes。计算稀疏交叉熵之后通过tf.reduce_mean计算中值,再把中值加到集合中,然后把所有的loss加起来再返回。
根据损失函数构造训练方法。
# Build a Graph that trains the model with one batch of examples and
# updates the model parameters.
train_op = cifar10.train(loss, global_step)
这里训练时学习率会动态衰减,开始的时候学习率较大,收敛速度较快,随着训练次数增加,为了更接近最有值,较小学习率。
def train(total_loss, global_step):
"""Train CIFAR-10 model.
Create an optimizer and apply to all trainable variables. Add moving
average for all trainable variables.
Args:
total_loss: Total loss from loss().
global_step: Integer Variable counting the number of training steps
processed.
Returns:
train_op: op for training.
"""
# Variables that affect learning rate.
num_batches_per_epoch = NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN / FLAGS.batch_size
decay_steps = int(num_batches_per_epoch * NUM_EPOCHS_PER_DECAY)
# Decay the learning rate exponentially based on the number of steps.
lr = tf.train.exponential_decay(INITIAL_LEARNING_RATE,
global_step,
decay_steps,
LEARNING_RATE_DECAY_FACTOR,
staircase=True)
tf.summary.scalar('learning_rate', lr)
计算每次epoch时batch数,据此计算衰减步数,使用tf.train.exponential_decay让学习率呈指数衰减,函数需要指定初始学习速率,global_step,衰减步数,衰减因子,staircase设置为True表明呈阶梯衰减,即每decay_steps步之后计算一次衰减,得到新的学习率,之后保持学习率不变,直至下一个decay_steps。统计学习率的变化。
# Generate moving averages of all losses and associated summaries.
loss_averages_op = _add_loss_summaries(total_loss)
这里生成所有loss的移动平均值。
def _add_loss_summaries(total_loss):
"""Add summaries for losses in CIFAR-10 model.
Generates moving average for all losses and associated summaries for
visualizing the performance of the network.
Args:
total_loss: Total loss from loss().
Returns:
loss_averages_op: op for generating moving averages of losses.
"""
# Compute the moving average of all individual losses and the total loss.
loss_averages = tf.train.ExponentialMovingAverage(0.9, name='avg')
losses = tf.get_collection('losses')
loss_averages_op = loss_averages.apply(losses + [total_loss])
# Attach a scalar summary to all individual losses and the total loss; do the
# same for the averaged version of the losses.
for l in losses + [total_loss]:
# Name each loss as '(raw)' and name the moving average version of the loss
# as the original loss name.
tf.summary.scalar(l.op.name + ' (raw)', l)
tf.summary.scalar(l.op.name, loss_averages.average(l))
return loss_averages_op
tf.train.ExponentialMovingAverage,利用指数衰减计算滑动平均值。具体可以参考这里。然后利用滑动平均值计算loss,这里个人理解是,某一批次的损失主要与当次计算的损失有关,用其他次训练的损失平滑,以免噪声等的影响。统计原始损失和平均之后的损失,返回平均之后的损失。
# Compute gradients.
with tf.control_dependencies([loss_averages_op]):
opt = tf.train.GradientDescentOptimizer(lr)
grads = opt.compute_gradients(total_loss)
# Apply gradients.
apply_gradient_op = opt.apply_gradients(grads, global_step=global_step)
计算梯度并应用。
# Add histograms for trainable variables.
for var in tf.trainable_variables():
tf.summary.histogram(var.op.name, var)
# Add histograms for gradients.
for grad, var in grads:
if grad is not None:
tf.summary.histogram(var.op.name + '/gradients', grad)
直方图统计训练变量和梯度。
# Track the moving averages of all trainable variables.
variable_averages = tf.train.ExponentialMovingAverage(
MOVING_AVERAGE_DECAY, global_step)
with tf.control_dependencies([apply_gradient_op]):
variables_averages_op = variable_averages.apply(tf.trainable_variables())
return variables_averages_op
计算变量的滑动平均值,使用tf.control_dependencies,保证梯度已经应用之后再做下面的操作,即计算变量的滑动平均值,保证变量滑动平均值是在梯度应用之后计算的结果。返回变量的滑动平均值。
class _LoggerHook(tf.train.SessionRunHook):
"""Logs loss and runtime."""
def begin(self):
self._step = -1
self._start_time = time.time()
def before_run(self, run_context):
self._step += 1
return tf.train.SessionRunArgs(loss) # Asks for loss value.
def after_run(self, run_context, run_values):
if self._step % FLAGS.log_frequency == 0:
current_time = time.time()
duration = current_time - self._start_time
self._start_time = current_time
loss_value = run_values.results
examples_per_sec = FLAGS.log_frequency * FLAGS.batch_size / duration
sec_per_batch = float(duration / FLAGS.log_frequency)
format_str = ('%s: step %d, loss = %.2f (%.1f examples/sec; %.3f '
'sec/batch)')
print (format_str % (datetime.now(), self._step, loss_value,
examples_per_sec, sec_per_batch))
begin方法初始化训练步数和起始时间,before_run,用于运行之前返回loss的值,同时技术训练步数。after_run用于打印相关信息。
with tf.train.MonitoredTrainingSession(
checkpoint_dir=FLAGS.train_dir,
hooks=[tf.train.StopAtStepHook(last_step=FLAGS.max_steps),
tf.train.NanTensorHook(loss),
_LoggerHook()],
config=tf.ConfigProto(
log_device_placement=FLAGS.log_device_placement)) as mon_sess:
while not mon_sess.should_stop():
mon_sess.run(train_op)
tf.train.MonitoredTrainingSession,字面意思为监督训练的会话,checkpoint_dir,恢复checkpoint的文件夹(不是很懂),tf.train.StopAtStepHook,到达last_step时发起停止的信号,tf.train.NanTensorHook用于监督loss是否为nan,如果没有收到停止信息就训练。
至此,大概知道了训练过程,首先读取图片,对图片预处理,图片存放到队列中打乱之后用于网络输入。其次构造模型,损失函数计算,学习率指数衰减,计算梯度,用梯度来求解最优值(猜测是应用梯度的时候求的)。最后开始训练。
附上cifar10例程源码地址。