TensorFlow是一个支持分布式的深度学习框架,在Google的推动下,它正在变得越来越普及。我最近学了TensorFlow教程上的一个例子,即采用CNN对cifar10数据集进行分类。在看源代码的时候,看完后有一种似懂非懂的感觉,又考虑到这个样例涵盖了tensorflow的大部分语法知识,包括QueueRunners机制、Tensorboard可视化和多GPU数据并行编程等。“纸上得来终觉浅,绝知此事要躬行”,于是我便照着自己的理解和记忆来重新编程实现一遍,实现过程中也遇到了一些问题,在这里把我对程序的详细的理解和遇到的问题记录下来,本着“知其然,知其所以然”的标准,程序中的注释更加侧重于对‘为什么这一行代码要这样写’的解读,以供以后参考(关于卷积神经网络相关的理论部分,网上有很多了,这里就不做太多介绍)。
一个标准的机器学习程序,应该包括数据输入、定义模型本身、模型训练和模型性能测试四大部分,可以分成四个.py文件。
(一)数据输入部分(input_dataset.py)
从概念上来说,这部分主要是关于数据管道(data pipe)的构建,数据流向为“二进制文件->文件名队列->数据队列->读取出的data-batch”。数据块用于输入到深度学习网络中,进行信息的forward propagation,这部分在定义模型本身部分讨论。在定义整个数据管道的时候,会使用到TensorFlow的队列机制,这部分在之前的博文“TensorFlow读取二进制文件数据到队列”中进行了详细讲述。另外,读原数据文件的时候,要结合文件本身的格式,相信编写过c语言读取二进制文件的程序的朋友,对这应该再熟悉不过了,具体的代码如下,
-
- import os
- import tensorflow as tf
-
- fixed_height = 24
- fixed_width = 24
-
- train_samples_per_epoch = 50000
- test_samples_per_epoch = 10000
- data_dir='./cifar-10-batches-bin'
- batch_size=128
-
- def read_cifar10(filename_queue):
-
- class Image(object):
- pass
- image = Image()
- image.height=32
- image.width=32
- image.depth=3
- label_bytes = 1
- image_bytes = image.height*image.width*image.depth
- Bytes_to_read = label_bytes+image_bytes
-
- reader = tf.FixedLengthRecordReader(record_bytes=Bytes_to_read)
-
- image.key, value_str = reader.read(filename_queue)
-
- value = tf.decode_raw(bytes=value_str, out_type=tf.uint8)
-
- image.label = tf.slice(input_=value, begin=[0], size=[label_bytes])
- data_mat = tf.slice(input_=value, begin=[label_bytes], size=[image_bytes])
- data_mat = tf.reshape(data_mat, (image.depth, image.height, image.width))
- transposed_value = tf.transpose(data_mat, perm=[1, 2, 0])
- image.mat = transposed_value
- return image
-
- def get_batch_samples(img_obj, min_samples_in_queue, batch_size, shuffle_flag):
- ''
-
-
-
-
-
-
-
-
- if shuffle_flag == False:
- image_batch, label_batch = tf.train.batch(tensors=img_obj,
- batch_size=batch_size,
- num_threads=4,
- capacity=min_samples_in_queue+3*batch_size)
- else:
- image_batch, label_batch = tf.train.shuffle_batch(tensors=img_obj,
- batch_size=batch_size,
- num_threads=4,
- min_after_dequeue=min_samples_in_queue,
- capacity=min_samples_in_queue+3*batch_size)
- tf.image_summary('input_image', image_batch, max_images=6)
- return image_batch, tf.reshape(label_batch, shape=[batch_size])
-
- def preprocess_input_data():
- ''
- filenames = [os.path.join(data_dir, 'data_batch_%d.bin' % i) for i in range(1, 6)]
-
- for f in filenames:
- if not tf.gfile.Exists(f):
- raise ValueError('fail to find file:'+f)
- filename_queue = tf.train.string_input_producer(string_tensor=filenames)
- image = read_cifar10(filename_queue)
- new_img = tf.cast(image.mat, tf.float32)
- tf.image_summary('raw_input_image', tf.reshape(new_img, [1, 32, 32, 3]))
- new_img = tf.random_crop(new_img, size=(fixed_height, fixed_width, 3))
- new_img = tf.image.random_brightness(new_img, max_delta=63)
- new_img = tf.image.random_flip_left_right(new_img)
- new_img = tf.image.random_contrast(new_img, lower=0.2, upper=1.8)
- final_img = tf.image.per_image_whitening(new_img)
-
- min_samples_ratio_in_queue = 0.4
- min_samples_in_queue = int(min_samples_ratio_in_queue*train_samples_per_epoch)
- return get_batch_samples([final_img, image.label], min_samples_in_queue, batch_size, shuffle_flag=True)
(二)模型本身定义部分(forward_prop.py)
定义模型本身,也就是确定出网络结构(Graph),使得输入信号流在该图中进行forward propagation。在本样例中定义的网络结构为“输入层->卷积层->池化层->规范化层->卷积层->规范化层->池化层->全连接层->全连接层->softmax输出层”。与其他的深度网络模型不同,这里引入了规范化层,原因是Relu(rectified linear unit)激活函数会把输入激励映射到[0, infinite],详细的代码及其解释如下,
-
- import tensorflow as tf
- import input_dataset
-
- height = input_dataset.fixed_height
- width = input_dataset.fixed_width
- train_samples_per_epoch = input_dataset.train_samples_per_epoch
- test_samples_per_epoch = input_dataset.test_samples_per_epoch
-
-
- moving_average_decay = 0.9999
- num_epochs_per_decay = 350.0
- learning_rate_decay_factor = 0.1
- initial_learning_rate = 0.1
-
- def variable_on_cpu(name, shape, dtype, initializer):
- with tf.device("/cpu:0"):
- return tf.get_variable(name=name,
- shape=shape,
- initializer=initializer,
- dtype=dtype)
-
- def variable_on_cpu_with_collection(name, shape, dtype, stddev, wd):
- with tf.device("/cpu:0"):
- weight = tf.get_variable(name=name,
- shape=shape,
- initializer=tf.truncated_normal_initializer(stddev=stddev, dtype=dtype))
- if wd is not None:
- weight_decay = tf.mul(tf.nn.l2_loss(weight), wd, name='weight_loss')
- tf.add_to_collection(name='losses', value=weight_decay)
- return weight
-
- def losses_summary(total_loss):
-
-
-
-
-
- average_op = tf.train.ExponentialMovingAverage(decay=0.9)
- losses = tf.get_collection(key='losses')
-
- maintain_averages_op = average_op.apply(losses+[total_loss])
- for i in losses+[total_loss]:
- tf.scalar_summary(i.op.name+'_raw', i)
- tf.scalar_summary(i.op.name, average_op.average(i))
- return maintain_averages_op
-
- def one_step_train(total_loss, step):
- batch_count = int(train_samples_per_epoch/input_dataset.batch_size)
- decay_step = batch_count*num_epochs_per_decay
- lr = tf.train.exponential_decay(learning_rate=initial_learning_rate,
- global_step=step,
- decay_steps=decay_step,
- decay_rate=learning_rate_decay_factor,
- staircase=True)
- tf.scalar_summary('learning_rate', lr)
- losses_movingaverage_op = losses_summary(total_loss)
-
- with tf.control_dependencies(control_inputs=[losses_movingaverage_op]):
- trainer = tf.train.GradientDescentOptimizer(learning_rate=lr)
- gradient_pairs = trainer.compute_gradients(loss=total_loss)
- gradient_update = trainer.apply_gradients(grads_and_vars=gradient_pairs, global_step=step)
-
- variables_average_op = tf.train.ExponentialMovingAverage(decay=moving_average_decay, num_updates=step)
-
- maintain_variable_average_op = variables_average_op.apply(var_list=tf.trainable_variables())
- with tf.control_dependencies(control_inputs=[gradient_update, maintain_variable_average_op]):
- gradient_update_optimizor = tf.no_op()
- return gradient_update_optimizor
-
- def network(images):
-
- with tf.variable_scope(name_or_scope='conv1') as scope:
- weight = variable_on_cpu_with_collection(name='weight',
- shape=(5, 5, 3, 64),
- dtype=tf.float32,
- stddev=0.05,
- wd = 0.0)
- bias = variable_on_cpu(name='bias', shape=(64), dtype=tf.float32, initializer=tf.constant_initializer(value=0.0))
- conv1_in = tf.nn.conv2d(input=images, filter=weight, strides=(1, 1, 1, 1), padding='SAME')
- conv1_in = tf.nn.bias_add(value=conv1_in, bias=bias)
- conv1_out = tf.nn.relu(conv1_in)
-
- pool1 = tf.nn.max_pool(value=conv1_out, ksize=(1, 3, 3, 1), strides=(1, 2, 2, 1), padding='SAME')
-
- norm1 = tf.nn.lrn(input=pool1, depth_radius=4, bias=1.0, alpha=0.001/9.0, beta=0.75)
-
- with tf.variable_scope(name_or_scope='conv2') as scope:
- weight = variable_on_cpu_with_collection(name='weight',
- shape=(5, 5, 64, 64),
- dtype=tf.float32,
- stddev=0.05,
- wd=0.0)
- bias = variable_on_cpu(name='bias', shape=(64), dtype=tf.float32, initializer=tf.constant_initializer(value=0.1))
- conv2_in = tf.nn.conv2d(norm1, weight, strides=(1, 1, 1, 1), padding='SAME')
- conv2_in = tf.nn.bias_add(conv2_in, bias)
- conv2_out = tf.nn.relu(conv2_in)
-
- norm2 = tf.nn.lrn(input=conv2_out, depth_radius=4, bias=1.0, alpha=0.001/9.0, beta=0.75)
-
- pool2 = tf.nn.max_pool(value=norm2, ksize=(1, 3, 3, 1), strides=(1, 2, 2, 1), padding='SAME')
-
- reshaped_pool2 = tf.reshape(tensor=pool2, shape=(-1, 6*6*64))
-
- with tf.variable_scope(name_or_scope='fully_connected_layer1') as scope:
- weight = variable_on_cpu_with_collection(name='weight',
- shape=(6*6*64, 384),
- dtype=tf.float32,
- stddev=0.04,
- wd = 0.004)
- bias = variable_on_cpu(name='bias', shape=(384), dtype=tf.float32, initializer=tf.constant_initializer(value=0.1))
- fc1_in = tf.matmul(reshaped_pool2, weight)+bias
- fc1_out = tf.nn.relu(fc1_in)
-
- with tf.variable_scope(name_or_scope='fully_connected_layer2') as scope:
- weight = variable_on_cpu_with_collection(name='weight',
- shape=(384, 192),
- dtype=tf.float32,
- stddev=0.04,
- wd=0.004)
- bias = variable_on_cpu(name='bias', shape=(192), dtype=tf.float32, initializer=tf.constant_initializer(value=0.1))
- fc2_in = tf.matmul(fc1_out, weight)+bias
- fc2_out = tf.nn.relu(fc2_in)
-
- with tf.variable_scope(name_or_scope='softmax_layer') as scope:
- weight = variable_on_cpu_with_collection(name='weight',
- shape=(192, 10),
- dtype=tf.float32,
- stddev=1/192,
- wd=0.0)
- bias = variable_on_cpu(name='bias', shape=(10), dtype=tf.float32, initializer=tf.constant_initializer(value=0.0))
- classifier_in = tf.matmul(fc2_out, weight)+bias
- classifier_out = tf.nn.softmax(classifier_in)
- return classifier_out
-
- def loss(logits, labels):
- labels = tf.cast(x=labels, dtype=tf.int32)
- cross_entropy_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=labels, name='likelihood_loss')
- cross_entropy_loss = tf.reduce_mean(cross_entropy_loss, name='cross_entropy_loss')
- tf.add_to_collection(name='losses', value=cross_entropy_loss)
- return tf.add_n(inputs=tf.get_collection(key='losses'), name='total_loss')
(三)模型训练部分(train.py)
模型训练的实质就是一个“参数寻优”的过程,最常见的优化算法是mini-batch的随机梯度下降法(mini-batch是相对于online learning而言的),寻找使得损失函数值最小的模型参数。为了防止过拟合,这里的损失函数包含了正则化项,具体的代码解释如下,
-
- import input_dataset
- import forward_prop
- import tensorflow as tf
- import os
- import numpy as np
-
- max_iter_num = 100000
- checkpoint_path = './checkpoint'
- event_log_path = './event-log'
-
- def train():
- with tf.Graph().as_default():
- global_step = tf.Variable(initial_value=0, trainable=False)
- img_batch, label_batch = input_dataset.preprocess_input_data()
-
- logits = forward_prop.network(img_batch)
- total_loss = forward_prop.loss(logits, label_batch)
- one_step_gradient_update = forward_prop.one_step_train(total_loss, global_step)
-
- saver = tf.train.Saver(var_list=tf.all_variables())
- all_summary_obj = tf.merge_all_summaries()
- initiate_variables = tf.initialize_all_variables()
-
- with tf.Session(config=tf.ConfigProto(log_device_placement=False)) as sess:
- sess.run(initiate_variables)
- tf.train.start_queue_runners(sess=sess)
- Event_writer = tf.train.SummaryWriter(logdir=event_log_path, graph=sess.graph)
- for step in range(max_iter_num):
- _, loss_value = sess.run(fetches=[one_step_gradient_update, total_loss])
- assert not np.isnan(loss_value)
- if step%10 == 0:
- print('step %d, the loss_value is %.2f' % (step, loss_value))
- if step%100 == 0:
-
- all_summaries = sess.run(all_summary_obj)
- Event_writer.add_summary(summary=all_summaries, global_step=step)
- if step%1000 == 0 or (step+1)==max_iter_num:
- variables_save_path = os.path.join(checkpoint_path, 'model-parameters.bin')
- saver.save(sess, variables_save_path, global_step=step)
- if __name__ == '__main__':
- train()
(四)模型性能评估部分(evaluate.py)
机器学习模型训练好之后,要在测试数据集上进行测试,从而判断模型的性能,常见的性能指标有准确率、召回率等。顺便提及一下,有的机器学习模型在训练时,会把数据集分成三部分,训练集(training dataset),正则集(validation dataset)和测试集(test dataset),正则集的作用也是为了防止过拟合,但我们这里通过对模型参数正则化来防止过拟合,因此就不用像这样划分数据集了,具体的代码及解释如下,
-
- import tensorflow as tf
- import input_dataset
- import forward_prop
- import train
- import math
- import numpy as np
-
- def eval_once(summary_op, summary_writer, saver, predict_true_or_false):
- with tf.Session() as sess:
-
- checkpoint_proto = tf.train.get_checkpoint_state(checkpoint_dir=train.checkpoint_path)
- if checkpoint_proto and checkpoint_proto.model_checkpoint_path:
- saver.restore(sess, checkpoint_proto.model_checkpoint_path)
- else:
- print('checkpoint file not found!')
- return
-
- coord = tf.train.Coordinator()
- try:
- threads = []
- for queue_runner in tf.get_collection(key=tf.GraphKeys.QUEUE_RUNNERS):
- threads.extend(queue_runner.create_threads(sess, coord=coord, daemon=True, start=True))
-
- test_batch_num = math.ceil(input_dataset.test_samples_per_epoch/input_dataset.batch_size)
- iter_num = 0
- true_test_num = 0
-
- total_test_num = test_batch_num*input_dataset.batch_size
-
- while iter_numand not coord.should_stop():
- result_judge = sess.run([predict_true_or_false])
- true_test_num += np.sum(result_judge)
- iter_num += 1
- precision = true_test_num/total_test_num
- print("The test precision is %.3f" % precision)
- except:
- coord.request_stop()
- coord.request_stop()
- coord.join(threads)
-
- def evaluate():
- with tf.Graph().as_default() as g:
- img_batch, labels = input_dataset.input_data(eval_flag=True)
- logits = forward_prop.network(img_batch)
-
- predict_true_or_false = tf.nn.in_top_k(predictions=logits, targets=labels, k=1)
-
- moving_average_op = tf.train.ExponentialMovingAverage(decay=forward_prop.moving_average_decay)
-
-
- variables_to_restore = moving_average_op.variables_to_restore()
- saver = tf.train.Saver(var_list=variables_to_restore)
-
- summary_op = tf.merge_all_summaries()
-
- summary_writer = tf.train.SummaryWriter(logdir='./event-log-test', graph=g)
- eval_once(summary_op, summary_writer, saver, predict_true_or_false)
另外,在自己编写程序的过程中,出现了一些错误提示,在这里一并记录下来,供以后参考,
(1)"SyntaxError: positional argument follows keyword argument"
错误原因:在python中向子函数传递参数时,实参或者全部用关键字参数;或者全部用定位参数;当同时使用关键字参数和定位参数时,一定是定位参数在前,关键字参数在后,样例如下,
def test(a, b, c):
return a+b+c
则以下三种调用方式均正确,
test(1,2,c=3)
test(1,2,3)
test(a=1, b=2, c=3)
(2)语句"with tf.Graph.as_default()"出现错误“TypeError: as_default() missing 1 required positional argument: 'self'”
错误原因:应该为tf.Graph().as_default()
(3)打开tensorboard时提示“inner server error”
错误原因:在我的电脑上开启了lantern代理,导致启动tensorboard时服务器出错
(4)错误提示“TypeError: int() argument must be a string, a bytes-like object or a number, not 'Variable'”
错误原因:tf.get_variable()中的shape参数不能为tensor类型,可以区别下面的两种情况来理解
情况1: reshape = tf.reshape(pool2, [batch_size, -1])
dim = reshape.get_shape()[1].value #reshape.get_shape()[1]为dimension类型tensor,取其value属性会得到int类型数值
情况2: reshaped_pool2 = tf.reshape(tensor=pool2, shape=(batch_size, -1))
dim = tf.shape(reshaped_pool2)[1] #dim为tensor类型
又因为tf.get_variable(name, shape=[dim, 384], initializer=initializer, dtype=dtype)函数中,shape参数必须为ndarray类型
(5)在每次迭代时,打印出的损失值过大,如下所示,
step 0, the loss_value is 22439.82
step 10, the loss_value is 6426354679171382723219403309056.00
错误原因:设置权值参数w时,标准差取了一个过大的constant,
如设置weight = tf.get_variable(name=name, shape=shape, initializer=tf.truncated_normal_initializer(stddev=0.5, dtype=dtype))
(6)本程序在tensorflow0.8.0版本下也是可以运行的,只不过需要把tf.train.batch和tf.train.shuffle_batch函数中的关键字tensors改成tensor_list
(7)错误提示“UnboundLocalError: local variable 'CONSTANT' referenced before assignment”
错误原因:可以参见下面这段代码
CONSTANT = 0
def modifyConstant() :
print CONSTANT
CONSTANT += 1
在函数内部修改了变量CONSTANT,Python认为CONSTANT是局部变量,而print CONSTANT又在CONSTANT += 1之前,所以会发生这种错误。
如果必须在函数内部访问并修改全局变量,应该使用关键字global在函数内声明变量CONSTANT
(8)要注意区别tf.reshape()和tf.transpose()函数,前者是按照数据的存储先后顺序重新调整tensor的尺寸,而后者是从空间角度进行维度的旋转。
参考资料:https://www.tensorflow.org/versions/r0.11/tutorials/deep_cnn/index.html#convolutional-neural-networks