什么是batch,准备了两种解释,看君喜欢哪种?
如果把准备训练数据比喻成一块准备打火锅的牛肉,那么epoch就是整块牛肉,batch就是切片后的牛肉片,iteration就是涮一块牛肉片(饿了吗?)。
不是给人吃,是喂给模型吃。在搭建了“模型-策略-算法”三大步之后,要开始利用数据跑(训练)这个框架,训练出最佳参数。
再次提供两种方法
1. yield→generator
具体的语法知识,请点链接。
# --------------函数说明-----------------
# sourceData_feature :训练集的feature部分
# sourceData_label :训练集的label部分
# batch_size : 牛肉片的厚度
# num_epochs : 牛肉翻煮多少次
# shuffle : 是否打乱数据
def batch_iter(sourceData_feature,sourceData_label, batch_size, num_epochs, shuffle=True):
data_size = len(sourceData_feature)
num_batches_per_epoch = int(data_size / batch_size) # 样本数/batch块大小,多出来的“尾数”,不要了
for epoch in range(num_epochs):
# Shuffle the data at each epoch
if shuffle:
shuffle_indices = np.random.permutation(np.arange(data_size))
shuffled_data_feature = sourceData_feature[shuffle_indices]
shuffled_data_label = sourceData_label[shuffle_indices]
else:
shuffled_data_feature = sourceData_feature
shuffled_data_label = sourceData_label
for batch_num in range(num_batches_per_epoch): # batch_num取值0到num_batches_per_epoch-1
start_index = batch_num * batch_size
end_index = min((batch_num + 1) * batch_size, data_size)
yield (shuffled_data_feature[start_index:end_index] , shuffled_data_label[start_index:end_index])
batchSize = 100 # 定义具体的牛肉厚度
Iterations = 0 # 记录迭代的次数
# sess
sess = tf.Session()
sess.run(tf.global_variables_initializer())
# 迭代 必须注意batch_iter是yield→generator,所以for语句有特别
for (batchInput, batchLabels) in batch_iter(mnist.train.images, mnist.train.labels, batchSize, 30, shuffle=True):
trainingLoss = sess.run([opt,loss], feed_dict = {X: batchInput, y:batchLabels})
if Iterations%1000 == 0: # 每迭代一千次,输出一次效果
train_accuracy = sess.run(accuracy, feed_dict={X:batchInput, y:batchLabels})
print("step %d, training accuracy %g"%(Iterations,train_accuracy))
Iterations=Iterations+1
2. slice_input_producer + batch
又涉及到一些背景知识,这篇文章和这篇文章。以下是图解slice_input_producer。
def get_batch_data(images, label, batch_Size):
input_queue = tf.train.slice_input_producer([images, label], shuffle=True, num_epochs=20) # 见图解
image_batch, label_batch = tf.train.batch(input_queue, batch_size=batch_Size, num_threads=2,allow_smaller_final_batch=True)
return image_batch,label_batch
batchSize = 100 # 记录迭代的次数
batchInput, batchLabels = get_batch_data(mnist.train.images, mnist.train.labels, batchSize)
Iterations = 0 # 定义具体的牛肉厚度
# sess
sess = tf.Session()
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())#就是这一行
coord = tf.train.Coordinator()
# 真正将文件放入文件名队列,还需要调用tf.train.start_queue_runners 函数来启动执行文件名队列填充的线程,
# 之后计算单元才可以把数据读出来,否则文件名队列为空的,
threads = tf.train.start_queue_runners(sess,coord)
try:
while not coord.should_stop():
BatchInput,BatchLabels = sess.run([batchInput, batchLabels])
trainingLoss = sess.run([opt,loss], feed_dict = {X:BatchInput, y:BatchLabels})
if Iterations%1000 == 0:
train_accuracy = accuracy.eval(session = sess, feed_dict={X:BatchInput, y:BatchLabels})
print("step %d, training accuracy %g"%(Iterations,train_accuracy))
Iterations = Iterations + 1
except tf.errors.OutOfRangeError:
train_accuracy = accuracy.eval(session = sess, feed_dict={X:BatchInput, y:BatchLabels})
print("step %d, training accuracy %g"%(Iterations,train_accuracy))
print('Done training')
finally:
coord.request_stop()
coord.join(threads)
# sess.close()
方式1: yield→generator 30个epoch
试验效果,开始前python.exe进程占了402M内存。
试验中,内存基本维持在865M左右
试验后,30个epoch耗时需要49.8s
方式2: slice_input_producer + batch
进行slice_input_producer这步,占用内存由410M提升到了583M
训练的时候,内存占用比较飘忽,有时1G多。
20个epoch耗时需要199s
小结:方式1的效率暂时比方式2快不少。
作者:StarsOcean
链接:https://www.jianshu.com/p/71f31c105879
来源:简书
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。