Mini-batching 是一个一次训练数据集的一小部分,而不是整个训练集的技术。它可以使内存较小、不能同时训练整个数据集的电脑也可以训练模型。
Mini-batching 从运算角度来说是低效的,因为你不能在所有样本中计算 loss。但是这点小代价也比根本不能运行模型要划算。
它跟随机梯度下降(SGD)结合在一起用也很有帮助。方法是在每一代训练之前,对数据进行随机混洗,然后创建 mini-batches,对每一个 mini-batch,用梯度下降训练网络权重。因为这些 batches 是随机的,你其实是在对每个 batch 做随机梯度下降(SGD)。
让我们看看你的机器能否训练出 MNIST 数据集的权重和偏置项。
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf
n_input = 784 # MNIST data input (img shape: 28*28)
n_classes = 10 # MNIST total classes (0-9 digits)
# Import MNIST data
mnist = input_data.read_data_sets('/datasets/ud730/mnist', one_hot=True)
# The features are already scaled and the data is shuffled
train_features = mnist.train.images
test_features = mnist.test.images
train_labels = mnist.train.labels.astype(np.float32)
test_labels = mnist.test.labels.astype(np.float32)
# Weights & bias
weights = tf.Variable(tf.random_normal([n_input, n_classes]))
bias = tf.Variable(tf.random_normal([n_classes]))
问题1
计算 train_features
, train_labels
, weights
, 和 bias
分别占用了多少字节(byte)的内存。可以忽略头部空间,只需要计算实际需要多少内存来存储数据。
你也可以看这里了解一个 float32 占用多少内存。
train_features Shape: (55000, 784) Type: float32
train_labels Shape: (55000, 10) Type: float32
weights Shape: (784, 10) Type: float32
bias Shape: (10,) Type: float32
输入、权重和偏置项总共的内存空间需求是 174MB,并不是太多。你可以在 CPU 和 GPU 上训练整个数据集。
但将来你要用到的数据集可能是以 G 来衡量,甚至更多。你可以买更多的内存,但是会很贵。例如一个 12GB 显存容量的 Titan X GPU 会超过 1000 美金。所以,为了在你自己机器上运行大模型,你需要学会用 mini-batching。
让我们看下如何在 TensorFlow 下实现 mini-batching
TensorFlow Mini-batching
要使用 mini-batching,你首先要把你的数据集分成 batch。
不幸的是,有时候不可能把数据完全分割成相同数量的 batch。例如有 1000 个数据点,你想每个 batch 有 128 个数据。但是 1000 无法被 128 整除。你得到的结果是其中 7 个 batch 有 128 个数据点,一个 batch 有 104 个数据点。(7128 + 1104 = 1000)
batch 里面的数据点数量会不同的情况下,你需要利用 TensorFlow 的 tf.placeholder()
函数来接收这些不同的 batch。
继续上述例子,如果每个样本有 n_input = 784
特征,n_classes = 10
个可能的标签,features
的维度应该是 [None, n_input]
,labels
的维度是 [None, n_classes]
。
# Features and Labels
features = tf.placeholder(tf.float32, [None, n_input])
labels = tf.placeholder(tf.float32, [None, n_classes])
None
在这里做什么用呢?
None
维度在这里是一个 batch size 的占位符。在运行时,TensorFlow 会接收任何大于 0 的 batch size。
回到之前的例子,这个设置可以让你把 features
和 labels
给到模型。无论 batch 中包含 128,还是 104 个数据点。
问题二
下列参数,会有多少 batch,最后一个 batch 有多少数据点?
features is (50000, 400)
labels is (50000, 10)
batch_size is 128
问题三
对 features 和 labels 实现一个 batches 函数。这个函数返回每个有最大 batch_size 数据点的 batch。下面有例子来说明一个示例 batches 函数的输出是什么。
# 4 个特征
example_features = [
['F11','F12','F13','F14'],
['F21','F22','F23','F24'],
['F31','F32','F33','F34'],
['F41','F42','F43','F44']]
# 4 个 label
example_labels = [
['L11','L12'],
['L21','L22'],
['L31','L32'],
['L41','L42']]
example_batches = batches(3, example_features, example_labels)
example_batches 变量如下:
[
# 分 2 个 batch:
# 第一个 batch 的 size 是 3
# 第二个 batch 的 size 是 1
[
# size 为 3 的第一个 Batch
[
# 3 个特征样本
# 每个样本有四个特征
['F11', 'F12', 'F13', 'F14'],
['F21', 'F22', 'F23', 'F24'],
['F31', 'F32', 'F33', 'F34']
], [
# 3 个标签样本
# 每个标签有两个 label
['L11', 'L12'],
['L21', 'L22'],
['L31', 'L32']
]
], [
# size 为 1 的第二个 Batch
# 因为 batch size 是 3。所以四个样品中只有一个在这里。
[
# 1 一个样本特征
['F41', 'F42', 'F43', 'F44']
], [
# 1 个 label
['L41', 'L42']
]
]
]
- sandbox.py:
from quiz import batches
from pprint import pprint
# 4 Samples of features
example_features = [
['F11','F12','F13','F14'],
['F21','F22','F23','F24'],
['F31','F32','F33','F34'],
['F41','F42','F43','F44']]
# 4 Samples of labels
example_labels = [
['L11','L12'],
['L21','L22'],
['L31','L32'],
['L41','L42']]
# PPrint prints data structures like 2d arrays, so they are easier to read
pprint(batches(3, example_features, example_labels))
- quiz.py
import math
def batches(batch_size, features, labels):
"""
Create batches of features and labels
:param batch_size: The batch size
:param features: List of features
:param labels: List of labels
:return: Batches of (Features, Labels)
"""
assert len(features) == len(labels)
# TODO: Implement batching
output_batches = []
sample_size = len(features)
for start_i in range(0, sample_size, batch_size):
end_i = start_i + batch_size
batch = [features[start_i:end_i], labels[start_i:end_i]]
output_batches.append(batch)
return output_batches
让我们用 mini-batching 来把 MNIST 特征和目标分批给到线性模型。
设定 batch size,用 batches 函数来分配所有数据。建议的 batch size 是 128,你也可以根据自己内存大小来改变它。
- quiz,py
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf
import numpy as np
from helper import batches
learning_rate = 0.001
n_input = 784 # MNIST data input (img shape: 28*28)
n_classes = 10 # MNIST total classes (0-9 digits)
# Import MNIST data
mnist = input_data.read_data_sets('/datasets/ud730/mnist', one_hot=True)
# The features are already scaled and the data is shuffled
train_features = mnist.train.images
test_features = mnist.test.images
train_labels = mnist.train.labels.astype(np.float32)
test_labels = mnist.test.labels.astype(np.float32)
# Features and Labels
features = tf.placeholder(tf.float32, [None, n_input])
labels = tf.placeholder(tf.float32, [None, n_classes])
# Weights & bias
weights = tf.Variable(tf.random_normal([n_input, n_classes]))
bias = tf.Variable(tf.random_normal([n_classes]))
# Logits - xW + b
logits = tf.add(tf.matmul(features, weights), bias)
# Define loss and optimizer
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)
# Calculate accuracy
correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(labels, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
# TODO: Set batch size
batch_size = 784
assert batch_size is not None, 'You must set the batch size'
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
# TODO: Train optimizer on all batches
for batch_features, batch_labels in batches(batch_size, train_features, train_labels):
sess.run(optimizer, feed_dict={features: batch_features, labels: batch_labels})
# Calculate accuracy for test dataset
test_accuracy = sess.run(
accuracy,
feed_dict={features: test_features, labels: test_labels})
print('Test Accuracy: {}'.format(test_accuracy))
helper.py
import math
def batches(batch_size, features, labels):
"""
Create batches of features and labels
:param batch_size: The batch size
:param features: List of features
:param labels: List of labels
:return: Batches of (Features, Labels)
"""
assert len(features) == len(labels)
outout_batches = []
sample_size = len(features)
for start_i in range(0, sample_size, batch_size):
end_i = start_i + batch_size
batch = [features[start_i:end_i], labels[start_i:end_i]]
outout_batches.append(batch)
return outout_batches