Mini-batching

Mini-batching 是一个一次训练数据集的一小部分,而不是整个训练集的技术。它可以使内存较小、不能同时训练整个数据集的电脑也可以训练模型。

Mini-batching 从运算角度来说是低效的,因为你不能在所有样本中计算 loss。但是这点小代价也比根本不能运行模型要划算。

它跟随机梯度下降(SGD)结合在一起用也很有帮助。方法是在每一代训练之前,对数据进行随机混洗,然后创建 mini-batches,对每一个 mini-batch,用梯度下降训练网络权重。因为这些 batches 是随机的,你其实是在对每个 batch 做随机梯度下降(SGD)。

让我们看看你的机器能否训练出 MNIST 数据集的权重和偏置项。

from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf

n_input = 784  # MNIST data input (img shape: 28*28)
n_classes = 10  # MNIST total classes (0-9 digits)

# Import MNIST data
mnist = input_data.read_data_sets('/datasets/ud730/mnist', one_hot=True)

# The features are already scaled and the data is shuffled
train_features = mnist.train.images
test_features = mnist.test.images

train_labels = mnist.train.labels.astype(np.float32)
test_labels = mnist.test.labels.astype(np.float32)

# Weights & bias
weights = tf.Variable(tf.random_normal([n_input, n_classes]))
bias = tf.Variable(tf.random_normal([n_classes]))

问题1

计算 train_features, train_labels, weights, 和 bias 分别占用了多少字节(byte)的内存。可以忽略头部空间,只需要计算实际需要多少内存来存储数据。

你也可以看这里了解一个 float32 占用多少内存。

train_features Shape: (55000, 784) Type: float32

train_labels Shape: (55000, 10) Type: float32

weights Shape: (784, 10) Type: float32

bias Shape: (10,) Type: float32
输入、权重和偏置项总共的内存空间需求是 174MB,并不是太多。你可以在 CPU 和 GPU 上训练整个数据集。

但将来你要用到的数据集可能是以 G 来衡量,甚至更多。你可以买更多的内存,但是会很贵。例如一个 12GB 显存容量的 Titan X GPU 会超过 1000 美金。所以,为了在你自己机器上运行大模型,你需要学会用 mini-batching。

让我们看下如何在 TensorFlow 下实现 mini-batching

TensorFlow Mini-batching

要使用 mini-batching,你首先要把你的数据集分成 batch。

不幸的是,有时候不可能把数据完全分割成相同数量的 batch。例如有 1000 个数据点,你想每个 batch 有 128 个数据。但是 1000 无法被 128 整除。你得到的结果是其中 7 个 batch 有 128 个数据点,一个 batch 有 104 个数据点。(7128 + 1104 = 1000)

batch 里面的数据点数量会不同的情况下,你需要利用 TensorFlow 的 tf.placeholder() 函数来接收这些不同的 batch。

继续上述例子,如果每个样本有 n_input = 784 特征,n_classes = 10 个可能的标签,features 的维度应该是 [None, n_input]labels 的维度是 [None, n_classes]

# Features and Labels
features = tf.placeholder(tf.float32, [None, n_input])
labels = tf.placeholder(tf.float32, [None, n_classes])

None 在这里做什么用呢?

None 维度在这里是一个 batch size 的占位符。在运行时,TensorFlow 会接收任何大于 0 的 batch size。

回到之前的例子,这个设置可以让你把 featureslabels 给到模型。无论 batch 中包含 128,还是 104 个数据点。

问题二

下列参数,会有多少 batch,最后一个 batch 有多少数据点?

features is (50000, 400)

labels is (50000, 10)

batch_size is 128
问题三
对 features 和 labels 实现一个 batches 函数。这个函数返回每个有最大 batch_size 数据点的 batch。下面有例子来说明一个示例 batches 函数的输出是什么。

# 4 个特征
example_features = [
    ['F11','F12','F13','F14'],
    ['F21','F22','F23','F24'],
    ['F31','F32','F33','F34'],
    ['F41','F42','F43','F44']]
# 4 个 label
example_labels = [
    ['L11','L12'],
    ['L21','L22'],
    ['L31','L32'],
    ['L41','L42']]

example_batches = batches(3, example_features, example_labels)
example_batches 变量如下:

[
    # 分 2 个 batch:
    #   第一个 batch 的 size 是 3
    #   第二个 batch 的 size 是 1
    [
        # size 为 3 的第一个 Batch
        [
            # 3 个特征样本
            # 每个样本有四个特征
            ['F11', 'F12', 'F13', 'F14'],
            ['F21', 'F22', 'F23', 'F24'],
            ['F31', 'F32', 'F33', 'F34']
        ], [
            # 3 个标签样本
            # 每个标签有两个 label
            ['L11', 'L12'],
            ['L21', 'L22'],
            ['L31', 'L32']
        ]
    ], [
        # size 为 1 的第二个 Batch 
        # 因为 batch size 是 3。所以四个样品中只有一个在这里。
        [
            # 1 一个样本特征
            ['F41', 'F42', 'F43', 'F44']
        ], [
            # 1 个 label
            ['L41', 'L42']
        ]
    ]
]
  • sandbox.py:
from quiz import batches
from pprint import pprint

# 4 Samples of features
example_features = [
    ['F11','F12','F13','F14'],
    ['F21','F22','F23','F24'],
    ['F31','F32','F33','F34'],
    ['F41','F42','F43','F44']]
# 4 Samples of labels
example_labels = [
    ['L11','L12'],
    ['L21','L22'],
    ['L31','L32'],
    ['L41','L42']]

# PPrint prints data structures like 2d arrays, so they are easier to read
pprint(batches(3, example_features, example_labels))
  • quiz.py
import math
def batches(batch_size, features, labels):
    """
    Create batches of features and labels
    :param batch_size: The batch size
    :param features: List of features
    :param labels: List of labels
    :return: Batches of (Features, Labels)
    """
    assert len(features) == len(labels)
    # TODO: Implement batching
    output_batches = []
    
    sample_size = len(features)
    for start_i in range(0, sample_size, batch_size):
        end_i = start_i + batch_size
        batch = [features[start_i:end_i], labels[start_i:end_i]]
        output_batches.append(batch)
        
    return output_batches

让我们用 mini-batching 来把 MNIST 特征和目标分批给到线性模型。

设定 batch size,用 batches 函数来分配所有数据。建议的 batch size 是 128,你也可以根据自己内存大小来改变它。

  • quiz,py
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf
import numpy as np
from helper import batches

learning_rate = 0.001
n_input = 784  # MNIST data input (img shape: 28*28)
n_classes = 10  # MNIST total classes (0-9 digits)

# Import MNIST data
mnist = input_data.read_data_sets('/datasets/ud730/mnist', one_hot=True)

# The features are already scaled and the data is shuffled
train_features = mnist.train.images
test_features = mnist.test.images

train_labels = mnist.train.labels.astype(np.float32)
test_labels = mnist.test.labels.astype(np.float32)

# Features and Labels
features = tf.placeholder(tf.float32, [None, n_input])
labels = tf.placeholder(tf.float32, [None, n_classes])

# Weights & bias
weights = tf.Variable(tf.random_normal([n_input, n_classes]))
bias = tf.Variable(tf.random_normal([n_classes]))

# Logits - xW + b
logits = tf.add(tf.matmul(features, weights), bias)

# Define loss and optimizer
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)

# Calculate accuracy
correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(labels, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))


# TODO: Set batch size
batch_size = 784
assert batch_size is not None, 'You must set the batch size'

init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    
    # TODO: Train optimizer on all batches
    for batch_features, batch_labels in batches(batch_size, train_features, train_labels):
        sess.run(optimizer, feed_dict={features: batch_features, labels: batch_labels})

    # Calculate accuracy for test dataset
    test_accuracy = sess.run(
        accuracy,
        feed_dict={features: test_features, labels: test_labels})

print('Test Accuracy: {}'.format(test_accuracy))

helper.py

import math
def batches(batch_size, features, labels):
    """
    Create batches of features and labels
    :param batch_size: The batch size
    :param features: List of features
    :param labels: List of labels
    :return: Batches of (Features, Labels)
    """
    assert len(features) == len(labels)
    outout_batches = []
    
    sample_size = len(features)
    for start_i in range(0, sample_size, batch_size):
        end_i = start_i + batch_size
        batch = [features[start_i:end_i], labels[start_i:end_i]]
        outout_batches.append(batch)
        
    return outout_batches

你可能感兴趣的:(Mini-batching)