好了,让我们来写一个程序,学习如何识别手写的数字,使用随机梯度下降和MNIST的训练数据。我们将用一个简短的Python(2.7)程序来完成这项工作,只需要74行代码!我们需要的第一件事就是获取MNIST的数据。如果您是一个git用户,那么您可以通过克隆这本书的代码库来获得数据
git clone https://github.com/mnielsen/neural-networks-and-deep-learning.git
顺便说一下,当我更早地描述MNIST的数据时,我说它被分成了6万个训练图像和1万个测试图像。这是官方的MNIST描述。但后来在书中我们会发现它有用的在搞清楚如何设置神经网络的某些超级参数—诸如学习速率,等等。尽管验证数据并不是原始的MNIST规范的一部分,但是许多人以这种方式使用MNIST,并且在神经网络中使用验证数据是很常见的。当我提到“MNIST训练数据”如前所述,MNIST数据集是基于NIST收集的两个数据集,美国国家标准与技术研究院。为了构建NIST的数据集,NIST的数据集被精简了,并被Yann LeCun、科琳娜科尔特斯和克里斯托弗j.c.Burges所采用的更方便的格式。有关更多细节,请参见此链接。我的存储库中的数据集是以一种形式,使得在Python中加载和操纵nist的数据变得很容易。我从蒙特利尔大学的LISA机器学习实验室(链接)获得了这种特殊形式的数据。
除了MNIST的数据之外,我们还需要一个名为Numpy的Python库,用于快速线性代数。如果你还没有安装Numpy,你可以在这里找到它。
在给出完整的清单之前,让我解释一下神经网络代码的核心特性。中心是一个网络类,我们用它来表示一个神经网络。下面是我们用来初始化一个网络对象的代码:
class Network(object):
def __init__(self, sizes):
self.num_layers = len(sizes)
self.sizes = sizes
self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
self.weights = [np.random.randn(y, x)
for x, y in zip(sizes[:-1], sizes[1:])]
在这段代码中,列表大小包含了各个层中神经元的数量。举个例子,如果我们想要创建一个网络对象在第一层有两个神经元,第二层的3个神经元,最后一层的1个神经元,我们会用代码来做这个。
$net = Network([2, 3, 1])
网络对象中的偏差和权重都是随机初始化的,使用np.random.randn函数生成高斯分布的平均值0和标准差1。这个随机初始化给出了我们的随机梯度下降算法一个起点。在后面的章节中,我们会找到更好的方法来初始化权重和偏差,但现在就可以了。请注意,网络初始化代码假设第一层神经元是一个输入层,并省略了对这些神经元的任何偏见,因为偏差只用于计算后期的输出。
这个方程里有很多东西,让我们把它拆开。 a a 是第二层神经元激活的载体。为了得到 a′ a ′ ,我们把 a a 乘以权重矩阵 w w ,然后加上偏差的向量 b b 。然后我们将这个函数元素应用到向量 wa+b w a + b 的每一个条目上。(这被称为矢量化函数。)
考虑到这一点,可以很容易地从网络实例中编写代码来计算输出。我们首先定义sigmoid函数:
根据:
def sigmoid(z):
return 1.0/(1.0+np.exp(-z))
再根据:
def feedforward(self, a):
"""Return the output of the network if "a" is input."""
for b, w in zip(self.biases, self.weights):
a = sigmoid(np.dot(w, a)+b)
return a
当然,我们希望我们的网络对象所做的主要事情是学习。为了达到这个目的,我们将给他们一个SGD方法来实现随机梯度下降。这里的代码。在一些地方有点神秘,但我将在列表之后把它分解。
def SGD(self, training_data, epochs, mini_batch_size, eta,
test_data=None):
"""Train the neural network using mini-batch stochastic
gradient descent. The "training_data" is a list of tuples
"(x, y)" representing the training inputs and the desired
outputs. The other non-optional parameters are
self-explanatory. If "test_data" is provided then the
network will be evaluated against the test data after each
epoch, and partial progress printed out. This is useful for
tracking progress, but slows things down substantially."""
if test_data: n_test = len(test_data)
n = len(training_data)
for j in xrange(epochs):
random.shuffle(training_data)
mini_batches = [
training_data[k:k+mini_batch_size]
for k in xrange(0, n, mini_batch_size)]
for mini_batch in mini_batches:
self.update_mini_batch(mini_batch, eta)
if test_data:
print "Epoch {0}: {1} / {2}".format(
j, self.evaluate(test_data), n_test)
else:
print "Epoch {0} complete".format(j)
训练数据是一组元组 (x,y) ( x , y ) 表示训练输入和相应的期望输出。你所期望的变量的大小和小批量的大小是你所期望的,在采样时使用的小批量的数量和小批量的大小。 etaS e t a S 是学习速率。如果提供了可选参数 testdata t e s t d a t a ,那么程序将在每次培训后评估网络,并打印出部分进展。这对于跟踪进度很有用,但是会大大降低进度。
代码工作如下。在每个时代,它都是通过随机打乱训练数据开始,然后将其划分成小批量的适当大小。这是一种从训练数据中随机抽取的简单方法。然后对于每一个小批量,我们应用一个梯度下降的步骤。这是由代码 self s e l f 完成的。 updateminibatch(minibatch,eta) u p d a t e m i n i b a t c h ( m i n i b a t c h , e t a ) ,它根据一个单一的梯度下降的迭代来更新网络的权重和偏差,只使用 minibatch m i n i b a t c h 的训练数据。下面是 updateminibatch u p d a t e m i n i b a t c h 方法的代码:
def update_mini_batch(self, mini_batch, eta):
"""Update the network's weights and biases by applying
gradient descent using backpropagation to a single mini batch.
The "mini_batch" is a list of tuples "(x, y)", and "eta"
is the learning rate."""
nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]
for x, y in mini_batch:
delta_nabla_b, delta_nabla_w = self.backprop(x, y)
nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
self.weights = [w-(eta/len(mini_batch))*nw
for w, nw in zip(self.weights, nabla_w)]
self.biases = [b-(eta/len(mini_batch))*nb
for b, nb in zip(self.biases, nabla_b)]
大部分的工作都是由直线完成的
delta_nabla_b, delta_nabla_w = self.backprop(x, y)
这调用了所谓的反向传播算法,这是计算成本函数梯度的一种快速方法。 updateminibatch u p d a t e m i n i b a t c h 的工作原理是简单地计算出 minibatch m i n i b a t c h 中每一个训练示例的梯度,然后更新 self.weights s e l f . w e i g h t s 和 self.biases s e l f . b i a s e s 。
我不打算展示 self.backprop的 s e l f . b a c k p r o p 的 代码。我们将在下一章中研究反向传播的工作原理,包括 self.backprop s e l f . b a c k p r o p 的代码。现在,只要假设它的行为就像声明的那样,返回适当的梯度,以获得与培训示例 x x 相关的数据。
让我们看一下完整的程序,包括文档字符串,我在上面省略了,除了 self.backprop s e l f . b a c k p r o p ,支持这个项目—所有的繁重工作都是在 self.SGD s e l f . S G D 和 self.updateminibatch s e l f . u p d a t e m i n i b a t c h 完成的,我们已经讨论过了。
注意,虽然程序看起来很长,但是大部分代码都是文档字符串,目的是使代码易于理解。事实上,这个程序只包含74行非空白、非注释代码。所有的代码都可以在GitHub上找到。
""
network.py
A module to implement the stochastic gradient descent learning
algorithm for a feedforward neural network. Gradients are calculated
using backpropagation. Note that I have focused on making the code
simple, easily readable, and easily modifiable. It is not optimized,
and omits many desirable features.
"""
#### Libraries
# Standard library
import random
# Third-party libraries
import numpy as np
class Network(object):
def __init__(self, sizes):
"""The list "sizes" contains the number of neurons in the
respective layers of the network. For example, if the list
was [2, 3, 1] then it would be a three-layer network, with the
first layer containing 2 neurons, the second layer 3 neurons,
and the third layer 1 neuron. The biases and weights for the
network are initialized randomly, using a Gaussian
distribution with mean 0, and variance 1. Note that the first
layer is assumed to be an input layer, and by convention we
won't set any biases for those neurons, since biases are only
ever used in computing the outputs from later layers."""
self.num_layers = len(sizes)
self.sizes = sizes
self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
self.weights = [np.random.randn(y, x)
for x, y in zip(sizes[:-1], sizes[1:])]
def feedforward(self, a):
for b, w in zip(self.biases, self.weights):
a = sigmoid(np.dot(w, a)+b)
return a
def SGD(self, training_data, epochs, mini_batch_size, eta,
test_data=None):
"""Train the neural network using mini-batch stochastic
gradient descent. The ``training_data`` is a list of tuples
``(x, y)`` representing the training inputs and the desired
outputs. The other non-optional parameters are
self-explanatory. If ``test_data`` is provided then the
network will be evaluated against the test data after each
epoch, and partial progress printed out. This is useful for
tracking progress, but slows things down substantially."""
if test_data: n_test = len(test_data)
n = len(training_data)
for j in xrange(epochs):
random.shuffle(training_data)
mini_batches = [
training_data[k:k+mini_batch_size]
for k in xrange(0, n, mini_batch_size)]
for mini_batch in mini_batches:
self.update_mini_batch(mini_batch, eta)
if test_data:
print "Epoch {0}: {1} / {2}".format(
j, self.evaluate(test_data), n_test)
else:
print "Epoch {0} complete".format(j)
def update_mini_batch(self, mini_batch, eta):
"""Update the network's weights and biases by applying
gradient descent using backpropagation to a single mini batch.
The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta``
is the learning rate."""
nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]
for x, y in mini_batch:
delta_nabla_b, delta_nabla_w = self.backprop(x, y)
nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
self.weights = [w-(eta/len(mini_batch))*nw
for w, nw in zip(self.weights, nabla_w)]
self.biases = [b-(eta/len(mini_batch))*nb
for b, nb in zip(self.biases, nabla_b)]
def backprop(self, x, y):
"""Return a tuple ``(nabla_b, nabla_w)`` representing the
gradient for the cost function C_x. ``nabla_b`` and
``nabla_w`` are layer-by-layer lists of numpy arrays, similar
to ``self.biases`` and ``self.weights``."""
nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]
# feedforward
activation = x
activations = [x] # list to store all the activations, layer by layer
zs = [] # list to store all the z vectors, layer by layer
for b, w in zip(self.biases, self.weights):
z = np.dot(w, activation)+b
zs.append(z)
activation = sigmoid(z)
activations.append(activation)
# backward pass
delta = self.cost_derivative(activations[-1], y) * \
sigmoid_prime(zs[-1])
nabla_b[-1] = delta
nabla_w[-1] = np.dot(delta, activations[-2].transpose())
# Note that the variable l in the loop below is used a little
# differently to the notation in Chapter 2 of the book. Here,
# l = 1 means the last layer of neurons, l = 2 is the
# second-last layer, and so on. It's a renumbering of the
# scheme in the book, used here to take advantage of the fact
# that Python can use negative indices in lists.
for l in xrange(2, self.num_layers):
z = zs[-l]
sp = sigmoid_prime(z)
delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
nabla_b[-l] = delta
nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
return (nabla_b, nabla_w)
def evaluate(self, test_data):
"""Return the number of test inputs for which the neural
network outputs the correct result. Note that the neural
network's output is assumed to be the index of whichever
neuron in the final layer has the highest activation."""
test_results = [(np.argmax(self.feedforward(x)), y)
for (x, y) in test_data]
return sum(int(x == y) for (x, y) in test_results)
def cost_derivative(self, output_activations, y):
"""Return the vector of partial derivatives \partial C_x /
\partial a for the output activations."""
return (output_activations-y)
#### Miscellaneous functions
def sigmoid(z):
"""The sigmoid function."""
return 1.0/(1.0+np.exp(-z))
def sigmoid_prime(z):
"""Derivative of the sigmoid function."""
return sigmoid(z)*(1-sigmoid(z))
早些时候,我跳过了有关 MNIST M N I S T 数据的加载细节。这是很简单的。为了完整起见,这里是代码。用来存储MNIST数据的数据结构在文档字符串中被描述——它是简单的东西、元组和 Numpyndarray N u m p y n d a r r a y 对象的列表(如果您不熟悉 ndarray n d a r r a y 的话,可以把它们看作是向量):
"""
mnist_loader
~~~~~~~~~~~~
A library to load the MNIST image data. For details of the data
structures that are returned, see the doc strings for ``load_data``
and ``load_data_wrapper``. In practice, ``load_data_wrapper`` is the
function usually called by our neural network code.
"""
#### Libraries
# Standard library
import cPickle
import gzip
# Third-party libraries
import numpy as np
def load_data():
"""Return the MNIST data as a tuple containing the training data,
the validation data, and the test data.
The ``training_data`` is returned as a tuple with two entries.
The first entry contains the actual training images. This is a
numpy ndarray with 50,000 entries. Each entry is, in turn, a
numpy ndarray with 784 values, representing the 28 * 28 = 784
pixels in a single MNIST image.
The second entry in the ``training_data`` tuple is a numpy ndarray
containing 50,000 entries. Those entries are just the digit
values (0...9) for the corresponding images contained in the first
entry of the tuple.
The ``validation_data`` and ``test_data`` are similar, except
each contains only 10,000 images.
This is a nice data format, but for use in neural networks it's
helpful to modify the format of the ``training_data`` a little.
That's done in the wrapper function ``load_data_wrapper()``, see
below.
"""
f = gzip.open('../data/mnist.pkl.gz', 'rb')
training_data, validation_data, test_data = cPickle.load(f)
f.close()
return (training_data, validation_data, test_data)
def load_data_wrapper():
"""Return a tuple containing ``(training_data, validation_data,
test_data)``. Based on ``load_data``, but the format is more
convenient for use in our implementation of neural networks.
In particular, ``training_data`` is a list containing 50,000
2-tuples ``(x, y)``. ``x`` is a 784-dimensional numpy.ndarray
containing the input image. ``y`` is a 10-dimensional
numpy.ndarray representing the unit vector corresponding to the
correct digit for ``x``.
``validation_data`` and ``test_data`` are lists containing 10,000
2-tuples ``(x, y)``. In each case, ``x`` is a 784-dimensional
numpy.ndarry containing the input image, and ``y`` is the
corresponding classification, i.e., the digit values (integers)
corresponding to ``x``.
Obviously, this means we're using slightly different formats for
the training data and the validation / test data. These formats
turn out to be the most convenient for use in our neural network
code."""
tr_d, va_d, te_d = load_data()
training_inputs = [np.reshape(x, (784, 1)) for x in tr_d[0]]
training_results = [vectorized_result(y) for y in tr_d[1]]
training_data = zip(training_inputs, training_results)
validation_inputs = [np.reshape(x, (784, 1)) for x in va_d[0]]
validation_data = zip(validation_inputs, va_d[1])
test_inputs = [np.reshape(x, (784, 1)) for x in te_d[0]]
test_data = zip(test_inputs, te_d[1])
return (training_data, validation_data, test_data)
def vectorized_result(j):
"""Return a 10-dimensional unit vector with a 1.0 in the jth
position and zeroes elsewhere. This is used to convert a digit
(0...9) into a corresponding desired output from the neural
network."""
e = np.zeros((10, 1))
e[j] = 1.0
return e