Theano-Deep Learning Tutorials 笔记:Getting Started




(1)mnist手写数字集:每张是一个784维向量(28*28),像素值为0到1的float,每张代表一个0到9的数,50000张training set,10000张validation set(验证集用于类似学习率,model size等参数的选择),10000张testing set。

For convenience we pickled the dataset to make it easier to use in python.

import cPickle, gzip, numpy

# Load the dataset
f ='mnist.pkl.gz', 'rb')
train_set, valid_set, test_set = cPickle.load(f)



(2)We encourage you to store the dataset into shared variablesand access it based on the minibatch index, given a fixed and known batch size(即代码中的batch_size =500).

原因是:使用GPU时,不停地把数据拷贝到GPU效率不高,尽量使用Theano shared variables来提高性能;建议设6个不同共享变量,data:training set,validation set ,testing set 3个,label 3个。

def shared_dataset(data_xy):
    #Function that loads the dataset into shared variables
    data_x, data_y = data_xy
    shared_x = theano.shared(numpy.asarray(data_x, dtype=theano.config.floatX))
    shared_y = theano.shared(numpy.asarray(data_y, dtype=theano.config.floatX))
    # GPU上数据存储为float,y应该是int,所以return的时候用cast转成int,
    return shared_x, T.cast(shared_y, 'int32')

test_set_x, test_set_y = shared_dataset(test_set)
valid_set_x, valid_set_y = shared_dataset(valid_set)
train_set_x, train_set_y = shared_dataset(train_set)

batch_size = 500    # size of the minibatch

# accessing the third minibatch of the training set

data  = train_set_x[2 * batch_size: 3 * batch_size]
label = train_set_y[2 * batch_size: 3 * batch_size]


you can store a sufficiently small chunk of your data (several minibatches) in a shared variable and use that during training. Once you got through the chunk, update the values it stores.

Learning a Classifier

Zero-One Loss


If f: R^D \rightarrow\{0,...,L\} is the prediction function, then this loss can be written as:

where either \mathcal{D} is the training set (during training) or (to avoid biasing the evaluation of validation or test error).I is the indicator function defined as:

In this tutorial, is defined as:

f(x) = {\rm argmax}_k P(Y=k | x, \theta)

# zero_one_loss is a Theano variable representing a symbolic
# expression of the zero one loss ; to get the actual value this
# symbolic expression has to be compiled into a Theano function (see
# the Theano tutorial for more details)
zero_one_loss = T.sum(T.neq(T.argmax(p_y_given_x), y))

Negative Log-Likelihood Loss


minimize the negative log-likelihood (NLL), defined as:

NLL(\theta, \mathcal{D}) = - \sum_{i=0}^{|\mathcal{D}|} \log P(Y=y^{(i)} | x^{(i)}, \theta)

# NLL is a symbolic variable ; to get the actual value of NLL, this symbolic
# expression has to be compiled into a Theano function (see the Theano
# tutorial for more details)
NLL = -T.sum(T.log(p_y_given_x)[T.arange(y.shape[0]), y])
# note on syntax: T.arange(y.shape[0]) is a vector of integers [0,1,2,...,len(y)].
# Indexing a matrix M by the two vectors [0,1,...,K], [a,b,...,k] returns the
# elements M[0,a], M[1,b], ..., M[K,k] as a vector.  Here, we use this
# syntax to retrieve the log-probability of the correct labels, y.

Stochastic Gradient Descent




An optimal is model-, dataset-, and hardware-dependent, and can be anywhere from 1 to maybe several hundreds. In the tutorial we set it to 20, but this choice is almost arbitrary (though harmless).

If you are training for a fixed number of epochs, the minibatch size becomes important because it controls the number of updates done to your parameters. Training the same model for 10 epochs using a batch size of 1 yields completely different results compared to training for the same 10 epochs but with a batchsize of 20.


# Minibatch Stochastic Gradient Descent

# assume loss is a symbolic description of the loss function given
# the symbolic variables params (shared variable), x_batch, y_batch;

# compute gradient of loss with respect to params
d_loss_wrt_params = T.grad(loss, params)

# compile the MSGD step into a theano function
updates = [(params, params - learning_rate * d_loss_wrt_params)]
MSGD = theano.function([x_batch,y_batch], loss, updates=updates)

for (x_batch, y_batch) in train_batches:
    # here x_batch and y_batch are elements of train_batches and
    # therefore numpy arrays; function MSGD also updates the params
    print('Current loss is ', MSGD(x_batch, y_batch))
    if stopping_condition_is_met:
        return params




L1 and L2 regularization


Formally, if our loss function is:

NLL(\theta, \mathcal{D}) = - \sum_{i=0}^{|\mathcal{D}|} \log P(Y=y^{(i)} | x^{(i)}, \theta)

then the regularized loss will be:

E(\theta, \mathcal{D}) =  NLL(\theta, \mathcal{D}) + \lambda R(\theta)\\

or, in our case

E(\theta, \mathcal{D}) =  NLL(\theta, \mathcal{D}) + \lambda||\theta||_p^p





In principle, adding a regularization term to the loss will encourage smooth network mappings in a neural network (by penalizing large values of the parameters, which decreases the amount of nonlinearity that the network models). More intuitively, the two terms (NLL and ) correspond to modelling the data well (NLL) and having “simple” or “smooth” solutions (). Thus, minimizing the sum of both will, in theory, correspond to finding the right trade-off (即折衷考虑)between the fit to the training data and the “generality” of the solution that is found. To follow Occam’s razor principle, this minimization should find us the simplest solution (as measured by our simplicity criterion) that fits the training data.

Note that the fact that a solution is “simple” does not mean that it will generalize well. Empirically, it was found that performing such regularization in the context of neural networks helps with generalization, especially on small datasets. The code block below shows how to compute the loss in python when it contains both a L1 regularization term weighted by and L2 regularization term weighted by

# symbolic Theano variable that represents the L1 regularization term
L1  = T.sum(abs(param))

# symbolic Theano variable that represents the squared L2 term
L2 = T.sum(param ** 2)

# the loss
loss = NLL + lambda_1 * L1 + lambda_2 * L2


Early-stopping通过测试模型在validation set的性能来防止过拟合。即当性能在测试集上不再显著提高甚至下降时,就停止优化迭代。

The choice of when to stop is a judgement call and a few heuristics(启发式) exist, but these tutorials will make use of a strategy based on a geometrically increasing amount of patience.(模拟一种耐心程度来决定何时停止)

# early-stopping parameters
patience = 5000  # look as this many examples regardless
patience_increase = 2     # wait this much longer when a new best is
                              # found
improvement_threshold = 0.995  # a relative improvement of this much is
                               # considered significant
validation_frequency = min(n_train_batches, patience/2)
                              # go through this many
                              # minibatches before checking the network
                              # on the validation set; in this case we
                              # check every epoch 因为n_train_batches比patience/2小,每n_train_batches验证一次就是每epoch验证一次

best_params = None
best_validation_loss = numpy.inf
test_score = 0.
start_time = time.clock()

done_looping = False
epoch = 0
while (epoch < n_epochs) and (not done_looping):
    # Report "1" for first epoch, "n_epochs" for last epoch
    epoch = epoch + 1
    for minibatch_index in xrange(n_train_batches):

        d_loss_wrt_params = ... # compute gradient
        params -= learning_rate * d_loss_wrt_params # gradient descent

        # iteration number. We want it to start at 0.
        iter = (epoch - 1) * n_train_batches + minibatch_index
        # note that if we do `iter % validation_frequency` it will be
        # true for iter = 0 which we do not want. We want it true for
        # iter = validation_frequency - 1.
        if (iter + 1) % validation_frequency == 0:

            this_validation_loss = ... # compute zero-one loss on validation set

            if this_validation_loss < best_validation_loss:

                # improve patience if loss improvement is good enough
                if this_validation_loss < best_validation_loss * improvement_threshold:

                    patience = max(patience, iter * patience_increase)
                best_params = copy.deepcopy(params)
                best_validation_loss = this_validation_loss

        if patience <= iter:
            done_looping = True

# best_params refers to the best out-of-sample parameters observed during the optimization

If we run out of batches of training data before running out of patience, then we just go back to the beginning of the training set and repeat.




(3)如果在验证集上的损失有明显下降且iter * patience_increase>patience,patience就增长:patience = max(patience, iter * patience_increase) 注意patience_increase为2,iter越大,patience增长越多。


Note:validation_frequency = min(n_train_batches, patience/2) 


Note:This algorithm could possibly be improved by using a test of statistical significance rather than the simple comparison, when deciding whether to increase the patience.

Theano/Python Tips

Loading and Saving Models


Read more about serialization in Theano, or Python’s pickling.

Pickle the numpy ndarrays from your shared variables

if your parameters are in shared variables w, v, u, then your save command should look something like:

import cPickle
save_file = open('path', 'wb')  # this will overwrite current contents
Pickle.dump(w.get_value(borrow=True), save_file, -1)  # the -1 is for HIGHEST_PROTOCOL
cPickle.dump(v.get_value(borrow=True), save_file, -1)  # .. and it triggers much more efficient
cPickle.dump(u.get_value(borrow=True), save_file, -1)  # .. storage than numpy's default

Then later, you can load your data back like this:

save_file = open('path')
w.set_value(cPickle.load(save_file), borrow=True)
v.set_value(cPickle.load(save_file), borrow=True)
u.set_value(cPickle.load(save_file), borrow=True)


Do not pickle your training or test functions for long-term storage

Theano functions are compatible with Python’s deepcopy and pickle mechanisms, but you should not necessarily pickle a Theano function. If you update your Theano folder and one of the internal changes, then you may not be able to un-pickle your model.


Plotting Intermediate Results

用PIL, matplotlib两个库实现可视化。


