参考链接[1]和[3]。
import theano import theano.tensor as T import numpy as np import matplotlib.pyplot as plt plt.ion() import load_mnist import load_cifar from theano.tensor.nnet.conv import conv2d from theano.tensor.signal.downsample import max_pool_2d # load data x_train, t_train, x_test, t_test = load_cifar.cifar10(dtype=theano.config.floatX) # x_train, x_test, t_train, t_test = load_mnist.mnist(onehot=True) labels_test = np.argmax(t_test, axis=1) # reshape data x_train = x_train.reshape((x_train.shape[0], 1, 32, 32)) x_test = x_test.reshape((x_test.shape[0], 1, 32, 32)) # define symbolic Theano variables x = T.tensor4() t = T.matrix() # define model: neural network def floatX(x): return np.asarray(x, dtype=theano.config.floatX) def init_weights(shape): return theano.shared(floatX(np.random.randn(*shape) * 0.1)) def momentum(cost, params, learning_rate, momentum): grads = theano.grad(cost, params) updates = [] for p, g in zip(params, grads): mparam_i = theano.shared(np.zeros(p.get_value().shape, dtype=theano.config.floatX)) v = momentum * mparam_i - learning_rate * g updates.append((mparam_i, v)) updates.append((p, p + v)) return updates def model(x, w_c1, b_c1, w_c2, b_c2, w_h3, b_h3, w_o, b_o): c1 = T.maximum(0, conv2d(x, w_c1) + b_c1.dimshuffle('x', 0, 'x', 'x')) p1 = max_pool_2d(c1, (3, 3)) c2 = T.maximum(0, conv2d(p1, w_c2) + b_c2.dimshuffle('x', 0, 'x', 'x')) p2 = max_pool_2d(c2, (2, 2)) p2_flat = p2.flatten(2) h3 = T.maximum(0, T.dot(p2_flat, w_h3) + b_h3) p_y_given_x = T.nnet.softmax(T.dot(h3, w_o) + b_o) return p_y_given_x w_c1 = init_weights((4, 1, 3, 3)) b_c1 = init_weights((4,)) w_c2 = init_weights((8, 4, 3, 3)) b_c2 = init_weights((8,)) w_h3 = init_weights((8 * 4 * 4, 100)) b_h3 = init_weights((100,)) w_o = init_weights((100, 10)) b_o = init_weights((10,)) params = [w_c1, b_c1, w_c2, b_c2, w_h3, b_h3, w_o, b_o] p_y_given_x = model(x, *params) y = T.argmax(p_y_given_x, axis=1) cost = T.mean(T.nnet.categorical_crossentropy(p_y_given_x, t)) updates = momentum(cost, params, learning_rate=0.01, momentum=0.9) # compile theano functions train = theano.function([x, t], cost, updates=updates, allow_input_downcast=True) predict = theano.function([x], y, allow_input_downcast=True) # train model batch_size = 50 for i in range(50): for start in range(0, len(x_train), batch_size): x_batch = x_train[start:start + batch_size] t_batch = t_train[start:start + batch_size] cost = train(x_batch, t_batch) predictions_test = predict(x_test) accuracy = np.mean(predictions_test == labels_test) print "epoch %d - accuracy: %.4f" % (i+1, accuracy)
(9)CNN具体网络结构(彩色图片)
输入图像的维度变为(50,3,32,32),卷积层1的卷积输入的feature由1变为3。
(10)Dropout
参考链接[5]。
1)含义
Dropout以概率p随机丢掉隐含层单元。
原因:防止训练数据时隐含层神经元的相互作用;隐含层单元不再依赖其它隐含层单元;每个神经元都会学到有用的特征。同时,可以在合理的时间内训练大规模网络。
2)训练
方法:
SGD
Mini-Batches
交叉熵目标函数。
修改惩罚项:为每个隐含层单元的权重向量的L2模设置上限;如果约束不满足,则重新归一化;防止权重增加过大;允许在训练初期用高学习率,训练期间逐渐降低;对权重空间作更彻底的搜索。
3)测试
平均网络:包含所有的隐含层单元,但每个单元的值都是原值的概率p倍。
用很小的学习率和dropout微调模型要比标准反向传播微调效果更好。
4)这里
对模型的全连接层的输入采用了dropout函数。
众所周知Dropout效果不错,但也并不总是这样。链接[6]中有这么一句话:
In vision tasks, input features are commonly dense, while in our task input features are sparse and labels are noisy. In the dense setting, dropout serves to separate effects from strongly correlated features, resulting in a more robust classifier. But in our sparse, noisy setting adding in dropout appears to simply reduce the amount of data available for learning.
5)问题
链接[6]中lamthep问到用到dropout的许多文章中都把dropout用在全连接层并怀疑原因是卷积层产生的特征已经是稀疏的。
ZygmuntZ答复引用了下面作者的话:
a.支持dropout应用在隐含层
Nitish Srivastava: "The additional gain in performance obtained by adding dropout in the convolutional layers besides doing dropout in the fully connected layers suggests that the utility of dropout is not limited to densely connected neural networks but can be more generally applied to other specialized architectures."
b.反对dropout应用在隐含层
Matt Zeiler:"A drawback to dropout is that it does not seem to have the same benefits for convolutional layers."
c.其它
zhaoyangyang:"The dropout with make the training time much longer, if applied at each layer, it might be too long to train."
卷积层不像全连接层那样需要正则化;卷积层正则化弊大于利(卷积层产生的特征已经是稀疏的,只会增加计算代价)。