FullyConnectedNets.ipynb
在本次作业中,我们将模块化的实现我们的全连接神经网络,每一层网络,我们将实现前向传播forward()
和反向传播 backward()
。
forward()
接收输入和权重以及必要的其它参数,然后返回一个输出,以及存储我们在反向传播过程中需要的变量,即类似于:def layer_forward(x, w):
""" Receive inputs x and weights w """
# Do some computations ...
z = # ... some intermediate value
# Do some more computations ...
out = # the output
cache = (x, w, z, out) # Values we need to compute gradients
return out, cache
backward()
将接收上流梯度以及之前存储的变量,然后返回输入以及权重的梯度:def layer_backward(dout, cache):
"""
Receive dout (derivative of loss with respect to outputs) and cache,
and compute derivative with respect to inputs.
"""
# Unpack cache values
x, w, z, out = cache
# Use values in cache to compute derivatives
dx = # Derivative of loss with respect to x
dw = # Derivative of loss with respect to w
return dx, dw
注意:gnureadline==6.3.3 在windows下不支持,直接不安装就行;其它也并不都是必须的,可选择性安装,但是要有Numpy、Cython、Future等。
cd assignment2
pip install -r requirements.txt
im2col.py
。我们要做的就是先编译这个文件,即:在cs231n
目录下,运行setup.py
python setup.py build_ext --inplace
Jupyter notebook
环境# As usual, a bit of setup
from __future__ import print_function
import time
import numpy as np
import matplotlib.pyplot as plt
from cs231n.classifiers.fc_net import *
from cs231n.data_utils import get_CIFAR10_data
from cs231n.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array
from cs231n.solver import Solver
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'
# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2
def rel_error(x, y):
""" returns relative error """
return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))
# Load the (preprocessed) CIFAR10 data.
data = get_CIFAR10_data()
for k, v in list(data.items()):
print(('%s: ' % k, v.shape))
('X_val: ', (1000, 3, 32, 32))
('y_test: ', (1000,))
('y_train: ', (49000,))
('X_test: ', (1000, 3, 32, 32))
('X_train: ', (49000, 3, 32, 32))
('y_val: ', (1000,))
cs231n/layers.py
and implement the affine_forward function.def affine_forward(x, w, b):
"""
Computes the forward pass for an affine (fully-connected) layer.
The input x has shape (N, d_1, ..., d_k) and contains a minibatch of N
examples, where each example x[i] has shape (d_1, ..., d_k). We will
reshape each input into a vector of dimension D = d_1 * ... * d_k, and
then transform it to an output vector of dimension M.
Inputs:
- x: A numpy array containing input data, of shape (N, d_1, ..., d_k)
- w: A numpy array of weights, of shape (D, M)
- b: A numpy array of biases, of shape (M,)
Returns a tuple of:
- out: output, of shape (N, M)
- cache: (x, w, b)
"""
out = None
# reshape the input into rows.
x = x.reshape(x.shape[0],-1) # [N , D]
out = np.dot(x,w) + b # [N , M]
cache = (x, w, b)
return out, cache
affine_backward
function and test your implementation using numeric gradient checking.def affine_backward(dout, cache):
"""
Computes the backward pass for an affine layer.
Inputs:
- dout: Upstream derivative, of shape (N, M)
- cache: Tuple of:
- x: Input data, of shape (N, d_1, ... d_k)
- w: Weights, of shape (D, M)
- b: Biases, of shape (M,)
Returns a tuple of:
- dx: Gradient with respect to x, of shape (N, d1, ..., d_k)
- dw: Gradient with respect to w, of shape (D, M)
- db: Gradient with respect to b, of shape (M,)
"""
x, w, b = cache
x_rows = x.reshape(x.shape[0],-1)
d_xrows = np.dot(dout,w.T)
dx = d_xrows.reshape(x.shape)
dw = np.dot(x_rows.T, dout)
# 注意,这里的db没有对行取平均
db = np.sum(dout, axis=0)
return dx, dw, db
def relu_forward(x):
"""
Computes the forward pass for a layer of rectified linear units (ReLUs).
Input:
- x: Inputs, of any shape
Returns a tuple of:
- out: Output, of the same shape as x
- cache: x
"""
out = np.maximum(0,x)
cache = x
return out, cache
def relu_backward(dout, cache):
"""
Computes the backward pass for a layer of rectified linear units (ReLUs).
Input:
- dout: Upstream derivatives, of any shape
- cache: Input x, of same shape as dout
Returns:
- dx: Gradient with respect to x
"""
dx, x = None, cache
dx = (x > 0) * dout
return dx
我们经常会在FC层后面添加ReLU激活函数,请在cs231n/layer_utils.py.
中实现该组合。也算是简单的模型设计组合练习。
def affine_relu_forward(x, w, b):
"""
Convenience layer that perorms an affine transform followed by a ReLU
Inputs:
- x: Input to the affine layer
- w, b: Weights for the affine layer
Returns a tuple of:
- out: Output from the ReLU
- cache: Object to give to the backward pass
"""
a, fc_cache = affine_forward(x, w, b)
out, relu_cache = relu_forward(a)
cache = (fc_cache, relu_cache)
return out, cache
def affine_relu_backward(dout, cache):
"""
Backward pass for the affine-relu convenience layer
"""
fc_cache, relu_cache = cache
da = relu_backward(dout, relu_cache)
dx, dw, db = affine_backward(da, fc_cache)
return dx, dw, db
居然还有这种好事,不过大家可以和前面自己实现的比较一下。
def svm_loss(x, y):
"""
Computes the loss and gradient using for multiclass SVM classification.
Inputs:
- x: Input data, of shape (N, C) where x[i, j] is the score for the jth
class for the ith input.
- y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and
0 <= y[i] < C
Returns a tuple of:
- loss: Scalar giving the loss
- dx: Gradient of the loss with respect to x
"""
N = x.shape[0]
correct_class_scores = x[np.arange(N), y]
margins = np.maximum(0, x - correct_class_scores[:, np.newaxis] + 1.0)
margins[np.arange(N), y] = 0
loss = np.sum(margins) / N
num_pos = np.sum(margins > 0, axis=1)
dx = np.zeros_like(x)
dx[margins > 0] = 1
dx[np.arange(N), y] -= num_pos
dx /= N
return loss, dx
def softmax_loss(x, y):
"""
Computes the loss and gradient for softmax classification.
Inputs:
- x: Input data, of shape (N, C) where x[i, j] is the score for the jth
class for the ith input.
- y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and
0 <= y[i] < C
Returns a tuple of:
- loss: Scalar giving the loss
- dx: Gradient of the loss with respect to x
"""
shifted_logits = x - np.max(x, axis=1, keepdims=True)
Z = np.sum(np.exp(shifted_logits), axis=1, keepdims=True)
log_probs = shifted_logits - np.log(Z)
probs = np.exp(log_probs)
N = x.shape[0]
loss = -np.sum(log_probs[np.arange(N), y]) / N
dx = probs.copy()
dx[np.arange(N), y] -= 1
dx /= N
return loss, dx
用我们之前实现的层来搭建一个二层全连接神经网络
Open the file cs231n/classifiers/fc_net.py
and complete the implementation of the TwoLayerNet class. This class will serve as a model for the other networks you will implement in this assignment, so read through it to make sure you understand the API. You can run the cell below to test your implementation.
AFFINE->RELU->AFFINE->Softmax
同时,这个Module不进行具体的梯度下降优化,它之后会与一个Solver
交互来进行具体的梯度下降。
而且,它需要学习的参数存储在一个字典self.params
中
class TwoLayerNet(object):
"""
A two-layer fully-connected neural network with ReLU nonlinearity and
softmax loss that uses a modular layer design. We assume an input dimension
of D, a hidden dimension of H, and perform classification over C classes.
The architecure should be affine - relu - affine - softmax.
Note that this class does not implement gradient descent; instead, it
will interact with a separate Solver object that is responsible for running
optimization.
The learnable parameters of the model are stored in the dictionary
self.params that maps parameter names to numpy arrays.
"""
def __init__(self, input_dim=3*32*32, hidden_dim=100, num_classes=10,
weight_scale=1e-3, reg=0.0):
"""
Initialize a new network.
Inputs:
- input_dim: An integer giving the size of the input
- hidden_dim: An integer giving the size of the hidden layer
- num_classes: An integer giving the number of classes to classify
- weight_scale: Scalar giving the standard deviation for random
initialization of the weights.
- reg: Scalar giving L2 regularization strength.
"""
self.params = {}
self.reg = reg
# TODO: Initialize the weights and biases of the two-layer net. Weights #
# should be initialized from a Gaussian centered at 0.0 with #
# standard deviation equal to weight_scale, and biases should be #
# initialized to zero.
self.params["W1"] = np.random.randn(input_dim,hidden_dim) * weight_scale
self.params["b1"] = np.zeros_like(hidden_dim)
self.params["W2"] = np.random.randn(hidden_dim,num_classes) * weight_scale
self.params["b2"] = np.zeros_like(num_classes)
def loss(self, X, y=None):
"""
Compute loss and gradient for a minibatch of data.
Inputs:
- X: Array of input data of shape (N, d_1, ..., d_k)
- y: Array of labels, of shape (N,). y[i] gives the label for X[i].
Returns:
If y is None, then run a test-time forward pass of the model and return:
- scores: Array of shape (N, C) giving classification scores, where
scores[i, c] is the classification score for X[i] and class c.
If y is not None, then run a training-time forward and backward pass and
return a tuple of:
- loss: Scalar value giving the loss
- grads: Dictionary with the same keys as self.params, mapping parameter
names to gradients of the loss with respect to those parameters.
"""
############################################################################
# TODO: Implement the forward pass for the two-layer net, computing the#
# class scores for X and storing them in the scores variable. #
############################################################################
H , cache_layer1 = affine_relu_forward(X,self.params["W1"],self.params["b1"])
scores , cache_layer2 = affine_forward(H, self.params["W2"], self.params["b2"])
# If y is None then we are in test mode so just return scores
if y is None:
return scores
loss, grads = 0, {}
############################################################################
# TODO: Implement the backward pass for the two-layer net. Store the loss #
# in the loss variable and gradients in the grads dictionary. Compute data #
# loss using softmax, and make sure that grads[k] holds the gradients for #
# self.params[k]. Don't forget to add L2 regularization! #
# #
# NOTE: To ensure that your implementation matches ours and you pass the #
# automated tests, make sure that your L2 regularization includes a factor #
# of 0.5 to simplify the expression for the gradient. #
############################################################################
loss, dS = softmax_loss(scores,y)
loss += 0.5 * self.reg * np.sum(self.params["W1"] * self.params["W1"])
loss += 0.5 * self.reg * np.sum(self.params["W2"] * self.params["W2"]) # 添加正则项
dH , dW2 , grads["b2"] = affine_backward(dS,cache_layer2)
dx, dW1,grads["b1"] = affine_relu_backward(dH , cache_layer1)
grads["W1"] = dW1 + self.reg * self.params["W1"]
grads["W2"] = dW2 + self.reg * self.params["W2"] # 正则项损失
return loss, grads
cs231n/solver.py
and read through it to familiarize yourself with the API. After doing so, use a Solver instance to train a TwoLayerNet that achieves at least 50% accuracy on the validation set.这个抽象分离得很好,前面的模型API负责接收数据,计算损失和梯度,这里的Solver复杂将数据组织进行输入,以及返回梯度并使用给定的优化方法更新梯度。这样,我们能用相同的Solver抽象来优化不同的模型,即 这是所有模型公共的部分,所以可以单独抽象出来。
__init__(self, model, data, **kwargs)
参数名 | 具体含义 |
---|---|
update_rule:string | 优化方法名字,会传给optim.py |
optim_config:dict | 存储优化器的参数,例如学习率 |
lr_decay | 每个epoch后学习率的衰减尺度 |
batch_size | 每个batch样本数目 |
num_epochs | 总的epoch数目 |
print_every | 打印训练损失的迭代(iterations)间隔 |
verbose | 是否打印训练的状态 |
num_train_samples | 每个epoch之后拿来计算精度的训练样本数 |
num_val_samples | 每个epoch之后拿来计算精度的验证集样本数 |
checkpoint_name | 保存的模型检查点名字,需要包括模型参数和我们的配置参数,便于下次恢复训练 |
init()
方法暂存这些输入的参数 def __init__(self, model, data, **kwargs):
"""
Construct a new Solver instance.
"""
self.model = model
self.X_train = data['X_train']
self.y_train = data['y_train']
self.X_val = data['X_val']
self.y_val = data['y_val']
# Unpack keyword arguments
self.update_rule = kwargs.pop('update_rule', 'sgd')
self.optim_config = kwargs.pop('optim_config', {})
self.lr_decay = kwargs.pop('lr_decay', 1.0)
self.batch_size = kwargs.pop('batch_size', 100)
self.num_epochs = kwargs.pop('num_epochs', 10)
self.num_train_samples = kwargs.pop('num_train_samples', 1000)
self.num_val_samples = kwargs.pop('num_val_samples', None)
self.checkpoint_name = kwargs.pop('checkpoint_name', None)
self.print_every = kwargs.pop('print_every', 10)
self.verbose = kwargs.pop('verbose', True)
# Throw an error if there are extra keyword arguments
if len(kwargs) > 0:
extra = ', '.join('"%s"' % k for k in list(kwargs.keys()))
raise ValueError('Unrecognized arguments %s' % extra)
# Make sure the update rule exists, then replace the string
# name with the actual function
if not hasattr(optim, self.update_rule):
raise ValueError('Invalid update_rule "%s"' % self.update_rule)
# now, it's a function
self.update_rule = getattr(optim, self.update_rule)
self._reset()
init()
方法调用了_reset()
方法,其初始化了一些训练时的状态参数,并且给模型中的每一个参数都设置了优化规则的参数,方便之后传模型参数+优化参数给optim来进行权重更新 def _reset(self):
"""
Set up some book-keeping variables for optimization. Don't call this
manually.
"""
# Set up some variables for book-keeping
self.epoch = 0
self.best_val_acc = 0
self.best_params = {}
self.loss_history = []
self.train_acc_history = []
self.val_acc_history = []
# Make a deep copy of the optim_config for each parameter
self.optim_configs = {}
for p in self.model.params:
d = {k: v for k, v in self.optim_config.items()}
self.optim_configs[p] = d
_step()
方法,其使用一个batchsize
的数据对权重参数进行更新 def _step(self):
"""
Make a single gradient update. This is called by train() and should not
be called manually.
"""
# Make a minibatch of training data
num_train = self.X_train.shape[0]
batch_mask = np.random.choice(num_train, self.batch_size)
X_batch = self.X_train[batch_mask]
y_batch = self.y_train[batch_mask]
# Compute loss and gradient
loss, grads = self.model.loss(X_batch, y_batch)
self.loss_history.append(loss)
# Perform a parameter update
for p, w in self.model.params.items():
dw = grads[p]
config = self.optim_configs[p]
next_w, next_config = self.update_rule(w, dw, config)
self.model.params[p] = next_w
self.optim_configs[p] = next_config
_save_checkpoint()
函数保存当前模型参数以及配置参数 if self.checkpoint_name is None: return
checkpoint = {
'model': self.model,
'update_rule': self.update_rule,
'lr_decay': self.lr_decay,
'optim_config': self.optim_config,
'batch_size': self.batch_size,
'num_train_samples': self.num_train_samples,
'num_val_samples': self.num_val_samples,
'epoch': self.epoch,
'loss_history': self.loss_history,
'train_acc_history': self.train_acc_history,
'val_acc_history': self.val_acc_history,
}
filename = '%s_epoch_%d.pkl' % (self.checkpoint_name, self.epoch)
if self.verbose:
print('Saving checkpoint to "%s"' % filename)
with open(filename, 'wb') as f:
pickle.dump(checkpoint, f)
check_accuracy(X, y, num_samples=None, batch_size=100)
在每个epoch之后,从X中抽取样本num_samples计算准确度,而且为了防止数据量过大,一次只取一个batchsize
来计算 def check_accuracy(self, X, y, num_samples=None, batch_size=100):
"""
Check accuracy of the model on the provided data.
Inputs:
- X: Array of data, of shape (N, d_1, ..., d_k)
- y: Array of labels, of shape (N,)
- num_samples: If not None, subsample the data and only test the model
on num_samples datapoints.
- batch_size: Split X and y into batches of this size to avoid using
too much memory.
Returns:
- acc: Scalar giving the fraction of instances that were correctly
classified by the model.
"""
# Maybe subsample the data
N = X.shape[0]
if num_samples is not None and N > num_samples:
mask = np.random.choice(N, num_samples)
N = num_samples
X = X[mask]
y = y[mask]
# Compute predictions in batches
num_batches = N // batch_size
if N % batch_size != 0:
num_batches += 1
y_pred = []
for i in range(num_batches):
start = i * batch_size
end = (i + 1) * batch_size
scores = self.model.loss(X[start:end])
y_pred.append(np.argmax(scores, axis=1))
y_pred = np.hstack(y_pred)
acc = np.mean(y_pred == y)
train()
方法了,首先,其根据我们的batchsize
来计算我么进行一个epoch
需要多少次迭代iteration
,然后计算总的迭代数。每个epoch
之后,我们都衰减学习率learning_rate
,且每个epoch
之后,我们都计算一下训练集和验证集的精确度。 def train(self):
"""
Run optimization to train the model.
"""
num_train = self.X_train.shape[0]
iterations_per_epoch = max(num_train // self.batch_size, 1)
num_iterations = self.num_epochs * iterations_per_epoch
for t in range(num_iterations):
self._step()
# Maybe print training loss
if self.verbose and t % self.print_every == 0:
print('(Iteration %d / %d) loss: %f' % (
t + 1, num_iterations, self.loss_history[-1]))
# At the end of every epoch, increment the epoch counter and decay
# the learning rate.
epoch_end = (t + 1) % iterations_per_epoch == 0
if epoch_end:
self.epoch += 1
for k in self.optim_configs:
self.optim_configs[k]['learning_rate'] *= self.lr_decay
# Check train and val accuracy on the first iteration, the last
# iteration, and at the end of each epoch.
first_it = (t == 0)
last_it = (t == num_iterations - 1)
if first_it or last_it or epoch_end:
train_acc = self.check_accuracy(self.X_train, self.y_train,
num_samples=self.num_train_samples)
val_acc = self.check_accuracy(self.X_val, self.y_val,
num_samples=self.num_val_samples)
self.train_acc_history.append(train_acc)
self.val_acc_history.append(val_acc)
self._save_checkpoint()
if self.verbose:
print('(Epoch %d / %d) train acc: %f; val_acc: %f' % (
self.epoch, self.num_epochs, train_acc, val_acc))
# Keep track of the best model
if val_acc > self.best_val_acc:
self.best_val_acc = val_acc
self.best_params = {}
for k, v in self.model.params.items():
self.best_params[k] = v.copy()
# At the end of training swap the best params into the model
self.model.params = self.best_params
1e-2
,发现权重更新太大,很快损失就增长到 nan
了model = TwoLayerNet()
##############################################################################
# TODO: Use a Solver instance to train a TwoLayerNet that achieves at least #
# 50% accuracy on the validation set. #
##############################################################################
solver = Solver(model,data,
num_epochs = 30,
update_rule = 'sgd',
optim_config = {
'learning_rate': 1e-3
},
lr_decay = 0.95,
batch_size = 128,
print_every = 100
)
solver.train()
可视化训练过程
plt.subplot(2, 1, 1)
plt.title('Training loss')
plt.plot(solver.loss_history, 'o')
plt.xlabel('Iteration')
plt.subplot(2, 1, 2)
plt.title('Accuracy')
plt.plot(solver.train_acc_history, '-o', label='train')
plt.plot(solver.val_acc_history, '-o', label='val')
plt.plot([0.5] * len(solver.val_acc_history), 'k--')
plt.xlabel('Epoch')
plt.legend(loc='lower right')
plt.gcf().set_size_inches(15, 12)
plt.show()
FullyConnectedNet
class in the file cs231n/classifiers/fc_net.py
.# 第一步可以先不实现BN和Dropout
class FullyConnectedNet(object):
"""
A fully-connected neural network with an arbitrary number of hidden layers,
ReLU nonlinearities, and a softmax loss function. This will also implement
dropout and batch/layer normalization as options. For a network with L layers,
the architecture will be
{affine - [batch/layer norm] - relu - [dropout]} x (L - 1) - affine - softmax
where batch/layer normalization and dropout are optional, and the {...} block is
repeated L - 1 times.
Similar to the TwoLayerNet above, learnable parameters are stored in the
self.params dictionary and will be learned using the Solver class.
"""
def __init__(self, hidden_dims, input_dim=3*32*32, num_classes=10,
dropout=1, normalization=None, reg=0.0,
weight_scale=1e-2, dtype=np.float32, seed=None):
self.normalization = normalization
self.use_dropout = dropout != 1
self.reg = reg
self.num_layers = 1 + len(hidden_dims)
self.dtype = dtype
self.params = {}
input_size = input_dim
for i in range(len(hidden_dims)):
output_size = hidden_dims[i]
self.params['W' + str(i+1)] = np.random.randn(input_size,output_size) * weight_scale
self.params['b' + str(i+1)] = np.zeros(output_size)
if not self.normalization is None:
self.params['gamma' + str(i+1)] = np.ones(output_size)
self.params['beta' + str(i+1)] = np.zeros(output_size)
input_size = output_size # 下一层的输入
# 输出层,没有BN操作
self.params['W' + str(self.num_layers)] = np.random.randn(input_size,num_classes) * weight_scale
self.params['b' + str(self.num_layers)] = np.zeros(num_classes)
# When using dropout we need to pass a dropout_param dictionary to each
# dropout layer so that the layer knows the dropout probability and the mode
# (train / test). You can pass the same dropout_param to each dropout layer.
self.dropout_param = {}
if self.use_dropout:
self.dropout_param = {'mode': 'train', 'p': dropout}
if seed is not None:
self.dropout_param['seed'] = seed
self.bn_params = []
if self.normalization=='batchnorm':
self.bn_params = [{'mode': 'train'} for i in range(self.num_layers - 1)]
if self.normalization=='layernorm':
self.bn_params = [{} for i in range(self.num_layers - 1)]
# Cast all parameters to the correct datatype
for k, v in self.params.items():
self.params[k] = v.astype(dtype)
def loss(self, X, y=None):
"""
Compute loss and gradient for the fully-connected net.
Input / output: Same as TwoLayerNet above.
"""
X = X.astype(self.dtype)
mode = 'test' if y is None else 'train'
# Set train/test mode for batchnorm params and dropout param since they
# behave differently during training and testing.
if self.use_dropout:
self.dropout_param['mode'] = mode
if self.normalization=='batchnorm':
for bn_param in self.bn_params:
bn_param['mode'] = mode
cache = {} # 需要存储反向传播需要的参数
hidden = X
for i in range(self.num_layers - 1):
if self.normalization :
pass
else:
hidden , cache[i+1] = affine_relu_forward(hidden,self.params['W' + str(i+1)],
self.params['b' + str(i+1)])
if self.use_dropout:
pass
# 最后一层不用激活
scores, cache[self.num_layers] = affine_forward(hidden , self.params['W' + str(self.num_layers)],
self.params['b' + str(self.num_layers)])
# If test mode return early
if mode == 'test':
return scores
loss, grads = 0.0, {}
loss, dS = softmax_loss(scores , y)
# 最后一层没有relu激活
dhidden, grads['W' + str(self.num_layers)], grads['b' + str(self.num_layers)] \
= affine_backward(dS,cache[self.num_layers])
loss += 0.5 * self.reg * np.sum(self.params['W' + str(self.num_layers)] * self.params['W' + str(self.num_layers)])
grads['W' + str(self.num_layers)] += self.reg * self.params['W' + str(self.num_layers)]
for i in range(self.num_layers - 1, 0, -1):
loss += 0.5 * self.reg * np.sum(self.params["W" + str(i)] * self.params["W" + str(i)])
# 倒着求梯度
if self.use_dropout:
pass
if self.normalization:
pass
else:
dhidden, dw, db = affine_relu_backward(dhidden, cache[i])
grads['W' + str(i)] = dw + self.reg * self.params['W' + str(i)]
grads['b' + str(i)] = db
return loss, grads
from random import uniform
for i in range(10):
lr = 10**unifor(-1,-3)
weight_scale = 10**uniform(-1,-4)
test_it()
之前,我们训练神经网络都是使用了朴素的sgd
算法,这一节,我们会实现其它更新规则来更新我们的参数,它们被定义在optim.py
def sgd_momentum(w, dw, config=None):
"""
Performs stochastic gradient descent with momentum.
config format:
- learning_rate: Scalar learning rate.
- momentum: Scalar between 0 and 1 giving the momentum value.
Setting momentum = 0 reduces to sgd.
- velocity: A numpy array of the same shape as w and dw used to store a
moving average of the gradients.
"""
if config is None: config = {}
config.setdefault('learning_rate', 1e-2)
config.setdefault('momentum', 0.9)
v = config.get('velocity', np.zeros_like(w))
# 梯度影响速度,而不直接影响位置
v = config['momentum'] * v - config['learning_rate'] * dw
next_w = w + v
config['velocity'] = v
return next_w, config
def rmsprop(w, dw, config=None):
"""
Uses the RMSProp update rule, which uses a moving average of squared
gradient values to set adaptive per-parameter learning rates.
config format:
- learning_rate: Scalar learning rate.
- decay_rate: Scalar between 0 and 1 giving the decay rate for the squared
gradient cache.
- epsilon: Small scalar used for smoothing to avoid dividing by zero.
- cache: Moving average of second moments of gradients.
"""
if config is None: config = {}
config.setdefault('learning_rate', 1e-2)
config.setdefault('decay_rate', 0.99)
config.setdefault('epsilon', 1e-8)
config.setdefault('cache', np.zeros_like(w))
config['cache'] = config['decay_rate'] * config['cache'] + (1 - config['decay_rate']) * dw * dw
next_w = w - config['learning_rate'] * dw / (np.sqrt(config['cache'] + config['epsilon']))
return next_w, config
def adam(w, dw, config=None):
"""
Uses the Adam update rule, which incorporates moving averages of both the
gradient and its square and a bias correction term.
config format:
- learning_rate: Scalar learning rate.
- beta1: Decay rate for moving average of first moment of gradient.
- beta2: Decay rate for moving average of second moment of gradient.
- epsilon: Small scalar used for smoothing to avoid dividing by zero.
- m: Moving average of gradient.
- v: Moving average of squared gradient.
- t: Iteration number.
"""
if config is None: config = {}
config.setdefault('learning_rate', 1e-3)
config.setdefault('beta1', 0.9)
config.setdefault('beta2', 0.999)
config.setdefault('epsilon', 1e-8)
config.setdefault('m', np.zeros_like(w))
config.setdefault('v', np.zeros_like(w))
config.setdefault('t', 0)
# 第一动量
config['m'] = config['beta1'] * config['m'] + (1 - config['beta1']) * dw
# 第二动量
config['v'] = config['beta2'] * config['v'] + (1 - config['beta2']) * dw * dw
# 偏置修正
config['t'] += 1
m_unbias = config['m'] / (1 - config['beta1'] ** config['t'])
v_unbias = config['v'] / (1 - config['beta2'] ** config['t'])
# 更新参数
next_w = w - m_unbias * config['learning_rate'] / (np.sqrt(v_unbias) + config['epsilon'] )
return next_w, config
BatchNormalization.ipynb
和随机失活Dropout.ipynb
后再来构建一个更复杂的网络,在验证集上得到至少50%
的精度。Q1: We’ve only asked you to implement ReLU, but there are a number of different activation functions that one could use in neural networks, each with its pros and cons. In particular, an issue commonly seen with activation functions is getting zero (or close to zero) gradient flow during backpropagation. Which of the following activation functions have this problem? If you consider these functions in the one dimensional case, what types of input would lead to this behaviour?
A1: sigmoid会出现正饱和和负饱和,当输入值很大或者很小时;ReLU在负半轴会出现饱和。
Q2: Did you notice anything about the comparative difficulty of training the three-layer net vs training the five layer net? In particular, based on your experience, which network seemed more sensitive to the initialization scale? Why do you think that is the case?
A2: 之前用sgd训练5层神经网络的时候,发现权重初始化特别重要:1)初始化太小,可能导致某一层的神经元没有梯度传回来;2)初始化太大,导致某一层的神经元发生梯度爆炸。因为权重是要不断参与矩阵乘法的,网络越深,权重初始化越重要。
Q3: AdaGrad, like Adam, is a per-parameter optimization method that uses the following update rule:
cache += dw**2
w += - learning_rate * dw / (np.sqrt(cache) + eps)
John notices that when he was training a network with AdaGrad that the updates became very small, and that his network was learning slowly. Using your knowledge of the AdaGrad update rule, why do you think the updates would become very small? Would Adam have the same issue?
A3: AdaGrad的动量不会随着时间的推移而消减,当动量累积越来越大的时候,更新步长就会越来越小,所以需要一个衰减因子。而Adam就不会出现这种问题。
.py
的名字是一个模块名,其里面的方法可以看作其属性,用hasattr(object, name)
判断模块是否有某成员函数,用getattr(object, name[, default])
返回模块名为name
的函数raise
抛出异常pickle.dump()
方法来序列化字典进行存储np.hstack()