神经网络的过程主要就是forward propagation和backward propagation。
forward propagation to evaluate score function & loss function, then back propagation 对每一层计算loss对W和b的梯度,利用梯度完成W和b的更新。
总体过程可以理解为:forward–>backward–>update–>forward–>backward–>update……
这是一个两层的full-connected神经网络,分别是输入层、隐藏层和输出层(输入层不算一层)。输入层4个节点表示样本是4维的,输出层3个节点表示有3类,输出的结果是每个类的score。
FC层(全连接层):每个神经元都连接上一层的所有神经元的layer。
在神经网络中,bigger = better,即网络模型越大越好。more neurons = more capacity,因为机器学习中经常出现模型表达能力不足的情况。
这里activation function采用的是ReLU函数。
Why we set activation function?
answer:如果不设置activation function,则每层的output都是input的线性函数,则无论隐藏层有多少层,都与没有隐藏层效果一样。因此,设置activation function,目的是为了对input进行非线性转化,然后output,使神经网络更有意义。
本次作业中使用的是softmax loss function。可参考softmax classifier。
完成作业的关键仍是loss function对W求梯度。下面求gradient(注意:我在角标表示上有错误,但不影响理解和代码化):
这样,score、loss和gradient都表达出来了,2层神经网络也就可以完成了。
Forward: 计算score,再根据score计算loss
Backward:分别对W2、b2、W1、b1求梯度
def loss(self, X, y=None, reg=0.0):
"""
Compute the loss and gradients for a two layer fully connected neural
network.
Inputs:
- X: Input data of shape (N, D). Each X[i] is a training sample.
- y: Vector of training labels. y[i] is the label for X[i], and each y[i] is
an integer in the range 0 <= y[i] < C. This parameter is optional; if it
is not passed then we only return scores, and if it is passed then we
instead return the loss and gradients.
- reg: Regularization strength.
Returns:
If y is None, return a matrix scores of shape (N, C) where scores[i, c] is
the score for class c on input X[i].
If y is not None, instead return a tuple of:
- loss: Loss (data loss and regularization loss) for this batch of training
samples.
- grads: Dictionary mapping parameter names to gradients of those parameters
with respect to the loss function; has the same keys as self.params.
"""
# Unpack variables from the params dictionary
W1, b1 = self.params['W1'], self.params['b1']
W2, b2 = self.params['W2'], self.params['b2']
N, D = X.shape
# Compute the forward pass
scores = None
#############################################################################
# TODO: Perform the forward pass, computing the class scores for the input. #
# Store the result in the scores variable, which should be an array of #
# shape (N, C). #
#############################################################################
h1 = np.maximum(0, np.dot(X,W1) + b1) #(5,10)
#print (h1.shape)
scores = np.dot(h1,W2) + b2 # (5,3)
#print (scores.shape)
#############################################################################
# END OF YOUR CODE #
#############################################################################
# If the targets are not given then jump out, we're done
if y is None:
return scores
# Compute the loss
loss = None
#############################################################################
# TODO: Finish the forward pass, and compute the loss. This should include #
# both the data loss and L2 regularization for W1 and W2. Store the result #
# in the variable loss, which should be a scalar. Use the Softmax #
# classifier loss. #
#############################################################################
exp_S = np.exp(scores) #(5,3)
sum_exp_S = np.sum(exp_S,axis = 1)
sum_exp_S = sum_exp_S.reshape(-1,1) #(5,1)
#print (sum_exp_S.shape)
loss = np.sum(-scores[range(N),list(y)]) + sum(np.log(sum_exp_S))
loss = loss / N + 0.5 * reg * np.sum(W1 * W1) + 0.5 * reg * np.sum(W2 * W2)
#print (loss)
#############################################################################
# END OF YOUR CODE #
#############################################################################
# Backward pass: compute gradients
grads = {}
#############################################################################
# TODO: Compute the backward pass, computing the derivatives of the weights #
# and biases. Store the results in the grads dictionary. For example, #
# grads['W1'] should store the gradient on W1, and be a matrix of same size #
#############################################################################
'''
网络为4-10-3
对于第2层:input为h1(5,10)
W2:(10,3)
output为score:(5,3)
对于第1层:input为X(5,4)
W1(5,10)
output为h1(5,10)
'''
#---------------------------------#
dscores = np.zeros(scores.shape)
dscores[range(N),list(y)] = -1
dscores += (exp_S/sum_exp_S) #(5,3)
dscores /= N
grads['W2'] = np.dot(h1.T, dscores)
grads['W2'] += reg * W2
grads['b2'] = np.sum(dscores, axis = 0)
#---------------------------------#
dh1 = np.dot(dscores, W2.T) #(5,10)
dh1_ReLU = (h1>0) * dh1
grads['W1'] = X.T.dot(dh1_ReLU) + reg * W1
grads['b1'] = np.sum(dh1_ReLU, axis = 0)
#---------------------------------#
'''
本人之前的写法:
缺点:无法很少的利用中间变量;
逻辑上可读性差
buf = np.zeros(scores.shape)
buf[range(N),list(y)] = -1
#print (buf+(exp_S/sum_exp_S))
grads['W2'] = np.dot(h1.T,(buf + (exp_S/sum_exp_S))) #(10,3)
grads['W2'] = grads['W2']/N + reg * W2
grads['b2'] = np.sum((exp_S/sum_exp_S)+buf, axis = 0) #(3,)
'''
#############################################################################
# END OF YOUR CODE #
#############################################################################
return loss, grads
首先计算numeric gradient:
def eval_numerical_gradient(f, x, verbose=True, h=0.00001):
"""
a naive implementation of numerical gradient of f at x
- f should be a function that takes a single argument
- x is the point (numpy array) to evaluate the gradient at
"""
fx = f(x) # evaluate function value at original point
# x: 权重W1,W2,b2,b1
grad = np.zeros_like(x)
# iterate over all indexes in x
it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite']) #构建一个迭代器
while not it.finished:
# evaluate function at x+h
ix = it.multi_index #输出结果为元素的索引
oldval = x[ix] #权重中的元素值
x[ix] = oldval + h # increment by h
fxph = f(x) # evalute f(x + h)
x[ix] = oldval - h
fxmh = f(x) # evaluate f(x - h)
x[ix] = oldval # restore
# compute the partial derivative with centered formula
grad[ix] = (fxph - fxmh) / (2 * h) # the slope
if verbose:
print(ix, grad[ix]) #输出数值梯度
it.iternext() # step to next dimension,如果不加这句话,则始终保持在(0,0)
return grad #返回的是数值梯度
然后check:
from cs231n.gradient_check import eval_numerical_gradient
# Use numeric gradient checking to check your implementation of the backward pass.
# If your implementation is correct, the difference between the numeric and
# analytic gradients should be less than 1e-8 for each of W1, W2, b1, and b2.
loss, grads = net.loss(X, y, reg=0.1)
#print (loss)
#print (grads)
# these should all be less than 1e-8 or so
for param_name in grads:
f = lambda W: net.loss(X, y, reg=0.1)[0] #input:W; output:loss
param_grad_num = eval_numerical_gradient(f, net.params[param_name], verbose=False) #返回数值梯度
#print (np.abs(param_grad_num - grads[param_name]))
print('%s max relative error: %e' % (param_name, rel_error(param_grad_num, grads[param_name])))
既然forward和backward都已经完成了,且gradient检查通过,接下来就是训练出一个较好的模型。
基本迭代过程是这样的:forward–>backward–>update–>forward–>backward–>update……
def train(self, X, y, X_val, y_val,
learning_rate=1e-3, learning_rate_decay=0.95,
reg=5e-6, num_iters=100,
batch_size=200, verbose=False):
"""
Train this neural network using stochastic gradient descent.
Inputs:
- X: A numpy array of shape (N, D) giving training data.
- y: A numpy array f shape (N,) giving training labels; y[i] = c means that
X[i] has label c, where 0 <= c < C.
- X_val: A numpy array of shape (N_val, D) giving validation data.
- y_val: A numpy array of shape (N_val,) giving validation labels.
- learning_rate: Scalar giving learning rate for optimization.
- learning_rate_decay: Scalar giving factor used to decay the learning rate
after each epoch.
- reg: Scalar giving regularization strength.
- num_iters: Number of steps to take when optimizing.
- batch_size: Number of training examples to use per step.
- verbose: boolean; if true print progress during optimization.
"""
num_train = X.shape[0]
iterations_per_epoch = max(num_train / batch_size, 1)
# Use SGD to optimize the parameters in self.model
loss_history = []
train_acc_history = []
val_acc_history = []
for it in xrange(num_iters):
X_batch = None
y_batch = None
#########################################################################
# TODO: Create a random minibatch of training data and labels, storing #
# them in X_batch and y_batch respectively. #
#########################################################################
mask = np.random.choice(num_train,batch_size,replace = True)
X_batch = X[mask]
y_batch = y[mask]
#########################################################################
# END OF YOUR CODE #
#########################################################################
# Compute loss and gradients using the current minibatch
loss, grads = self.loss(X_batch, y=y_batch, reg=reg)
loss_history.append(loss)
#########################################################################
# TODO: Use the gradients in the grads dictionary to update the #
# parameters of the network (stored in the dictionary self.params) #
# using stochastic gradient descent. You'll need to use the gradients #
# stored in the grads dictionary defined above. #
#########################################################################
self.params['W1'] += -learning_rate * grads['W1']
self.params['b1'] += -learning_rate * grads['b1']
self.params['W2'] += -learning_rate * grads['W2']
self.params['b2'] += -learning_rate * grads['b2']
#########################################################################
# END OF YOUR CODE #
#########################################################################
if verbose and it % 100 == 0:
print('iteration %d / %d: loss %f' % (it, num_iters, loss))
# Every epoch, check train and val accuracy and decay learning rate.
if it % iterations_per_epoch == 0:
# Check accuracy
#print ('第%d个epoch' %it)
train_acc = (self.predict(X_batch) == y_batch).mean()
val_acc = (self.predict(X_val) == y_val).mean()
train_acc_history.append(train_acc)
val_acc_history.append(val_acc)
# Decay learning rate
learning_rate *= learning_rate_decay #减小学习率
return {
'loss_history': loss_history,
'train_acc_history': train_acc_history,
'val_acc_history': val_acc_history,
}
训练完模型,即可以对未知样本做出prediction。
1. np.random.seed(a)
function: 选定固定的随机数,a为任意数字(如0)。即后面每次产生的随机数都固定。
2. np.nditer(array, flags=[‘multi_index’], op_flags=[‘readwrite’])
function: 构建一个迭代器。
参数1:数组; 参数2:多重索引; 参数3:可读写
本作业中,用于输出数组的index,从而进行数值梯度的计算。