cs231n assignment2 Fully-connected Neural Network

开始了卷积神经网络的学习,这部分内容还是非常多的。本次作业分为五个部分:Q1:Fully-connected Neural Network、Q2:Batch Normalization、Q3:Dropout、Q4:Convolutional Networks、Q5:PyTorch / TensorFlow on CIFAR-10。

以下内容为全连接网络内容,采用模块化的设计,思路非常清晰。

准备工作:

  • 需要安装Pytorch,官网稳定版为1.1,https://pytorch.org/get-started/locally/
  • 准备CIFAR10数据集

文章目录

      • Affine layer
        • foward
        • backward
      • ReLU activation
        • forward
        • backward
      • "Sandwich" layers
      • Loss layers: Softmax and SVM
      • Two-layer network
      • Solver
      • Multilayer network
      • Update rules
        • SGD+Momentum
        • RMSProp and Adam
    • Inline Question
      • 训练好的模型

Affine layer

foward

完成对x的reshape,并计算出全连接层的结果

layers.py中的affine_forward()

# 此处返回的x同原始纬度,即对x本身不做任何reshape操作
D = w.shape[0]
new_x = x.reshape(-1,D)
out = new_x.dot(w) + b

backward

layers.py中的affine_backward()

db = np.sum(dout,axis=0)                        # (M,)
# 设置纬度
N = dout.shape[0]
new_x = x.reshape(N,-1)                         # (N,D) 其中D = d_1 * d_2 * .. * d_k
dw = new_x.T.dot(dout)                              # (D,M)
dx = dout.dot(w.T).reshape(x.shape)             # (N,d_1,...,d_k)

ReLU activation

forward

ReLU函数为max(0,x),使用numpy中的maximum函数即可实现
layers.py中的relu_forward()

out = np.maximum(0,x)

backward

关于ReLU的反向传播计算方法:x大于0的数乘以上层梯度,其余位置为0,代码如下:

dx = (x > 0) * dout

“Sandwich” layers

这部分代码已经集成好了,主要的工作是把affine层后面接了ReLU层,两个层关联到一起。

Loss layers: Softmax and SVM

已经实现好了Softmax和SVM的损失函数,包括计算loss和梯度。

Two-layer network

实现模块化的两层神经网络。
fc_net.py中的__init__()

W1 = np.random.randn(input_dim,hidden_dim) * weight_scale
b1 = np.zeros((1,hidden_dim))
W2 = np.random.randn(hidden_dim,num_classes) * weight_scale
b2 = np.zeros((1,num_classes))
self.params['W1'] = W1
self.params['b1'] = b1
self.params['W2'] = W2
self.params['b2'] = b2

fc_net.py中的loss()
计算scores

W1 = self.params['W1']
b1 = self.params['b1']
W2 = self.params['W2']
b2 = self.params['b2']

h,cacheh = affine_forward(X,W1,b1)
h1,cacheh1 = relu_forward(h)
out,cache2 = affine_forward(h1,W2,b2)
scores = out

计算loss和梯度

# 计算loss
data_loss,dscores = softmax_loss(scores,y)
reg_loss = 0.5 * self.reg * (np.sum(W1 * W1) + np.sum(W2 * W2))
loss = data_loss + reg_loss

# 反向传播
dh1,dW2,db2 = affine_backward(dscores,cache2)
dh = relu_backward(dh1,cacheh1)
dX,dW1,db1 = affine_backward(dh,cacheh)

grads['W2'] = dW2 + self.reg * W2
grads['b2'] = db2
grads['W1'] = dW1 + self.reg * W1
grads['b1'] = db1

Solver

使用已经给出的solver.py代码的API进行训练。
FullyConnectedNets.ipynb

solver = Solver(model,data,
               update_rule='sgd',
               optim_config={
                   'learning_rate':4e-4
               },
               lr_decay=0.95,
               num_epochs=10, batch_size=100,
               print_every=100)

solver.train()

准确率可以达到50%了。

Multilayer network

接下来需要完成fc_net.py中的FullyConnectedNet类,但是在多层网络中需要实现batch normalization、layer normalization、dropout等相关函数,这些函数暂时先不实现,在后面的练习中实现。

在前向网络中,按照要求实现L-1层的循环,包括affine -> normalization -> ReLU -> dropout,最后在实现一层affine即可。
loss()函数

# affine -> batch -> relu -> dropout 循环L-1 ->affine ->softmax

caches = list()
for i in range(self.num_layers - 1):
	cache = list()
	scores, fc_cache = affine_forward(scores, self.params['W' + str(i + 1)],self.params['b' + str(i + 1)])
	cache.append(fc_cache)
	if self.normalization is not None:
		if self.normalization == 'batchnorm':
			scores, bn_cache = batchnorm_forward(scores, self.params['gamma' + str(i + 1)],self.params['beta' + str(i + 1)], bn_param[i])
			cache.append(bn_cache)
		elif self.normalization == 'layernorm':
			scores, ln_cache = layernorm_forward(scores, self.params['gamma' + str(i + 1)],self.params['beta' + str(i + 1)], bn_param[i])
			cache.append(ln_cache)
		pass
	scores, relu_cache = relu_forward(scores)
	cache.append(relu_cache)

	if self.use_dropout:
		scores, droupout_cache = dropout_forward(scores, self.dropout_param)
		cache.append(droupout_cache)
	caches.append(cache)
	pass
scores, fc_cache = affine_forward(scores, self.params['W' + str(self.num_layers)],self.params['b' + str(self.num_layers)])
caches.append(fc_cache)

反向传播,loss和gradient计算:

loss, dx = softmax_loss(scores, y)

# 加上正则项
for i in range(self.num_layers):
	loss += 0.5 * self.reg * np.sum(self.params['W' + str(i + 1)] ** 2)
pass

# 计算梯度
for i in range(self.num_layers, 0, -1):
	if i == self.num_layers:
		# 计算最后一层梯度
		dx, dw, db = affine_backward(dx, caches[i - 1])
		# 保存梯度并加上正则项
		grads['W%s' % i] = dw + self.reg * self.params['W' + str(i)]
		grads['b%s' % i] = db
		pass
	else:
		j = -1
		if self.use_dropout:
			dx = dropout_backward(dx, caches[i - 1][j])
			j -= 1
			pass

		dx = relu_backward(dx, caches[i - 1][j])
		j -= 1
		if self.normalization is not None:
			if self.normalization == 'batchnorm':
				dx, dgamma, dbeta = batchnorm_backward(dx, caches[i - 1][j])
				j -= 1
			elif self.normalization == 'layernorm':
				dx, dgamma, dbeta = layernorm_backward(dx, caches[i - 1][j])
				j -= 1
			else:
				raise ValueError("No such normalization")
			# 保存梯度
			grads['gamma%s' % i] = dgamma
			grads['beta%s' % i] = dbeta
			pass

		dx, dw, db = affine_backward(dx, caches[i - 1][j])
		# 保存梯度并加上正则项
		grads['W%s' % i] = dw + self.reg * self.params['W' +str(i)]
		grads['b%s' % i] = db

接着需要设置相应的learning_rate 和 weight_scale使模型对小部分数据达到过拟合,使训练准确率达到100%,根据损失函数走势逐步增大。

三层网络对应的参数选择:

weight_scale = 1e-2   # Experiment with this!
learning_rate = 1.5e-2  # Experiment with this!

五层网络对应的参数选择:

learning_rate = 2e-3  # Experiment with this!
weight_scale = 1e-1   # Experiment with this!

Update rules

以上所有的更新规则都是用了随机梯度下降SGD算法,对于SGD算法拥有着一个缺点:在高维数据的情况下,梯度会卡在鞍点上。所谓鞍点,既非极小值点,又非极大值点。

SGD+Momentum

optim.py中的sgd_momentum()函数

v = config['momentum'] * v - config['learning_rate'] * dw
next_w = w + v

RMSProp and Adam

RMSProp()

grad_squared = config['decay_rate'] * config['cache'] + (1-config['decay_rate']) * dw * dw
next_w = w - config['learning_rate'] * dw / (np.sqrt(grad_squared) + config['epsilon'])
config['cache'] = grad_squared

Adam()

# 迭代次数 + 1
t = config['t'] + 1

beta1 = config['beta1']
beta2 = config['beta2']
m = config['m']
v = config['v']
# fisrt moment
m = beta1 * m + (1 - beta1) * dw
# second moment
v = beta2 * v + (1 - beta2) * dw * dw

first_unbias = m / (1 - beta1 ** t)
second_unbias = v / (1 - beta2 ** t)
next_w = w - config['learning_rate'] * first_unbias / (np.sqrt(second_unbias) + config['epsilon'])

# 更新first moment、second moment和迭代次数
config['m'] = m
config['v'] = v
config['t'] = t

Inline Question

Inline Question 1

We’ve only asked you to implement ReLU, but there are a number of different activation functions that one could use in neural networks, each with its pros and cons. In particular, an issue commonly seen with activation functions is getting zero (or close to zero) gradient flow during backpropagation. Which of the following activation functions have this problem? If you consider these functions in the one dimensional case, what types of input would lead to this behaviour?

  1. Sigmoid
  2. ReLU
  3. Leaky ReLU

Answer
1和2会造成梯度消失,因为这两者函数存在某一段区间上,函数导数为0。

Inline Question 2
Did you notice anything about the comparative difficulty of training the three-layer net vs training the five layer net? In particular, based on your experience, which network seemed more sensitive to the initialization scale? Why do you think that is the case?

Answer
三层网络同五层网络都可以实现过拟合小数据集,但是三层网络此时对验证集的泛化能力更好。五层网络对参数初始化更加敏感,因为五层网络层次多、参数多,loss函数更加复杂,更容易陷入局部最小值。

Inline Question 3

AdaGrad, like Adam, is a per-parameter optimization method that uses the following update rule:

cache += dw**2
w += - learning_rate * dw / (np.sqrt(cache) + eps)

John notices that when he was training a network with AdaGrad that the updates became very small, and that his network was learning slowly. Using your knowledge of the AdaGrad update rule, why do you think the updates would become very small? Would Adam have the same issue?

Answer
AdaGrad:cache因为加上dw的平方项后会变得越来越大,导致分母增加,故导致w每次增量减小。对于凸函数来讲,会在全局最小值处越来越慢,固然很好。但是对于非凸函数,这样的会困在局部最小值处。
对于Adam算法则不会,因为更新之前使用的规则是 m = m 1 − b e t a 1 ∗ ∗ t m = \frac{m}{1-beta1**t} m=1beta1tm v = v 1 − b e t a 2 ∗ ∗ t v = \frac{v}{1-beta_2**t} v=1beta2tv。可以看到随着t的增加,使得m和v都在减小,但是更新规则中,分母为根号v,则分母缩小的速度更快,故更新速度不会变慢。

训练好的模型

完成FullyConnectedNets.ipynb,其中Dropout和BatchNormalization留到后面作业完成,若此时完成,准确率能提升十个百分点。

代码:

model = FullyConnectedNet([100, 100, 100, 100, 100], weight_scale=5e-2)

solver = Solver(model, small_data,
                num_epochs=10, batch_size=100,
                update_rule='adam',
                optim_config={
                'learning_rate': 1.2e-3
                },
                verbose=True)
solver.train()

plt.subplot(3, 1, 1)
plt.title('Training loss')
plt.xlabel('Iteration')

plt.subplot(3, 1, 2)
plt.title('Training accuracy')
plt.xlabel('Epoch')

plt.subplot(3, 1, 3)
plt.title('Validation accuracy')
plt.xlabel('Epoch')

plt.subplot(3, 1, 1)
plt.plot(solver.loss_history, 'o', label=update_rule)
  
plt.subplot(3, 1, 2)
plt.plot(solver.train_acc_history, '-o', label=update_rule)

plt.subplot(3, 1, 3)
plt.plot(solver.val_acc_history, '-o', label=update_rule)
plt.gcf().set_size_inches(15, 15)
plt.show()

best_model = model

参考文章
https://www.jianshu.com/p/9c4396653324
https://blog.csdn.net/weixin_39880579/article/details/86764215

你可能感兴趣的:(计算机视觉,CS231N)