最近在学习神经网络的相关知识,希望通过写博客记录下学习时的感悟,督促、勉励自己。
关于神经网络优化的主要方面是避免神经网络的过拟合,提高模型的泛化能力,常用的方法有:L1、L2正则化,dropout、权重的初始化等。其中正则化在深度学习中的线性传播中用的最多,在反向求导的过程中起到惩罚权重的作用。
L2正则化原理其实很简单:
至于正则化为什么能够对神经网络防止过拟合,我是这么理解的:随着神经网络的不断加深,层数越来越多,很容易会产生过拟合的现象,导致偏差低却方差高的现象(也就是训练集效果好,测试集效果太差),而层数比较低的时候,则不容易过拟合。我们都知道,当神经网络的权重初始化的时候全部变为零的话,由数学推导可知与层数少的神经网络效果是一样的,相当于一个简单的线性函数,无法很好的拟合数据。由此我们可以通过对 w 的额外惩罚令 w 接近0,起到防止过拟合的发生。
以下是在学习deeplearning.ai推出的教程中的相关代码实现:
def compute_cost_with_regularization(A3, Y, parameters, lambd):
"""
Implement the cost function with L2 regularization. See formula (2) above.
Arguments:
A3 -- post-activation, output of forward propagation, of shape (output size, number of examples)
Y -- "true" labels vector, of shape (output size, number of examples)
parameters -- python dictionary containing parameters of the model
Returns:
cost - value of the regularized loss function (formula (2))
"""
m = Y.shape[1]
W1 = parameters["W1"]
W2 = parameters["W2"]
W3 = parameters["W3"]
cross_entropy_cost = compute_cost(A3, Y) # This gives you the cross-entropy part of the cost
### START CODE HERE ### (approx. 1 line)
L2_regularization_cost = lambd*(np.sum(np.square(W1))+ np.sum(np.square(W2))+ np.sum(np.square(W3)))/(2*m)
### END CODER HERE ###
cost = cross_entropy_cost + L2_regularization_cost
return cost
def backward_propagation_with_regularization(X, Y, cache, lambd):
"""
Implements the backward propagation of our baseline model to which we added an L2 regularization.
Arguments:
X -- input dataset, of shape (input size, number of examples)
Y -- "true" labels vector, of shape (output size, number of examples)
cache -- cache output from forward_propagation()
lambd -- regularization hyperparameter, scalar
Returns:
gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
"""
m = X.shape[1]
(Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache
dZ3 = A3 - Y
### START CODE HERE ### (approx. 1 line)
dW3 = 1./m * np.dot(dZ3, A2.T) + lambd*W3/m
### END CODE HERE ###
db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)
dA2 = np.dot(W3.T, dZ3)
dZ2 = np.multiply(dA2, np.int64(A2 > 0)) ##因为除了最后一层,其余的激活函数都是relu,需要大于0才能求导
### START CODE HERE ### (approx. 1 line)
dW2 = 1./m * np.dot(dZ2, A1.T) + lambd*W2/m
### END CODE HERE ###
db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
dA1 = np.dot(W2.T, dZ2)
dZ1 = np.multiply(dA1, np.int64(A1 > 0))
### START CODE HERE ### (approx. 1 line)
dW1 = 1./m * np.dot(dZ1, X.T) + lambd*W1/m
### END CODE HERE ###
db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)
gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
"dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1,
"dZ1": dZ1, "dW1": dW1, "db1": db1}
return gradients