对于l层神经网络的前向传播计算如下:
a 0 = x = z 0 z 1 = w 1 a 0 + b 1 a 1 = s i g m o i d ( z 1 ) z 2 = w 2 a 1 + b 2 a 2 = s i g m o i d ( z 2 ) … z l = w l a l − 1 + b l a l = s i g m o i d ( z l ) \begin{aligned} & a^{0} = x = z^{0} \\ & z^{1} =w^{1}a^{0}+b^{1} \\ & a^{1} = sigmoid(z^{1}) \\ & z^{2} =w^{2}a^{1}+b^{2} \\ & a^{2} = sigmoid(z^{2}) \\ & \dots \\ & z^{l} = w^{l}a^{l-1}+b^{l} \\ & a^{l} = sigmoid(z^{l}) \end{aligned} a0=x=z0z1=w1a0+b1a1=sigmoid(z1)z2=w2a1+b2a2=sigmoid(z2)…zl=wlal−1+blal=sigmoid(zl)因为在求梯度的时候会用到 a i a^{i} ai,所以存储了前向传播中的 a i a^{i} ai。
def feedforward(self, a):
"""Return the output of the network if ``a`` is input."""
a_i = a
self.a_stock = [a_i]
for w_i, b_i in zip(self.weights, self.biases):
a_i = sigmoid(np.dot(w_i, a_i) + b_i)
self.a_stock.append(a_i)
return a_i
反向传播计算梯度:
loss采用均方差:
cost function = 1 2 ∣ ∣ y − a l ∣ ∣ 2 = 1 2 ∑ i n ( y i − a i l ) 2 \begin{aligned} \text{cost function} & = \frac{1}{2}||y-a^{l}||^{2} \\ & = \frac{1}{2}\sum_{i}^{n}(y_{i}-a_{i}^{l})^{2} \\ \end{aligned} cost function=21∣∣y−al∣∣2=21i∑n(yi−ail)2
通过递归的方法求梯度:
∇ z l C = ( ( y 1 − a 1 l ) ( − s i g m o i d ′ ( z 1 l ) ) ( y 2 − a 2 l ) ( − s i g m o i d ′ ( z 2 l ) ) … ( y n − a n l ) ( − s i g m o i d ′ ( z n l ) ) ) = ( s i g m o i d ′ ( z 1 l ) 0 … 0 0 s i g m o i d ′ ( z 2 l ) … 0 ⋮ ⋮ ⋮ ⋮ 0 0 … s i g m o i d ′ ( z n l ) ) ( a l − y ) = D l ( a l − y ) \begin{aligned} \nabla_{z^{l}}C & =\left( \begin{array}{cccc} (y_{1}-a_{1}^{l})(-sigmoid^{'}(z_{1}^{l})) \\ (y_{2}-a_{2}^{l})(-sigmoid^{'}(z_{2}^{l}))\\ \dots \\ (y_{n}-a_{n}^{l})(-sigmoid^{'}(z_{n}^{l})) \end{array} \right) \\ & = \left( \begin{array}{cccc} &sigmoid^{'}(z_{1}^{l}) & 0 &\dots &0\\ &0 &sigmoid^{'}(z_{2}^{l}) &\dots &0 \\ &\vdots &\vdots &\vdots &\vdots \\ &0 &0 &\dots &sigmoid^{'}(z_{n}^{l}) \end{array} \right)(a^{l}-y)\\ &=D^{l}(a^{l}-y) \end{aligned} ∇zlC=⎝⎜⎜⎛(y1−a1l)(−sigmoid′(z1l))(y2−a2l)(−sigmoid′(z2l))…(yn−anl)(−sigmoid′(znl))⎠⎟⎟⎞=⎝⎜⎜⎜⎛sigmoid′(z1l)0⋮00sigmoid′(z2l)⋮0……⋮…00⋮sigmoid′(znl)⎠⎟⎟⎟⎞(al−y)=Dl(al−y)
根据定义求 ∇ z l − 1 C \nabla_{z^{l-1}}C ∇zl−1C
设 C = f ( z l ) = f ( w l σ ( z l − 1 ) + b l ) ) 则 f ( w l σ ( z l − 1 + h ) + b l ) ) − f ( w l σ ( z l − 1 + h ) + b l ) ) = < ∇ z l C , w l ( σ ( z l − 1 + h ) − σ ( z l − 1 ) ) > + O ∣ ∣ σ ( z l − 1 + h ) − σ ( z l − 1 ) ∣ ∣ = < ∇ z l C , w l D l − 1 h > = t r ( ∇ z l C T w l D l − 1 h ) = < ( ∇ z l C T w l D l − 1 ) T , h > ∇ z l − 1 C = ( D l − 1 ) T ( w l ) T ∇ z l C \begin{aligned} 设 C = f(z^{l})=f(w^{l}\sigma(z^{l-1})+b^{l})) \\ 则 f(w^{l}\sigma(z^{l-1}+h)+b^{l})) - f(w^{l}\sigma(z^{l-1}+h)+b^{l}))&=<\nabla_{z^{l}}C,w^{l}(\sigma(z^{l-1}+h)-\sigma(z^{l-1}))> + O||\sigma(z^{l-1}+h)-\sigma(z^{l-1})|| \\ & =<\nabla_{z^{l}}C,w^{l}D^{l-1}h> \\ & = tr(\nabla_{z^{l}}C^{T}w^{l}D^{l-1}h)\\ & = <(\nabla_{z^{l}}C^{T}w^{l}D^{l-1})^{T},h> \\ \nabla_{z^{l-1}}C = (D^{l-1})^{T}(w^{l})^{T}\nabla_{z^{l}}C \end{aligned} 设C=f(zl)=f(wlσ(zl−1)+bl))则f(wlσ(zl−1+h)+bl))−f(wlσ(zl−1+h)+bl))∇zl−1C=(Dl−1)T(wl)T∇zlC=<∇zlC,wl(σ(zl−1+h)−σ(zl−1))>+O∣∣σ(zl−1+h)−σ(zl−1)∣∣=<∇zlC,wlDl−1h>=tr(∇zlCTwlDl−1h)=<(∇zlCTwlDl−1)T,h>
同理可得:
∇ w l C = ∇ z l C ( a l − 1 ) T ∇ b l C = ∇ z l C \begin{aligned} & \nabla_{w^{l}}C = \nabla_{z^{l}}C(a^{l-1})^{T}\\ & \nabla_{b^{l}}C = \nabla_{z^{l}}C \end{aligned} ∇wlC=∇zlC(al−1)T∇blC=∇zlC
所以可以递归求得所有参数的梯度:
∇ z l C = D l ( a l − y ) ∇ w l C = ∇ z l C ( a l − 1 ) T ∇ b l C = ∇ z l C ∇ z l − 1 C = ( D l − 1 ) T ( w l ) T ∇ z l C ∇ w l − 1 C = ∇ z l − 1 C ( a l − 2 ) T ∇ b l − 1 C = ∇ z l − 1 C … ∇ z 1 C = ( D 1 ) T ( w 2 ) T ∇ z 2 C ∇ w 1 C = ∇ z 1 C ( a 0 ) T ∇ b 1 C = ∇ z 1 C \begin{aligned} & \nabla_{z^{l}}C = D^{l}(a^{l}-y) \\ & \nabla_{w^{l}}C = \nabla_{z^{l}}C(a^{l-1})^{T}\\ & \nabla_{b^{l}}C = \nabla_{z^{l}}C \\ & \nabla_{z^{l-1}}C = (D^{l-1})^{T}(w^{l})^{T}\nabla_{z^{l}}C & \nabla_{w^{l-1}}C = \nabla_{z^{l-1}}C(a^{l-2})^{T}\\ & \nabla_{b^{l-1}}C = \nabla_{z^{l-1}}C \\ & \dots \\ & \nabla_{z^{1}}C = (D^{1})^{T}(w^{2})^{T}\nabla_{z^{2}}C \\ & \nabla_{w^{1}}C = \nabla_{z^{1}}C(a^{0})^{T}\\ & \nabla_{b^{1}}C = \nabla_{z^{1}}C \\ \end{aligned} ∇zlC=Dl(al−y)∇wlC=∇zlC(al−1)T∇blC=∇zlC∇zl−1C=(Dl−1)T(wl)T∇zlC∇bl−1C=∇zl−1C…∇z1C=(D1)T(w2)T∇z2C∇w1C=∇z1C(a0)T∇b1C=∇z1C∇wl−1C=∇zl−1C(al−2)T
代码实现如下:
def cost_derivative(self, output_activations, y):
"""Return the vector of partial derivatives for the output.
Assume the loss is quadratic loss 1/2 || output_activations-y ||^2
"""
cos_deri = np.dot(np.diag(sigmoid_prime(output_activations)), (output_activations - y).T)
return np.reshape(cos_deri, (cos_deri.shape[0], 1)) # compute the gradient here
def backprop(self, x, y):
"""Return a tuple ``(nabla_b, nabla_w)`` representing the
gradient for the cost function C_x. ``nabla_b`` and
``nabla_w`` are layer-by-layer lists of numpy arrays, similar
to ``self.biases`` and ``self.weights``."""
nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]
output_activations = self.feedforward(x)
nabla_zi = self.cost_derivative(output_activations, y)
assert nabla_zi.shape == nabla_b[-1].shape, "程序错误{}{}".format(nabla_b[-1].shape, nabla_zi.shape)
nabla_b[-1] = nabla_zi
assert nabla_w[-1].shape == np.dot(nabla_zi, self.a_stock[-2].T).shape, \
"程序错误{}".format(nabla_w[-1].shape, np.dot(nabla_zi, self.a_stock[-2].T).shape)
nabla_w[-1] = nabla_zi @ self.a_stock[-2].T
for i in range(len(nabla_b) - 2, -1, -1):
nabla_zi = np.diag(sigmoid_prime(self.a_stock[i + 1])[:, 0]) @ self.weights[i + 1].T @ nabla_zi
assert nabla_zi.shape == nabla_b[i].shape, "程序错误{}".format(nabla_b[i].shape, nabla_zi.shape)
nabla_b[i] = nabla_zi
assert nabla_w[i].shape == np.dot(nabla_zi, self.a_stock[i].T).shape, \
"程序错误{}".format(nabla_w[i].shape, np.dot(nabla_zi, self.a_stock[i].T).shape)
nabla_w[i] = np.dot(nabla_zi, self.a_stock[i].T)
## to be finished
return (nabla_b, nabla_w)
利用mini_batch 随机梯度下降法更新参数
def update_mini_batch(self, mini_batch, eta):
"""Update the network's weights and biases by applying
gradient descent using backpropagation to a single mini batch.
The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta``
is the learning rate."""
nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]
for x, y in mini_batch:
delta_nabla_b, delta_nabla_w = self.backprop(x, y)
nabla_b = [nb + delta_b for nb, delta_b in
zip(nabla_b, delta_nabla_b)] # gradient computation of b in the mini_batch
nabla_w = [nw + delta_w for nw, delta_w in
zip(nabla_w, delta_nabla_w)] # gradient computation of w in the mini_batch
self.weights = [w - eta * nw / len(mini_batch) for w, nw in
zip(self.weights, nabla_w)] # sgd step update weights w
self.biases = [b - eta * nb / len(mini_batch) for b, nb in
zip(self.biases, nabla_b)] # sgd step update biases b
网络设置为:
[784, 50,30, 10]
tf的版本:1.15.0
全连接网络设置:[784, 50,30, 10]
loss:mean_squared_error
优化方法:GradientDescentOptimizer
代码实现在:tf_mnist.py
运行效果对比:
epoch=10;minibatch=15;learning rate=1
平均运行时间对比
基于numpy的实现:532.818
基于tensorflow的实现:419.38s
基于numpy实现相对基于tensorflow的实现表现较差,猜测tensorflow在实现SGD的时候进行一些其他的策略导致收敛效果较好。
全部代码: