全连接神经网络(numpy实现和公式推导) mnist数据集

1、原理与实现

1.1 feedforward函数

对于l层神经网络的前向传播计算如下:
a 0 = x = z 0 z 1 = w 1 a 0 + b 1 a 1 = s i g m o i d ( z 1 ) z 2 = w 2 a 1 + b 2 a 2 = s i g m o i d ( z 2 ) … z l = w l a l − 1 + b l a l = s i g m o i d ( z l ) \begin{aligned} & a^{0} = x = z^{0} \\ & z^{1} =w^{1}a^{0}+b^{1} \\ & a^{1} = sigmoid(z^{1}) \\ & z^{2} =w^{2}a^{1}+b^{2} \\ & a^{2} = sigmoid(z^{2}) \\ & \dots \\ & z^{l} = w^{l}a^{l-1}+b^{l} \\ & a^{l} = sigmoid(z^{l}) \end{aligned} a0=x=z0z1=w1a0+b1a1=sigmoid(z1)z2=w2a1+b2a2=sigmoid(z2)zl=wlal1+blal=sigmoid(zl)因为在求梯度的时候会用到 a i a^{i} ai,所以存储了前向传播中的 a i a^{i} ai

    def feedforward(self, a):
        """Return the output of the network if ``a`` is input."""
        a_i = a
        self.a_stock = [a_i]
        for w_i, b_i in zip(self.weights, self.biases):
            a_i = sigmoid(np.dot(w_i, a_i) + b_i)
            self.a_stock.append(a_i)
        return a_i

1.2 backprop函数

反向传播计算梯度:
loss采用均方差:
cost function = 1 2 ∣ ∣ y − a l ∣ ∣ 2 = 1 2 ∑ i n ( y i − a i l ) 2 \begin{aligned} \text{cost function} & = \frac{1}{2}||y-a^{l}||^{2} \\ & = \frac{1}{2}\sum_{i}^{n}(y_{i}-a_{i}^{l})^{2} \\ \end{aligned} cost function=21yal2=21in(yiail)2
通过递归的方法求梯度:
∇ z l C = ( ( y 1 − a 1 l ) ( − s i g m o i d ′ ( z 1 l ) ) ( y 2 − a 2 l ) ( − s i g m o i d ′ ( z 2 l ) ) … ( y n − a n l ) ( − s i g m o i d ′ ( z n l ) ) ) = ( s i g m o i d ′ ( z 1 l ) 0 … 0 0 s i g m o i d ′ ( z 2 l ) … 0 ⋮ ⋮ ⋮ ⋮ 0 0 … s i g m o i d ′ ( z n l ) ) ( a l − y ) = D l ( a l − y ) \begin{aligned} \nabla_{z^{l}}C & =\left( \begin{array}{cccc} (y_{1}-a_{1}^{l})(-sigmoid^{'}(z_{1}^{l})) \\ (y_{2}-a_{2}^{l})(-sigmoid^{'}(z_{2}^{l}))\\ \dots \\ (y_{n}-a_{n}^{l})(-sigmoid^{'}(z_{n}^{l})) \end{array} \right) \\ & = \left( \begin{array}{cccc} &sigmoid^{'}(z_{1}^{l}) & 0 &\dots &0\\ &0 &sigmoid^{'}(z_{2}^{l}) &\dots &0 \\ &\vdots &\vdots &\vdots &\vdots \\ &0 &0 &\dots &sigmoid^{'}(z_{n}^{l}) \end{array} \right)(a^{l}-y)\\ &=D^{l}(a^{l}-y) \end{aligned} zlC=(y1a1l)(sigmoid(z1l))(y2a2l)(sigmoid(z2l))(ynanl)(sigmoid(znl))=sigmoid(z1l)000sigmoid(z2l)000sigmoid(znl)(aly)=Dl(aly)
根据定义求 ∇ z l − 1 C \nabla_{z^{l-1}}C zl1C
设 C = f ( z l ) = f ( w l σ ( z l − 1 ) + b l ) ) 则 f ( w l σ ( z l − 1 + h ) + b l ) ) − f ( w l σ ( z l − 1 + h ) + b l ) ) = < ∇ z l C , w l ( σ ( z l − 1 + h ) − σ ( z l − 1 ) ) > + O ∣ ∣ σ ( z l − 1 + h ) − σ ( z l − 1 ) ∣ ∣ = < ∇ z l C , w l D l − 1 h > = t r ( ∇ z l C T w l D l − 1 h ) = < ( ∇ z l C T w l D l − 1 ) T , h > ∇ z l − 1 C = ( D l − 1 ) T ( w l ) T ∇ z l C \begin{aligned} 设 C = f(z^{l})=f(w^{l}\sigma(z^{l-1})+b^{l})) \\ 则 f(w^{l}\sigma(z^{l-1}+h)+b^{l})) - f(w^{l}\sigma(z^{l-1}+h)+b^{l}))&=<\nabla_{z^{l}}C,w^{l}(\sigma(z^{l-1}+h)-\sigma(z^{l-1}))> + O||\sigma(z^{l-1}+h)-\sigma(z^{l-1})|| \\ & =<\nabla_{z^{l}}C,w^{l}D^{l-1}h> \\ & = tr(\nabla_{z^{l}}C^{T}w^{l}D^{l-1}h)\\ & = <(\nabla_{z^{l}}C^{T}w^{l}D^{l-1})^{T},h> \\ \nabla_{z^{l-1}}C = (D^{l-1})^{T}(w^{l})^{T}\nabla_{z^{l}}C \end{aligned} C=f(zl)=f(wlσ(zl1)+bl))f(wlσ(zl1+h)+bl))f(wlσ(zl1+h)+bl))zl1C=(Dl1)T(wl)TzlC=<zlC,wl(σ(zl1+h)σ(zl1))>+Oσ(zl1+h)σ(zl1)=<zlC,wlDl1h>=tr(zlCTwlDl1h)=<(zlCTwlDl1)T,h>
同理可得:
∇ w l C = ∇ z l C ( a l − 1 ) T ∇ b l C = ∇ z l C \begin{aligned} & \nabla_{w^{l}}C = \nabla_{z^{l}}C(a^{l-1})^{T}\\ & \nabla_{b^{l}}C = \nabla_{z^{l}}C \end{aligned} wlC=zlC(al1)TblC=zlC
所以可以递归求得所有参数的梯度:
∇ z l C = D l ( a l − y ) ∇ w l C = ∇ z l C ( a l − 1 ) T ∇ b l C = ∇ z l C ∇ z l − 1 C = ( D l − 1 ) T ( w l ) T ∇ z l C ∇ w l − 1 C = ∇ z l − 1 C ( a l − 2 ) T ∇ b l − 1 C = ∇ z l − 1 C … ∇ z 1 C = ( D 1 ) T ( w 2 ) T ∇ z 2 C ∇ w 1 C = ∇ z 1 C ( a 0 ) T ∇ b 1 C = ∇ z 1 C \begin{aligned} & \nabla_{z^{l}}C = D^{l}(a^{l}-y) \\ & \nabla_{w^{l}}C = \nabla_{z^{l}}C(a^{l-1})^{T}\\ & \nabla_{b^{l}}C = \nabla_{z^{l}}C \\ & \nabla_{z^{l-1}}C = (D^{l-1})^{T}(w^{l})^{T}\nabla_{z^{l}}C & \nabla_{w^{l-1}}C = \nabla_{z^{l-1}}C(a^{l-2})^{T}\\ & \nabla_{b^{l-1}}C = \nabla_{z^{l-1}}C \\ & \dots \\ & \nabla_{z^{1}}C = (D^{1})^{T}(w^{2})^{T}\nabla_{z^{2}}C \\ & \nabla_{w^{1}}C = \nabla_{z^{1}}C(a^{0})^{T}\\ & \nabla_{b^{1}}C = \nabla_{z^{1}}C \\ \end{aligned} zlC=Dl(aly)wlC=zlC(al1)TblC=zlCzl1C=(Dl1)T(wl)TzlCbl1C=zl1Cz1C=(D1)T(w2)Tz2Cw1C=z1C(a0)Tb1C=z1Cwl1C=zl1C(al2)T
代码实现如下:

    def cost_derivative(self, output_activations, y):
        """Return the vector of partial derivatives for the output.
        Assume the loss is quadratic loss 1/2 || output_activations-y ||^2
        """
        cos_deri = np.dot(np.diag(sigmoid_prime(output_activations)), (output_activations - y).T)
        return np.reshape(cos_deri, (cos_deri.shape[0], 1))  # compute the gradient here
    
    def backprop(self, x, y):
        """Return a tuple ``(nabla_b, nabla_w)`` representing the
        gradient for the cost function C_x.  ``nabla_b`` and
        ``nabla_w`` are layer-by-layer lists of numpy arrays, similar
        to ``self.biases`` and ``self.weights``."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        output_activations = self.feedforward(x)
        nabla_zi = self.cost_derivative(output_activations, y)
        assert nabla_zi.shape == nabla_b[-1].shape, "程序错误{}{}".format(nabla_b[-1].shape, nabla_zi.shape)
        nabla_b[-1] = nabla_zi
        assert nabla_w[-1].shape == np.dot(nabla_zi, self.a_stock[-2].T).shape, \
            "程序错误{}".format(nabla_w[-1].shape, np.dot(nabla_zi, self.a_stock[-2].T).shape)
        nabla_w[-1] = nabla_zi @ self.a_stock[-2].T
        for i in range(len(nabla_b) - 2, -1, -1):
            nabla_zi = np.diag(sigmoid_prime(self.a_stock[i + 1])[:, 0]) @ self.weights[i + 1].T @ nabla_zi
            assert nabla_zi.shape == nabla_b[i].shape, "程序错误{}".format(nabla_b[i].shape, nabla_zi.shape)
            nabla_b[i] = nabla_zi
            assert nabla_w[i].shape == np.dot(nabla_zi, self.a_stock[i].T).shape, \
                "程序错误{}".format(nabla_w[i].shape, np.dot(nabla_zi, self.a_stock[i].T).shape)
            nabla_w[i] = np.dot(nabla_zi, self.a_stock[i].T)
        ## to be finished
        return (nabla_b, nabla_w)

1.3 update_mini_batch函数

利用mini_batch 随机梯度下降法更新参数

    def update_mini_batch(self, mini_batch, eta):
        """Update the network's weights and biases by applying
        gradient descent using backpropagation to a single mini batch.
        The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta``
        is the learning rate."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb + delta_b for nb, delta_b in
                       zip(nabla_b, delta_nabla_b)]  # gradient computation of b in the mini_batch
            nabla_w = [nw + delta_w for nw, delta_w in
                       zip(nabla_w, delta_nabla_w)]  # gradient computation of w in the mini_batch

        self.weights = [w - eta * nw / len(mini_batch) for w, nw in
                        zip(self.weights, nabla_w)]  # sgd step update weights w
        self.biases = [b - eta * nb / len(mini_batch) for b, nb in
                       zip(self.biases, nabla_b)]  # sgd step update biases b

2、测试

2.1 网络设置

网络设置为:
[784, 50,30, 10]

2.2 学习率

全连接神经网络(numpy实现和公式推导) mnist数据集_第1张图片

2.3 对比tensorflow的实现

tf的版本:1.15.0
全连接网络设置:[784, 50,30, 10]
loss:mean_squared_error
优化方法:GradientDescentOptimizer
代码实现在:tf_mnist.py
运行效果对比:
epoch=10;minibatch=15;learning rate=1
全连接神经网络(numpy实现和公式推导) mnist数据集_第2张图片
平均运行时间对比
基于numpy的实现:532.818
基于tensorflow的实现:419.38s

3、总结

基于numpy实现相对基于tensorflow的实现表现较差,猜测tensorflow在实现SGD的时候进行一些其他的策略导致收敛效果较好。

4、附录

全部代码:

你可能感兴趣的:(线性代数,算法,深度学习,神经网络)