本文中,只实现一层卷积层的正反向传播(带激活函数Relu),实现的是多通道输入,多通道输出的。之前对深度学习的理解大多止于pytorch里面现成的API,这还是第一次摆脱import torch,若有错误还烦请帮忙指正~
为了更加方便地表示张量,我们这里调用numpy包。
a l = σ ( z l ) = σ ( a l − 1 ∗ W l + b l ) a^{l}=\sigma\left(z^{l}\right)=\sigma\left(a^{l-1} * W^{l}+b^{l}\right) al=σ(zl)=σ(al−1∗Wl+bl)
其中, σ \sigma σ是激活函数,在本文中使用Relu; a l − 1 a^{l-1} al−1是该卷积层的输入, a l a^{l} al是经过激活函数后的输出, z l z^{l} zl是未经过激活函数的输出, W l W^{l} Wl是卷积核, b l b^{l} bl是该层的偏置。
激活函数Relu的前向传播,在输入大于等于零时保持原值,小于零时输出置零。
def Rulu(z):
z[z<0]=0
return z
卷积层的前向传播的代码如下所示(带激活函数Relu)。
X , W , b X,W,b X,W,b分别与 a l − 1 , W l , b l a^{l-1}, W^{l}, b^{l} al−1,Wl,bl 相对应;输出 O u t Out Out即对应 a l a^{l} al。
def conv_forward(X, W, b, stride=(1,1), padding=(0,0)):
# number of sample, number of channel, height, widtd (input X)
m,c,Ih,Iw=X.shape
# dimension of filter, number of channel, height, width (kernel W)
f,_,Kw,Kh=W.shape
#size of stride and padding
Sw,Sh=stride
Pw,Ph=padding
# calculate the width and height of the output
Oh = int( 1 + (Ih + 2 * Ph - Kh) / Sh )
Ow = int( 1 + (Iw + 2 * Pw - Kw) / Sw )
# pre-allocate output Out
# number of sample, number of channel, height, width (output Out)
Out=np.zeros([m, f, Oh, Ow])
X_pad = np.zeros((m, c, Ih +2 * Ph, Iw +2 * Pw))
X_pad[:,:,Ph:Ph+Ih,Pw:Pw+Iw]= X
# multi in (c channels), multi out (f channels)
# dimension of filter, also the number of output channel
for n in range(Out.shape[1]):
# consider the multi-in-single-out situation
for i in range(Out.shape[2]):
for j in range(Out.shape[3]):
# the m samples are dealt in parallel
# (m,c,Ih,Iw) * (c,Kh,Kw) = (m,c,Oh,Ow)
# sum -> m*Oh*Ow
Out[:,n,i,j] = np.sum(X_pad[:, :, i*Sh : i*Sh+Kh, j*Sw : j*Sw+Kw] * W[n, :, :, :], axis=(1, 2, 3))
#bias added in each dimension of filter
Out[:,n,:,:]+=b[n]
#Relu_forward
relu(Out)
return Out
我们要通过反向传播求出损失函数 J ( W , b ) J(W,b) J(W,b)对 z l , W l , b l z^{l}, W^{l}, b^{l} zl,Wl,bl三者的偏导数。其中记
δ l = ∂ J ( W , b ) ∂ z l \delta^{l}=\frac{\partial J(W, b)}{\partial z^{l}} δl=∂zl∂J(W,b)
δ l − 1 = δ l ∗ rot 180 ( W l ) ⊙ σ ′ ( z l − 1 ) \delta^{l-1}=\delta^{l} * \operatorname{rot} 180\left(W^{l}\right) \odot \sigma^{\prime}\left(z^{l-1}\right) δl−1=δl∗rot180(Wl)⊙σ′(zl−1)
其中, ⊙ \odot ⊙ 代表Hadamard积,对于两个维度相同的向量: A = ( a 1 , a 2 , . . . , a n ) T A=(a_{1},a_{2},...,a_{n})^{T} A=(a1,a2,...,an)T, B = ( b 1 , b 2 , . . . , b n ) T B=(b_{1},b_{2},...,b_{n})^{T} B=(b1,b2,...,bn)T,则 A ⊙ B = ( a 1 b 1 , a 2 b 2 , . . . , a n b n ) T A⊙B=(a_{1}b_{1},a_{2}b_{2},...,a_{n}b_{n})^{T} A⊙B=(a1b1,a2b2,...,anbn)T。而 σ ′ ( z l − 1 ) \sigma^{\prime}(z^{l-1}) σ′(zl−1)为Relu函数的导数,即当函数的自变量小于零时,导数为零,否则为1。至于为何要旋转180度,链接1的文章中有解释。
∂ J ( W , b ) ∂ W l = a l − 1 ∗ δ l \frac{\partial J(W, b)}{\partial W^{l}}=a^{l-1} * \delta^{l} ∂Wl∂J(W,b)=al−1∗δl
∂ J ( W , b ) ∂ b l = ∑ u , v ( δ l ) u , v \frac{\partial J(W, b)}{\partial b^{l}}=\sum_{u, v}\left(\delta^{l}\right)_{u, v} ∂bl∂J(W,b)=u,v∑(δl)u,v
损失函数 J ( W , b ) J(W,b) J(W,b)对 z l , a l − 1 , W l , b l , z l − 1 z^{l}, a^{l-1},W^{l}, b^{l},z^{l-1} zl,al−1,Wl,bl,zl−1三者的偏导数分别记为dz,dx,dw,db,dz0
def conv_backward(dz, X, W, b, stride=(1,1), padding=(0,0)):
"""
dz: Gradient with respect to z
dz0: Gradient with respect to z of the former convolutional layer
dx: Gradient with respect to x
dw: Gradient with respect to w
db: Gradient with respect to b
"""
m, f, _, _ = dz.shape
m, c, Ih, Iw = X.shape
_,_,Kh,Kw = W.shape
Sw,Sh=stride
Pw,Ph=padding
dx, dw, db = np.zeros_like(X), np.zeros_like(W), np.zeros_like(b)
X_pad = np.pad(X, [(0,0), (0,0), (Ph,Ph), (Pw,Pw)], 'constant')
dx_pad = np.pad(dx, [(0,0), (0,0), (Ph,Ph), (Pw,Pw)], 'constant')
db = np.sum(dz, axis=(0,2,3))
for k in range(dz.shape[0]):
for i in range(dz.shape[2]):
for j in range(dz.shape[3]):
X_w = X_pad[k, :, i * Sh : i * Sh + Kh, j * Sw : j * Sw + Kw]
for n in range(f):
#f,c,Kw,Kh
dw[n] += X_w* dz[k, n, i, j]
dx_pad[k, :, i * Sh : i * Sh + Kh, j * Sw : j * Sw + Kw] += np.flip(W[n],axis=(1,2)) * dz[k, n, i, j]
dx = dx_pad[:, :, Ph:Ph+Ih, Pw:Pw+Iw]
dz0 = dx
dz0[dz0<0]=0 #Relu
return dx, dw, db, dz0
import numpy as np
def relu(z):
z[z<0]=0
return z;
def conv_forward(X, W, b, stride=(1,1), padding=(0,0)):
# number of sample, number of channel, height, widtd (input X)
m,c,Ih,Iw=X.shape
# dimension of filter, number of channel, height, width (kernel W)
f,_,Kw,Kh=W.shape
#size of stride and padding
Sw,Sh=stride
Pw,Ph=padding
# calculate the width and height of the output
Oh = int( 1 + (Ih + 2 * Ph - Kh) / Sh )
Ow = int( 1 + (Iw + 2 * Pw - Kw) / Sw )
# pre-allocate output O
# number of sample, number of channel, height, width (output O)
Out=np.zeros([m, f, Oh, Ow])
X_pad = np.zeros((m, c, Ih +2 * Ph, Iw +2 * Pw))
X_pad[:,:,Ph:Ph+Ih,Pw:Pw+Iw]= X
# multi in (c channels), multi out (f channels)
# dimension of filter, also the number of output channel
for n in range(Out.shape[1]):
# consider the multi-in-single-out situation
for i in range(Out.shape[2]):
for j in range(Out.shape[3]):
# the m samples are dealt in parallel
# (m,c,Ih,Iw) * (c,Kh,Kw) = (m,c,Oh,Ow)
# sum -> m*Oh*Ow
Out[:,n,i,j] = np.sum(X_pad[:, :, i*Sh : i*Sh+Kh, j*Sw : j*Sw+Kw] * W[n, :, :, :], axis=(1, 2, 3))
#bias added in each dimension of filter
Out[:,n,:,:]+=b[n]
#Relu_forward
relu(Out)
return Out
def conv_backward(dz, X, W, b, stride=(1,1), padding=(0,0)):
"""
dz: Gradient with respect to z
dz0: Gradient with respect to z of the former convolutional layer
dx: Gradient with respect to x
dw: Gradient with respect to w
db: Gradient with respect to b
"""
m, f, _, _ = dz.shape
m, c, Ih, Iw = X.shape
_,_,Kh,Kw = W.shape
Sw,Sh=stride
Pw,Ph=padding
dx, dw, db = np.zeros_like(X), np.zeros_like(W), np.zeros_like(b)
X_pad = np.pad(X, [(0,0), (0,0), (Ph,Ph), (Pw,Pw)], 'constant')
dx_pad = np.pad(dx, [(0,0), (0,0), (Ph,Ph), (Pw,Pw)], 'constant')
db = np.sum(dz, axis=(0,2,3))
for k in range(dz.shape[0]):
for i in range(dz.shape[2]):
for j in range(dz.shape[3]):
x_window = X_pad[k, :, i * Sh : i * Sh + Kh, j * Sw : j * Sw + Kw]
for n in range(f):
#f,c,Kw,Kh
dw[n] += x_window * dz[k, n, i, j]
dx_pad[k, :, i * Sh : i * Sh + Kh, j * Sw : j * Sw + Kw] += np.flip(W[n],axis=(1,2)) * dz[k, n, i, j]
dx = dx_pad[:, :, Ph:Ph+Ih, Pw:Pw+Iw]
dz0 = dx
dz0[dz0<0]=0 #Relu
return dx, dw, db, dz0