参考:
矩阵求导术
刘建平pinard
一般输入的张量维度的格式有两种
一种是channels_last : (batch_sizes, height, width, channels)
一种是channels_first : (batch_sizes, channels, height, width)
记 s 为 卷 积 步 长 s t r i d e \ 记 s为卷积步长stride 记s为卷积步长stride, p 为 p a d d i n g 填 充 的 大 小 \ p为padding填充的大小 p为padding填充的大小,前一层长或宽为 n \ n n,滤波器在对应维度上的长度为 f \ f f,滤波器的数量为 n c \ n_c nc,则
这里讲的卷积,实际上叫相关系数(cross-correlations),与数学上的卷积有区别(需要旋转卷积核),我们在卷积层中的卷积并没有作旋转处理,是因为我们把旋转的步骤通过网络的训练来得到相同的效果,忽略旋转处理可以提高效率
填充类型
vaild : no padding
same : p = f − 1 2 \ p=\frac{f-1}{2} p=2f−1
长或宽的维度变为:
n n e w = ⌊ n + 2 p − f s + 1 ⌋ \ n_{new} = \lfloor\frac{n+2p-f}{s} + 1\rfloor nnew=⌊sn+2p−f+1⌋
通道数变为:
n c \ n_c nc
即卷积前,张量尺寸为
( m , n h p r e v , n w p r e v , n c p r e v ) \ (m, n_{h_{prev}}, n_{w_{prev}}, n_{c_{prev}}) (m,nhprev,nwprev,ncprev)
卷积后
( m , n h n e w , n w n e w , n c ) \ (m, n_{h_{new}}, n_{w_{new}}, n_c) (m,nhnew,nwnew,nc)
Z [ l ] = W [ l ] A [ l − 1 ] + b [ l ] \ Z^{[l]} = W^{[l]}A^{[l-1]}+b^{[l]} Z[l]=W[l]A[l−1]+b[l]
A [ l ] = σ ( Z [ l − 1 ] ) \ A^{[l]} = \sigma(Z^{[l-1]}) A[l]=σ(Z[l−1])
其中,偏移项 b [ l ] \ b^{[l]} b[l]的尺寸为 ( 1 , 1 , 1 , n c ) \ (1, 1, 1, n_c) (1,1,1,nc)
一般使用max pooling或averange pooling
池化后张量的尺寸中长宽的计算方式跟上面相同,由于池化核是影响单个通道,所以池化后不会影响输出张量的通道数,也就是说,卷积层的操作是增加通道数(提取更多特征),而池化层的操作是缩减张量的尺寸
在进行最后一层池化后,要将 m 个 尺 寸 为 ( n h , n w , n c ) 的 张 量 展 开 成 一 维 的 向 量 \ m个尺寸为(n_h,n_w, n_c)的张量展开成一维的向量 m个尺寸为(nh,nw,nc)的张量展开成一维的向量,最终得到尺寸为 ( m , n h ∗ n w ∗ n c ) \ (m, n_h*n_w*n_c) (m,nh∗nw∗nc)的矩阵
Z [ l ] = W [ l ] A [ l − 1 ] + b [ l ] \ Z^{[l]} = W^{[l]}A^{[l-1]}+b^{[l]} Z[l]=W[l]A[l−1]+b[l]
A [ l ] = σ ( Z [ l − 1 ] ) \ A^{[l]} = \sigma(Z^{[l-1]}) A[l]=σ(Z[l−1])
层间张量维度:
A [ l − 1 ] : ( m , p ) \ A^{[l-1]}:(m, p) A[l−1]:(m,p)
Z [ l ] : ( m , q ) \ Z^{[l]}:(m,q) Z[l]:(m,q)
W [ l ] : ( p , q ) \ W^{[l]}:(p,q) W[l]:(p,q)
b [ l ] : ( 1 , q ) \ b^{[l]}:(1,q) b[l]:(1,q)
值得一提的是,在吴恩达的机器学习教程中,偏差项的添加不是添加b,而是在张量中增加一维全1向量,即
A b i a s [ l ] = [ 1 T ; A [ l ] ] \ A^{[l]}_{bias}=[\textbf1^T;A^{[l]}] Abias[l]=[1T;A[l]]
激活函数的选择:
一般在隐层中选择ReLU或sigmond,在输出层中选择sigmond(二分类)或softmax(多分类)
1. d ( X ± Y ) = d X ± d Y ; d ( X Y ) = d ( X ) Y + X d Y ; d ( X T ) = ( d X ) T ; d t r ( X ) = t r ( d X ) \ 1.d(X\pm Y)=dX \pm dY; \quad d(XY)=d(X)Y+XdY; \quad d(X^T)=(dX)^T; \quad dtr(X)=tr(dX) 1.d(X±Y)=dX±dY;d(XY)=d(X)Y+XdY;d(XT)=(dX)T;dtr(X)=tr(dX)
2. d X − 1 = − X − 1 d X X − 1 \ 2.dX^{-1}=-X^{-1}dXX^{-1} 2.dX−1=−X−1dXX−1
3. d ∣ X ∣ = t r ( X ∗ d X ) , X ∗ 为 X 的 伴 随 矩 阵 , 当 X 可 逆 时 , d ∣ X ∣ = ∣ X ∣ t r ( X − 1 d X ) \ 3.d|X|=tr(X^*dX),X^*为X的伴随矩阵,当X可逆时,d|X|=|X|tr(X^{-1}dX) 3.d∣X∣=tr(X∗dX),X∗为X的伴随矩阵,当X可逆时,d∣X∣=∣X∣tr(X−1dX)
4. d ( X ⨀ Y ) = d X ⨀ Y + X ⨀ d Y \ 4.d(X \bigodot Y)=dX \bigodot Y+X \bigodot dY 4.d(X⨀Y)=dX⨀Y+X⨀dY
5. d σ ( X ) = σ ′ ( X ) ⨀ d X , σ ( X ) = [ σ ( X i j ) ] \ 5.d \sigma(X)=\sigma'(X) \bigodot dX, \quad \sigma(X)=[\sigma(X_{ij})] 5.dσ(X)=σ′(X)⨀dX,σ(X)=[σ(Xij)]
迹技巧:
1. 标 量 套 迹 : a = t r ( a ) \ 1.标量套迹: a=tr(a) 1.标量套迹:a=tr(a)
2. t r ( A T ) = t r ( A ) \ 2.tr(A^T)=tr(A) 2.tr(AT)=tr(A)
3. t r ( A ± B ) = t r ( A ) ± t r ( B ) \ 3.tr(A \pm B) = tr(A) \pm tr(B) 3.tr(A±B)=tr(A)±tr(B)
4. t r ( A B ) = t r ( B A ) , 其 中 A 与 B T 尺 寸 相 同 \ 4.tr(AB) = tr(BA), 其中A与B^T尺寸相同 4.tr(AB)=tr(BA),其中A与BT尺寸相同
5. t r ( A T ( B ⨀ C ) ) = t r ( ( A ⨀ B ) T C ) , 其 中 A , B , C 尺 寸 相 同 \ 5.tr(A^T(B \bigodot C))=tr((A \bigodot B)^TC),其中A,B,C尺寸相同 5.tr(AT(B⨀C))=tr((A⨀B)TC),其中A,B,C尺寸相同
在开始求导之前,我们首先定义各层的误差为损失函数对该层未激活单元的偏导数,即 δ l = ∂ L ∂ z l \ \delta^l=\frac{\partial{L}}{\partial{z^l}} δl=∂zl∂L
要进行反向传播对权重系数进行求导,首先要定义代价函数
由于输出层是softmax, σ ( z ) = e x p ( z ) 1 T e x p ( z ) \ \sigma(z)= \frac{exp(\mathbf{z})}{\mathbf{1^T}exp(\mathbf{z})} σ(z)=1Texp(z)exp(z)
定义loss为:
L = − y T l o g σ ( z ) \ L=-\mathbf{y}^Tlog\sigma(\mathbf{z}) L=−yTlogσ(z)
我们先求出输出层的误差:
展开损失函数:
L = − y T ( l o g ( e x p ( z ) ) − 1 l o g ( 1 T e x p ( z ) ) ) = − y T z + l o g ( 1 T e x p ( z ) ) \ L=-\mathbf{y}^T(log(exp(\mathbf{z}))-\mathbf{1}log(\mathbf{1}^Texp(\mathbf{z})))=-\mathbf{y}^T\mathbf{z}+log(\mathbf{1}^Texp(\mathbf{z})) L=−yT(log(exp(z))−1log(1Texp(z)))=−yTz+log(1Texp(z))
两边求微分:
d L = − y T d z + 1 T ( e x p ( z ) ⨀ ( d z ) ) 1 T e x p ( z ) = − y T d z + e x p ( z ) T d z 1 T e x p ( z ) \ dL=-\mathbf{y}^Td\mathbf{z}+\frac{\mathbf{1}^T(exp(\mathbf{z})\bigodot (d\mathbf{z}))}{\mathbf{1}^Texp(\mathbf{z})}=-\mathbf{y}^Td\mathbf{z}+\frac{exp(\mathbf{z})^Td\mathbf{z}}{\mathbf{1}^Texp(\mathbf{z})} dL=−yTdz+1Texp(z)1T(exp(z)⨀(dz))=−yTdz+1Texp(z)exp(z)Tdz
两边套迹:
d L = t r ( − y T d z + e x p ( z ) T d z 1 T e x p ( z ) ) = t r ( ( σ ( z ) T − y T ) d z ) \ dL=tr(-\mathbf{y}^Td\mathbf{z}+\frac{exp(\mathbf{z})^Td\mathbf{z}}{\mathbf{1}^Texp(\mathbf{z})})=tr((\sigma(\mathbf{z})^T-\mathbf{y}^T)d\mathbf{z}) dL=tr(−yTdz+1Texp(z)exp(z)Tdz)=tr((σ(z)T−yT)dz)
得:
∂ L ∂ z = ( σ ( z ) T − y T ) T = σ ( z ) − y \ \frac{\partial{L}}{\partial{\mathbf{z}}}=(\sigma(\mathbf{z})^T-\mathbf{y}^T)^T=\sigma(\mathbf{z})-\mathbf{y} ∂z∂L=(σ(z)T−yT)T=σ(z)−y
用链式法则求出前面层的误差:
∂ L ∂ z l = ( ∂ L T ∂ z l ) T = ( ∂ L T ∂ z L ∂ z L ∂ z L − 1 . . . ∂ z l + 1 ∂ z l ) T = ( ∂ z L ∂ z L − 1 . . . ∂ z l + 1 ∂ z l ) T ∂ L ∂ z L \ \frac{\partial{L}}{\partial{\mathbf{z}^l}}=(\frac{\partial{L}^T}{\partial{\mathbf{z}^l}})^T=(\frac{\partial{L}^T}{\partial{\mathbf{z}^L}}\frac{\partial{\mathbf{z}^L}}{\partial{\mathbf{z}^{L-1}}}...\frac{\partial{\mathbf{z}^{l+1}}}{\partial{\mathbf{z}^l}})^T=(\frac{\partial{\mathbf{z}^L}}{\partial{\mathbf{z}^{L-1}}}...\frac{\partial{\mathbf{z}^{l+1}}}{\partial{\mathbf{z}^l}})^T\frac{\partial{L}}{\partial{\mathbf{z}^L}} ∂zl∂L=(∂zl∂LT)T=(∂zL∂LT∂zL−1∂zL...∂zl∂zl+1)T=(∂zL−1∂zL...∂zl∂zl+1)T∂zL∂L
即
δ l = ( ∂ z l + 1 ∂ z l ) T δ l + 1 = ( W l + 1 ) T δ l + 1 ⨀ σ ′ ( z l ) \ \delta^l=(\frac{\partial{z}^{l+1}}{\partial{z}^l})^T\delta^{l+1}=(W^{l+1})^T\delta^{l+1}\bigodot\sigma'(z^l) δl=(∂zl∂zl+1)Tδl+1=(Wl+1)Tδl+1⨀σ′(zl)
因为
z l = W l a l − 1 + b l \ z^l=W^la^{l-1}+b^l zl=Wlal−1+bl
可得代价函数对权重矩阵以及偏差项的偏导数为:
∂ L ∂ W l = ∂ L ∂ z l ∂ z l ∂ W l = δ l ( a l − 1 ) T \ \frac{\partial{L}}{\partial{W^l}}=\frac{\partial{L}}{\partial{z^l}}\frac{\partial{z^l}}{\partial{W^l}}=\delta^l(a^{l-1})^T ∂Wl∂L=∂zl∂L∂Wl∂zl=δl(al−1)T
∂ L ∂ b l = ∂ L ∂ z l ∂ z l ∂ b l = δ l \ \frac{\partial{L}}{\partial{b^l}}=\frac{\partial{L}}{\partial{z^l}}\frac{\partial{z^l}}{\partial{b^l}}=\delta^l ∂bl∂L=∂zl∂L∂bl∂zl=δl
由此便可以梯度下降优化系数
池化层的 δ l \ \delta^l δl由全连接层第一层的误差张量进行reshape操作或者由卷积层反向传播获得,在获得池化层的 δ l \ \delta^l δl后,我们要求出池化层前一层的误差 δ l − 1 \ \delta^{l-1} δl−1
由于池化会压缩输入到池化层的张量,所以我们先要将池化层的张量恢复成之前的形状,然后将下采样获得的值映射回该值在输入张量的位置。
我们记转换后的矩阵为 u n s a m p l e ( δ l ) \ unsample(\delta^l) unsample(δl)(这里只对某一通道作解释)
可知
∂ L ∂ a l − 1 = u n s a m p l e ( δ l ) \ \frac{\partial{L}}{\partial{a^{l-1}}}=unsample(\delta^l) ∂al−1∂L=unsample(δl)
则可以求得
δ l − 1 = ( ∂ a l − 1 ∂ z l − 1 ) T ∂ L ∂ a l − 1 = u n s a m p l e ( δ l ) ⨀ σ ′ ( z l − 1 ) \ \delta^{l-1}=(\frac{\partial{a^{l-1}}}{\partial{z^{l-1}}})^T \frac{\partial{L}}{\partial{a^{l-1}}}=unsample(\delta^l)\bigodot \sigma'(z^{l-1}) δl−1=(∂zl−1∂al−1)T∂al−1∂L=unsample(δl)⨀σ′(zl−1)
获得卷积层的 δ l \ \delta^l δl后,求出前一层的 δ l − 1 \ \delta^{l-1} δl−1
参考全连接层的反向传播
δ l = ( ∂ z l + 1 ∂ z l ) T δ l + 1 = ( W l + 1 ) T δ l + 1 ⨀ σ ′ ( z l ) \ \delta^l=(\frac{\partial{z}^{l+1}}{\partial{z}^l})^T\delta^{l+1}=(W^{l+1})^T\delta^{l+1}\bigodot\sigma'(z^l) δl=(∂zl∂zl+1)Tδl+1=(Wl+1)Tδl+1⨀σ′(zl)
卷积层的反向传播表示为
δ l − 1 = δ l ∗ r o t 180 ( W l ) ⨀ σ ′ ( z l − 1 ) \ \delta^{l-1}=\delta^l*rot180(W^l)\bigodot\sigma'(z^{l-1}) δl−1=δl∗rot180(Wl)⨀σ′(zl−1)
为什么这里的表示有所不同,其实我们可以理解为误差的反向传播是后一层误差通过卷积核(权重矩阵)反向映射回前一层,所以我们要找出这个反向映射的关系,在全连接层中,由于前后层之间的映射就是矩阵相乘,所以可以直接通过原矩阵映射回前一层,但是在卷积层中,由于卷积层对后一层的映射由卷积核滑动完成,所以要找出映射关系,就要列出 z l 、 w l 、 a l − 1 \ z^l、w^l、a^{l-1} zl、wl、al−1各元素之间的关系,最后可以获得以上的表示
求出卷积层和前一层的误差后,就可以获得卷积核的梯度
∂ L ∂ W l = a l − 1 ∗ δ l \ \frac{\partial{L}}{\partial{W^l}}=a^{l-1}*\delta^l ∂Wl∂L=al−1∗δl
∂ L ∂ b l = ∑ u , v ( δ l ) u , v \ \frac{\partial{L}}{\partial{b^l}}=\sum_{u,v}(\delta^l)_{u,v} ∂bl∂L=u,v∑(δl)u,v
这里误差项的梯度是误差张量的每个通道的矩阵的所有元素求和
对于卷积核梯度的求法,其实也是找前一层激活单元与后一层误差的映射关系
由此可以得到
为了加深对CNN内部结构的理解,决定手写一次代码,在网络的参数传递中,全连接层是比较简单的,但是卷积层和池化层部分参数的传递比较复杂,为此画了一下参数传递的关系
import numpy as np
import scipy.io as sio
class CNN():
"""
一个CNN训练的类
alpha -- 学习率
lamda -- 正则化参数
X -- 训练集图片 shape(m(24*100), 1, 20, 20)
y -- 训练集标签 shape(24*100, 24)
maxItera -- 最大迭代次数
WSet -- 卷积核以及全连接层的权重 以字典键值对进行索引
WSet = {'convW_1':convW_1, 'convW_2':convW_2, 'FCW_1':FCW_1, 'FCW_2':FCW_2, 'FCW_3':FCW_3}
Bias -- 卷积层输出以及全连接层输出每一项的偏差
Bias = {'convB_1':convB_1, 'convB_2':convB_2, 'FCB_1':FCB_1, 'FCB_2':FCB_2, 'FCB_3':FCB_3}
"""
def __init__(self, alpha, lamda, maxItera):
self.alpha = alpha
self.lamda = lamda
self.maxItera = maxItera
self.X_train = self.readData()[0]
self.y_train = self.readData()[1]
self.X_test = self.readData()[2]
self.y_test = self.readData()[3]
self.m = np.shape(self.X_train)[0]
self.WSet = self.initializeWB()[0]
self.Bias = self.initializeWB()[1]
def readData(self):
"""
读入训练数据
Arguments:
None
Returns:
X_train -- 一组数量为m的图片组 shape(m, 1, ?, ?)
y_train -- 训练集标签 shape(m, 1)
X_test
y_test
"""
data = sio.loadmat('MNIST_sort')
label_t10k = data['label_t10k']
label_train = data['label_train']
img_t10k = data['img_t10k'] / 255
img_train = data['img_train'] / 255
idx_train = [0]
for i in range(9):
idx = idx_train[i] + np.sum(label_train == i)
idx_train.append(idx)
idx_test = [0]
for i in range(9):
idx = idx_test[i] + np.sum(label_t10k == i)
idx_test.append(idx)
X_train = np.zeros((400, 1, 28, 28))
X_test = np.zeros((100, 1, 28, 28))
for i in range(10):
for j in range(40):
x = img_train[idx_train[i]+j, :]
x = x.reshape(1, 28, 28)
X_train[(i+1)*j] = x
for i in range(10):
for j in range(10):
x = img_t10k[idx_test[i]+j, :]
x = x.reshape(1, 28, 28)
X_test[(i+1)*j] = x
y_train = np.empty((0, 1))
y_test = np.empty((0, 1))
for i in range(10):
y_train_slice = np.ones((40, 1)) * i
y_test_slice = np.ones((10, 1)) * i
y_train = np.r_[y_train, y_train_slice]
y_test = np.r_[y_test, y_test_slice]
return X_train, y_train, X_test, y_test
def zeroPad(self, X, pad):
"""
扩展矩阵用于same卷积
Argument:
X -- ndarray of shape(m, n_C, n_H, n_W) 一批数量为m的图片
pad -- 图片要扩充的尺寸
Returns:
X_pad -- 扩展后的矩阵 shape(m, n_C, n_H + 2*pad, n_W + 2*pad)
"""
X_pad = np.pad(X, ((0, 0), (0, 0), (pad, pad), (pad, pad)))
return X_pad
def singleConv(self, a_slice_prev, W, b):
"""
卷积中其中一部分切片的求和
Argument:
a_slice_prev -- 要卷积数据的切片 shape(n_C_prev, f, f)
W -- 卷积核 shape(n_C_prev, f, f)
Returns:
Z -- 局部求和后的结果
"""
s = a_slice_prev * W + b
Z = np.sum(s)
return Z
def forwardConv(self, A_prev, W, b, hPara):
"""
前向传播 卷积层
Argument:
A_prev -- 前一层的输出 shape(m, n_C_prev, n_H_prev, n_W_prev)
W -- 卷积核组 shape(n_C, n_C_prev, f, f)
b -- bias项 shape(n_C, 1, 1, 1)
hPara -- 超参数 字典类型 卷积步长与填充量 {'stride': 'pad':}
Returns:
Z -- 卷积后的输出 shape(m, n_C, n_H, n_W)
cache -- 缓存数据 用于反向传播
"""
(m, n_C_prev, n_H_prev, n_W_prev) = A_prev.shape
(n_C, n_C_prev, f, f) = W.shape
stride = hPara['stride']
pad = hPara['pad']
# 计算卷积后输出的尺寸
n_H = int((n_H_prev - f + 2 * pad) / stride + 1)
n_W = int((n_W_prev - f + 2 * pad) / stride + 1)
# 初始化卷积层的输出
Z = np.zeros((m, n_C, n_H, n_W))
# 填充前一层的输入
A_prev_pad = self.zeroPad(A_prev, pad)
# 计算卷积
for i in range(m):
a_prev_pad = A_prev_pad[i]
for c in range(n_C):
for h in range(n_H):
for w in range(n_W):
# 获得切片起止范围
h_start = h * stride
h_end = h_start + f
w_start = w * stride
w_end = w_start + f
a_slice_prev = a_prev_pad[:, h_start:h_end, w_start:w_end]
Z[i, c, h, w] = self.singleConv(a_slice_prev, W[c, ...], b[c, ...])
cache = (A_prev, W, b, hPara)
return Z, cache
def forwardPool(self, A_prev, hPara, mode = 'max'):
"""
前向传播 池化层
Arguments:
A_prev -- 前一层的输出 shape(m, n_C_prev, n_H_prev, n_W_prev)
hPara -- 超参数 字典形式 池化核的大小核卷积步长 {'f': , 'stride'}
mode -- 池化模式:最大值池化'max' 均值池化'mean' 默认为'max'
Returns:
A -- 池化后的输出 shape(m, n_C, n_H, n_W)
cache -- 缓存数据 用于反向传播
"""
(m, n_C_prev, n_H_prev, n_W_prev) = A_prev.shape
f = hPara['f']
stride = hPara['stride']
# 计算池化后的输出的尺寸
n_H = int((n_H_prev - f) / stride + 1)
n_W = int((n_W_prev - f) / stride + 1)
n_C = n_C_prev
# 初始化输出
A = np.zeros((m, n_C, n_H, n_W))
for i in range(m):
for c in range(n_C):
for h in range(n_H):
for w in range(n_W):
h_start = h * stride
h_end = h_start + f
w_start = w * stride
w_end = w_start + f
a_slice_prev = A_prev[i, c, h_start:h_end, w_start:w_end]
if mode == 'max':
A[i, c, h, w] = np.max(a_slice_prev)
elif mode == 'mean':
A[i, c, h, w] = np.mean(a_slice_prev)
cache = (A_prev, hPara)
return A, cache
def backwardConv(self, dZ, cache):
"""
卷积层的误差反向传播
Arguments:
dZ -- 卷积层输出Z的梯度 shape(m, n_C, n_H, n_W)
cache -- forwardConv()的输出 反向传播需要的数据
Returns:
dA_prev -- 卷积层输入的梯度 shape(m, n_C_prev, n_H_prev, n_W_prev)
dW -- 卷积核的梯度 shape(n_C, n_C_prev, f, f)
db -- bias项的梯度 shape(n_C, 1, 1, 1)
"""
(A_prev, W, b, hPara) = cache
(m, n_C_prev, n_H_prev, n_W_prev) = A_prev.shape
(n_C, n_C_prev, f, f) = W.shape
stride = hPara['stride']
pad = hPara['pad']
(m, n_C, n_H, n_W) = dZ.shape
# 初始化 dA_prev dW db
dA_prev = np.zeros((m, n_C_prev, n_H_prev, n_W_prev))
dW = np.zeros((n_C, n_C_prev, f, f))
db = np.zeros((n_C, 1, 1, 1))
# 填充张量
A_prev_pad = self.zeroPad(A_prev, pad)
dA_prev_pad = self.zeroPad(dA_prev, pad)
for i in range(m):
a_prev_pad = A_prev_pad[i]
da_prev_pad = dA_prev_pad[i]
for c in range(n_C):
for h in range(n_H):
for w in range(n_W):
h_start = h * stride
h_end = h_start + f
w_start = w * stride
w_end = w_start + f
a_slice = a_prev_pad[:, h_start:h_end, w_start:w_end]
da_prev_pad[:, h_start:h_end, w_start:w_end] += W[c, :, :, :] * dZ[i, c, h, w]
dW[c, :, :, :] += a_slice * dZ[i, c, h, w]
db[c, :, :, :] += dZ[i, c, h, w]
dA_prev[i, :, :, :] = da_prev_pad[:, pad:-pad, pad:-pad]
return dA_prev, dW, db
def createMask(self, x):
"""
池化层反向传播时 为生成矩阵残剩一个掩膜 用于 max pooling
Arguments:
x -- shape(f, f)
Returns:
mask -- shape(f, f) 显示x最大值的逻辑矩阵
"""
mask = (x == np.max(x))
return mask
def distributeValue(self, dz, shape):
"""
池化层反向传播时 分散矩阵 用于 均值池化
Arguments:
dz -- 输入的梯度
shape -- 输入的dz分散后的形状 (n_H, n_W)
Returns:
a -- 分散后的矩阵 shape(n_H, n_W)
"""
(n_H, n_W) = shape
mean = dz / (n_H * n_W)
a = np.ones(shape) * mean
return a
def backwardPool(self, dA, cache, mode):
"""
池化层反向传播
Arguments:
dA -- 池化层输出的梯度
cache -- 缓存数据 由之前前向传播提供
mode -- 池化层类型 'max' 'mean'
Returns:
dA_prev -- 池化层输入的梯度
"""
(A_prev, hPara) = cache
stride = hPara['stride']
f = hPara['f']
(m, n_C_prev, n_H_prev, n_W_prev) = A_prev.shape
(m, n_C, n_H, n_W) = dA.shape
# 初始化dA_prev
dA_prev = np.zeros(A_prev.shape)
for i in range(m):
a_prev = A_prev[i]
for c in range(n_C):
for h in range(n_H):
for w in range(n_W):
h_start = h * stride
h_end = h_start + f
w_start = w * stride
w_end = w_start + f
if mode == 'max':
a_prev_slice = a_prev[c, h_start:h_end, w_start:w_end]
mask = self.createMask(a_prev_slice)
dA_prev[i, c, h_start:h_end, w_start:w_end] += mask * dA[i, c, h, w]
elif mode == 'mean':
da = dA[i, c, h, w]
shape = (f, f)
dA_prev[i, c, h_start:h_end, w_start:w_end] += self.distributeValue(da, shape)
return dA_prev
def initializeWB(self):
"""
初始化卷积核以及全连接层的权重矩阵
Arguments:
None
Returns:
WSet -- 各个初始化后的卷积核 以及全连接层的权重矩阵 以字典进行索引
Bias -- 卷积层以及全连接层的偏差项 全部初始化为0
"""
eps = 1
convW_1 = np.random.rand(6, 1, 5, 5)
convW_1 = 2 * eps * convW_1 - convW_1
convB_1 = np.zeros((6, 1, 1, 1))
convW_2 = np.random.rand(16, 6, 5, 5)
convW_2 = 2 * eps * convW_2 - convW_2
convB_2 = np.zeros((32, 1, 1, 1))
FCW_1 = np.random.rand(256, 784)
FCW_1 = 2 * eps * FCW_1 - FCW_1
FCB_1 = np.zeros((1, 256))
FCW_2 = np.random.rand(128, 256)
FCW_2 = 2 * eps * FCW_2 - FCW_2
FCB_2 = np.zeros((1, 128))
FCW_3 = np.random.rand(10, 128)
FCW_3 = 2 * eps * FCW_3 - FCW_3
FCB_3 = np.zeros((1, 10))
WSet = {'convW_1':convW_1, 'convW_2':convW_2, 'FCW_1':FCW_1, 'FCW_2':FCW_2, 'FCW_3':FCW_3}
Bias = {'convB_1':convB_1, 'convB_2':convB_2, 'FCB_1':FCB_1, 'FCB_2':FCB_2, 'FCB_3':FCB_3}
return WSet, Bias
def ReLU(self, x_in):
"""
ReLU激活函数
Arguments:
x_in -- 输入张量
Returns:
x_out -- 输出张量
"""
mask = x_in > 0
x_out = mask * x_in
return x_out
def computeCost(self, out, WSet):
"""
计算代价函数
Arguments:
x -- 全连接层输出
WSet -- 权重集合,用于正则化
Returns:
J -- 代价函数值
dout -- 输出层误差
"""
dout = -(np.log(out) * self.y_train)
reg = self.lamda / 2 * (np.sum(WSet['convW_1']**2) + np.sum(WSet['convW_2']**2) + np.sum(WSet['FCW_1']**2) + np.sum(WSet['FCW_2']**2) + np.sum(WSet['FCW_3']**2))
J = (np.sum(dout) + reg) / self.m
return J, dout
def epochRun(self):
"""
运行一次完整的前向传播以及反向传播 更新卷积核以及全连接层权重
Arguments:
None
Yield:
J -- 前向传播输出后的代价函数
"""
while True:
##---------- 前向传播 ----------##
# 输入 -> 卷积层Ⅰ
convZ_1, conv_cache_1 = self.forwardConv(self.X_train, self.WSet['convW_1'], self.Bias['convB_1'], {'stride':1, 'pad':2})
# ReLU
convA_1 = self.ReLU(convZ_1)
# 卷积层Ⅰ -> 池化层Ⅰ
poolA_1, pool_cache_1 = self.forwardPool(convA_1, {'f':2, 'stride':2})
# 池化层Ⅰ -> 卷积层Ⅱ
convZ_2, conv_cache_2 = self.forwardConv(poolA_1, self.WSet['convW_2'], self.Bias['convB_2'], {'stride':1, 'pad':2})
#ReLU
convA_2 = self.ReLU(convZ_2)
# 卷积层Ⅱ -> 池化层Ⅱ
poolA_2, pool_cache_2 = self.forwardPool(convA_2, {'f':2, 'stride':2})
# 池化层Ⅱ -> 全连接层0 (展开池化层)
FClayer_0 = poolA_2.reshape((self.m, 784))
# 全连接层0 -> 全连接层1 FCmask作为ReLu的掩膜同时也是偏导数
FClayer_1_z = np.dot(FClayer_0, self.WSet['FCW_1'].T) + self.Bias['FCB_1']
# ReLU
FClayer_1 = self.ReLU(FClayer_1_z)
# 全连接层1 -> 全连接层2
FClayer_2_z = np.dot(FClayer_1, self.WSet['FCW_2'].T) + self.Bias['FCB_2']
# ReLU
FClayer_2 = self.ReLU(FClayer_2_z)
# 全连接层2 -> 输出
FClayer_3_z = np.dot(FClayer_2, self.WSet['FCW_3'].T) + self.Bias['FCB_3']
# softmax
out = np.exp(FClayer_3_z)
sum_out = np.sum(out, 1)
sum_out = np.array([sum_out]).T
out = out / sum_out
##--------- 计算loss(代价函数值J) 同时也是最后一层的误差 用于反向传播 ---------##
J, dFC_3 = self.computeCost(out, self.WSet)
yield J
##---------- 反向传播 ----------##
# 输出 -> 全连接层2
dFC_2 = np.dot(dFC_3, self.WSet['FCW_3']) * (FClayer_2 > 0)
dFCW_3 = np.dot(dFC_3.T, FClayer_2)
# 全连接层2 -> 全连接层1
dFC_1 = np.dot(dFC_2, self.WSet['FCW_2']) * (FClayer_1 > 0)
dFCW_2 = np.dot(dFC_2.T, FClayer_1)
# 全连接层1 -> 全连接层0
dFC_0 = np.dot(dFC_1, self.WSet['FCW_1'])
dFCW_1 = np.dot(dFC_1.T, FClayer_0)
# 将全连接层0形状变为池化层Ⅱ
dpoolA_2 = dFC_0.reshape((self.m, 16, 7, 7))
# 池化层Ⅱ -> 卷积层Ⅱ
dconvA_2 = self.backwardPool(dpoolA_2, pool_cache_2, 'max')
dconvA_2_z = dconvA_2 * (pool_cache_2[0] > 0)
# 卷积层Ⅱ -> 池化层Ⅰ
dpoolA_1, dconvW_2, dconvB_2 = self.backwardConv(dconvA_2_z, conv_cache_2)
# 池化层Ⅰ -> 卷积层Ⅰ
dconvA_1 = self.backwardPool(dpoolA_1, pool_cache_1, 'max')
dconvA_1_z = (dconvA_1 * pool_cache_1[0] > 0)
# 卷积层Ⅰ误差
din, dconvW_1, dconvB_1 = self.backwardConv(dconvA_1, conv_cache_1)
##---------- 梯度下降更新参数 ----------##
self.WSet['FCW_3'] -= self.alpha * (dFCW_3 + self.lamda * self.WSet['FCW_3']) / self.m
self.WSet['FCW_2'] -= self.alpha * (dFCW_2 + self.lamda * self.WSet['FCW_2']) / self.m
self.WSet['FCW_1'] -= self.alpha * (dFCW_1 + self.lamda * self.WSet['FCW_1']) / self.m
self.Bias['FCB_3'] -= self.alpha * np.sum(dFC_3, 0)
self.Bias['FCB_2'] -= self.alpha * np.sum(dFC_2, 0)
self.Bias['FCB_1'] -= self.alpha * np.sum(dFC_1, 0)
self.WSet['convW_2'] -= self.alpha * (dconvW_2 + self.lamda * self.WSet['convW_2']) / self.m
self.WSet['convW_1'] -= self.alpha * (dconvW_1 + self.lamda * self.WSet['convW_1']) / self.m
self.Bias['convB_2'] -= self.alpha * dconvB_2
self.Bias['convB_1'] -= self.alpha * dconvB_1
def run(self):
"""
训练CNN 不断更新WSet 和 Bias
Argument:
None
Returns:
None
"""
epoch = self.epochRun()
J = next(epoch)
tempJ = 0
iteraTime = 1
while True:
tempJ = next(epoch)
iteraTime += 1
print('cost=' + str(tempJ), end=' ')
print('eps=' + str(abs(tempJ - J)), end=' ')
print('iter_time=' + str(iteraTime))
# 收敛或达到最大迭代次数退出
if (abs(tempJ - J) < 0.001) | (iteraTime > self.maxItera):
break
else:
J = tempJ
# 存储数据
sio.savemat('letterCNN.mat', {'WSet':self.WSet, 'Bias':self.Bias})
def test(self):
"""
测试
"""
convZ_1, conv_cache_1 = self.forwardConv(self.X_test, self.WSet['convW_1'], self.Bias['convB_1'], {'stride':1, 'pad':2})
convA_1 = self.ReLU(convZ_1)
poolA_1, pool_cache_1 = self.forwardPool(convA_1, {'f':2, 'stride':2})
convZ_2, conv_cache_2 = self.forwardConv(poolA_1, self.WSet['convW_2'], self.Bias['convB_2'], {'stride':1, 'pad':2})
convA_2 = self.ReLU(convZ_2)
poolA_2, pool_cache_2 = self.forwardPool(convA_2, {'f':2, 'stride':2})
FClayer_0 = poolA_2.reshape((self.m, 784))
FClayer_1_z = np.dot(FClayer_0, self.WSet['FCW_1'].T) + self.Bias['FCB_1']
FClayer_1 = self.ReLU(FClayer_1_z)
FClayer_2_z = np.dot(FClayer_1, self.WSet['FCW_2'].T) + self.Bias['FCB_2']
FClayer_2 = self.ReLU(FClayer_2_z)
FClayer_3_z = np.dot(FClayer_2, self.WSet['FCW_3'].T) + self.Bias['FCB_3']
out = np.exp(FClayer_3_z)
sum_out = np.sum(out, 1)
sum_out = np.arrray([sum_out]).T
out = out / sum_out
y_predit = np.where(out==np.max(out, 1).reshape(self.y_test.shape))[1]
y_predit = y_predit.reshape(self.y_test.shape)
result = np.sum(y_predit == self.y_test)
corr_rate = result / self.y_test.shape[0]
print("%.2f" %(corr_rate*100))
if __name__ == "__main__":
MNIST_CNN = CNN(0.01, 1.0, 100)
MNIST_CNN.run()
MNIST_CNN.test()