监督学习:1个x对应1个y;
Sigmoid : 激活函数
s i g m o i d = 1 1 + e − x sigmoid=\frac{1}{1+e^{-x}} sigmoid=1+e−x1
ReLU : 线性整流函数;
## Logistic Regression
-->binary classification / x-->y 0 1
### some sign
$$
(x,y) , x\in{\mathbb{R}^{n_{x}}},y\in{0,1}\\\\
M=m_{train}\quad m_{test}=test\\\\
M:{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)})...,(x^{(m)},y^{(m)})}\\\\
X =
\left[
\begin{matrix}
x^{(1)} & x^{(2)} &\cdots & x^{(m)}
\end{matrix} \right] \leftarrow n^{x}\times m\\\\
\hat{y}=P(y=1\mid x)\quad\hat{y}=\sigma(w^tx+b)\qquad
w\in \mathbb{R}^{n_x} \quad b\in \mathbb{R}\\
\sigma (z)=\frac{1}{1+e^{-z}}
$$
### Loss function
单个样本
$$
Loss\:function:\mathcal{L}(\hat{y},y)=\frac{1}{2}(\hat{y}-y)^2\\\\
p(y\mid x)=\hat{y}^y(1-\hat y)^{(1-y)}\\
min\;cost\rightarrow max\;\log(y\mid x)\\
\mathcal{L}(\hat{y},y)=-(y\log(\hat{y})+(1-y)\log(1-\hat{y}))\\\\
y=1:\mathcal{L}(\hat{y},y)=-\log\hat{y}\quad \log\hat{y}\leftarrow larger\quad\hat{y}\leftarrow larger\\
y=0:\mathcal{L}(\hat{y},y)=-\log(1-\hat{y})\quad \log(1-\hat{y})\leftarrow larger\quad\hat{y}\leftarrow smaller\\\\
$$
### cost function
$$
\mathcal{J}(w,b)=\frac{1}{m}\sum_{i=1}^{m}\mathcal{L}(\hat{y}^{(i)},y^{(i)})
$$
### Gradient Descent
find w,b that minimiaze J(w,b) ;
Repeat:
$$
w:=w-\alpha \frac{\partial\mathcal{J}(w,b)}{\partial w}(dw)\\
b:=b-\alpha \frac{\partial\mathcal{J}(w,b)}{\partial b}(db)
$$
### Computation Grapha
example:
$$
J=3(a+bc)
$$
```mermaid
graph LR
C[v=a+u]
A(a)
B(b)
D(c)
E[u=bc]
F[J=3v]
A --> C
B --> E
D --> E
E --> C
C --> F
one example gradient descent computer grapha:
recap:
z = w T x + b y ^ = a = σ ( z ) = 1 1 + e − z L ( a , y ) = − ( y log ( a ) + ( 1 − y ) log ( 1 − a ) ) z=w^Tx+b\\ \hat{y}=a=\sigma(z)=\frac{1}{1+e^{-z}} \\ \mathcal{L}(a,y)=-(y\log(a)+(1-y)\log(1-a)) z=wTx+by^=a=σ(z)=1+e−z1L(a,y)=−(ylog(a)+(1−y)log(1−a))
The grapha:
′ d a ′ = d L ( a , y ) d a = − y a + 1 − y 1 − a ′ d z ′ = d L ( a , y ) d z = d L d a ⋅ d a d z = a − y ′ d w 1 ′ = x 1 ⋅ d z . . . w 1 : = w 1 − α d w 1 . . . 'da'=\frac{d\mathcal{L}(a,y)}{da}=-\frac{y}{a}+\frac{1-y}{1-a}\\ 'dz'=\frac{d\mathcal{L}(a,y)}{dz}=\frac{d\mathcal{L}}{da}\cdot\frac{da}{dz}=a-y\\ 'dw_1'=x_1\cdot dz\;\;\; ... \\w_1:=w_1-\alpha dw_1\;\;... ′da′=dadL(a,y)=−ay+1−a1−y′dz′=dzdL(a,y)=dadL⋅dzda=a−y′dw1′=x1⋅dz...w1:=w1−αdw1...
m example gradient descent computer grapha:
recap:
J ( w , b ) = 1 m ∑ i = 1 m L ( a ( i ) , y ( 1 ) ) \mathcal{J}(w,b)=\frac{1}{m}\sum_{i=1}^m\mathcal{L}(a^{(i)},y^{(1)}) J(w,b)=m1i=1∑mL(a(i),y(1))
The grapha: (two iterate)
∂ ∂ w 1 J ( w , b ) = 1 m ∑ i = 1 m ∂ ∂ w 1 L ( a ( i ) , y ( 1 ) ) F o r i = 1 t o m : { a ( i ) = σ ( w T x ( i ) + b ) J + = − [ y ( i ) log a i + ( 1 − y ( i ) log ( 1 − a ( i ) ) ) ] d z ( i ) = a ( i ) − y ( i ) d w 1 + = x 1 ( i ) d z ( i ) d w 2 + = x 2 ( i ) d z ( i ) d b + = d z ( i ) } J / = m ; d w 1 / = m ; d w 2 / = m ; d b / = m d w 1 = ∂ J ∂ w 1 w 1 = w 1 − α d w 1 \frac{\partial}{\partial w_1}\mathcal{J}(w,b)=\frac{1}{m}\sum_{i=1}^m\frac{\partial}{\partial w_1}\mathcal{L}(a^{(i)},y^{(1)})\\\\ For \quad i=1 \quad to \quad m:\{\\ a^{(i)}=\sigma (w^Tx^{(i)}+b)\\ \mathcal{J}+=-[y^{(i)}\log a^{i}+(1-y^{(i)}\log(1-a^{(i)}))] \\ dz^{(i)}=a^{(i)}-y^{(i)}\\ dw_1+=x_1^{(i)}dz^{(i)}\\ dw_2+=x_2^{(i)}dz^{(i)}\\ db+=dz^{(i)}\}\\ \mathcal{J}/=m;dw_1/=m;dw_2/=m;db/=m\\ dw_1=\frac{\partial\mathcal{J}}{\partial w_1}\\ w_1=w_1-\alpha dw_1 ∂w1∂J(w,b)=m1i=1∑m∂w1∂L(a(i),y(1))Fori=1tom:{a(i)=σ(wTx(i)+b)J+=−[y(i)logai+(1−y(i)log(1−a(i)))]dz(i)=a(i)−y(i)dw1+=x1(i)dz(i)dw2+=x2(i)dz(i)db+=dz(i)}J/=m;dw1/=m;dw2/=m;db/=mdw1=∂w1∂Jw1=w1−αdw1
z = n p . d o t ( w , x ) + b z=np.dot(w,x)+b z=np.dot(w,x)+b
logistic regression derivatives:
change:
d w 1 = 0 , d w 2 = 0 → d w = n p . z e r o s ( ( n x , 1 ) ) { d w 1 + = x 1 ( i ) d z ( i ) d w 2 + = x 2 ( i ) d z ( i ) → d w + = x ( i ) d z ( i ) Z = ( z ( 1 ) z ( 2 ) . . . z ( m ) ) = w T X + b A = σ ( Z ) d z = A − Y = ( a ( 1 ) − y ( 1 ) z ( 2 ) − y ( 2 ) . . . z ( m ) − y ( m ) ) d b = 1 m ∑ i = 1 m d z ( i ) = 1 m n p . s u m ( d z ) d w = 1 m X d z T = 1 m ( x ( 1 ) ⋅ d z ( 1 ) x ( 2 ) ⋅ d z ( 2 ) . . . x ( m ) ⋅ d z ( m ) ) dw_1=0,dw_2=0\rightarrow dw=np.zeros((n_x,1))\\ \begin{cases}dw_1+=x_1^{(i)}dz^{(i)}\\ dw_2+=x_2^{(i)}dz^{(i)}\end{cases}\rightarrow dw+=x^{(i)}dz^{(i)}\\\\ Z=\left(\;\begin{matrix} z^{(1)} & z^{(2)} &... &z^{(m)}\end{matrix}\;\right)=w^TX+b\\ A=\sigma(Z)\\\\ dz=A-Y=\left(\;\begin{matrix} a^{(1)}-y^{(1)} & z^{(2)}-y^{(2)} &... &z^{(m)}-y^{(m)}\end{matrix}\;\right)\\ db=\frac{1}{m}\sum_{i=1}^mdz^{(i)}=\frac{1}{m}np.sum(dz)\\ dw=\frac{1}{m}Xdz^T=\frac{1}{m}\left(\;\begin{matrix} x^{(1)}\cdot dz^{(1)}&x^{(2)}\cdot dz^{(2)}&...&x^{(m)}\cdot dz^{(m)}\end{matrix}\;\right) dw1=0,dw2=0→dw=np.zeros((nx,1)){dw1+=x1(i)dz(i)dw2+=x2(i)dz(i)→dw+=x(i)dz(i)Z=(z(1)z(2)...z(m))=wTX+bA=σ(Z)dz=A−Y=(a(1)−y(1)z(2)−y(2)...z(m)−y(m))db=m1i=1∑mdz(i)=m1np.sum(dz)dw=m1XdzT=m1(x(1)⋅dz(1)x(2)⋅dz(2)...x(m)⋅dz(m))
Z = w T X + b = n p . d o t ( w T , X ) + b A = σ ( Z ) J = − 1 m ∑ i = 1 m ( y ( i ) log ( a ( i ) ) + ( 1 − y ( i ) ) log ( 1 − a ( i ) ) ) d Z = A − Y d w = 1 m X d Z T d b = 1 m n p . s u m ( d Z ) w : = w − α d w b : = b − α d b Z=w^TX+b=np.dot(w^T,X)+b\\ A=\sigma(Z)\\ J=-\frac{1}{m}\sum_{i=1}^m(y^{(i)}\log(a^{(i)})+(1-y^{(i)})\log(1-a^{(i)}))\\ dZ=A-Y\\ dw=\frac{1}{m}XdZ^T\\ db=\frac{1}{m}np.sum(dZ)\\ w:=w-\alpha dw\\ b:=b-\alpha db Z=wTX+b=np.dot(wT,X)+bA=σ(Z)J=−m1i=1∑m(y(i)log(a(i))+(1−y(i))log(1−a(i)))dZ=A−Ydw=m1XdZTdb=m1np.sum(dZ)w:=w−αdwb:=b−αdb
n p . d o t ( w T , X ) + b np.dot(w^T,X)+b np.dot(wT,X)+b
A note on numpy
a = n p . r a n d o m . r a n d n ( 5 ) / / w r o n g → a = a . r e s h a p e ( 5 , 1 ) a s s e r t ( a . s h a p e = = ( 5 , 1 ) ) a = n p . r a n d o m . r a n d n ( 5 , 1 ) → c o l u m v e c t o r a=np.random.randn(5) //wrong\rightarrow a=a.reshape(5,1)\\ assert(a.shape==(5,1))\\ a=np.random.randn(5,1)\rightarrow colum\;vector a=np.random.randn(5)//wrong→a=a.reshape(5,1)assert(a.shape==(5,1))a=np.random.randn(5,1)→columvector
2 layer NN:
I n p u t l a y e r → h i d d e n → l a y e r → o u t l a y e r a [ 0 ] → a [ 1 ] → a [ 2 ] z [ 1 ] = W [ 1 ] a [ 0 ] + b [ 1 ] a [ 1 ] = σ ( z [ 1 ] ) z [ 2 ] = W [ 2 ] a [ 1 ] + b [ 2 ] a [ 2 ] = σ ( z [ 2 ] ) = y ^ Input\;layer\rightarrow hidden\rightarrow layer\rightarrow out\;layer\\ a^{[0]}\rightarrow a^{[1]}\rightarrow a^{[2]}\\\\ z^{[1]}=W^{[1]}a^{[0]}+b^{[1]}\\ a^{[1]}=\sigma(z^{[1]})\\ z^{[2]}=W^{[2]}a^{[1]}+b^{[2]}\\ a^{[2]}=\sigma(z^{[2]})=\hat y\\ Inputlayer→hidden→layer→outlayera[0]→a[1]→a[2]z[1]=W[1]a[0]+b[1]a[1]=σ(z[1])z[2]=W[2]a[1]+b[2]a[2]=σ(z[2])=y^
z i [ 1 ] = w i [ 1 ] T x + b i [ 1 ] a i [ 1 ] = σ ( z i [ 1 ] ) [ w 1 [ 1 ] T w 2 [ 1 ] T w 3 [ 1 ] T w 4 [ 1 ] T ] ⋅ [ x 1 x 2 x 3 ] + [ b 1 [ 1 ] b 2 [ 1 ] b 3 [ 1 ] b 4 [ 1 ] ] = [ z 1 [ 1 ] z 2 [ 1 ] z 3 [ 1 ] z 4 [ 1 ] ] z_i^{[1]}=w_i^{[1]T}x+b_i^{[1]}\\ a_i^{[1]}=\sigma(z_i^{[1]})\\ \left[ \begin{matrix} w_1^{[1]T}\\w_2^{[1]T}\\w_3^{[1]T}\\w_4^{[1]T} \end{matrix} \right] \cdot \left[ \begin{matrix} x_1\\x_2\\x_3 \end{matrix} \right]+\left[ \begin{matrix} b_1^{[1]}\\b_2^{[1]}\\b_3^{[1]}\\b_4^{[1]} \end{matrix} \right]=\left[ \begin{matrix} z_1^{[1]}\\z_2^{[1]}\\z_3^{[1]}\\z_4^{[1]} \end{matrix} \right] zi[1]=wi[1]Tx+bi[1]ai[1]=σ(zi[1]) w1[1]Tw2[1]Tw3[1]Tw4[1]T ⋅ x1x2x3 + b1[1]b2[1]b3[1]b4[1] = z1[1]z2[1]z3[1]z4[1]
x ( i ) → a [ 2 ] ( i ) = y ^ ( i ) Z [ 1 ] = W [ 1 ] X + b [ 1 ] A [ 1 ] = σ ( Z [ 1 ] ) Z [ 2 ] = W [ 2 ] A [ 1 ] + b [ 2 ] A [ 2 ] = σ ( Z [ 2 ] ) W [ 1 ] ⋅ [ x ( 1 ) x ( 2 ) ⋯ x ( m ) ] + b = [ z [ 1 ] ( 1 ) z [ 1 ] ( 2 ) ⋯ z [ 1 ] ( m ) ] = Z [ 1 ] x^{(i)}\rightarrow a^{[2](i)}=\hat y^{(i)}\\ Z^{[1]}=W^{[1]}X+b^{[1]}\\ A^{[1]}=\sigma(Z^{[1]})\\ Z^{[2]}=W^{[2]}A^{[1]}+b^{[2]}\\ A^{[2]}=\sigma(Z^{[2]})\\ W^{[1]}\cdot \left[ \begin{matrix} x^{(1)} & x^{(2)} &\cdots & x^{(m)} \end{matrix} \right]+b=\left[ \begin{matrix} z^{[1](1)} & z^{[1](2)} &\cdots & z^{[1](m)} \end{matrix} \right]=Z^{[1]} x(i)→a[2](i)=y^(i)Z[1]=W[1]X+b[1]A[1]=σ(Z[1])Z[2]=W[2]A[1]+b[2]A[2]=σ(Z[2])W[1]⋅[x(1)x(2)⋯x(m)]+b=[z[1](1)z[1](2)⋯z[1](m)]=Z[1]
a = 1 1 + e − z , a ′ = a ( 1 − a ) a = tanh ( z ) = e z − e − z e z + e − z , a ∈ ( − 1 , 1 ) , a ′ = 1 − a 2 a = m a x ( 0 , z ) a = m a x ( 0.01 z , z ) a=\frac{1}{1+e^{-z}},a'=a(1-a)\\ a=\tanh(z)=\frac{e^z-e^{-z}}{e^z+e^{-z}},a\in (-1,1),a'=1-a^2\\ a=max(0,z)\\ a=max(0.01z,z) a=1+e−z1,a′=a(1−a)a=tanh(z)=ez+e−zez−e−z,a∈(−1,1),a′=1−a2a=max(0,z)a=max(0.01z,z)
z [ 1 ] = W [ 1 ] x + b [ 1 ] → a [ 1 ] = σ ( z [ 1 ] ) → z [ 2 ] = W [ 2 ] a [ 1 ] + b [ 2 ] → a [ 2 ] = σ ( z [ 2 ] ) → L ( a [ 2 ] , y ) d z [ 2 ] = a [ 2 ] − y d w [ 2 ] = d z [ 2 ] a [ 1 ] T d b [ 2 ] = d z [ 2 ] d z [ 1 ] = w [ 2 ] T d z [ 2 ] ∗ a ′ [ 1 ] d w [ 1 ] = d z [ 1 ] ⋅ x T d b [ 1 ] = d z [ 1 ] z^{[1]}=W^{[1]}x+b^{[1]}\rightarrow\\ a^{[1]}=\sigma(z^{[1]})\rightarrow\\ z^{[2]}=W^{[2]}a^{[1]}+b^{[2]}\rightarrow\\ a^{[2]}=\sigma(z^{[2]})\rightarrow\\ \mathcal{L}(a^{[2]},y)\\\\ dz^{[2]}=a^{[2]}-y\\ dw^{[2]}=dz^{[2]}a^{[1]T}\\ db^{[2]}=dz^{[2]}\\ dz^{[1]}=w^{[2]T}dz^{[2]}*a^{'[1]}\\ dw^{[1]}=dz^{[1]}\cdot x^T\\ db^{[1]}=dz^{[1]}\\\\ z[1]=W[1]x+b[1]→a[1]=σ(z[1])→z[2]=W[2]a[1]+b[2]→a[2]=σ(z[2])→L(a[2],y)dz[2]=a[2]−ydw[2]=dz[2]a[1]Tdb[2]=dz[2]dz[1]=w[2]Tdz[2]∗a′[1]dw[1]=dz[1]⋅xTdb[1]=dz[1]
dz[1]的推导涉及到了矩阵求导
x : ( n 0 , m ) W [ 1 ] : ( n 1 , n 0 ) → a [ 1 ] : ( n 1 , m ) W [ 2 ] : : ( n 2 , n 1 ) → a [ 2 ] : ( n 2 , m ) x:(n_0,m)\quad W^{[1]}:(n_1,n_0)\rightarrow \\ a^{[1]}:(n_1,m)\quad W^{[2]:}:(n_2,n_1)\rightarrow\\ a^{[2]}:(n_2,m)\quad x:(n0,m)W[1]:(n1,n0)→a[1]:(n1,m)W[2]::(n2,n1)→a[2]:(n2,m)
d Z [ 2 ] = A [ 2 ] − Y d W [ 2 ] = 1 m d Z [ 2 ] A [ 1 ] T d b [ 2 ] = n p . s u m ( d Z [ 2 ] , a x i s = 1 , k e e p d i m s = T r u e ) d Z [ 1 ] = W [ 2 ] T d Z [ 2 ] ∗ A ′ [ 1 ] d W [ 1 ] = 1 m d Z [ 1 ] X T d b [ 1 ] = 1 m n p . s u m ( d Z [ 1 ] , a x i s = 1 , k e e p d i m s = T r u e ) dZ^{[2]}=A^{[2]}-Y\\ dW^{[2]}=\frac{1}{m}dZ^{[2]}A^{[1]T}\\ db^{[2]}=np.sum(dZ^{[2]},axis = 1,keepdims=True)\\ dZ^{[1]}=W^{[2]T}dZ^{[2]}*A^{'[1]}\\ dW^{[1]}=\frac{1}{m}dZ^{[1]}X^T\\ db^{[1]}=\frac{1}{m}np.sum(dZ{[1]},axis=1,keepdims=True) dZ[2]=A[2]−YdW[2]=m1dZ[2]A[1]Tdb[2]=np.sum(dZ[2],axis=1,keepdims=True)dZ[1]=W[2]TdZ[2]∗A′[1]dW[1]=m1dZ[1]XTdb[1]=m1np.sum(dZ[1],axis=1,keepdims=True)
w [ 1 ] = n p . r a n d o m . r a n d n ( ( 2 , 2 ) ) ∗ 0.01 b [ 1 ] = n p . z e r o ( ( 2 , 1 ) ) w^{[1]}=np.random.randn((2,2))*0.01\\ b^{[1]}=np.zero((2,1)) w[1]=np.random.randn((2,2))∗0.01b[1]=np.zero((2,1))
e x a m p l e : L l a y e r N N a [ l ] → a c t i v a t i o n f u n c t i o n w [ l ] → w e i g h t s f o r z [ l ] y ^ = a [ L ] example:L\;\;layer\;\;NN\\ a^{[l]}\rightarrow activation\;function\\ w^{[l]}\rightarrow weights\;for\;z^{[l]}\\ \hat y=a^{[L]} example:LlayerNNa[l]→activationfunctionw[l]→weightsforz[l]y^=a[L]
f o r l = 1 , 2 , 3.. z [ l ] = w [ l ] a [ l − 1 ] + b [ l ] c a c h e z [ l ] , w [ l ] , b [ l ] a [ l ] = g [ l ] ( z [ l ] ) for\;\;l=1,2,3..\\ z^{[l]}=w^{[l]}a^{[l-1]}+b^{[l]}\\cache\;z^{[l]},w^{[l]},b^{[l]} \\ a^{[l]}=g^{[l]}(z^{[l]}) forl=1,2,3..z[l]=w[l]a[l−1]+b[l]cachez[l],w[l],b[l]a[l]=g[l](z[l])
d a [ l ] → d a [ l − 1 ] ( d z [ l ] , d w [ l ] , d b [ l ] ) d z [ l ] = d a [ l ] ∗ g [ l ] ′ ( z [ l ] ) = w [ l + 1 ] d z [ l + 1 ] ∗ g [ l ] ′ ( z [ l ] ) d w [ l ] = d z [ l ] ⋅ a [ l − 1 ] T d b [ l ] = d z [ l ] d a [ l − 1 ] = w [ l ] T ⋅ d z [ l ] da^{[l]}\rightarrow da^{[l-1]}(dz^{[l]},dw^{[l]},db^{[l]})\\ dz^{[l]}=da^{[l]}*g^{[l]'}(z^{[l]})=w^{[l+1]}dz^{[l+1]}*g^{[l]'}(z^{[l]})\\ dw^{[l]}=dz^{[l]}\cdot a^{[l-1]T}\\ db^{[l]}=dz^{[l]}\\ da^{[l-1]}=w^{[l]T}\cdot dz^{[l]}\\ da[l]→da[l−1](dz[l],dw[l],db[l])dz[l]=da[l]∗g[l]′(z[l])=w[l+1]dz[l+1]∗g[l]′(z[l])dw[l]=dz[l]⋅a[l−1]Tdb[l]=dz[l]da[l−1]=w[l]T⋅dz[l]
d w , w [ l ] : ( n [ l ] , n [ l − 1 ] ) d b , b [ l ] : ( n [ l ] , 1 ) Z [ l ] , A [ l ] : ( n [ l ] , m ) dw,w^{[l]}:(n^{[l]},n^{[l-1]})\\ db, b^{[l]}:(n^{[l]},1 )\\ Z^{[l]},A^{[l]}:(n^{[l]},m) dw,w[l]:(n[l],n[l−1])db,b[l]:(n[l],1)Z[l],A[l]:(n[l],m)
0.7/0/0.3 0.6.0.2.0.2 -> 100-10000
0.98/0.01/0.01 … -> big data
偏差度量的是单个模型的学习能力,而方差度量的是同一个模型在不同数据集上的稳定性。
high variance ->high dev set error
high bias ->high train set error
high bias -> bigger network / train longer / more advanced optimization algorithms / NN architectures
high variance -> more data / regularization / NN architecture
L 2 r e g u l a r i z a t i o n : m i n J ( w , b ) → J ( w , b ) = 1 m ∑ i = 1 m L ( y ^ ( i ) , y ( i ) ) + λ 2 m ∥ w ∥ 2 2 L2\;\; regularization:\\min\mathcal{J}(w,b)\rightarrow J(w,b)=\frac{1}{m}\sum_{i=1}^m\mathcal{L}(\hat y^{(i)},y^{(i)})+\frac{\lambda}{2m}\Vert w\Vert_2^2 L2regularization:minJ(w,b)→J(w,b)=m1i=1∑mL(y^(i),y(i))+2mλ∥w∥22
F r o b e n i u s n o r m ∥ w [ l ] ∥ F 2 = ∑ i = 1 n [ l ] ∑ j = 1 n [ l − 1 ] ( w i , j [ l ] ) 2 D r o p o u t r e g u l a r i z a t i o n : d 3 = n p . r a n d m . r a n d ( a 3. s h a p e . s h a p e [ 0 ] , a 3. s h a p e [ 1 ] < k e e p . p r o b ) a 3 = n p . m u l t i p l y ( a 3 , d 3 ) a 3 / = k e e p . p r o b Frobenius\;\; norm\\ \Vert w^{[l]}\Vert^2_F=\sum_{i=1}^{n^{[l]}}\sum_{j=1}^{n^{[l-1]}}(w_{i,j}^{[l]})^2\\\\ Dropout\;\; regularization:\\ d3=np.randm.rand(a3.shape.shape[0],a3.shape[1]
speed up the training of your neural network
μ = 1 m ∑ i = 1 m x ( i ) x : = x − μ \mu =\frac{1}{m}\sum _{i=1}^{m}x^{(i)}\\ x:=x-\mu μ=m1i=1∑mx(i)x:=x−μ
σ 2 = 1 m ∑ i = 1 m ( x ( i ) ) 2 x / = σ \sigma ^2=\frac{1}{m}\sum_{i=1}^m(x^{(i)})^2\\ x/=\sigma σ2=m1i=1∑m(x(i))2x/=σ
y = w [ l ] w [ l − 1 ] . . . w [ 2 ] w [ 1 ] x w [ l ] > I → ( w [ l ] ) L → ∞ w [ l ] < I → ( w [ l ] ) L → 0 y=w^{[l]}w^{[l-1]}...w^{[2]}w^{[1]}x\\ w^{[l]}>I\rightarrow (w^{[l]})^L\rightarrow\infty \\w^{[l]}y=w[l]w[l−1]...w[2]w[1]xw[l]>I→(w[l])L→∞w[l]<I→(w[l])L→0
v a r ( w ) = 1 n ( l − 1 ) w [ l ] = n p . r a n d o m . r a n d n ( s h a p e ) ∗ n p . s q r t ( 1 n ( l − 1 ) ) var(w)=\frac{1}{n^{(l-1)}}\\ w^{[l]}=np.random.randn(shape)*np.sqrt(\frac{1}{n^{(l-1)}}) var(w)=n(l−1)1w[l]=np.random.randn(shape)∗np.sqrt(n(l−1)1)
f ( θ ) = θ 3 f ′ ( θ ) = f ( θ + ε ) − f ( θ − ε ) 2 ε f(\theta)=\theta^3\\ f'(\theta)=\frac{f(\theta+\varepsilon)-f(\theta-\varepsilon)}{2\varepsilon} f(θ)=θ3f′(θ)=2εf(θ+ε)−f(θ−ε)
d θ a p p r o x [ i ] = J ( θ 1 , . . . θ i + ε . . . ) − J ( θ 1 , . . . θ i − ε . . . ) 2 ε = d θ [ i ] c h e c k : ∥ d θ a p p r o x − d θ ∥ 2 ∥ d θ a p p r o x ∥ 2 + ∥ d θ ∥ 2 < 1 0 − 7 d\theta_{approx}[i]=\frac{J(\theta_1,...\theta_i+\varepsilon...)-J(\theta_1,...\theta_i-\varepsilon...)}{2\varepsilon}=d\theta[i]\\ check:\frac{\Vert d\theta_{approx}-d\theta\Vert_2}{\Vert d\theta_{approx}\Vert_2+\Vert d\theta\Vert_2}<10^{-7} dθapprox[i]=2εJ(θ1,...θi+ε...)−J(θ1,...θi−ε...)=dθ[i]check:∥dθapprox∥2+∥dθ∥2∥dθapprox−dθ∥2<10−7
[ x ( 1 ) . . . x ( m ) ] → [ x { 1 } . . . x { m / u } ] ( a n e p o c h : F o r w a r d p r o p o n x { t } : z [ l ] = w [ l ] X { t } + b [ l ] A [ l ] = g [ l ] ( z [ l ] ) J { t } = 1 1000 ∑ i = 1 l L ( y ^ ( i ) , y ( i ) ) + λ 2 ∗ s i z e ∑ l ∥ w [ l ] ∥ F 2 B a c k w a r d p r o p [x^{(1)}...x^{(m)}]\rightarrow [x^{\{1\}}...x^{\{m/u\}}]\\ (an\;\;epoch:Forward\;\;prop\;\;on\;\;x^{\{t\}}:\\ z^{[l]}=w^{[l]}X^{\{t\}}+b^{[l]}\\ A^{[l]}=g^{[l]}(z^{[l]})\\ J^{\{t\}}=\frac{1}{1000}\sum_{i=1}^l\mathcal{L}(\hat y^{(i)},y^{(i)})+\frac{\lambda}{2*size}\sum_l\Vert w^{[l]}\Vert_F^2\\ Backward\;\;prop [x(1)...x(m)]→[x{1}...x{m/u}](anepoch:Forwardproponx{t}:z[l]=w[l]X{t}+b[l]A[l]=g[l](z[l])J{t}=10001i=1∑lL(y^(i),y(i))+2∗sizeλl∑∥w[l]∥F2Backwardprop
size = m -> Batch gradient descent <- small train set (<2000)
size = 1 -> stochastic gradient descent
typical mini-batch size (62,128,256…)
v θ = 0 θ t → v θ : = β v θ − 1 + ( 1 − β ) θ θ v_\theta = 0\\ \theta_t\rightarrow v_\theta:=\beta v_{\theta-1}+(1-\beta)\theta_\theta\\ vθ=0θt→vθ:=βvθ−1+(1−β)θθ
1 1 − β → v t 1 − β t \frac{1}{1-\beta}\rightarrow\frac{v_t}{1-\beta^t} 1−β1→1−βtvt
V d w = β V d w + ( 1 − β ) d w V d b = β V d b + ( 1 − β ) d b w : = w − α V d w V_{dw}=\beta V_{dw}+(1-\beta)dw\\ V_{db}=\beta V_{db}+(1-\beta)db\\ w:=w-\alpha V_{dw} Vdw=βVdw+(1−β)dwVdb=βVdb+(1−β)dbw:=w−αVdw
S d w = β 2 S d w + ( 1 − β 2 ) d w 2 S d b = β 2 S d b + ( 1 − β 2 ) d b 2 w : = w − α d w S d w + ε S_{dw}=\beta_2 S_{dw}+(1-\beta_2)dw^2\\ S_{db}=\beta_2 S_{db}+(1-\beta_2)db^2\\ w:=w-\alpha \frac{dw}{\sqrt S_{dw}+\varepsilon}\\ Sdw=β2Sdw+(1−β2)dw2Sdb=β2Sdb+(1−β2)db2w:=w−αSdw+εdw
V d w = 0 , S d w = 0 V d w = β 1 V d w + ( 1 − β 1 ) d w V d b = β 1 V d b + ( 1 − β 1 ) d b S d w = β 2 S d w + ( 1 − β 2 ) d w 2 S d b = β 2 S d b + ( 1 − β 2 ) d b 2 V d w c o r r e c t = v d w 1 − β 1 t S d w c o r r e c t = s d w 1 − β 2 t W : = W − α V d w c o r r e c t S d w c o r r e c t + ε β 1 : 0.9 , β 2 : 0.999 V_{dw}=0,S_{dw}=0\\ V_{dw}=\beta_1 V_{dw}+(1-\beta_1)dw\\V_{db}=\beta_1 V_{db}+(1-\beta_1)db\\ S_{dw}=\beta_2 S_{dw}+(1-\beta_2)dw^2\\S_{db}=\beta_2 S_{db}+(1-\beta_2)db^2\\ V_{dw}^{correct}=\frac{v_{dw}}{1-\beta_1^t}\\S_{dw}^{correct}=\frac{s_{dw}}{1-\beta_2^t}\\ W:=W-\alpha \frac{V_{dw}^{correct}}{\sqrt{S_{dw}^{correct}}+\varepsilon}\\ \beta_1:0.9,\beta_2:0.999 Vdw=0,Sdw=0Vdw=β1Vdw+(1−β1)dwVdb=β1Vdb+(1−β1)dbSdw=β2Sdw+(1−β2)dw2Sdb=β2Sdb+(1−β2)db2Vdwcorrect=1−β1tvdwSdwcorrect=1−β2tsdwW:=W−αSdwcorrect+εVdwcorrectβ1:0.9,β2:0.999
α = 1 1 + d e c a y R a t e ∗ e p o c h N u m b e r α 0 α = k e p o c h N u m α 0 \alpha=\frac{1}{1+decayRate*epochNumber}\alpha_0\\ \alpha=\frac{k}{\sqrt{epochNum}}\alpha_0 α=1+decayRate∗epochNumber1α0α=epochNumkα0
learning_rate -> beta / hidden units / mini-batch size
Try random values
Coarse to fine
babysitting one model / training many models in parallel
appropriate scale for hyparameters
a = 0.0001 → 1 r = − 4 ∗ n p . r a n d o m . r a n d ( ) α = 1 0 r a=0.0001\rightarrow1\\ r=-4*np.random.rand()\\ \alpha=10^r a=0.0001→1r=−4∗np.random.rand()α=10r
β = 0.9...0.999 1 − β = 0.1...0.001 r ∈ [ − 3 , − 1 ] β = 1 0 r \beta =0.9...0.999\\ 1-\beta = 0.1...0.001\\ r\in[-3,-1] \\\beta=10^r β=0.9...0.9991−β=0.1...0.001r∈[−3,−1]β=10r
μ = 1 m ∑ i = 1 m x ( i ) x : = x − μ σ 2 = 1 m ∑ i = 1 m ( x ( i ) ) 2 x / = σ \mu =\frac{1}{m}\sum _{i=1}^{m}x^{(i)}\\ x:=x-\mu\\ \sigma ^2=\frac{1}{m}\sum_{i=1}^m(x^{(i)})^2\\ x/=\sigma\\ μ=m1i=1∑mx(i)x:=x−μσ2=m1i=1∑m(x(i))2x/=σ
μ = 1 m ∑ i = 1 m z ( i ) z : = z − μ σ 2 = 1 m ∑ i = 1 m ( z ( i ) ) 2 z n o r m ( i ) / = σ 2 + ε z ~ = γ z n o r m ( i ) + β ( i f γ = σ 2 + ε , β = μ ) \mu =\frac{1}{m}\sum _{i=1}^{m}z^{(i)}\\ z:=z-\mu\\ \sigma ^2=\frac{1}{m}\sum_{i=1}^m(z^{(i)})^2\\ z^{(i)}_{norm}/=\sqrt{\sigma^2+\varepsilon}\\ \tilde z=\gamma z^{(i)}_{norm}+\beta(if\;\;\gamma=\sqrt{\sigma^2+\varepsilon},\beta=\mu) μ=m1i=1∑mz(i)z:=z−μσ2=m1i=1∑m(z(i))2znorm(i)/=σ2+εz~=γznorm(i)+β(ifγ=σ2+ε,β=μ)
x → z [ i ] → z ~ [ i ] → a [ i + 1 ] → z [ i + 1 ] → z ~ [ i + 1 ] → a [ i + 1 ] β [ i ] = β [ l ] − α d β [ i ] x\rightarrow z^{[i]}\rightarrow \tilde z^{[i]}\rightarrow a^{[i+1]}\rightarrow z^{[i+1]}\rightarrow \tilde z^{[i+1]}\rightarrow a^{[i+1]} \\\beta^{[i]}=\beta^{[l]}-\alpha d\beta^{[i]} x→z[i]→z~[i]→a[i+1]→z[i+1]→z~[i+1]→a[i+1]β[i]=β[l]−αdβ[i]
exponential weighted averages
S o f t m a x l a y e r a c t i v a t i o n f u n c t i o n : t = e ( z [ l ] ) a [ l ] = e z [ l ] ∑ i = 1 4 t i , a i [ l ] = t i ∑ i = 1 4 t i L ( y ^ , y ) = − ∑ j = 1 4 y j log y ^ j Softmax\;\;layer\;\;activation\;\;function: \\t=e^{(z^{[l]})}\\ a^{[l]}=\frac{e^{z^{[l]}}}{\sum_{i=1}^4t_i},a^{[l]}_i=\frac{t_i}{\sum_{i=1}^4t_i}\\ \mathcal{L}(\hat y,y)=-\sum_{j=1}^4y_j\log \hat y_j Softmaxlayeractivationfunction:t=e(z[l])a[l]=∑i=14tiez[l],ai[l]=∑i=14titiL(y^,y)=−j=1∑4yjlogy^j
Precision : n% actually are …
Recall : n% was correctly recognized
F1 score : $ \frac{2}{\frac{1}{P}+\frac{1}{R}} $
cost = accuracy - 0.5 * running time
N matrics : 1 optimizing , (N-1) reach threshold