深度学习笔记

Deep Learning

Basic

  • 神经网络:
algorithm1
input1
output
input2
input3
input4
algorithm2
  • 监督学习:1个x对应1个y;

  • Sigmoid : 激活函数
    s i g m o i d = 1 1 + e − x sigmoid=\frac{1}{1+e^{-x}} sigmoid=1+ex1

  • ReLU : 线性整流函数;


## Logistic Regression

-->binary classification / x-->y 0 1

### some sign

$$
(x,y) , x\in{\mathbb{R}^{n_{x}}},y\in{0,1}\\\\
M=m_{train}\quad m_{test}=test\\\\
M:{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)})...,(x^{(m)},y^{(m)})}\\\\
X =
\left[ 
\begin{matrix} 
x^{(1)} & x^{(2)} &\cdots & x^{(m)} 
\end{matrix} \right] \leftarrow n^{x}\times m\\\\
\hat{y}=P(y=1\mid x)\quad\hat{y}=\sigma(w^tx+b)\qquad
w\in \mathbb{R}^{n_x} \quad b\in \mathbb{R}\\
\sigma (z)=\frac{1}{1+e^{-z}}
$$

### Loss function

单个样本
$$
Loss\:function:\mathcal{L}(\hat{y},y)=\frac{1}{2}(\hat{y}-y)^2\\\\
p(y\mid x)=\hat{y}^y(1-\hat y)^{(1-y)}\\
min\;cost\rightarrow max\;\log(y\mid x)\\
\mathcal{L}(\hat{y},y)=-(y\log(\hat{y})+(1-y)\log(1-\hat{y}))\\\\
y=1:\mathcal{L}(\hat{y},y)=-\log\hat{y}\quad \log\hat{y}\leftarrow larger\quad\hat{y}\leftarrow larger\\
y=0:\mathcal{L}(\hat{y},y)=-\log(1-\hat{y})\quad \log(1-\hat{y})\leftarrow larger\quad\hat{y}\leftarrow smaller\\\\
$$

### cost function 

$$
\mathcal{J}(w,b)=\frac{1}{m}\sum_{i=1}^{m}\mathcal{L}(\hat{y}^{(i)},y^{(i)})
$$

### Gradient Descent

find w,b that minimiaze J(w,b) ;

Repeat:
$$
w:=w-\alpha \frac{\partial\mathcal{J}(w,b)}{\partial w}(dw)\\
b:=b-\alpha \frac{\partial\mathcal{J}(w,b)}{\partial b}(db)
$$

### Computation Grapha

example:
$$
J=3(a+bc)
$$

```mermaid
graph LR
	C[v=a+u]
	A(a)
	B(b)
	D(c)
	E[u=bc]
	F[J=3v]
	
	A --> C
	B --> E
	D --> E
	E --> C
	C --> F

one example gradient descent computer grapha:

recap:
z = w T x + b y ^ = a = σ ( z ) = 1 1 + e − z L ( a , y ) = − ( y log ⁡ ( a ) + ( 1 − y ) log ⁡ ( 1 − a ) ) z=w^Tx+b\\ \hat{y}=a=\sigma(z)=\frac{1}{1+e^{-z}} \\ \mathcal{L}(a,y)=-(y\log(a)+(1-y)\log(1-a)) z=wTx+by^=a=σ(z)=1+ez1L(a,y)=(ylog(a)+(1y)log(1a))
The grapha:

′ d a ′ = d L ( a , y ) d a = − y a + 1 − y 1 − a ′ d z ′ = d L ( a , y ) d z = d L d a ⋅ d a d z = a − y ′ d w 1 ′ = x 1 ⋅ d z        . . . w 1 : = w 1 − α d w 1      . . . 'da'=\frac{d\mathcal{L}(a,y)}{da}=-\frac{y}{a}+\frac{1-y}{1-a}\\ 'dz'=\frac{d\mathcal{L}(a,y)}{dz}=\frac{d\mathcal{L}}{da}\cdot\frac{da}{dz}=a-y\\ 'dw_1'=x_1\cdot dz\;\;\; ... \\w_1:=w_1-\alpha dw_1\;\;... da=dadL(a,y)=ay+1a1ydz=dzdL(a,y)=dadLdzda=aydw1=x1dz...w1:=w1αdw1...
m example gradient descent computer grapha:

recap:
J ( w , b ) = 1 m ∑ i = 1 m L ( a ( i ) , y ( 1 ) ) \mathcal{J}(w,b)=\frac{1}{m}\sum_{i=1}^m\mathcal{L}(a^{(i)},y^{(1)}) J(w,b)=m1i=1mL(a(i),y(1))
The grapha: (two iterate)
∂ ∂ w 1 J ( w , b ) = 1 m ∑ i = 1 m ∂ ∂ w 1 L ( a ( i ) , y ( 1 ) ) F o r i = 1 t o m : { a ( i ) = σ ( w T x ( i ) + b ) J + = − [ y ( i ) log ⁡ a i + ( 1 − y ( i ) log ⁡ ( 1 − a ( i ) ) ) ] d z ( i ) = a ( i ) − y ( i ) d w 1 + = x 1 ( i ) d z ( i ) d w 2 + = x 2 ( i ) d z ( i ) d b + = d z ( i ) } J / = m ; d w 1 / = m ; d w 2 / = m ; d b / = m d w 1 = ∂ J ∂ w 1 w 1 = w 1 − α d w 1 \frac{\partial}{\partial w_1}\mathcal{J}(w,b)=\frac{1}{m}\sum_{i=1}^m\frac{\partial}{\partial w_1}\mathcal{L}(a^{(i)},y^{(1)})\\\\ For \quad i=1 \quad to \quad m:\{\\ a^{(i)}=\sigma (w^Tx^{(i)}+b)\\ \mathcal{J}+=-[y^{(i)}\log a^{i}+(1-y^{(i)}\log(1-a^{(i)}))] \\ dz^{(i)}=a^{(i)}-y^{(i)}\\ dw_1+=x_1^{(i)}dz^{(i)}\\ dw_2+=x_2^{(i)}dz^{(i)}\\ db+=dz^{(i)}\}\\ \mathcal{J}/=m;dw_1/=m;dw_2/=m;db/=m\\ dw_1=\frac{\partial\mathcal{J}}{\partial w_1}\\ w_1=w_1-\alpha dw_1 w1J(w,b)=m1i=1mw1L(a(i),y(1))Fori=1tom:{a(i)=σ(wTx(i)+b)J+=[y(i)logai+(1y(i)log(1a(i)))]dz(i)=a(i)y(i)dw1+=x1(i)dz(i)dw2+=x2(i)dz(i)db+=dz(i)}J/=m;dw1/=m;dw2/=m;db/=mdw1=w1Jw1=w1αdw1

Vectorization

vectorized

z = n p . d o t ( w , x ) + b z=np.dot(w,x)+b z=np.dot(w,x)+b
logistic regression derivatives:

change:
d w 1 = 0 , d w 2 = 0 → d w = n p . z e r o s ( ( n x , 1 ) ) { d w 1 + = x 1 ( i ) d z ( i ) d w 2 + = x 2 ( i ) d z ( i ) → d w + = x ( i ) d z ( i ) Z = (    z ( 1 ) z ( 2 ) . . . z ( m )    ) = w T X + b A = σ ( Z ) d z = A − Y = (    a ( 1 ) − y ( 1 ) z ( 2 ) − y ( 2 ) . . . z ( m ) − y ( m )    ) d b = 1 m ∑ i = 1 m d z ( i ) = 1 m n p . s u m ( d z ) d w = 1 m X d z T = 1 m (    x ( 1 ) ⋅ d z ( 1 ) x ( 2 ) ⋅ d z ( 2 ) . . . x ( m ) ⋅ d z ( m )    ) dw_1=0,dw_2=0\rightarrow dw=np.zeros((n_x,1))\\ \begin{cases}dw_1+=x_1^{(i)}dz^{(i)}\\ dw_2+=x_2^{(i)}dz^{(i)}\end{cases}\rightarrow dw+=x^{(i)}dz^{(i)}\\\\ Z=\left(\;\begin{matrix} z^{(1)} & z^{(2)} &... &z^{(m)}\end{matrix}\;\right)=w^TX+b\\ A=\sigma(Z)\\\\ dz=A-Y=\left(\;\begin{matrix} a^{(1)}-y^{(1)} & z^{(2)}-y^{(2)} &... &z^{(m)}-y^{(m)}\end{matrix}\;\right)\\ db=\frac{1}{m}\sum_{i=1}^mdz^{(i)}=\frac{1}{m}np.sum(dz)\\ dw=\frac{1}{m}Xdz^T=\frac{1}{m}\left(\;\begin{matrix} x^{(1)}\cdot dz^{(1)}&x^{(2)}\cdot dz^{(2)}&...&x^{(m)}\cdot dz^{(m)}\end{matrix}\;\right) dw1=0,dw2=0dw=np.zeros((nx,1)){dw1+=x1(i)dz(i)dw2+=x2(i)dz(i)dw+=x(i)dz(i)Z=(z(1)z(2)...z(m))=wTX+bA=σ(Z)dz=AY=(a(1)y(1)z(2)y(2)...z(m)y(m))db=m1i=1mdz(i)=m1np.sum(dz)dw=m1XdzT=m1(x(1)dz(1)x(2)dz(2)...x(m)dz(m))

Implementing:

Z = w T X + b = n p . d o t ( w T , X ) + b A = σ ( Z ) J = − 1 m ∑ i = 1 m ( y ( i ) log ⁡ ( a ( i ) ) + ( 1 − y ( i ) ) log ⁡ ( 1 − a ( i ) ) ) d Z = A − Y d w = 1 m X d Z T d b = 1 m n p . s u m ( d Z ) w : = w − α d w b : = b − α d b Z=w^TX+b=np.dot(w^T,X)+b\\ A=\sigma(Z)\\ J=-\frac{1}{m}\sum_{i=1}^m(y^{(i)}\log(a^{(i)})+(1-y^{(i)})\log(1-a^{(i)}))\\ dZ=A-Y\\ dw=\frac{1}{m}XdZ^T\\ db=\frac{1}{m}np.sum(dZ)\\ w:=w-\alpha dw\\ b:=b-\alpha db Z=wTX+b=np.dot(wT,X)+bA=σ(Z)J=m1i=1m(y(i)log(a(i))+(1y(i))log(1a(i)))dZ=AYdw=m1XdZTdb=m1np.sum(dZ)w:=wαdwb:=bαdb

broadcasting

n p . d o t ( w T , X ) + b np.dot(w^T,X)+b np.dot(wT,X)+b
A note on numpy
a = n p . r a n d o m . r a n d n ( 5 ) / / w r o n g → a = a . r e s h a p e ( 5 , 1 ) a s s e r t ( a . s h a p e = = ( 5 , 1 ) ) a = n p . r a n d o m . r a n d n ( 5 , 1 ) → c o l u m    v e c t o r a=np.random.randn(5) //wrong\rightarrow a=a.reshape(5,1)\\ assert(a.shape==(5,1))\\ a=np.random.randn(5,1)\rightarrow colum\;vector a=np.random.randn(5)//wronga=a.reshape(5,1)assert(a.shape==(5,1))a=np.random.randn(5,1)columvector

Shallow Neural Network

Representation

2 layer NN:
I n p u t    l a y e r → h i d d e n → l a y e r → o u t    l a y e r a [ 0 ] → a [ 1 ] → a [ 2 ] z [ 1 ] = W [ 1 ] a [ 0 ] + b [ 1 ] a [ 1 ] = σ ( z [ 1 ] ) z [ 2 ] = W [ 2 ] a [ 1 ] + b [ 2 ] a [ 2 ] = σ ( z [ 2 ] ) = y ^ Input\;layer\rightarrow hidden\rightarrow layer\rightarrow out\;layer\\ a^{[0]}\rightarrow a^{[1]}\rightarrow a^{[2]}\\\\ z^{[1]}=W^{[1]}a^{[0]}+b^{[1]}\\ a^{[1]}=\sigma(z^{[1]})\\ z^{[2]}=W^{[2]}a^{[1]}+b^{[2]}\\ a^{[2]}=\sigma(z^{[2]})=\hat y\\ Inputlayerhiddenlayeroutlayera[0]a[1]a[2]z[1]=W[1]a[0]+b[1]a[1]=σ(z[1])z[2]=W[2]a[1]+b[2]a[2]=σ(z[2])=y^

computing:

z i [ 1 ] = w i [ 1 ] T x + b i [ 1 ] a i [ 1 ] = σ ( z i [ 1 ] ) [ w 1 [ 1 ] T w 2 [ 1 ] T w 3 [ 1 ] T w 4 [ 1 ] T ] ⋅ [ x 1 x 2 x 3 ] + [ b 1 [ 1 ] b 2 [ 1 ] b 3 [ 1 ] b 4 [ 1 ] ] = [ z 1 [ 1 ] z 2 [ 1 ] z 3 [ 1 ] z 4 [ 1 ] ] z_i^{[1]}=w_i^{[1]T}x+b_i^{[1]}\\ a_i^{[1]}=\sigma(z_i^{[1]})\\ \left[ \begin{matrix} w_1^{[1]T}\\w_2^{[1]T}\\w_3^{[1]T}\\w_4^{[1]T} \end{matrix} \right] \cdot \left[ \begin{matrix} x_1\\x_2\\x_3 \end{matrix} \right]+\left[ \begin{matrix} b_1^{[1]}\\b_2^{[1]}\\b_3^{[1]}\\b_4^{[1]} \end{matrix} \right]=\left[ \begin{matrix} z_1^{[1]}\\z_2^{[1]}\\z_3^{[1]}\\z_4^{[1]} \end{matrix} \right] zi[1]=wi[1]Tx+bi[1]ai[1]=σ(zi[1]) w1[1]Tw2[1]Tw3[1]Tw4[1]T x1x2x3 + b1[1]b2[1]b3[1]b4[1] = z1[1]z2[1]z3[1]z4[1]

Vectorize:

x ( i ) → a [ 2 ] ( i ) = y ^ ( i ) Z [ 1 ] = W [ 1 ] X + b [ 1 ] A [ 1 ] = σ ( Z [ 1 ] ) Z [ 2 ] = W [ 2 ] A [ 1 ] + b [ 2 ] A [ 2 ] = σ ( Z [ 2 ] ) W [ 1 ] ⋅ [ x ( 1 ) x ( 2 ) ⋯ x ( m ) ] + b = [ z [ 1 ] ( 1 ) z [ 1 ] ( 2 ) ⋯ z [ 1 ] ( m ) ] = Z [ 1 ] x^{(i)}\rightarrow a^{[2](i)}=\hat y^{(i)}\\ Z^{[1]}=W^{[1]}X+b^{[1]}\\ A^{[1]}=\sigma(Z^{[1]})\\ Z^{[2]}=W^{[2]}A^{[1]}+b^{[2]}\\ A^{[2]}=\sigma(Z^{[2]})\\ W^{[1]}\cdot \left[ \begin{matrix} x^{(1)} & x^{(2)} &\cdots & x^{(m)} \end{matrix} \right]+b=\left[ \begin{matrix} z^{[1](1)} & z^{[1](2)} &\cdots & z^{[1](m)} \end{matrix} \right]=Z^{[1]} x(i)a[2](i)=y^(i)Z[1]=W[1]X+b[1]A[1]=σ(Z[1])Z[2]=W[2]A[1]+b[2]A[2]=σ(Z[2])W[1][x(1)x(2)x(m)]+b=[z[1](1)z[1](2)z[1](m)]=Z[1]

Activation functions

a = 1 1 + e − z , a ′ = a ( 1 − a ) a = tanh ⁡ ( z ) = e z − e − z e z + e − z , a ∈ ( − 1 , 1 ) , a ′ = 1 − a 2 a = m a x ( 0 , z ) a = m a x ( 0.01 z , z ) a=\frac{1}{1+e^{-z}},a'=a(1-a)\\ a=\tanh(z)=\frac{e^z-e^{-z}}{e^z+e^{-z}},a\in (-1,1),a'=1-a^2\\ a=max(0,z)\\ a=max(0.01z,z) a=1+ez1,a=a(1a)a=tanh(z)=ez+ezezez,a(1,1),a=1a2a=max(0,z)a=max(0.01z,z)

Gradient descent

computation

z [ 1 ] = W [ 1 ] x + b [ 1 ] → a [ 1 ] = σ ( z [ 1 ] ) → z [ 2 ] = W [ 2 ] a [ 1 ] + b [ 2 ] → a [ 2 ] = σ ( z [ 2 ] ) → L ( a [ 2 ] , y ) d z [ 2 ] = a [ 2 ] − y d w [ 2 ] = d z [ 2 ] a [ 1 ] T d b [ 2 ] = d z [ 2 ] d z [ 1 ] = w [ 2 ] T d z [ 2 ] ∗ a ′ [ 1 ] d w [ 1 ] = d z [ 1 ] ⋅ x T d b [ 1 ] = d z [ 1 ] z^{[1]}=W^{[1]}x+b^{[1]}\rightarrow\\ a^{[1]}=\sigma(z^{[1]})\rightarrow\\ z^{[2]}=W^{[2]}a^{[1]}+b^{[2]}\rightarrow\\ a^{[2]}=\sigma(z^{[2]})\rightarrow\\ \mathcal{L}(a^{[2]},y)\\\\ dz^{[2]}=a^{[2]}-y\\ dw^{[2]}=dz^{[2]}a^{[1]T}\\ db^{[2]}=dz^{[2]}\\ dz^{[1]}=w^{[2]T}dz^{[2]}*a^{'[1]}\\ dw^{[1]}=dz^{[1]}\cdot x^T\\ db^{[1]}=dz^{[1]}\\\\ z[1]=W[1]x+b[1]a[1]=σ(z[1])z[2]=W[2]a[1]+b[2]a[2]=σ(z[2])L(a[2],y)dz[2]=a[2]ydw[2]=dz[2]a[1]Tdb[2]=dz[2]dz[1]=w[2]Tdz[2]a[1]dw[1]=dz[1]xTdb[1]=dz[1]

dz[1]的推导涉及到了矩阵求导

the dimension

x : ( n 0 , m ) W [ 1 ] : ( n 1 , n 0 ) → a [ 1 ] : ( n 1 , m ) W [ 2 ] : : ( n 2 , n 1 ) → a [ 2 ] : ( n 2 , m ) x:(n_0,m)\quad W^{[1]}:(n_1,n_0)\rightarrow \\ a^{[1]}:(n_1,m)\quad W^{[2]:}:(n_2,n_1)\rightarrow\\ a^{[2]}:(n_2,m)\quad x:(n0,m)W[1]:(n1,n0)a[1]:(n1,m)W[2]::(n2,n1)a[2]:(n2,m)

vectorize

d Z [ 2 ] = A [ 2 ] − Y d W [ 2 ] = 1 m d Z [ 2 ] A [ 1 ] T d b [ 2 ] = n p . s u m ( d Z [ 2 ] , a x i s = 1 , k e e p d i m s = T r u e ) d Z [ 1 ] = W [ 2 ] T d Z [ 2 ] ∗ A ′ [ 1 ] d W [ 1 ] = 1 m d Z [ 1 ] X T d b [ 1 ] = 1 m n p . s u m ( d Z [ 1 ] , a x i s = 1 , k e e p d i m s = T r u e ) dZ^{[2]}=A^{[2]}-Y\\ dW^{[2]}=\frac{1}{m}dZ^{[2]}A^{[1]T}\\ db^{[2]}=np.sum(dZ^{[2]},axis = 1,keepdims=True)\\ dZ^{[1]}=W^{[2]T}dZ^{[2]}*A^{'[1]}\\ dW^{[1]}=\frac{1}{m}dZ^{[1]}X^T\\ db^{[1]}=\frac{1}{m}np.sum(dZ{[1]},axis=1,keepdims=True) dZ[2]=A[2]YdW[2]=m1dZ[2]A[1]Tdb[2]=np.sum(dZ[2],axis=1,keepdims=True)dZ[1]=W[2]TdZ[2]A[1]dW[1]=m1dZ[1]XTdb[1]=m1np.sum(dZ[1],axis=1,keepdims=True)

Random Initialization

w [ 1 ] = n p . r a n d o m . r a n d n ( ( 2 , 2 ) ) ∗ 0.01 b [ 1 ] = n p . z e r o ( ( 2 , 1 ) ) w^{[1]}=np.random.randn((2,2))*0.01\\ b^{[1]}=np.zero((2,1)) w[1]=np.random.randn((2,2))0.01b[1]=np.zero((2,1))

Deep neural network

notation

e x a m p l e : L      l a y e r      N N a [ l ] → a c t i v a t i o n    f u n c t i o n w [ l ] → w e i g h t s    f o r    z [ l ] y ^ = a [ L ] example:L\;\;layer\;\;NN\\ a^{[l]}\rightarrow activation\;function\\ w^{[l]}\rightarrow weights\;for\;z^{[l]}\\ \hat y=a^{[L]} example:LlayerNNa[l]activationfunctionw[l]weightsforz[l]y^=a[L]

Forward propagation

f o r      l = 1 , 2 , 3.. z [ l ] = w [ l ] a [ l − 1 ] + b [ l ] c a c h e    z [ l ] , w [ l ] , b [ l ] a [ l ] = g [ l ] ( z [ l ] ) for\;\;l=1,2,3..\\ z^{[l]}=w^{[l]}a^{[l-1]}+b^{[l]}\\cache\;z^{[l]},w^{[l]},b^{[l]} \\ a^{[l]}=g^{[l]}(z^{[l]}) forl=1,2,3..z[l]=w[l]a[l1]+b[l]cachez[l],w[l],b[l]a[l]=g[l](z[l])

Backward propagation

d a [ l ] → d a [ l − 1 ] ( d z [ l ] , d w [ l ] , d b [ l ] ) d z [ l ] = d a [ l ] ∗ g [ l ] ′ ( z [ l ] ) = w [ l + 1 ] d z [ l + 1 ] ∗ g [ l ] ′ ( z [ l ] ) d w [ l ] = d z [ l ] ⋅ a [ l − 1 ] T d b [ l ] = d z [ l ] d a [ l − 1 ] = w [ l ] T ⋅ d z [ l ] da^{[l]}\rightarrow da^{[l-1]}(dz^{[l]},dw^{[l]},db^{[l]})\\ dz^{[l]}=da^{[l]}*g^{[l]'}(z^{[l]})=w^{[l+1]}dz^{[l+1]}*g^{[l]'}(z^{[l]})\\ dw^{[l]}=dz^{[l]}\cdot a^{[l-1]T}\\ db^{[l]}=dz^{[l]}\\ da^{[l-1]}=w^{[l]T}\cdot dz^{[l]}\\ da[l]da[l1](dz[l],dw[l],db[l])dz[l]=da[l]g[l](z[l])=w[l+1]dz[l+1]g[l](z[l])dw[l]=dz[l]a[l1]Tdb[l]=dz[l]da[l1]=w[l]Tdz[l]

matrix dimensions

d w , w [ l ] : ( n [ l ] , n [ l − 1 ] ) d b , b [ l ] : ( n [ l ] , 1 ) Z [ l ] , A [ l ] : ( n [ l ] , m ) dw,w^{[l]}:(n^{[l]},n^{[l-1]})\\ db, b^{[l]}:(n^{[l]},1 )\\ Z^{[l]},A^{[l]}:(n^{[l]},m) dw,w[l]:(n[l],n[l1])db,b[l]:(n[l],1)Z[l],A[l]:(n[l],m)

Improve NN

train/dev/test set

0.7/0/0.3 0.6.0.2.0.2 -> 100-10000

0.98/0.01/0.01 … -> big data

Bias/Variance

偏差度量的是单个模型的学习能力,而方差度量的是同一个模型在不同数据集上的稳定性。

high variance ->high dev set error

high bias ->high train set error

basic recipe

high bias -> bigger network / train longer / more advanced optimization algorithms / NN architectures

high variance -> more data / regularization / NN architecture

Regularization

Logistic Regression

L 2      r e g u l a r i z a t i o n : m i n J ( w , b ) → J ( w , b ) = 1 m ∑ i = 1 m L ( y ^ ( i ) , y ( i ) ) + λ 2 m ∥ w ∥ 2 2 L2\;\; regularization:\\min\mathcal{J}(w,b)\rightarrow J(w,b)=\frac{1}{m}\sum_{i=1}^m\mathcal{L}(\hat y^{(i)},y^{(i)})+\frac{\lambda}{2m}\Vert w\Vert_2^2 L2regularization:minJ(w,b)J(w,b)=m1i=1mL(y^(i),y(i))+2mλw22

Neural network

F r o b e n i u s      n o r m ∥ w [ l ] ∥ F 2 = ∑ i = 1 n [ l ] ∑ j = 1 n [ l − 1 ] ( w i , j [ l ] ) 2 D r o p o u t      r e g u l a r i z a t i o n : d 3 = n p . r a n d m . r a n d ( a 3. s h a p e . s h a p e [ 0 ] , a 3. s h a p e [ 1 ] < k e e p . p r o b ) a 3 = n p . m u l t i p l y ( a 3 , d 3 ) a 3 / = k e e p . p r o b Frobenius\;\; norm\\ \Vert w^{[l]}\Vert^2_F=\sum_{i=1}^{n^{[l]}}\sum_{j=1}^{n^{[l-1]}}(w_{i,j}^{[l]})^2\\\\ Dropout\;\; regularization:\\ d3=np.randm.rand(a3.shape.shape[0],a3.shape[1]Frobeniusnormw[l]F2=i=1n[l]j=1n[l1](wi,j[l])2Dropoutregularization:d3=np.randm.rand(a3.shape.shape[0],a3.shape[1]<keep.prob)a3=np.multiply(a3,d3)a3/=keep.prob

other ways

  • early stopping
  • data augmentation

Optimization problem

speed up the training of your neural network

Normalizing inputs

  1. subtract mean

μ = 1 m ∑ i = 1 m x ( i ) x : = x − μ \mu =\frac{1}{m}\sum _{i=1}^{m}x^{(i)}\\ x:=x-\mu μ=m1i=1mx(i)x:=xμ

  1. normalize variance

σ 2 = 1 m ∑ i = 1 m ( x ( i ) ) 2 x / = σ \sigma ^2=\frac{1}{m}\sum_{i=1}^m(x^{(i)})^2\\ x/=\sigma σ2=m1i=1m(x(i))2x/=σ

vanishing/exploding gradients

y = w [ l ] w [ l − 1 ] . . . w [ 2 ] w [ 1 ] x w [ l ] > I → ( w [ l ] ) L → ∞ w [ l ] < I → ( w [ l ] ) L → 0 y=w^{[l]}w^{[l-1]}...w^{[2]}w^{[1]}x\\ w^{[l]}>I\rightarrow (w^{[l]})^L\rightarrow\infty \\w^{[l]}y=w[l]w[l1]...w[2]w[1]xw[l]>I(w[l])Lw[l]<I(w[l])L0

weight initialize

v a r ( w ) = 1 n ( l − 1 ) w [ l ] = n p . r a n d o m . r a n d n ( s h a p e ) ∗ n p . s q r t ( 1 n ( l − 1 ) ) var(w)=\frac{1}{n^{(l-1)}}\\ w^{[l]}=np.random.randn(shape)*np.sqrt(\frac{1}{n^{(l-1)}}) var(w)=n(l1)1w[l]=np.random.randn(shape)np.sqrt(n(l1)1)

gradient check

Numerical approximation

f ( θ ) = θ 3 f ′ ( θ ) = f ( θ + ε ) − f ( θ − ε ) 2 ε f(\theta)=\theta^3\\ f'(\theta)=\frac{f(\theta+\varepsilon)-f(\theta-\varepsilon)}{2\varepsilon} f(θ)=θ3f(θ)=2εf(θ+ε)f(θε)

grad check

d θ a p p r o x [ i ] = J ( θ 1 , . . . θ i + ε . . . ) − J ( θ 1 , . . . θ i − ε . . . ) 2 ε = d θ [ i ] c h e c k : ∥ d θ a p p r o x − d θ ∥ 2 ∥ d θ a p p r o x ∥ 2 + ∥ d θ ∥ 2 < 1 0 − 7 d\theta_{approx}[i]=\frac{J(\theta_1,...\theta_i+\varepsilon...)-J(\theta_1,...\theta_i-\varepsilon...)}{2\varepsilon}=d\theta[i]\\ check:\frac{\Vert d\theta_{approx}-d\theta\Vert_2}{\Vert d\theta_{approx}\Vert_2+\Vert d\theta\Vert_2}<10^{-7} dθapprox[i]=2εJ(θ1,...θi+ε...)J(θ1,...θiε...)=dθ[i]check:dθapprox2+dθ2dθapproxdθ2<107

Optimize algorithm

mini-bach gradient

[ x ( 1 ) . . . x ( m ) ] → [ x { 1 } . . . x { m / u } ] ( a n      e p o c h : F o r w a r d      p r o p      o n      x { t } : z [ l ] = w [ l ] X { t } + b [ l ] A [ l ] = g [ l ] ( z [ l ] ) J { t } = 1 1000 ∑ i = 1 l L ( y ^ ( i ) , y ( i ) ) + λ 2 ∗ s i z e ∑ l ∥ w [ l ] ∥ F 2 B a c k w a r d      p r o p [x^{(1)}...x^{(m)}]\rightarrow [x^{\{1\}}...x^{\{m/u\}}]\\ (an\;\;epoch:Forward\;\;prop\;\;on\;\;x^{\{t\}}:\\ z^{[l]}=w^{[l]}X^{\{t\}}+b^{[l]}\\ A^{[l]}=g^{[l]}(z^{[l]})\\ J^{\{t\}}=\frac{1}{1000}\sum_{i=1}^l\mathcal{L}(\hat y^{(i)},y^{(i)})+\frac{\lambda}{2*size}\sum_l\Vert w^{[l]}\Vert_F^2\\ Backward\;\;prop [x(1)...x(m)][x{1}...x{m/u}](anepoch:Forwardproponx{t}:z[l]=w[l]X{t}+b[l]A[l]=g[l](z[l])J{t}=10001i=1lL(y^(i),y(i))+2sizeλlw[l]F2Backwardprop

mini-batch size

size = m -> Batch gradient descent <- small train set (<2000)

size = 1 -> stochastic gradient descent

typical mini-batch size (62,128,256…)

exponential weighted averages

v θ = 0 θ t → v θ : = β v θ − 1 + ( 1 − β ) θ θ v_\theta = 0\\ \theta_t\rightarrow v_\theta:=\beta v_{\theta-1}+(1-\beta)\theta_\theta\\ vθ=0θtvθ:=βvθ1+(1β)θθ

Bias correction

1 1 − β → v t 1 − β t \frac{1}{1-\beta}\rightarrow\frac{v_t}{1-\beta^t} 1β11βtvt

Momentum

V d w = β V d w + ( 1 − β ) d w V d b = β V d b + ( 1 − β ) d b w : = w − α V d w V_{dw}=\beta V_{dw}+(1-\beta)dw\\ V_{db}=\beta V_{db}+(1-\beta)db\\ w:=w-\alpha V_{dw} Vdw=βVdw+(1β)dwVdb=βVdb+(1β)dbw:=wαVdw

RMSprop

S d w = β 2 S d w + ( 1 − β 2 ) d w 2 S d b = β 2 S d b + ( 1 − β 2 ) d b 2 w : = w − α d w S d w + ε S_{dw}=\beta_2 S_{dw}+(1-\beta_2)dw^2\\ S_{db}=\beta_2 S_{db}+(1-\beta_2)db^2\\ w:=w-\alpha \frac{dw}{\sqrt S_{dw}+\varepsilon}\\ Sdw=β2Sdw+(1β2)dw2Sdb=β2Sdb+(1β2)db2w:=wαS dw+εdw

Adam algorithm

V d w = 0 , S d w = 0 V d w = β 1 V d w + ( 1 − β 1 ) d w V d b = β 1 V d b + ( 1 − β 1 ) d b S d w = β 2 S d w + ( 1 − β 2 ) d w 2 S d b = β 2 S d b + ( 1 − β 2 ) d b 2 V d w c o r r e c t = v d w 1 − β 1 t S d w c o r r e c t = s d w 1 − β 2 t W : = W − α V d w c o r r e c t S d w c o r r e c t + ε β 1 : 0.9 , β 2 : 0.999 V_{dw}=0,S_{dw}=0\\ V_{dw}=\beta_1 V_{dw}+(1-\beta_1)dw\\V_{db}=\beta_1 V_{db}+(1-\beta_1)db\\ S_{dw}=\beta_2 S_{dw}+(1-\beta_2)dw^2\\S_{db}=\beta_2 S_{db}+(1-\beta_2)db^2\\ V_{dw}^{correct}=\frac{v_{dw}}{1-\beta_1^t}\\S_{dw}^{correct}=\frac{s_{dw}}{1-\beta_2^t}\\ W:=W-\alpha \frac{V_{dw}^{correct}}{\sqrt{S_{dw}^{correct}}+\varepsilon}\\ \beta_1:0.9,\beta_2:0.999 Vdw=0,Sdw=0Vdw=β1Vdw+(1β1)dwVdb=β1Vdb+(1β1)dbSdw=β2Sdw+(1β2)dw2Sdb=β2Sdb+(1β2)db2Vdwcorrect=1β1tvdwSdwcorrect=1β2tsdwW:=WαSdwcorrect +εVdwcorrectβ1:0.9,β2:0.999

Learning rate decay

α = 1 1 + d e c a y R a t e ∗ e p o c h N u m b e r α 0 α = k e p o c h N u m α 0 \alpha=\frac{1}{1+decayRate*epochNumber}\alpha_0\\ \alpha=\frac{k}{\sqrt{epochNum}}\alpha_0 α=1+decayRateepochNumber1α0α=epochNum kα0

Local optima

Tuning process

hyperparameter search

  • learning_rate -> beta / hidden units / mini-batch size

  • Try random values

  • Coarse to fine

  • babysitting one model / training many models in parallel

Pick at random

appropriate scale for hyparameters

  • learning rate

a = 0.0001 → 1 r = − 4 ∗ n p . r a n d o m . r a n d ( ) α = 1 0 r a=0.0001\rightarrow1\\ r=-4*np.random.rand()\\ \alpha=10^r a=0.00011r=4np.random.rand()α=10r

  • exponentially weighted averages

β = 0.9...0.999 1 − β = 0.1...0.001 r ∈ [ − 3 , − 1 ] β = 1 0 r \beta =0.9...0.999\\ 1-\beta = 0.1...0.001\\ r\in[-3,-1] \\\beta=10^r β=0.9...0.9991β=0.1...0.001r[3,1]β=10r

Narmalizing activations

μ = 1 m ∑ i = 1 m x ( i ) x : = x − μ σ 2 = 1 m ∑ i = 1 m ( x ( i ) ) 2 x / = σ \mu =\frac{1}{m}\sum _{i=1}^{m}x^{(i)}\\ x:=x-\mu\\ \sigma ^2=\frac{1}{m}\sum_{i=1}^m(x^{(i)})^2\\ x/=\sigma\\ μ=m1i=1mx(i)x:=xμσ2=m1i=1m(x(i))2x/=σ

implementing Batch Norm

μ = 1 m ∑ i = 1 m z ( i ) z : = z − μ σ 2 = 1 m ∑ i = 1 m ( z ( i ) ) 2 z n o r m ( i ) / = σ 2 + ε z ~ = γ z n o r m ( i ) + β ( i f      γ = σ 2 + ε , β = μ ) \mu =\frac{1}{m}\sum _{i=1}^{m}z^{(i)}\\ z:=z-\mu\\ \sigma ^2=\frac{1}{m}\sum_{i=1}^m(z^{(i)})^2\\ z^{(i)}_{norm}/=\sqrt{\sigma^2+\varepsilon}\\ \tilde z=\gamma z^{(i)}_{norm}+\beta(if\;\;\gamma=\sqrt{\sigma^2+\varepsilon},\beta=\mu) μ=m1i=1mz(i)z:=zμσ2=m1i=1m(z(i))2znorm(i)/=σ2+ε z~=γznorm(i)+β(ifγ=σ2+ε ,β=μ)

fit in NN

x → z [ i ] → z ~ [ i ] → a [ i + 1 ] → z [ i + 1 ] → z ~ [ i + 1 ] → a [ i + 1 ] β [ i ] = β [ l ] − α d β [ i ] x\rightarrow z^{[i]}\rightarrow \tilde z^{[i]}\rightarrow a^{[i+1]}\rightarrow z^{[i+1]}\rightarrow \tilde z^{[i+1]}\rightarrow a^{[i+1]} \\\beta^{[i]}=\beta^{[l]}-\alpha d\beta^{[i]} xz[i]z~[i]a[i+1]z[i+1]z~[i+1]a[i+1]β[i]=β[l]αdβ[i]

at test time

exponential weighted averages

softmax regression

S o f t m a x      l a y e r      a c t i v a t i o n      f u n c t i o n : t = e ( z [ l ] ) a [ l ] = e z [ l ] ∑ i = 1 4 t i , a i [ l ] = t i ∑ i = 1 4 t i L ( y ^ , y ) = − ∑ j = 1 4 y j log ⁡ y ^ j Softmax\;\;layer\;\;activation\;\;function: \\t=e^{(z^{[l]})}\\ a^{[l]}=\frac{e^{z^{[l]}}}{\sum_{i=1}^4t_i},a^{[l]}_i=\frac{t_i}{\sum_{i=1}^4t_i}\\ \mathcal{L}(\hat y,y)=-\sum_{j=1}^4y_j\log \hat y_j Softmaxlayeractivationfunction:t=e(z[l])a[l]=i=14tiez[l],ai[l]=i=14titiL(y^,y)=j=14yjlogy^j

Structuring ML project

Ml strategy

Single Numble Evaluation Metric

Precision : n% actually are …

Recall : n% was correctly recognized

F1 score : $ \frac{2}{\frac{1}{P}+\frac{1}{R}} $

optimizing and satisficing metric

cost = accuracy - 0.5 * running time

N matrics : 1 optimizing , (N-1) reach threshold

你可能感兴趣的:(深度学习,笔记,人工智能)