神经网络中的正则化


    • 1. Logistic regression
    • 2. Neural network “Frobenius norm”
    • 3. inverted dropout

Adding regularization will often help To prevent overfitting problem (high variance problem ).

1. Logistic regression

回忆一下训练时的优化目标函数

minw,bJ(w,b),    wRnx,bR(1-1) (1-1) min w , b J ( w , b ) ,         w ∈ R n x , b ∈ R

其中
J(w,b)=1mi=1mL(y^(i),y(i))(1-2) (1-2) J ( w , b ) = 1 m ∑ i = 1 m L ( y ^ ( i ) , y ( i ) )

L2  regularization L 2     r e g u l a r i z a t i o n (most commonly used):
J(w,b)=1mi=1mL(y^(i),y(i))+λ2mw22(1-3) (1-3) J ( w , b ) = 1 m ∑ i = 1 m L ( y ^ ( i ) , y ( i ) ) + λ 2 m ‖ w ‖ 2 2

其中
w22=j=1nxw2j=wTw(1-4) (1-4) ‖ w ‖ 2 2 = ∑ j = 1 n x w j 2 = w T w

Why do we regularize just the parameter w? Because w Is usually a high dimensional parameter vector while b is A scalar. Almost all The parameters are in w rather than b.
L1  regularization L 1     r e g u l a r i z a t i o n
J(w,b)=1mi=1mL(y^(i),y(i))+λm|w|1(1-5) (1-5) J ( w , b ) = 1 m ∑ i = 1 m L ( y ^ ( i ) , y ( i ) ) + λ m | w | 1

其中
|w|1=jnx|wj|(1-6) (1-6) | w | 1 = ∑ j n x | w j |

w will end up being sparse. In other words the w vector will have a lot of zeros in it. This can help with compressing the model a little.

2. Neural network “Frobenius norm”

J(w[1],b[1],,w[L],b[L])=1mi=1mL(y^(i),y(i))+λ2ml=1Lw22(2-1) (2-1) J ( w [ 1 ] , b [ 1 ] , ⋯ , w [ L ] , b [ L ] ) = 1 m ∑ i = 1 m L ( y ^ ( i ) , y ( i ) ) + λ 2 m ∑ l = 1 L ‖ w ‖ 2 2

其中
w[l]2F=in[l1]jn[l](wij)2(2-2) (2-2) ‖ w [ l ] ‖ F 2 = ∑ i n [ l − 1 ] ∑ j n [ l ] ( w i j ) 2

L2 L 2 regulation is also called Weight decay:
dw[l]wl:=(from backprop)+λmw[l]=w[l]αdw[l]=(1αλm)w[l]α(from backprop)(2-3) (2-3) d w [ l ] = ( f r o m   b a c k p r o p ) + λ m w [ l ] w l : = w [ l ] − α d w [ l ] = ( 1 − α λ m ) w [ l ] − α ( f r o m   b a c k p r o p )

能够防止权重 w w 过大,从而避免过拟合

3. inverted dropout

对于不同的训练样本都可以随机消除一部分结点
反向随机失活(前向和后向都需要dropout):

d3a3a3/z[4]z[4]/=np.random.rand(a3.shape[0],a3.shape[1])<keep.prob=np.multiply(a3,d3)   #a3d3,element wise multiplication=keep.prob   #in order to not reduce the expected value of a3  inverted dropout=w[4]a[3]+b[4]=keep.prob(3-1) (3-1) d 3 = n p . r a n d o m . r a n d ( a 3 . s h a p e [ 0 ] , a 3 . s h a p e [ 1 ] ) < k e e p . p r o b a 3 = n p . m u l t i p l y ( a 3 , d 3 )       # a 3 ∗ d 3 , e l e m e n t   w i s e   m u l t i p l i c a t i o n a 3 / = k e e p . p r o b       # i n   o r d e r   t o   n o t   r e d u c e   t h e   e x p e c t e d   v a l u e   o f   a 3     i n v e r t e d   d r o p o u t z [ 4 ] = w [ 4 ] a [ 3 ] + b [ 4 ] z [ 4 ] / = k e e p . p r o b

this inverted dropout technique by dividing by the keep.prob, it ensures that the expected value of a3 remains the same. This makes test time easier because you have less of a scaling problem.
测试时不需要使用drop out

你可能感兴趣的:(学习笔记,理论学习,神经网络,正则化,过拟合)