-
- 1. Logistic regression
- 2. Neural network “Frobenius norm”
- 3. inverted dropout
Adding regularization will often help To prevent overfitting problem (high variance problem ).
1. Logistic regression
回忆一下训练时的优化目标函数
minw,bJ(w,b), w∈Rnx,b∈R(1-1) (1-1) min w , b J ( w , b ) , w ∈ R n x , b ∈ R
其中
J(w,b)=1m∑i=1mL(y^(i),y(i))(1-2) (1-2) J ( w , b ) = 1 m ∑ i = 1 m L ( y ^ ( i ) , y ( i ) )
L2 regularization L 2 r e g u l a r i z a t i o n (most commonly used):
J(w,b)=1m∑i=1mL(y^(i),y(i))+λ2m∥w∥22(1-3) (1-3) J ( w , b ) = 1 m ∑ i = 1 m L ( y ^ ( i ) , y ( i ) ) + λ 2 m ‖ w ‖ 2 2
其中
∥w∥22=∑j=1nxw2j=wTw(1-4) (1-4) ‖ w ‖ 2 2 = ∑ j = 1 n x w j 2 = w T w
Why do we regularize just the parameter w? Because w Is usually a high dimensional parameter vector while b is A scalar. Almost all The parameters are in w rather than b.
L1 regularization L 1 r e g u l a r i z a t i o n
J(w,b)=1m∑i=1mL(y^(i),y(i))+λm|w|1(1-5) (1-5) J ( w , b ) = 1 m ∑ i = 1 m L ( y ^ ( i ) , y ( i ) ) + λ m | w | 1
其中
|w|1=∑jnx|wj|(1-6) (1-6) | w | 1 = ∑ j n x | w j |
w will end up being sparse. In other words the w vector will have a lot of zeros in it. This can help with compressing the model a little.
2. Neural network “Frobenius norm”
J(w[1],b[1],⋯,w[L],b[L])=1m∑i=1mL(y^(i),y(i))+λ2m∑l=1L∥w∥22(2-1) (2-1) J ( w [ 1 ] , b [ 1 ] , ⋯ , w [ L ] , b [ L ] ) = 1 m ∑ i = 1 m L ( y ^ ( i ) , y ( i ) ) + λ 2 m ∑ l = 1 L ‖ w ‖ 2 2
其中
∥∥w[l]∥∥2F=∑in[l−1]∑jn[l](wij)2(2-2) (2-2) ‖ w [ l ] ‖ F 2 = ∑ i n [ l − 1 ] ∑ j n [ l ] ( w i j ) 2
L2 L 2 regulation is also called
Weight decay:
dw[l]wl:=(from backprop)+λmw[l]=w[l]−αdw[l]=(1−αλm)w[l]−α(from backprop)(2-3) (2-3) d w [ l ] = ( f r o m b a c k p r o p ) + λ m w [ l ] w l : = w [ l ] − α d w [ l ] = ( 1 − α λ m ) w [ l ] − α ( f r o m b a c k p r o p )
能够防止权重
w w 过大,从而避免过拟合
3. inverted dropout
对于不同的训练样本都可以随机消除一部分结点
反向随机失活(前向和后向都需要dropout):
d3a3a3/z[4]z[4]/=np.random.rand(a3.shape[0],a3.shape[1])<keep.prob=np.multiply(a3,d3) #a3∗d3,element wise multiplication=keep.prob #in order to not reduce the expected value of a3 inverted dropout=w[4]a[3]+b[4]=keep.prob(3-1) (3-1) d 3 = n p . r a n d o m . r a n d ( a 3 . s h a p e [ 0 ] , a 3 . s h a p e [ 1 ] ) < k e e p . p r o b a 3 = n p . m u l t i p l y ( a 3 , d 3 ) # a 3 ∗ d 3 , e l e m e n t w i s e m u l t i p l i c a t i o n a 3 / = k e e p . p r o b # i n o r d e r t o n o t r e d u c e t h e e x p e c t e d v a l u e o f a 3 i n v e r t e d d r o p o u t z [ 4 ] = w [ 4 ] a [ 3 ] + b [ 4 ] z [ 4 ] / = k e e p . p r o b
this inverted dropout technique by dividing by the keep.prob, it ensures that the expected value of a3 remains the same. This makes test time easier because you have less of a scaling problem.
测试时不需要使用drop out