深度学习入门(2)

1.最优权重参数的最优方法
参数更新方法(梯度下降方法)

1)SGD(随机梯度下降法)

class SGD:

    def __init__(self, lr=0.01):
        self.lr = lr
        
    def update(self, params, grads):
        for key in params.keys():
            params[key] -= self.lr * grads[key]
  1. Momentum
class Momentum:

    """Momentum SGD"""

    def __init__(self, lr=0.01, momentum=0.9):
        self.lr = lr
        self.momentum = momentum
        self.v = None
        
    def update(self, params, grads):
        if self.v is None:
            self.v = {}
            for key, val in params.items():                                
                self.v[key] = np.zeros_like(val)
                
        for key in params.keys():
            self.v[key] = self.momentum*self.v[key] - self.lr*grads[key] 
            params[key] += self.v[key]

3)Nesterov

class Nesterov:

    """Nesterov's Accelerated Gradient (http://arxiv.org/abs/1212.0901)"""

    def __init__(self, lr=0.01, momentum=0.9):
        self.lr = lr
        self.momentum = momentum
        self.v = None
        
    def update(self, params, grads):
        if self.v is None:
            self.v = {}
            for key, val in params.items():
                self.v[key] = np.zeros_like(val)
            
        for key in params.keys():
            self.v[key] *= self.momentum
            self.v[key] -= self.lr * grads[key]
            params[key] += self.momentum * self.momentum * self.v[key]
            params[key] -= (1 + self.momentum) * self.lr * grads[key]
  1. AdaGrad

    class AdaGrad
             """AdaGrad"""
            def __init__(self, lr=0.01):
                 self.lr = lr
                 self.h = None
             
         def update(self, params, grads):
             if self.h is None:
                 self.h = {}
                 for key, val in params.items():
                     self.h[key] = np.zeros_like(val)
                 
             for key in params.keys():
                 self.h[key] += grads[key] * grads[key]
                 params[key] -= self.lr * grads[key] / (np.sqrt(self.h[key]) + 1e-7)
    
  2. Adam

class Adam:        
            def __init__(self, lr=0.001, beta1=0.9, beta2=0.999):
                self.lr = lr
                self.beta1 = beta1
                self.beta2 = beta2
                self.iter = 0
                self.m = None
                self.v = None
            def update(self, params, grads):
                if self.m is None:
                    self.m, self.v = {}, {}
                    for key, val in params.items():
                        self.m[key] = np.zeros_like(val)
                        self.v[key] = np.zeros_like(val)
                self.iter += 1
                lr_t  = self.lr * np.sqrt(1.0 - self.beta2**self.iter) / (1.0 - self.beta1**self.iter)         
                for key in params.keys():
                    #self.m[key] = self.beta1*self.m[key] + (1-self.beta1)*grads[key]
                    #self.v[key] = self.beta2*self.v[key] + (1-self.beta2)*(grads[key]**2)
                    self.m[key] += (1 - self.beta1) * (grads[key] - self.m[key])
                    self.v[key] += (1 - self.beta2) * (grads[key]**2 - self.v[key])
                    params[key] -= lr_t * self.m[key] / (np.sqrt(self.v[key]) + 1e-7)
                    #unbias_m += (1 - self.beta1) * (grads[key] - self.m[key]) # correct bias
                    #unbisa_b += (1 - self.beta2) * (grads[key]*grads[key] - self.v[key]) # correct bias
                    #params[key] += self.lr * unbias_m / (np.sqrt(unbisa_b) + 1e-7)

2.权重的初始值
Xavier初始值:如果前一层的节点数为n,则初始值使用标准差为 1/n的平方根 的分布
Xavier初始值 适用:sigmoid与tanh激活函数
He初始值为:使用标准差为 2/n的平方根 的分布
He初始值 适用Relu 激活函数

3.BatchNormalization(BN层)
优点:

1)可以快速进行学习,增大学习率
2)不那么依赖初始值
3)抵制过拟合

理解此层:通俗一点讲:简称BN,即批规范化。
BN的使用位置是在每一层输出的激活函数之前。
通过一定的规范化手段,把每层神经网络中任意神经元的输入值的分布强行拉回到接近均值为0方差为1的标准正太分布 。
使得激活输入值落在非线性函数对输入比较敏感的区域,这样输入的小变化就会导致损失函数较大的变化,使得让梯度变大,
避免梯度消失问题产生,而且梯度变大意味着学习收敛速度快,能大大加快训练速度

正向步骤:

1) 求均值。
2)  求方差。
3) 对输入数据进行均值为0,方差为1的正规化。
4) 对正规化的数据进行缩放和平移变换。

实现:

class BatchNormalization:
    """
    http://arxiv.org/abs/1502.03167
    """
    def __init__(self, gamma, beta, momentum=0.9, running_mean=None, running_var=None):
        self.gamma = gamma
        self.beta = beta
        self.momentum = momentum
        self.input_shape = None 

        self.running_mean = running_mean
        self.running_var = running_var  
        
        self.batch_size = None
        self.xc = None
        self.std = None
        self.dgamma = None
        self.dbeta = None

    def forward(self, x, train_flg=True):
        self.input_shape = x.shape
        if x.ndim != 2:
            N, C, H, W = x.shape
            x = x.reshape(N, -1)

        out = self.__forward(x, train_flg)
        
        return out.reshape(*self.input_shape)
            
    def __forward(self, x, train_flg):
        if self.running_mean is None:
            N, D = x.shape
            self.running_mean = np.zeros(D)
            self.running_var = np.zeros(D)
                        
        if train_flg:
            mu = x.mean(axis=0)
            xc = x - mu
            var = np.mean(xc**2, axis=0)
            std = np.sqrt(var + 10e-7)
            xn = xc / std
            
            self.batch_size = x.shape[0]
            self.xc = xc
            self.xn = xn
            self.std = std
            self.running_mean = self.momentum * self.running_mean + (1-self.momentum) * mu
            self.running_var = self.momentum * self.running_var + (1-self.momentum) * var            
        else:
            xc = x - self.running_mean
            xn = xc / ((np.sqrt(self.running_var + 10e-7)))
            
        out = self.gamma * xn + self.beta 
        return out

    def backward(self, dout):
        if dout.ndim != 2:
            N, C, H, W = dout.shape
            dout = dout.reshape(N, -1)

        dx = self.__backward(dout)

        dx = dx.reshape(*self.input_shape)
        return dx

    def __backward(self, dout):
        dbeta = dout.sum(axis=0)
        dgamma = np.sum(self.xn * dout, axis=0)
        dxn = self.gamma * dout
        dxc = dxn / self.std
        dstd = -np.sum((dxn * self.xc) / (self.std * self.std), axis=0)
        dvar = 0.5 * dstd / self.std
        dxc += (2.0 / self.batch_size) * self.xc * dvar
        dmu = np.sum(dxc, axis=0)
        dx = dxc - dmu / self.batch_size
        
        self.dgamma = dgamma
        self.dbeta = dbeta
        
        return dx

4.正则化
过拟合是指只能拟合训练数据,但不能很好的拟合不包含在训练数据中的其他数据的状态。
发生过拟合的原因:

1) 模型拥有大量参数,表现力强
2) 训练数据少

解决办法:
(1)L2正则化(权值衰减)
在学习过程中对大的权重进行惩罚,来抑制过拟合。例子中加在了最后一层,正向逆向都要加
(2)droupout
这是一种在学习的过程中随机删除神经元的方法。
实现:

class Dropout:
    def __init__(self, dropout_ratio=0.5):
        self.dropout_ratio = dropout_ratio
        self.mask = None

    def forward(self, x, train_flg=True):
        if train_flg:
            self.mask = np.random.rand(*x.shape) > self.dropout_ratio
            return x * self.mask
        else:
            return x * (1.0 - self.dropout_ratio)

    def backward(self, dout):
        return dout * self.mask

训练时正向传播,self.mask以false的形式保存要删除的神经元。反向传播则原样输出。

你可能感兴趣的:(python,AI)