NNDL 实验五 前馈神经网络(2)自动梯度计算 & 优化问题

4.3 自动梯度计算和预定义算子
虽然我们能够通过模块化的方式比较好地对神经网络进行组装,但是每个模块的梯度计算过程仍然十分繁琐且容易出错。在深度学习框架中,已经封装了自动梯度计算的功能,我们只需要聚焦模型架构,不再需要耗费精力进行计算梯度。
飞桨提供了paddle.nn.Layer类,来方便快速的实现自己的层和模型。模型和层都可以基于paddle.nn.Layer扩充实现,模型只是一种特殊的层。
继承了paddle.nn.Layer类的算子中,可以在内部直接调用其它继承paddle.nn.Layer类的算子,飞桨框架会自动识别算子中内嵌的paddle.nn.Layer类算子,并自动计算它们的梯度,并在优化时更新它们的参数。
4.3.1 利用预定义算子重新实现前馈神经网络


class Model_MLP_L2_V4(torch.nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(Model_MLP_L2_V4, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        w=torch.normal(0,0.1,size=(hidden_size,input_size),requires_grad=True)
        self.fc1.weight = nn.Parameter(w)

        self.fc2 = nn.Linear(hidden_size, output_size)
        w = torch.normal(0, 0.1, size=(output_size, hidden_size), requires_grad=True)
        self.fc2.weight = nn.Parameter(w)

        # 使用'torch.nn.functional.sigmoid'定义 Logistic 激活函数
        self.act_fn = torch.sigmoid

    # 前向计算
    def forward(self, inputs):
        z1 = self.fc1(inputs.to(torch.float32))
        a1 = self.act_fn(z1)
        z2 = self.fc2(a1)
        a2 = self.act_fn(z2)
        return a2

4.3.2 完善Runner类
基于上一节实现的 RunnerV2_1 类,本节的 RunnerV2_2 类在训练过程中使用自动梯度计算;模型保存时,使用state_dict方法获取模型参数;模型加载时,使用set_state_dict方法加载模型参数.



class RunnerV2_2(object):
    def __init__(self, model, optimizer, metric, loss_fn, **kwargs):
        self.model = model
        self.optimizer = optimizer
        self.loss_fn = loss_fn
        self.metric = metric

        # 记录训练过程中的评估指标变化情况
        self.train_scores = []
        self.dev_scores = []

        # 记录训练过程中的评价指标变化情况
        self.train_loss = []
        self.dev_loss = []

    def train(self, train_set, dev_set, **kwargs):
        # 将模型切换为训练模式
        self.model.train()

        # 传入训练轮数,如果没有传入值则默认为0
        num_epochs = kwargs.get("num_epochs", 0)
        # 传入log打印频率,如果没有传入值则默认为100
        log_epochs = kwargs.get("log_epochs", 100)
        # 传入模型保存路径,如果没有传入值则默认为"best_model.pdparams"
        save_path = kwargs.get("save_path", "best_model.pdparams")

        # log打印函数,如果没有传入则默认为"None"
        custom_print_log = kwargs.get("custom_print_log", None)

        # 记录全局最优指标
        best_score = 0
        # 进行num_epochs轮训练
        for epoch in range(num_epochs):
            X, y = train_set

            # 获取模型预测
            logits = self.model(X.to(torch.float32))
            # 计算交叉熵损失
            trn_loss = self.loss_fn(logits, y)
            self.train_loss.append(trn_loss.item())
            # 计算评估指标
            trn_score = self.metric(logits, y).item()
            self.train_scores.append(trn_score)

            # 自动计算参数梯度
            trn_loss.backward()
            if custom_print_log is not None:
                # 打印每一层的梯度
                custom_print_log(self)

            # 参数更新
            self.optimizer.step()
            # 清空梯度
            self.optimizer.zero_grad()   # reset gradient

            dev_score, dev_loss = self.evaluate(dev_set)
            # 如果当前指标为最优指标,保存该模型
            if dev_score > best_score:
                self.save_model(save_path)
                print(f"[Evaluate] best accuracy performence has been updated: {best_score:.5f} --> {dev_score:.5f}")
                best_score = dev_score

            if log_epochs and epoch % log_epochs == 0:
                print(f"[Train] epoch: {epoch}/{num_epochs}, loss: {trn_loss.item()}")
    @torch.no_grad()
    def evaluate(self, data_set):
        # 将模型切换为评估模式
        self.model.eval()

        X, y = data_set
        # 计算模型输出
        logits = self.model(X)
        # 计算损失函数
        loss = self.loss_fn(logits, y).item()
        self.dev_loss.append(loss)
        # 计算评估指标
        score = self.metric(logits, y).item()
        self.dev_scores.append(score)
        return score, loss

    # 模型测试阶段,使用'torch.no_grad()'控制不计算和存储梯度
    @torch.no_grad()
    def predict(self, X):
        # 将模型切换为评估模式
        self.model.eval()
        return self.model(X)

    # 使用'model.state_dict()'获取模型参数,并进行保存
    def save_model(self, saved_path):
        torch.save(self.model.state_dict(), saved_path)

    # 使用'model.set_state_dict'加载模型参数
    def load_model(self, model_path):
        state_dict = torch.load(model_path)
        self.model.load_state_dict(state_dict)

4.3.3 模型训练


# 设置模型
input_size = 2
hidden_size = 5
output_size = 1
model = Model_MLP_L2_V4(input_size=input_size, hidden_size=hidden_size, output_size=output_size)

# 设置损失函数
loss_fn = F.binary_cross_entropy

# 设置优化器
learning_rate = 0.2 #5e-2
optimizer = torch.optim.SGD(model.parameters(),lr=learning_rate)

# 设置评价指标
metric = accuracy

# 其他参数
epoch = 2000
saved_path = 'best_model.pdparams'

# 实例化RunnerV2类,并传入训练配置
runner = RunnerV2_2(model, optimizer, metric, loss_fn)

runner.train([X_train, y_train], [X_dev, y_dev], num_epochs = epoch, log_epochs=50, save_path="best_model.pdparams")

plot(runner, 'fw-acc.pdf')

实验结果:

[Evaluate] best accuracy performence has been updated: 0.00000 --> 0.45625
[Train] epoch: 0/2000, loss: 0.7013065218925476
[Evaluate] best accuracy performence has been updated: 0.45625 --> 0.78750
[Evaluate] best accuracy performence has been updated: 0.78750 --> 0.82500
[Train] epoch: 50/2000, loss: 0.6864297389984131
[Train] epoch: 100/2000, loss: 0.664617121219635
[Train] epoch: 150/2000, loss: 0.603905200958252
[Train] epoch: 200/2000, loss: 0.5093539953231812
[Train] epoch: 250/2000, loss: 0.4296819567680359
[Train] epoch: 300/2000, loss: 0.38168662786483765
[Train] epoch: 350/2000, loss: 0.35333654284477234
[Train] epoch: 400/2000, loss: 0.3346514403820038
[Train] epoch: 450/2000, loss: 0.32097792625427246
[Train] epoch: 500/2000, loss: 0.31027689576148987
[Evaluate] best accuracy performence has been updated: 0.82500 --> 0.83125
[Train] epoch: 550/2000, loss: 0.30156683921813965
[Evaluate] best accuracy performence has been updated: 0.83125 --> 0.83750
[Train] epoch: 600/2000, loss: 0.29430365562438965
[Evaluate] best accuracy performence has been updated: 0.83750 --> 0.84375
[Evaluate] best accuracy performence has been updated: 0.84375 --> 0.85000
[Train] epoch: 650/2000, loss: 0.2881467938423157
[Evaluate] best accuracy performence has been updated: 0.85000 --> 0.85625
[Train] epoch: 700/2000, loss: 0.282865434885025
[Train] epoch: 750/2000, loss: 0.2782956063747406
[Train] epoch: 800/2000, loss: 0.2743169665336609
[Train] epoch: 850/2000, loss: 0.27083858847618103
[Train] epoch: 900/2000, loss: 0.2677897810935974
[Train] epoch: 950/2000, loss: 0.2651140093803406
[Train] epoch: 1000/2000, loss: 0.26276475191116333
[Train] epoch: 1050/2000, loss: 0.2607026994228363
[Train] epoch: 1100/2000, loss: 0.2588939070701599
[Evaluate] best accuracy performence has been updated: 0.85625 --> 0.86250
[Train] epoch: 1150/2000, loss: 0.2573087811470032
[Train] epoch: 1200/2000, loss: 0.2559208869934082
[Train] epoch: 1250/2000, loss: 0.25470679998397827
[Evaluate] best accuracy performence has been updated: 0.86250 --> 0.86875
[Train] epoch: 1300/2000, loss: 0.25364556908607483
[Evaluate] best accuracy performence has been updated: 0.86875 --> 0.87500
[Train] epoch: 1350/2000, loss: 0.25271841883659363
[Train] epoch: 1400/2000, loss: 0.25190865993499756
[Train] epoch: 1450/2000, loss: 0.2512013018131256
[Train] epoch: 1500/2000, loss: 0.2505832314491272
[Train] epoch: 1550/2000, loss: 0.2500428557395935
[Train] epoch: 1600/2000, loss: 0.2495698183774948
[Evaluate] best accuracy performence has been updated: 0.87500 --> 0.88125
[Train] epoch: 1650/2000, loss: 0.2491552084684372
[Train] epoch: 1700/2000, loss: 0.2487911432981491
[Train] epoch: 1750/2000, loss: 0.24847082793712616
[Train] epoch: 1800/2000, loss: 0.24818828701972961
[Train] epoch: 1850/2000, loss: 0.2479383498430252
[Train] epoch: 1900/2000, loss: 0.24771662056446075
[Train] epoch: 1950/2000, loss: 0.24751918017864227

4.3.4 性能评价
使用测试数据对训练完成后的最优模型进行评价,观察模型在测试集上的准确率以及loss情况。代码如下:

#模型评价
runner.load_model("best_model.pdparams")
score, loss = runner.evaluate([X_test, y_test])
print("[Test] score/loss: {:.4f}/{:.4f}".format(score, loss))

实验结果:

[Test] score/loss: 0.9050/0.2228

将训练过程中训练集与验证集的准确率变化情况进行可视化。

# 可视化观察训练集与验证集的指标变化情况
def plot(runner, fig_name):
    plt.figure(figsize=(10,5))
    epochs = [i for i in range(len(runner.train_scores))]

    plt.subplot(1,2,1)
    plt.plot(epochs, runner.train_loss, color='#e4007f', label="Train loss")
    plt.plot(epochs, runner.dev_loss, color='#f19ec2', linestyle='--', label="Dev loss")
    # 绘制坐标轴和图例
    plt.ylabel("loss", fontsize='large')
    plt.xlabel("epoch", fontsize='large')
    plt.legend(loc='upper right', fontsize='x-large')

    plt.subplot(1,2,2)
    plt.plot(epochs, runner.train_scores, color='#e4007f', label="Train accuracy")
    plt.plot(epochs, runner.dev_scores, color='#f19ec2', linestyle='--', label="Dev accuracy")
    # 绘制坐标轴和图例
    plt.ylabel("score", fontsize='large')
    plt.xlabel("epoch", fontsize='large')
    plt.legend(loc='lower right', fontsize='x-large')
    
    plt.savefig(fig_name)
    plt.show()

plot(runner, 'fw-acc.pdf')

实验结果:
NNDL 实验五 前馈神经网络(2)自动梯度计算 & 优化问题_第1张图片
1. 使用pytorch的预定义算子来重新实现二分类任务。

# 均匀生成40000个数据点
x1, x2 = torch.meshgrid(torch.linspace(-math.pi, math.pi, 200), torch.linspace(-math.pi, math.pi, 200))

x = torch.stack([torch.flatten(x1), torch.flatten(x2)], axis=1)

# 预测对应类别
y = runner.predict(x)
# y = torch.squeeze(torch.as_tensor(torch.can_cast((y>=0.5).dtype,torch.float32)))

# 绘制类别区域
plt.ylabel('x2')
plt.xlabel('x1')
plt.scatter(x[:,0].tolist(), x[:,1].tolist(), c=y.tolist(), cmap=plt.cm.Spectral)

plt.scatter(X_train[:, 0].tolist(), X_train[:, 1].tolist(), marker='*', c=torch.squeeze(y_train,axis=-1).tolist())
plt.scatter(X_dev[:, 0].tolist(), X_dev[:, 1].tolist(), marker='*', c=torch.squeeze(y_dev,axis=-1).tolist())
plt.scatter(X_test[:, 0].tolist(), X_test[:, 1].tolist(), marker='*', c=torch.squeeze(y_test,axis=-1).tolist())

plt.show()

实验结果:
NNDL 实验五 前馈神经网络(2)自动梯度计算 & 优化问题_第2张图片

2. 增加一个3个神经元的隐藏层,再次实现二分类,并与1做对比。

class Model_MLP_L2_V5(torch.nn.Module):
    def __init__(self, input_size, hidden_size, hidden_size2, output_size):
        super(Model_MLP_L2_V5, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        w1=torch.normal(0,0.1,size=(hidden_size,input_size),requires_grad=True)
        self.fc1.weight = nn.Parameter(w1)

        self.fc2 = nn.Linear(hidden_size, hidden_size2)
        w2 = torch.normal(0, 0.1, size=(hidden_size2, hidden_size), requires_grad=True)
        self.fc2.weight = nn.Parameter(w2)

        self.fc3 = nn.Linear(hidden_size2, output_size)
        w3 = torch.normal(0, 0.1, size=(output_size, hidden_size2), requires_grad=True)
        self.fc3.weight = nn.Parameter(w3)

        # 使用'torch.nn.functional.sigmoid'定义 Logistic 激活函数
        self.act_fn = torch.sigmoid
 # 前向计算
    def forward(self, inputs):
        z1 = self.fc1(inputs.to(torch.float32))
        a1 = self.act_fn(z1)
        z2 = self.fc2(a1)
        a2 = self.act_fn(z2)
        z3 = self.fc3(a2)
        a3 = self.act_fn(z3)
        return a3

实验结果:
NNDL 实验五 前馈神经网络(2)自动梯度计算 & 优化问题_第3张图片

NNDL 实验五 前馈神经网络(2)自动梯度计算 & 优化问题_第4张图片
与1相比,其准确率提高了。

【思考题】

自定义梯度计算和自动梯度计算:
从计算性能、计算结果等多方面比较,谈谈自己的看法。
自定义梯度计算:

    def backward(self):
        # 计算损失函数对模型预测的导数
        loss_grad_predicts = -1.0 * (self.labels / self.predicts -
                                     (1 - self.labels) / (1 - self.predicts)) / self.num
 
        # 梯度反向传播
        self.model.backward(loss_grad_predicts)

实验结果:

[Test] score/loss: 0.8500/0.3205

自动梯度计算:

     # 自动计算参数梯度
            trn_loss.backward()

实验结果:

[Test] score/loss: 0.8650/0.3019

可以看到自动梯度计算要更好一些。

4.4 优化问题
4.4.1 参数初始化
实现一个神经网络前,需要先初始化模型参数。如果对每一层的权重和偏置都用0初始化,那么通过第一遍前向计算,所有隐藏层神经元的激活值都相同;在反向传播时,所有权重的更新也都相同,这样会导致隐藏层神经元没有差异性,出现对称权重现象。

接下来,将模型参数全都初始化为0,看实验结果。这里重新定义了一个类TwoLayerNet_Zeros,两个线性层的参数全都初始化为0。

# import torch
import torch.nn as nn
import torch.nn.functional as F
# 定义多层前馈神经网络
class Model_MLP_L2_V4(torch.nn.Module):
    def __init__(self, input_size, hidden_size,output_size):
        super(Model_MLP_L2_V4, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        # w1=torch.normal(0,0.1,size=(hidden_size,input_size),requires_grad=True)
        # self.fc1.weight = nn.Parameter(w1)
        self.fc1.weight=nn.init.constant_(self.fc1.weight,val=0.0)
        # self.fc1.bias = nn.init.constant_(self.fc1.bias, val=1.0)
        self.fc1.bias = nn.init.constant_(self.fc1.bias, val=0.0)
        self.fc2 = nn.Linear(hidden_size, output_size)
        # w2 = torch.normal(0, 0.1, size=(output_size, hidden_size), requires_grad=True)
        # self.fc2.weight = nn.Parameter(w2)
        self.fc2.weight = nn.init.constant_(self.fc2.weight, val=0.0)
        self.fc2.bias = nn.init.constant_(self.fc2.bias, val=0.0)
        # 使用'torch.nn.functional.sigmoid'定义 Logistic 激活函数
        self.act_fn = torch.sigmoid

    # 前向计算
    def forward(self, inputs):
        z1 = self.fc1(inputs.to(torch.float32))
        a1 = self.act_fn(z1)
        z2 = self.fc2(a1)
        a2 = self.act_fn(z2)
        return a2
def print_weights(runner):
    print('The weights of the Layers:')

    for item in runner.model.sublayers():
        print(item.full_name())
        for param in item.parameters():
            print(param.numpy())

利用Runner类训练模型:

# 设置模型
input_size = 2
hidden_size = 5
output_size = 1
model = Model_MLP_L2_V4(input_size=input_size, hidden_size=hidden_size, output_size=output_size)

# 设置损失函数
loss_fn = F.binary_cross_entropy

# 设置优化器
learning_rate = 0.2#5e-2
optimizer = torch.optim.SGD(model.parameters(),lr=learning_rate)

# 设置评价指标
metric = accuracy

# 其他参数
epoch = 2000
saved_path = 'best_model.pdparams'

# 实例化RunnerV2类,并传入训练配置
runner = RunnerV2_2(model, optimizer, metric, loss_fn)

runner.train([X_train, y_train], [X_dev, y_dev], num_epochs = epoch, log_epochs=50, save_path="best_model.pdparams")

实验结果:

[Evaluate] best accuracy performence has been updated: 0.00000 --> 0.50625
[Train] epoch: 0/2000, loss: 0.6931473016738892
[Train] epoch: 50/2000, loss: 0.6931255459785461
[Train] epoch: 100/2000, loss: 0.6931201219558716
[Train] epoch: 150/2000, loss: 0.6931053400039673
[Train] epoch: 200/2000, loss: 0.6930629014968872
[Train] epoch: 250/2000, loss: 0.6929393410682678
[Train] epoch: 300/2000, loss: 0.6925800442695618
[Train] epoch: 350/2000, loss: 0.6915353536605835
[Train] epoch: 400/2000, loss: 0.6885142922401428
[Train] epoch: 450/2000, loss: 0.679978609085083
[Evaluate] best accuracy performence has been updated: 0.50625 --> 0.51250
[Evaluate] best accuracy performence has been updated: 0.51250 --> 0.52500
[Evaluate] best accuracy performence has been updated: 0.52500 --> 0.53125
[Evaluate] best accuracy performence has been updated: 0.53125 --> 0.53750
[Evaluate] best accuracy performence has been updated: 0.53750 --> 0.55625
[Evaluate] best accuracy performence has been updated: 0.55625 --> 0.58125
[Evaluate] best accuracy performence has been updated: 0.58125 --> 0.58750
[Evaluate] best accuracy performence has been updated: 0.58750 --> 0.60625
[Evaluate] best accuracy performence has been updated: 0.60625 --> 0.61250
[Evaluate] best accuracy performence has been updated: 0.61250 --> 0.62500
[Evaluate] best accuracy performence has been updated: 0.62500 --> 0.63750
[Evaluate] best accuracy performence has been updated: 0.63750 --> 0.65625
[Evaluate] best accuracy performence has been updated: 0.65625 --> 0.68750
[Evaluate] best accuracy performence has been updated: 0.68750 --> 0.69375
[Evaluate] best accuracy performence has been updated: 0.69375 --> 0.70625
[Evaluate] best accuracy performence has been updated: 0.70625 --> 0.72500
[Evaluate] best accuracy performence has been updated: 0.72500 --> 0.74375
[Evaluate] best accuracy performence has been updated: 0.74375 --> 0.75625
[Evaluate] best accuracy performence has been updated: 0.75625 --> 0.76875
[Evaluate] best accuracy performence has been updated: 0.76875 --> 0.77500
[Train] epoch: 500/2000, loss: 0.6576451063156128
[Evaluate] best accuracy performence has been updated: 0.77500 --> 0.78750
[Evaluate] best accuracy performence has been updated: 0.78750 --> 0.79375
[Evaluate] best accuracy performence has been updated: 0.79375 --> 0.80000
[Evaluate] best accuracy performence has been updated: 0.80000 --> 0.80625
[Evaluate] best accuracy performence has been updated: 0.80625 --> 0.81875
[Evaluate] best accuracy performence has been updated: 0.81875 --> 0.82500
[Evaluate] best accuracy performence has been updated: 0.82500 --> 0.83750
[Evaluate] best accuracy performence has been updated: 0.83750 --> 0.84375
[Train] epoch: 550/2000, loss: 0.6099116206169128
[Train] epoch: 600/2000, loss: 0.5398089289665222
[Train] epoch: 650/2000, loss: 0.47239741683006287
[Train] epoch: 700/2000, loss: 0.4223340153694153
[Train] epoch: 750/2000, loss: 0.38712045550346375
[Train] epoch: 800/2000, loss: 0.36122432351112366
[Train] epoch: 850/2000, loss: 0.3410833179950714
[Train] epoch: 900/2000, loss: 0.32484984397888184
[Train] epoch: 950/2000, loss: 0.3115811049938202
[Evaluate] best accuracy performence has been updated: 0.84375 --> 0.85000
[Train] epoch: 1000/2000, loss: 0.30073660612106323
[Train] epoch: 1050/2000, loss: 0.2919325530529022
[Evaluate] best accuracy performence has been updated: 0.85000 --> 0.85625
[Train] epoch: 1100/2000, loss: 0.28484243154525757
[Evaluate] best accuracy performence has been updated: 0.85625 --> 0.86250
[Train] epoch: 1150/2000, loss: 0.279169499874115
[Train] epoch: 1200/2000, loss: 0.27464812994003296
[Train] epoch: 1250/2000, loss: 0.2710496485233307
[Train] epoch: 1300/2000, loss: 0.2681841254234314
[Train] epoch: 1350/2000, loss: 0.2658982276916504
[Evaluate] best accuracy performence has been updated: 0.86250 --> 0.86875
[Train] epoch: 1400/2000, loss: 0.26407015323638916
[Train] epoch: 1450/2000, loss: 0.2626038193702698
[Train] epoch: 1500/2000, loss: 0.26142385601997375
[Evaluate] best accuracy performence has been updated: 0.86875 --> 0.87500
[Train] epoch: 1550/2000, loss: 0.2604711651802063
[Train] epoch: 1600/2000, loss: 0.259699285030365
[Train] epoch: 1650/2000, loss: 0.2590716481208801
[Evaluate] best accuracy performence has been updated: 0.87500 --> 0.88125
[Train] epoch: 1700/2000, loss: 0.25855937600135803
[Train] epoch: 1750/2000, loss: 0.25813964009284973
[Train] epoch: 1800/2000, loss: 0.2577943205833435
[Evaluate] best accuracy performence has been updated: 0.88125 --> 0.88750
[Train] epoch: 1850/2000, loss: 0.25750893354415894
[Train] epoch: 1900/2000, loss: 0.25727197527885437
[Train] epoch: 1950/2000, loss: 0.257074236869812

可视化训练和验证集上的主准确率和loss变化:

plot(runner, "fw-zero.pdf")

实验结果:
NNDL 实验五 前馈神经网络(2)自动梯度计算 & 优化问题_第5张图片
从输出结果看,二分类准确率为50%左右,说明模型没有学到任何内容。训练和验证loss几乎没有怎么下降。

为了避免对称权重现象,可以使用高斯分布或均匀分布初始化神经网络的参数。
高斯分布和均匀分布采样的实现和可视化代码如下:

gausian_weights = torch.normal(mean=0.0, std=1.0, size=[10000])
uniform_weights = torch.Tensor(10000)
uniform_weights.uniform_(-1, 1)
# 绘制两种参数分布
plt.figure()
plt.subplot(1, 2, 1)
plt.title('Gausian Distribution')
plt.hist(gausian_weights, bins=200, density=True, color='#f19ec2')
plt.subplot(1, 2, 2)
plt.title('Uniform Distribution')
plt.hist(uniform_weights, bins=200, density=True, color='#e4007f')
plt.savefig('fw-gausian-uniform.pdf')
plt.show()

实验结果:

NNDL 实验五 前馈神经网络(2)自动梯度计算 & 优化问题_第6张图片
4.4.2 梯度消失问题
在神经网络的构建过程中,随着网络层数的增加,理论上网络的拟合能力也应该是越来越好的。但是随着网络变深,参数学习更加困难,容易出现梯度消失问题。

由于Sigmoid型函数的饱和性,饱和区的导数更接近于0,误差经过每一层传递都会不断衰减。当网络层数很深时,梯度就会不停衰减,甚至消失,使得整个网络很难训练,这就是所谓的梯度消失问题。
在深度神经网络中,减轻梯度消失问题的方法有很多种,一种简单有效的方式就是使用导数比较大的激活函数,如:ReLU。

下面通过一个简单的实验观察前馈神经网络的梯度消失现象和改进方法。
4.4.2.1 模型构建
定义一个前馈神经网络,包含4个隐藏层和1个输出层,通过传入的参数指定激活函数。代码实现如下:

# 定义多层前馈神经网络
class Model_MLP_L5(nn.Module):
    def __init__(self, input_size, output_size, act='sigmoid', w_init=torch.normal(mean=torch.tensor(0.0), std=torch.tensor(0.01)), b_init=torch.tensor(1.0)):
        super(Model_MLP_L5, self).__init__()
        self.fc1 = torch.nn.Linear(input_size, 3)
        self.fc2 = torch.nn.Linear(3, 3)
        self.fc3 = torch.nn.Linear(3, 3)
        self.fc4 = torch.nn.Linear(3, 3)
        self.fc5 = torch.nn.Linear(3, output_size)
        # 定义网络使用的激活函数
        if act == 'sigmoid':
            self.act = F.sigmoid
        elif act == 'relu':
            self.act = F.relu
        elif act == 'lrelu':
            self.act = F.leaky_relu
        else:
            raise ValueError("Please enter sigmoid relu or lrelu!")
        # 初始化线性层权重和偏置参数
        self.init_weights(w_init, b_init)
 
    # 初始化线性层权重和偏置参数
    def init_weights(self, w_init, b_init):
        # 使用'named_sublayers'遍历所有网络层
        for n, m in self.named_parameters():
            # 如果是线性层,则使用指定方式进行参数初始化
            if isinstance(m, nn.Linear):
                w_init(m.weight)
                b_init(m.bias)
 
    def forward(self, inputs):
        outputs = self.fc1(inputs)
        outputs = self.act(outputs)
        outputs = self.fc2(outputs)
        outputs = self.act(outputs)
        outputs = self.fc3(outputs)
        outputs = self.act(outputs)
        outputs = self.fc4(outputs)
        outputs = self.act(outputs)
        outputs = self.fc5(outputs)
        outputs = F.sigmoid(outputs)
        return outputs

4.4.2.2 使用Sigmoid型函数进行训练
使用Sigmoid型函数作为激活函数,为了便于观察梯度消失现象,只进行一轮网络优化。代码实现如下:

定义梯度打印函数

def print_grads(runner):
    # 打印每一层的权重的模
    print('The gradient of the Layers:')
    for item in runner.model.named_parameters():
        if len(item[1])==3:
            print(item[0],".gard:")
            print(torch.mean(item[1].grad))
            print("=============")
 
# 学习率大小
lr = 0.01
 
# 定义网络,激活函数使用sigmoid
model =  Model_MLP_L5(input_size=2, output_size=1, act='sigmoid')
 
# 定义优化器
optimizer = torch.optim.SGD(model.parameters(),lr=lr)
 
# 定义损失函数,使用交叉熵损失函数
loss_fn = F.binary_cross_entropy
 
# 定义评价指标
metric = accuracy
 
# 指定梯度打印函数
custom_print_log=print_grads

实例化RunnerV2_2类,并传入训练配置。代码实现如下:

# 实例化Runner类
runner = RunnerV2_2(model, optimizer, metric, loss_fn)

模型训练,打印网络每层梯度值的ℓ2范数。代码实现如下:

# 启动训练
runner.train([X_train, y_train], [X_dev, y_dev], 
            num_epochs=1, log_epochs=None, 
            save_path="best_model.pdparams", 
            custom_print_log=custom_print_log)

实验结果:

The gradient of the Layers:
fc1.weight .gard:
tensor(1.9422e-12)
=============
fc1.bias .gard:
tensor(7.6139e-12)
=============
fc2.weight .gard:
tensor(2.1942e-10)
=============
fc2.bias .gard:
tensor(3.0126e-10)
=============
fc3.weight .gard:
tensor(4.7720e-07)
=============
fc3.bias .gard:
tensor(6.5120e-07)
=============
fc4.weight .gard:
tensor(-3.8480e-05)
=============
fc4.bias .gard:
tensor(-5.2509e-05)
=============

观察实验结果可以发现,梯度经过每一个神经层的传递都会不断衰减,最终传递到第一个神经层时,梯度几乎完全消失。
4.4.2.3 使用ReLU函数进行模型训练

torch.manual_seed(102)
# 学习率大小
lr = 0.01

# 定义网络,激活函数使用sigmoid
model = Model_MLP_L5(input_size=2, output_size=1, act='sigmoid')

# 定义优化器
optimizer = torch.optim.SGD(model.parameters(), lr)

# 定义损失函数,使用交叉熵损失函数
loss_fn = F.binary_cross_entropy

# 定义评价指标
metric = accuracy

# 指定梯度打印函数
custom_print_log = print_grads
# 实例化Runner类
runner = RunnerV2_2(model, optimizer, metric, loss_fn)
# 启动训练
runner.train([X_train, y_train], [X_dev, y_dev],
             num_epochs=1, log_epochs=None,
             save_path="best_model.pdparams",
             custom_print_log=custom_print_log)

实验结果:

The gradient of the Layers:
fc1.weight .gard:
tensor(-9.6566e-13)
=============
fc1.bias .gard:
tensor(-4.4046e-12)
=============
fc2.weight .gard:
tensor(-5.2732e-10)
=============
fc2.bias .gard:
tensor(-7.2219e-10)
=============
fc3.weight .gard:
tensor(2.6233e-07)
=============
fc3.bias .gard:
tensor(3.5886e-07)
=============
fc4.weight .gard:
tensor(0.0002)
=============
fc4.bias .gard:
tensor(0.0002)
=============
[Evaluate] best accuracy performence has been updated: 0.00000 --> 0.41250

4.4.3 死亡ReLU问题
ReLU激活函数可以一定程度上改善梯度消失问题,但是在某些情况下容易出现死亡ReLU问题,使得网络难以训练。

这是由于当x<0x<0时,ReLU函数的输出恒为0。在训练过程中,如果参数在一次不恰当的更新后,某个ReLU神经元在所有训练数据上都不能被激活(即输出为0),那么这个神经元自身参数的梯度永远都会是0,在以后的训练过程中永远都不能被激活。

一种简单有效的优化方式就是将激活函数更换为Leaky ReLU、ELU等ReLU的变种。
4.4.3.1 使用ReLU进行模型训练

# 定义多层前馈神经网络
class Model_MLP_L5(torch.nn.Module):
    def __init__(self, input_size, output_size, act='relu'):
        super(Model_MLP_L5, self).__init__()
        self.fc1 = torch.nn.Linear(input_size, 3)
        w_ = torch.normal(0, 0.01, size=(3, input_size), requires_grad=True)
        self.fc1.weight = nn.Parameter(w_)
        # self.fc1.bias = nn.init.constant_(self.fc1.bias, val=1.0)
        self.fc1.bias = nn.init.constant_(self.fc1.bias, val=-8.0)
        w= torch.normal(0, 0.01, size=(3, 3), requires_grad=True)

        self.fc2 = torch.nn.Linear(3, 3)
        self.fc2.weight = nn.Parameter(w)
        # self.fc2.bias = nn.init.constant_(self.fc2.bias, val=1.0)
        self.fc1.bias = nn.init.constant_(self.fc1.bias, val=-8.0)
        self.fc3 = torch.nn.Linear(3, 3)
        self.fc3.weight = nn.Parameter(w)
        # self.fc3.bias = nn.init.constant_(self.fc2.bias, val=1.0)
        self.fc3.bias = nn.init.constant_(self.fc3.bias, val=-8.0)
        self.fc4 = torch.nn.Linear(3, 3)
        self.fc4.weight = nn.Parameter(w)
        # self.fc4.bias = nn.init.constant_(self.fc2.bias, val=1.0)
        self.fc4.bias = nn.init.constant_(self.fc4.bias, val=-8.0)
        self.fc5 = torch.nn.Linear(3, output_size)
        w1 = torch.normal(0, 0.01, size=(output_size, 3), requires_grad=True)
        self.fc5.weight = nn.Parameter(w1)
        # self.fc5.bias = nn.init.constant_(self.fc2.bias, val=1.0)
        self.fc5.bias = nn.init.constant_(self.fc5.bias, val=-8.0)
        # 定义网络使用的激活函数
        if act == 'sigmoid':
            self.act = F.sigmoid
        elif act == 'relu':
            self.act = F.relu
        elif act == 'lrelu':
            self.act = F.leaky_relu
        else:
            raise ValueError("Please enter sigmoid relu or lrelu!")


    def forward(self, inputs):
        outputs = self.fc1(inputs.to(torch.float32))
        outputs = self.act(outputs)
        outputs = self.fc2(outputs)
        outputs = self.act(outputs)
        outputs = self.fc3(outputs)
        outputs = self.act(outputs)
        outputs = self.fc4(outputs)
        outputs = self.act(outputs)
        outputs = self.fc5(outputs)
        outputs = F.sigmoid(outputs)
        return outputs

实验结果:

The gradient of the Layers:
fc1.weight .gard:
tensor(2.5608e-20)
=============
fc1.bias .gard:
tensor(6.2843e-20)
=============
fc2.weight .gard:
tensor(2.4298e-18)
=============
fc2.bias .gard:
tensor(7.2055e-15)
=============
fc3.weight .gard:
tensor(-1.3296e-12)
=============
fc3.bias .gard:
tensor(-2.4046e-12)
=============
fc4.weight .gard:
tensor(-3.0175e-10)
=============
fc4.bias .gard:
tensor(-9.0033e-07)
=============
[Evaluate] best accuracy performence has been updated: 0.00000 --> 0.50625

从输出结果可以发现,使用 ReLU 作为激活函数,当满足条件时,会发生死亡ReLU问题,网络训练过程中 ReLU 神经元的梯度始终为0,参数无法更新。针对死亡ReLU问题,一种简单有效的优化方式就是将激活函数更换为Leaky ReLU、ELU等ReLU 的变种。接下来,观察将激活函数更换为 Leaky ReLU时的梯度情况。
4.4.3.2 使用Leaky ReLU进行模型训练
将激活函数更换为Leaky ReLU进行模型训练,观察梯度情况。代码实现如下:

# 定义网络,激活函数使用sigmoid
model = Model_MLP_L5(input_size=2, output_size=1, act='lrelu')


# 实例化Runner类
runner = RunnerV2_2(model, optimizer, metric, loss_fn)

# 启动训练
runner.train([X_train, y_train], [X_dev, y_dev], 
            num_epochs=1, log_epochps=None, 
            save_path="best_model.pdparams", 
            custom_print_log=custom_print_log)

实验结果:

The gradient of the Layers:
fc1.weight .gard:
tensor(-1.8123e-15)
=============
fc1.bias .gard:
tensor(-4.1456e-15)
=============
fc2.weight .gard:
tensor(-4.3190e-12)
=============
fc2.bias .gard:
tensor(5.3859e-11)
=============
fc3.weight .gard:
tensor(1.9313e-10)
=============
fc3.bias .gard:
tensor(1.2578e-09)
=============
fc4.weight .gard:
tensor(-1.3360e-06)
=============
fc4.bias .gard:
tensor(1.6696e-05)
=============
[Evaluate] best accuracy performence has been updated: 0.00000 --> 0.50000
[Train] epoch: 0/1, loss: 4.0374345779418945

从输出结果可以看到,将激活函数更换为Leaky ReLU后,死亡ReLU问题得到了改善,梯度恢复正常,参数也可以正常更新。但是由于 Leaky ReLU 中,x<0x<0 时的斜率默认只有0.01,所以反向传播时,随着网络层数的加深,梯度值越来越小。如果想要改善这一现象,将 Leaky ReLU 中,x<0x<0 时的斜率调大即可。

**实验心得:**这次试验,学习到了自定义梯度计算和自动梯度计算之间的区别以及一些paddle转torch函数的用法。

你可能感兴趣的:(神经网络,深度学习,人工智能)