学习笔记|Pytorch使用教程13(权值初始化)

学习笔记|Pytorch使用教程13

本学习笔记主要摘自“深度之眼”,做一个总结,方便查阅。
使用Pytorch版本为1.2

  • 梯度消失与爆炸
  • Xavier方法与Kaiming方法
  • 常用初始化方法

一.梯度消失与爆炸

学习笔记|Pytorch使用教程13(权值初始化)_第1张图片
梯度推导公式:
H 2 = H 1 ∗ W 2 Δ W 2 = ∂ L 0 s s ∂ W 2 = ∂ L 0 s s ∂ o u t ∗ ∂ o u t ∂ H 2 ∗ ∂ H 2 ∂ w 2 = ∂ L 0 s s ∂ o u t ∗ ∂ o u t ∂ H 2 ∗ H 1 \begin{aligned} \mathrm{H}_{2}=& \mathrm{H}_{1} * \mathrm{W}_{2} \\ \Delta \mathrm{W}_{2} &=\frac{\partial \mathrm{L}_{0} \mathrm{s} s}{\partial \mathrm{W}_{2}}=\frac{\partial \mathrm{L}_{0} \mathrm{s} s}{\partial \mathrm{out}} * \frac{\partial \mathrm{out}}{\partial \mathrm{H}_{2}} * \frac{\partial \mathrm{H}_{2}}{\partial \mathrm{w}_{2}} \\ &=\frac{\partial \mathrm{L}_{0} \mathrm{ss}}{\partial \mathrm{out}} * \frac{\partial \mathrm{out}}{\partial \mathrm{H}_{2}} * \mathrm{H}_{1} \end{aligned} H2=ΔW2H1W2=W2L0ss=outL0ssH2outw2H2=outL0ssH2outH1
梯度消失 H 1 → 0 ⇒ Δ W 2 → 0 \mathrm{H}_{1} \rightarrow 0 \Rightarrow \Delta \mathrm{W}_{2} \rightarrow 0 H10ΔW20
梯度爆炸 H 1 → ∞ ⇒ Δ W 2 → ∞ \mathrm{H}_{1} \rightarrow \infty \Rightarrow \Delta W_{2} \rightarrow \infty H1ΔW2

测试代码:

import os
import torch
import random
import numpy as np
import torch.nn as nn
from tools.common_tools import set_seed

set_seed(1)  # 设置随机种子


class MLP(nn.Module):
    def __init__(self, neural_num, layers):
        super(MLP, self).__init__()
        self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])
        self.neural_num = neural_num

    def forward(self, x):
        for (i, linear) in enumerate(self.linears):
            x = linear(x)
            """
            #x = torch.relu(x)
            print("layer:{}, std:{}".format(i, x.std()))
            if torch.isnan(x.std()):
                print("output is nan in {} layers".format(i))
                break
            """
        return x

    def initialize(self):
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.normal_(m.weight.data)
                # nn.init.normal_(m.weight.data, std=np.sqrt(1/self.neural_num))    # normal: mean=0, std=1

                # a = np.sqrt(6 / (self.neural_num + self.neural_num))
                #
                # tanh_gain = nn.init.calculate_gain('tanh')
                # a *= tanh_gain
                #
                # nn.init.uniform_(m.weight.data, -a, a)

                # nn.init.xavier_uniform_(m.weight.data, gain=tanh_gain)

                # nn.init.normal_(m.weight.data, std=np.sqrt(2 / self.neural_num))
                # nn.init.kaiming_normal_(m.weight.data)

# flag = 0
flag = 1

if flag:
    layer_nums = 100
    neural_nums = 256
    batch_size = 16

    net = MLP(neural_nums, layer_nums)
    net.initialize()

    inputs = torch.randn((batch_size, neural_nums))  # normal: mean=0, std=1

    output = net(inputs)
    print(output)

输出:

tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]], grad_fn=<MmBackward>)

可以发现这个网络的输出的值很大,出现了nan的情况。
查看在那一层出现了梯度爆炸,设置:

    def forward(self, x):
        for (i, linear) in enumerate(self.linears):
            x = linear(x)
            
            #x = torch.relu(x)
            print("layer:{}, std:{}".format(i, x.std()))
            if torch.isnan(x.std()):
                print("output is nan in {} layers".format(i))
                break
            """"""
        return x

输出:

layer:0, std:15.959932327270508
layer:1, std:256.6237487792969
layer:2, std:4107.24560546875
layer:3, std:65576.8125
layer:4, std:1045011.875
layer:5, std:17110408.0
layer:6, std:275461440.0
layer:7, std:4402537984.0
layer:8, std:71323615232.0
layer:9, std:1148104736768.0
layer:10, std:17911758454784.0
layer:11, std:283574813065216.0
layer:12, std:4480599540629504.0
layer:13, std:7.196813845908685e+16
layer:14, std:1.1507761512626258e+18
layer:15, std:1.8531105202862293e+19
layer:16, std:2.9677722308204246e+20
layer:17, std:4.780375660819944e+21
layer:18, std:7.61322258007914e+22
layer:19, std:1.2092650667673597e+24
layer:20, std:1.923256845372055e+25
layer:21, std:3.134466694721031e+26
layer:22, std:5.014437175989598e+27
layer:23, std:8.066614199776408e+28
layer:24, std:1.2392660797937701e+30
layer:25, std:1.9455685681908206e+31
layer:26, std:3.02381787247178e+32
layer:27, std:4.950357261592001e+33
layer:28, std:8.150924034825315e+34
layer:29, std:1.3229830735592165e+36
layer:30, std:2.0786816651036685e+37
layer:31, std:nan
output is nan in 31 layers
tensor([[        inf, -2.6817e+38,         inf,  ...,         inf,
                 inf,         inf],
        [       -inf,        -inf,  1.4387e+38,  ..., -1.3409e+38,
         -1.9660e+38,        -inf],
        [-1.5873e+37,         inf,        -inf,  ...,         inf,
                -inf,  1.1484e+38],
        ...,
        [ 2.7754e+38, -1.6783e+38, -1.5531e+38,  ...,         inf,
         -9.9440e+37, -2.5132e+38],
        [-7.7183e+37,        -inf,         inf,  ..., -2.6505e+38,
                 inf,         inf],
        [        inf,         inf,        -inf,  ...,        -inf,
                 inf,  1.7432e+38]], grad_fn=<MmBackward>)

发现在31层的时候就出现了梯度爆炸
相关公式推导:
期望: E ( X ) , E ( Y ) E(X),E(Y) E(X),E(Y)
方差: D ( x ) = E { ∑ [ X − E ( X ) ] 2 } D(x)=E\left\{\sum[X-E(X)]^{2}\right\} D(x)=E{[XE(X)]2}
则:

  1. E ( X ∗ Y ) = E ( X ) ∗ E ( Y ) \mathrm{E}(X * Y)=E(X) * E(Y) E(XY)=E(X)E(Y)
  2. D ( X ) = E ( X 2 ) − [ E ( X ) ] 2 D(X)=E\left(X^{2}\right)-[E(X)]^{2} D(X)=E(X2)[E(X)]2
  3. D ( X + Y ) = D ( X ) + D ( Y ) D(X+Y)=D(X)+D(Y) D(X+Y)=D(X)+D(Y)
  4. 1.2.3 ⇒ D ( X ∗ Y ) = D ( X ) ∗ D ( Y ) + D ( X ) ∗ [ E ( Y ) ] 2 + D ( Y ) ⋆ [ E ( X ) ] 2 1.2.3 \Rightarrow D(X * Y)=D(X) * D(Y)+D(X) *[E(Y)]^{2}+D(Y) \star[E(X)]^{2} 1.2.3D(XY)=D(X)D(Y)+D(X)[E(Y)]2+D(Y)[E(X)]2

E ( X ) = 0 , E ( Y ) = 0 E(X)=0,E(Y)=0 E(X)=0,E(Y)=0
D ( X ∗ Y ) = D ( X ) ∗ D ( Y ) D(X * Y)=D(X) * D(Y) D(XY)=D(X)D(Y)
以图中 H 11 H_{11} H11节点为例:
学习笔记|Pytorch使用教程13(权值初始化)_第2张图片
由本次输入:inputs = torch.randn((batch_size, neural_nums)) # normal: mean=0, std=1
可知,期望(平均值)为0,方差为1,即满足条件: E ( X ) = 0 , E ( Y ) = 0 E(X)=0,E(Y)=0 E(X)=0,E(Y)=0,可得 D ( X ∗ Y ) = D ( X ) ∗ D ( Y ) D(X * Y)=D(X) * D(Y) D(XY)=D(X)D(Y)
推导过程如下:
H 11 = ∑ i = 0 n X i ∗ W 1 i   D ( H 11 ) = ∑ i = 0 n D ( X i ) ∗ D ( W 1 i ) = n ∗ ( 1 ∗ 1 ) = n std ⁡ ( H 11 ) = D ( H 11 ) = n \begin{aligned} \mathrm{H}_{11}=& \sum_{i=0}^{n} X_{i} * W_{1 i} \quad \ \\ \mathrm{D}\left(\mathrm{H}_{11}\right) &=\sum_{i=0}^{n} D\left(X_{i}\right) * D\left(W_{1 i}\right) \\ &=n *(1 * 1) \\ &=n \\ \operatorname{std}\left(\mathrm{H}_{11}\right) &=\sqrt{\mathrm{D}\left(\mathrm{H}_{11}\right)}=\sqrt{\boldsymbol{n}} \end{aligned} H11=D(H11)std(H11)i=0nXiW1i =i=0nD(Xi)D(W1i)=n(11)=n=D(H11) =n
结论:
每经过一次前向传播,方差扩大 n n n倍,标准差扩大 n \sqrt{n} n 倍。
验证:查看刚刚的输出

layer:0, std:15.959932327270508
layer:1, std:256.6237487792969
layer:2, std:4107.24560546875

layer2 的std 比 layer1的std 扩大了 16( 256 \sqrt{256} 256 )倍
为了让网络层尺度不变,让方差一直等于1(只有1可以保证任意个数相乘还是为1),即满足条件:
D ( H 1 ) = n ∗ D ( X ) ∗ D ( W ) = 1 \mathbf{D}\left(\mathrm{H}_{1}\right)=\boldsymbol{n} * \boldsymbol{D}(\boldsymbol{X}) * \boldsymbol{D}(\boldsymbol{W})=\mathbf{1} D(H1)=nD(X)D(W)=1
可以推导出:
D ( W ) = 1 n ⇒ std ⁡ ( W ) = 1 n \boldsymbol{D}(\boldsymbol{W})=\frac{1}{n} \Rightarrow \operatorname{std}(W)=\sqrt{\frac{1}{n}} D(W)=n1std(W)=n1
即,初始化权重 W W W的标准差为 1 / n \sqrt{1/n} 1/n
设置:nn.init.normal_(m.weight.data, std=np.sqrt(1/self.neural_num)) # normal: mean=0, std=sqrt(1/n)
输出:

layer:0, std:0.9974957704544067
layer:1, std:1.0024365186691284
layer:2, std:1.002745509147644
layer:3, std:1.0006227493286133
layer:4, std:0.9966009855270386
layer:5, std:1.019859790802002
layer:6, std:1.0261738300323486
layer:7, std:1.0250457525253296
layer:8, std:1.0378952026367188
layer:9, std:1.0441951751708984
layer:10, std:1.0181655883789062
......
layer:94, std:1.031973123550415
layer:95, std:1.0413124561309814
layer:96, std:1.0817031860351562
layer:97, std:1.1287994384765625
layer:98, std:1.1617799997329712
layer:99, std:1.2215300798416138
tensor([[-1.0696, -1.1373,  0.5047,  ..., -0.4766,  1.5904, -0.1076],
        [ 0.4572,  1.6211,  1.9660,  ..., -0.3558, -1.1235,  0.0979],
        [ 0.3909, -0.9998, -0.8680,  ..., -2.4161,  0.5035,  0.2814],
        ...,
        [ 0.1876,  0.7971, -0.5918,  ...,  0.5395, -0.8932,  0.1211],
        [-0.0102, -1.5027, -2.6860,  ...,  0.6954, -0.1858, -0.8027],
        [-0.5871, -1.3739, -2.9027,  ...,  1.6734,  0.5094, -0.9986]],
       grad_fn=<MmBackward>)

发现输出的值在合适的范围之内,并且标准差都在1左右,说明公式推导正确。

  • 下面加入激活函数

设置:

    def forward(self, x):
        for (i, linear) in enumerate(self.linears):
            x = linear(x)           
            x = torch.tanh(x)
            print("layer:{}, std:{}".format(i, x.std()))
            if torch.isnan(x.std()):
                print("output is nan in {} layers".format(i))
                break
            """"""
        return x

输出:

layer:0, std:0.6273701786994934
layer:1, std:0.48910173773765564
layer:2, std:0.4099564850330353
layer:3, std:0.35637012124061584
layer:4, std:0.32117360830307007
layer:5, std:0.2981105148792267
layer:6, std:0.27730831503868103
......
layer:94, std:0.07276967912912369
layer:95, std:0.07259567826986313
layer:96, std:0.07586522400379181
layer:97, std:0.07769151031970978
layer:98, std:0.07842091470956802
layer:99, std:0.08206240087747574
tensor([[-0.1103, -0.0739,  0.1278,  ..., -0.0508,  0.1544, -0.0107],
        [ 0.0807,  0.1208,  0.0030,  ..., -0.0385, -0.1887, -0.0294],
        [ 0.0321, -0.0833, -0.1482,  ..., -0.1133,  0.0206,  0.0155],
        ...,
        [ 0.0108,  0.0560, -0.1099,  ...,  0.0459, -0.0961, -0.0124],
        [ 0.0398, -0.0874, -0.2312,  ...,  0.0294, -0.0562, -0.0556],
        [-0.0234, -0.0297, -0.1155,  ...,  0.1143,  0.0083, -0.0675]],
       grad_fn=<TanhBackward>)

发现标准差,即数据越来越小,甚至到梯度消失的情况。

二.Xavier方法与Kaiming方法

1.Xavier初始化

  • 方差一致性:保持数据尺度维持在恰当范围,通常方差为1
  • 激活函数:饱和函数,如Sigmoid, Tanh
  • 参考文献:《Understanding the difficulty of training deep feedforward neural networks》

考虑前向传播和反向传播,并结合方差一致性准则,得到等式:
n i ∗ D ( W ) = 1 n_{i} * D(W)=1 niD(W)=1
n i + 1 ∗ D ( W ) = 1 n_{i+1} * D(W)=1 ni+1D(W)=1
n i n_{i} ni是输入神经元个数, n i + 1 n_{i+1} ni+1是输出神经元个数。
可以得到: ⇒ D ( W ) = 2 n i + n i + 1 \Rightarrow D(W)=\frac{2}{n_{i}+n_{i+1}} D(W)=ni+ni+12
Xavier通常采用均匀分布:
W ∼ U [ − a , a ] W \sim U[-a, a] WU[a,a]
D ( W ) = ( − a − a ) 2 12 = ( 2 a ) 2 12 = a 2 3 D(W)=\frac{(-a-a)^{2}}{12}=\frac{(2 a)^{2}}{12}=\frac{a^{2}}{3} D(W)=12(aa)2=12(2a)2=3a2
把这里的 D ( W ) D(W) D(W)和上面的 D ( W ) D(W) D(W)相等,求出分布上限与下限:
2 n i + n i + 1 = a 2 3 ⇒ a = 6 n i + n i + 1 \frac{2}{n_{i}+n_{i+1}}=\frac{a^{2}}{3} \Rightarrow a=\frac{\sqrt{6}}{\sqrt{n_{i}+n_{i+1}}} ni+ni+12=3a2a=ni+ni+1 6
⇒ W ∼ U [ − 6 n i + n i + 1 , 6 n i + n i + 1 ] \Rightarrow W \sim U\left[-\frac{\sqrt{6}}{\sqrt{n_{i}+n_{i+1}}}, \frac{\sqrt{6}}{\sqrt{n_{i}+n_{i+1}}}\right] WU[ni+ni+1 6 ,ni+ni+1 6 ]

设置:

                a = np.sqrt(6 / (self.neural_num + self.neural_num))                
                tanh_gain = nn.init.calculate_gain('tanh')
                a *= tanh_gain                
                nn.init.uniform_(m.weight.data, -a, a)

输出:

layer:0, std:0.7571136355400085
layer:1, std:0.6924336552619934
layer:2, std:0.6677976846694946
layer:3, std:0.6551960110664368
layer:4, std:0.655646800994873
layer:5, std:0.6536089777946472
layer:6, std:0.6500504612922668
......
layer:95, std:0.6516367793083191
layer:96, std:0.643530011177063
layer:97, std:0.6426344513893127
layer:98, std:0.6408163905143738
layer:99, std:0.6442267298698425
tensor([[ 0.1155,  0.1244,  0.8218,  ...,  0.9404, -0.6429,  0.5177],
        [-0.9576, -0.2224,  0.8576,  ..., -0.2517,  0.9361,  0.0118],
        [ 0.9484, -0.2239,  0.8746,  ..., -0.9592,  0.7936,  0.6285],
        ...,
        [ 0.7192,  0.0835, -0.4407,  ..., -0.9590,  0.2557,  0.5419],
        [-0.9546,  0.5104, -0.8002,  ..., -0.4366, -0.6098,  0.9672],
        [ 0.6085,  0.3967,  0.1099,  ...,  0.3905, -0.5264,  0.0729]],
       grad_fn=<TanhBackward>)

发现方差在0.65左右,不大不小,比较适中。

使用pytorch封装好的方法:tanh_gain = nn.init.calculate_gain('tanh') nn.init.xavier_uniform_(m.weight.data, gain=tanh_gain)
输出:

layer:0, std:0.7571136355400085
layer:1, std:0.6924336552619934
layer:2, std:0.6677976846694946
layer:3, std:0.6551960110664368
layer:4, std:0.655646800994873
layer:5, std:0.6536089777946472
layer:6, std:0.6500504612922668
......
layer:95, std:0.6516367793083191
layer:96, std:0.643530011177063
layer:97, std:0.6426344513893127
layer:98, std:0.6408163905143738
layer:99, std:0.6442267298698425
tensor([[ 0.1155,  0.1244,  0.8218,  ...,  0.9404, -0.6429,  0.5177],
        [-0.9576, -0.2224,  0.8576,  ..., -0.2517,  0.9361,  0.0118],
        [ 0.9484, -0.2239,  0.8746,  ..., -0.9592,  0.7936,  0.6285],
        ...,
        [ 0.7192,  0.0835, -0.4407,  ..., -0.9590,  0.2557,  0.5419],
        [-0.9546,  0.5104, -0.8002,  ..., -0.4366, -0.6098,  0.9672],
        [ 0.6085,  0.3967,  0.1099,  ...,  0.3905, -0.5264,  0.0729]],
       grad_fn=<TanhBackward>)

和手动计算的结果一致。注意:本初始化方法注意针对"饱和函数"的激活方法。如果采用非饱和函数(在前向传播的时候),会出现数据剧增,如在前向传播设置:x = torch.relu(x)
输出:

layer:0, std:0.9689465165138245
layer:1, std:1.0872339010238647
layer:2, std:1.2967970371246338
......
layer:95, std:3661650.25
layer:96, std:4741351.5
layer:97, std:5300344.0
layer:98, std:6797731.0
layer:99, std:7640649.5
tensor([[       0.0000,  3028669.0000, 12379584.0000,  ...,
          3593904.7500,        0.0000, 24658918.0000],
        [       0.0000,  2758812.2500, 11016996.0000,  ...,
          2970391.2500,        0.0000, 23173852.0000],
        [       0.0000,  2909405.2500, 13117483.0000,  ...,
          3867146.2500,        0.0000, 28463464.0000],
        ...,
        [       0.0000,  3913313.2500, 15489625.0000,  ...,
          5777772.0000,        0.0000, 33226552.0000],
        [       0.0000,  3673757.2500, 12739668.0000,  ...,
          4193462.0000,        0.0000, 26862394.0000],
        [       0.0000,  1913936.2500, 10243701.0000,  ...,
          4573383.5000,        0.0000, 22720464.0000]],
       grad_fn=<ReluBackward0>)

2.Kaiming初始化

  • 方差一致性:保持数据尺度维持在恰当范围,通常方差为1
  • 激活函数: ReLU及其变种
  • 参考文献:《Delving deep into rectifiers: Surpassing human- level performance on ImageNet classification》

推导得到方差:
D ( W ) = 2 n i \mathrm{D}(W)=\frac{2}{n_{i}} D(W)=ni2
针对ReLU的变种,负半轴斜率为a,ReLU负半轴斜率为0,因此:
D ( W ) = 2 ( 1 + a 2 ) ⋅ n i \mathrm{D}(W)=\frac{2}{\left(1+\mathrm{a}^{2}\right) \cdot n_{i}} D(W)=(1+a2)ni2
std ⁡ ( W ) = 2 ( 1 + a 2 ) ⋅ n i \operatorname{std}(W)=\sqrt{\frac{2}{\left(1+\mathrm{a}^{2}\right) \cdot n_{i}}} std(W)=(1+a2)ni2
设置:nn.init.normal_(m.weight.data, std=np.sqrt(2 / self.neural_num))
输出:

layer:0, std:0.826629638671875
layer:1, std:0.878681480884552
layer:2, std:0.9134420156478882
layer:3, std:0.8892467617988586
layer:4, std:0.8344276547431946
layer:5, std:0.87453693151474
......
layer:94, std:0.595414936542511
layer:95, std:0.6624482870101929
layer:96, std:0.6377813220024109
layer:97, std:0.6079217195510864
layer:98, std:0.6579239368438721
layer:99, std:0.6668398976325989
tensor([[0.0000, 1.3437, 0.0000,  ..., 0.0000, 0.6444, 1.1867],
        [0.0000, 0.9757, 0.0000,  ..., 0.0000, 0.4645, 0.8594],
        [0.0000, 1.0023, 0.0000,  ..., 0.0000, 0.5147, 0.9196],
        ...,
        [0.0000, 1.2873, 0.0000,  ..., 0.0000, 0.6454, 1.1411],
        [0.0000, 1.3588, 0.0000,  ..., 0.0000, 0.6749, 1.2437],
        [0.0000, 1.1807, 0.0000,  ..., 0.0000, 0.5668, 1.0600]],
       grad_fn=<ReluBackward0>)

数据不大不小,适中。
使用pytorch封装好的方法:nn.init.kaiming_normal_(m.weight.data)
输出:

layer:0, std:0.826629638671875
layer:1, std:0.878681480884552
layer:2, std:0.9134420156478882
layer:3, std:0.8892467617988586
layer:4, std:0.8344276547431946
layer:5, std:0.87453693151474
.......
layer:94, std:0.595414936542511
layer:95, std:0.6624482870101929
layer:96, std:0.6377813220024109
layer:97, std:0.6079217195510864
layer:98, std:0.6579239368438721
layer:99, std:0.6668398976325989
tensor([[0.0000, 1.3437, 0.0000,  ..., 0.0000, 0.6444, 1.1867],
        [0.0000, 0.9757, 0.0000,  ..., 0.0000, 0.4645, 0.8594],
        [0.0000, 1.0023, 0.0000,  ..., 0.0000, 0.5147, 0.9196],
        ...,
        [0.0000, 1.2873, 0.0000,  ..., 0.0000, 0.6454, 1.1411],
        [0.0000, 1.3588, 0.0000,  ..., 0.0000, 0.6749, 1.2437],
        [0.0000, 1.1807, 0.0000,  ..., 0.0000, 0.5668, 1.0600]],
       grad_fn=<ReluBackward0>)

和手动初始化结果一致。

三.常用初始化方法

  1. Xavier均匀分布
  2. Xavier正态分布
  3. Kaiming均匀分布
  4. Kaiming均匀分布
  5. 均匀分布
  6. 正态分布
  7. 常数分布
  8. 正交矩阵初始化
  9. 单位矩阵初始化
  10. 稀疏矩阵初始化

1.nn.init.calculate_gain
在这里插入图片描述

主要功能:计算激活函数的方差变化尺度
主要参数:

  • nonlinearity:激活函数名称
  • param:激活函数的参数,如Leaky ReLU的negative_ slop

测试代码

# ======================================= calculate gain =======================================

# flag = 0
flag = 1

if flag:

    x = torch.randn(10000)
    out = torch.tanh(x)

    gain = x.std() / out.std()
    print('gain:{}'.format(gain))

    tanh_gain = nn.init.calculate_gain('tanh')
    print('tanh_gain in PyTorch:', tanh_gain)

输出:

gain:1.5982500314712524
tanh_gain in PyTorch: 1.6666666666666667

你可能感兴趣的:(Pytorch,自学,Pytorch,权值初始化)