Hardware Environment(Ascend/GPU/CPU): GPU
Software Environment:
– MindSpore version (source or binary): 1.5.0
– Python version (e.g., Python 3.7.5): 3.7.6
– OS platform and distribution (e.g., Linux Ubuntu 16.04): Ubuntu 4.15.0-74-generic
– GCC/Compiler version (if compiled from source):
训练脚本是通过构建GroupNorm单算子网络,对小批量输入计算分组归一化的例子。脚本如下:
01 context.set_context(mode=context.PYNATIVE_MODE, device_target="Ascend")
02 def get_x():
03 arr = np.random.randint(1,20,size=(1,4,16,16))
04 x = Tensor(arr, dtype = ms.float32)
05 return x
06
07 class GroupNormTest(nn.Cell):
08 def __init__(self):
09 super(GroupNormTest, self).__init__()
10 self.print = ops.Print()
11 self.norm1 = nn.GroupNorm(num_groups=4, num_channels=4,affine=**False**)
12
13 def construct(self, x):
14 self.print(self.norm1.gamma)
15 x = self.norm1(x)
16 return x
17
18 groupnorm_op = GroupNormTest()
19 g2 = nn.GroupNorm(num_groups=4, num_channels=4,affine=**False**)
20 x = get_x()
21 output = groupnorm_op(x)
22 print(f"输出结果:{output}")
由标杆算子知,预期输出的Tensor不应该全为0,这里打印结果如下:
输出结果: [[[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]],
[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]],
[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]],
[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]]])
原因分析
在MindSpore 1.5版本,在construct中创建和使用Tensor。如脚本中第18行代码所示。
打印结果全为零,经过定位后发现是算子内部实现有bug,具体原因为:当GroupNorm算子的参数affine为False时,内部对Tensor数据只进行了initializer,而没有进行实际初始化,只有对应的数据,而没有具体的数据,即缺少进一步的Parameter处理。
上述问题已经在master分支和r1.5版本进行修复,具体修复方案为:修改算子内部初初始化流程,使其对其进行正常初始化`。
基于上述的bug修复方案,问题已经得到修复。上述1.2.1示例脚本执行结果输出如下:
输出结果:[[[[-0.9678677 -0.22491677 -0.4106545 ... -1.5250808 -1.1536053
0.7037718 ]
[-0.03917905 -0.9678677 0.14655867 ... 1.0752473 1.0752473
0.5180341 ]
[ 0.3322964 0.5180341 -1.7108185 ... 1.6324605 1.260985
-0.03917905]
...
[-0.03917905 1.0752473 -0.78212994 ... -1.1536053 -0.5963922
-0.03917905]
[ 0.14655867 -1.7108185 1.4467227 ... 1.260985 0.7037718
-1.1536053 ]
[ 1.0752473 -1.1536053 0.3322964 ... -0.4106545 -0.03917905
-0.5963922 ]]
[[ 0.8708671 1.4133049 -0.7564466 ... -0.9372592 1.2324923
1.0516797 ]
[-0.7564466 0.5092418 0.3284292 ... 0.14761657 -0.39482132
-0.9372592 ]
[ 1.5941175 1.4133049 -0.21400869 ... -0.03319607 0.8708671
0.14761657]
...
[-1.1180718 -1.4796971 0.6900544 ... 0.5092418 -0.03319607
0.14761657]
[-0.57563394 -1.4796971 1.0516797 ... -0.21400869 1.4133049
0.5092418 ]
[ 0.14761657 -0.03319607 -0.21400869 ... -0.57563394 -1.4796971
0.3284292 ]]
[[ 1.2368401 0.8667278 -1.1688899 ... -1.7240583 -0.6137214
1.2368401 ]
[ 1.4218963 -0.7987775 1.2368401 ... 1.4218963 1.6069524
1.6069524 ]
[-1.1688899 -0.7987775 -0.9838337 ... -1.353946 -1.1688899
0.49661553]
...
[ 0.8667278 -0.9838337 -1.7240583 ... 1.051784 -1.353946
-0.42866522]
[-1.7240583 -1.1688899 -1.1688899 ... -1.1688899 1.051784
0.12650323]
[ 1.051784 0.31155938 0.31155938 ... -0.24360907 -0.24360907
0.12650323]]
[[ 0.80021834 1.1713341 0.4291026 ... -1.0553604 -1.797592
0.05798684]
[ 1.356892 0.05798684 -1.0553604 ... 1.356892 0.6146605
1.356892 ]
[-0.6842447 -1.0553604 -1.2409184 ... -0.12757105 -0.6842447
0.05798684]
...
[ 0.05798684 -0.12757105 0.80021834 ... 0.24354473 1.1713341
-0.8698026 ]
[ 1.356892 0.6146605 1.356892 ... -0.8698026 0.80021834
0.98577625]
[-0.12757105 -0.31312892 0.80021834 ... 0.98577625 -0.31312892
-1.0553604 ]]]]
定位报错问题的步骤:
1、 确定标杆算子输出和对标算子输出产生的较大差距是由算子本身引起的;
2、 在社区平台上咨询有关人员,或者提相应的问题单,促进修复该bug;
3、需要重点关注变量定义、初始化的正确性。