将其他框架的源码迁移至MindSpore后若出现精度问题,可以对比验证精度最大epoch的模型输出。
MindSpore代码和Pytorch代码,存在MindSpore验证集精度比Pytorch低的问题,下面描述如何对比验证精度最大epoch的模型输出来定位问题。
训练时,将Pytorch和MindSpore每个epoch的模型保存下来。
model.eval()if not os.path.isdir('checkpoint'):
os.mkdir('checkpoint')
torch.save(model.state_dict(), './checkpoint/'+ str(epoch) +
"_" + str(my_val_accuracy) + '_resnet50.pth')
config_ck = CheckpointConfig(save_checkpoint_steps=steps_per_epoch_train,keep_checkpoint_max=epoch_max)
ckpoint_cb = ModelCheckpoint(prefix="train_resnet_cifar10", directory="./", config=config_ck)
其中save_checkpoint_steps为一个轮次中要执行的step数。keep_checkpoint_max为最大存储模型数目。
分别打印MindSpore和Pytorch中验证精度最大的epoch对应的模型。
from mindspore import load_checkpoint
param_dict = load_checkpoint("./train_mindspore/train_resnet_cifar10_1-162_391.ckpt")for key in param_dict.keys():
print(key,param_dict[key].data.asnumpy())
打印结果如下 :
conv1.weight [[[[-0.06971329 -0.00755827 -0.02128288]
[-0.11358283 0.0838231 0.05153372]
[-0.16815324 0.21193054 0.2659021 ]]
...
import torch
model = torch.load('./checkpoint/174_0_resnet50.pth')
print(model)
打印结果如下 :
OrderedDict([('conv1.weight', tensor([[[[ 0.0798, 0.2771, -0.0230],
[-0.4563, -0.4013, 0.2395],
[-0.0804, 0.2683, 0.2198]],
...
对比模型的各项输出结果,若发现有某项输出不一致,则需要分析对应的算子,可在MindSpore API中搜索相应算子,并分析。
MindSpore的自研Print算子可以将用户输入的Tensor或字符串信息打印出来。Print算子使用方法与其他算子相同,在网络中的__init__声明算子并在construct进行调用。
Print算子正向打印,简单示例如下 :
class ResNet(nn.Cell):
def __init__(self, num_blocks, num_classes=10):
super(ResNet, self).__init__()
self.conv1 = _conv2d(3, 64, kernel_size=3, stride=1, padding=1)
self.bn1 = nn.BatchNorm2d(64)
self.relu = nn.ReLU()
self.print = P.Print()
def construct(self, x_input):
x_input = self.conv1(x_input)
self.print('output_conv1:', x_input)
out = self.relu(x_input)
return out
打印结果如下:
output_conv1:
Tensor(shape=[1, 64, 32, 32], dtype=Float32, value=
[[[[ 2.72979736e-02 5.00488281e-02 9.80834961e-02 ... 1.38671875e-01 1.46850586e-01 5.52673340e-02]
[ 1.49291992e-01 1.49658203e-01 1.94824219e-01 ... 3.98193359e-01 3.90380859e-01 1.97875977e-01]
[ 7.49511719e-02 7.70263672e-02 1.22924805e-01 ... 2.86865234e-01 2.58056641e-01 1.22192383e-01]
...
Print算子打印模型权重,简单示例如下:
class ResNet(nn.Cell):
def __init__(self, num_blocks, num_classes=10):
super(ResNet, self).__init__()
self.conv1 = _conv2d(3, 64, kernel_size=3, stride=1, padding=1)
self.bn1 = nn.BatchNorm2d(64)
self.relu = nn.ReLU()
self.print = P.Print()
def construct(self, x_input):
x_input = self.conv1(x_input)
self.print("self.conv1.weight:", self.conv1.weight)
out = self.bn1(x_input)
out = self.relu(out)
return out
打印结果如下:
self.conv1.weight:
Tensor(shape=[64, 3, 3, 3], dtype=Float32, value=
[[[[ 1.87883265e-02 8.28264281e-02 3.95536423e-02]
[ 1.72755457e-02 -2.93852817e-02 5.61546721e-02]
[-2.40226928e-02 1.50793493e-01 1.78463876e-01]]
...
Print算子反向打印,简单示例如下:
class ResNet(nn.Cell):
def __init__(self, num_blocks, num_classes=10):
super(ResNet, self).__init__()
self.in_planes = 64
self.conv1 = _conv2d(3, 64, kernel_size=3, stride=1, padding=1)
self.bn1 = nn.BatchNorm2d(64)
self.relu = nn.ReLU()
self.print = P.Print()
self.print_grad = P.InsertGradientOf(self.save_gradient)
def save_gradient(self, dout):
return dout
def construct(self, x_input):
x_input = self.conv1(x_input)
out_grad = self.print_grad(x_input)
self.print("out_grad_conv1",out_grad)
out = self.bn1(x_input)
out = self.relu(out)
return out
打印结果如下:
out_grad_conv1
Tensor(shape=[128, 64, 32, 32], dtype=Float32, value=
[[[[-3.97216797e-01 4.74853516e-02 2.53417969e-01 ... 5.80749512e-02 7.40356445e-02 2.26806641e-01]
[-1.38867188e+00 -1.12890625e+00 -7.66601562e-01 ... -7.64160156e-01 -8.21777344e-01 -2.39013672e-01]
[-1.13964844e+00 -9.60449219e-01 -6.69921875e-01 ... -9.59472656e-01 -1.27050781e+00 -8.31054688e-01]
...
下面以一份有问题的MindSpore代码为例,描述如何通过print算子打印输出来定位问题。
问题现象:MindSpore验证集精度比Pytorch低
验证集精度:MindSpore:91.96%; Pytorch:92.58%。
准备工作
为保证Pytorch和MindSpore网络输入图片相同,代码做如下修改,Pytorch中代码也做对应修改。
(1)训练数据集shuffle设置为False:
复制
(2)数据转换中注释如下内容:
transform_data = C.Compose([
# CV.RandomCrop((32, 32), (4, 4, 4, 4)),
# CV.RandomHorizontalFlip(),
CV.Rescale(rescale, shift),
CV.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
CV.HWC2CHW()])
措施:通过print逐层打印前向输出,对比输出数据数量级
打印网络每层输出,需要在init中加入self.print = P.Print()。主网络中打印方式如下:
class ResNet(nn.Cell):
def __init__(self, num_blocks, num_classes=10):
super(ResNet, self).__init__()
self.in_planes = 64
self.conv1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1, pad_mode='pad', weight_init='Uniform')
self.bn1 = nn.BatchNorm2d(64)
self.relu = nn.ReLU()
self.layer1 = self._make_layer(64, num_blocks[0], stride=1)
self.layer2 = self._make_layer(128, num_blocks[1], stride=2)
self.layer3 = self._make_layer(256, num_blocks[2], stride=2)
self.layer4 = self._make_layer(512, num_blocks[3], stride=2)
self.avgpool2d = nn.AvgPool2d(kernel_size=4, stride=4)
self.reshape = mindspore.ops.Reshape()
self.linear = nn.Dense(2048, num_classes)
self.print = P.Print()
def _make_layer(self, planes, num_blocks, stride):
strides = [stride] + [1]*(num_blocks-1)
layers = []
for stride in strides:
layers.append(ResidualBlock(self.in_planes, planes, stride))
self.in_planes = EXPANSION*planes
return nn.SequentialCell(*layers)
def construct(self, x_input):
x_input = self.conv1(x_input)
self.print('output_conv1:', x_input)
out = self.bn1(x_input)
out = self.relu(out)
self.print('output_relu:', out)
out = self.layer1(out)
self.print('output_layer1:', out)
out = self.layer2(out)
self.print('output_layer2:', out)
out = self.layer3(out)
self.print('output_layer3:', out)
out = self.layer4(out)
self.print('output_layer4:', out)
out = self.avgpool2d(out)
self.print('output_avgpool2d:', out)
out = self.reshape(out, (out.shape[0], 2048))
out = self.linear(out)
self.print('output_linear:', out)
return out
对比第一张图片的输出,发现output_conv1(卷积)输出的数据数量级不一致。
问题1:卷积权重初始化不一致
对比MindSpore和Pytorch中卷积入参,发现卷积权重初始化不一致。
MindSpore中,权重初始化用的是:normal方法。
class mindspore.nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, pad_mode="same", padding=0, dilation=1, group=1, has_bias=False, weight_init="normal", bias_init="zeros", data_format="NCHW")
而Pytorch中,默认的权重初始化方法如下图所示:
解决方案:更新MindSpore中卷积权重初始化方法,与Pytorch中一致
将卷积算法改成如下形式:
def _conv2d(in_channel, out_channel, kernel_size, stride=1, padding=0):
scale = math.sqrt(1/(in_channel*kernel_size*kernel_size))
if padding == 0:
return nn.Conv2d(in_channel, out_channel, kernel_size=kernel_size,
stride=stride, padding=padding, pad_mode='same',
weight_init=mindspore.common.initializer.Uniform(scale=scale))
else:
return nn.Conv2d(in_channel, out_channel, kernel_size=kernel_size,
stride=stride, padding=padding, pad_mode='pad',
weight_init=mindspore.common.initializer.Uniform(scale=scale))
问题2:全连接层中权重和偏差初始化方法不一致
根据问题1中的问题,检查其他算子是否有类似问题。发现MindSpore中nn.Dense和Pytorch中nn.linear的权重和偏差初始化方法有所差别。
解决方案:更新MindSpore全连接层中权重和偏差初始化方法,与Pytorch中一致
将MindSpore中nn.Dense算法做如下修改:
def _dense(in_channel, out_channel):
scale = math.sqrt(1/in_channel)
return nn.Dense(in_channel, out_channel,
weight_init=mindspore.common.initializer.Uniform(scale=scale),
bias_init=mindspore.common.initializer.Uniform(scale=scale))
更新后代码,继续从头开始训练,并逐层打印前向输出。
改进效果
所有算子前向输出数量级一致;
MindSpore验证集精度有所提升,较Pytorch高。
验证集精度:MindSpore:92.98%;Pytorch:92.58%。