转载地址:https://bbs.huaweicloud.com/forum/thread-106733-1-1.html
作者:skywalk
非常高兴参加了第五期两日训练营,收获颇多,而且需要慢慢消化的东西很多。本来雄心壮志,想要完成11个作业,拿到神秘大奖看看。我只是好奇神秘大奖是什么,老天就满足下我的好奇心吧。
可惜天不如人愿,这次的训练营做作业的时间跟昇腾模型大赛的时间有冲突,主要是本人能力有限,调模型调的昏天暗地,导致根本没有时间做作业。最近几天有时间了,mindspore环境又出问题,导致根本无法完成作业。今天早晨凌晨5点多起来做第11个作业,现在mindspore环境终于好了,把上个月做好的作业1和作业2传代码截图,勉强算做好3个作业。
感慨的是,大部分作业无法用mac本完成,这样无形中增加了完成作业的难度。好消息是现在手上能凑到两个mindspore环境,而且在写这篇文档的时候,突然想到colab应该也可以支持mindspore的,回头上去试试。 事实证明多学习,多写学习总结,不光能提高学习效率,还能有拓展思路的作用啊!
头两个作业操作思路是一样的:训练模型,生成ckpt模型存盘文件,然后写一个转换成MindIR格式的python文件,执行格式转换工作。 看着很简单,具体操作起来还是要一番折腾的。
参考资料:AI框架中图层IR的分析
https://zhuanlan.zhihu.com/p/263420069
首先准备运行lenet,先处理好mnist数据集:
model_user14@34ffafe8-aacb-4c1f-87ed-966c50fed78d:~/ms/mindspore/model_zoo/official/cv/lenet/data$ gzip -d train-*
model_user14@34ffafe8-aacb-4c1f-87ed-966c50fed78d:~/ms/mindspore/model_zoo/official/cv/lenet/data$ ls
''$'\200\353' ''$'\374''ϐ993' P train-images-idx3-ubyte
''$'\375\374\374\261\226' ''$'\030\376\376''=' t10k-images-idx3-ubyte.gz train-labels-idx1-ubyte
''$'\367\360\305\027' '$_'$'\233\355\375\375\262'')' t10k-labels-idx1-ubyte.gz
model_user14@34ffafe8-aacb-4c1f-87ed-966c50fed78d:~/ms/mindspore/model_zoo/official/cv/lenet/data$ gzip -d t10k-*
model_user14@34ffafe8-aacb-4c1f-87ed-966c50fed78d:~/ms/mindspore/model_zoo/official/cv/lenet/data$
执行训练:python train.py --data_path Data
报错:路径没权限或不存在
原来是路径用了大小字母啊,重新来一次:
报错:
model_user14@34ffafe8-aacb-4c1f-87ed-966c50fed78d:~/ms/mindspore/model_zoo/official/cv/lenet$ python lenet2mindir.py
[WARNING] ME(34535:281473740427280,MainProcess):2021-01-25-00:54:32.303.511 [mindspore/_check_version.py:207] MindSpore version 1.1.0 and "te" wheel package version 1.0 does not match, reference to the match info on: https://www.mindspore.cn/install
MindSpore version 1.1.0 and "topi" wheel package version 0.6.0 does not match, reference to the match info on: https://www.mindspore.cn/install
Traceback (most recent call last):
File "lenet2mindir.py", line 17, in
load_checkpoint("ckpt/checkpoint_lenet-10_1875.ckpt", net=lenetwork())
File "/usr/local/lib/python3.7/site-packages/mindspore/nn/cell.py", line 360, in __call__
output = self.construct(*cast_inputs, **kwargs)
TypeError: construct() missing 1 required positional argument: 'x'
原来是抄写代码的时候写错了:
load_checkpoint("ckpt/checkpoint_lenet-10_1875.ckpt", net=lenetwork)
写成了load_checkpoint("ckpt/checkpoint_lenet-10_1875.ckpt", net=lenetwork())
修改代码,把括号去掉。
报错:
Traceback (most recent call last):
File "lenet2mindir.py", line 19, in
export(lenetwork, Tensor(input), file_name="lenet", file_format="MINDIR")
File "/usr/local/lib/python3.7/site-packages/mindspore/train/serialization.py", line 537, in export
_export(net, file_name, file_format, *inputs)
File "/usr/local/lib/python3.7/site-packages/mindspore/train/serialization.py", line 575, in _export
graph_id, _ = _executor.compile(net, *inputs, phase=phase_name, do_convert=False)
File "/usr/local/lib/python3.7/site-packages/mindspore/common/api.py", line 502, in compile
result = self._executor.compile(obj, args_list, phase, use_vm)
File "/usr/local/lib/python3.7/site-packages/mindspore/ops/primitive.py", line 388, in __infer__
out[track] = fn(*(x[track] for x in args))
File "/usr/local/lib/python3.7/site-packages/mindspore/ops/operations/nn_ops.py", line 1208, in infer_shape
Rel.EQ, self.name)
File "/usr/local/lib/python3.7/site-packages/mindspore/_checkparam.py", line 206, in check
raise excp_cls(f'{msg_prefix} `{arg_name}` should be {rel_str}, but got {arg_value}.')
ValueError: For 'Conv2D' the `x_shape[1] / group` should be == w_shape[1]: 1, but got 3.
model_user14@34ffafe8-aacb-4c1f-87ed-966c50fed78d:~/ms/mindspore/model_zoo/official/cv/lenet$
把shape改成1:size=[32, 1, 224, 224]
新的报错:
Traceback (most recent call last):
File "lenet2mindir.py", line 19, in
export(lenetwork, Tensor(input), file_name="lenet", file_format="MINDIR")
File "/usr/local/lib/python3.7/site-packages/mindspore/train/serialization.py", line 537, in export
_export(net, file_name, file_format, *inputs)
File "/usr/local/lib/python3.7/site-packages/mindspore/train/serialization.py", line 575, in _export
graph_id, _ = _executor.compile(net, *inputs, phase=phase_name, do_convert=False)
File "/usr/local/lib/python3.7/site-packages/mindspore/common/api.py", line 502, in compile
result = self._executor.compile(obj, args_list, phase, use_vm)
File "/usr/local/lib/python3.7/site-packages/mindspore/ops/primitive.py", line 388, in __infer__
out[track] = fn(*(x[track] for x in args))
File "/usr/local/lib/python3.7/site-packages/mindspore/ops/operations/math_ops.py", line 753, in infer_shape
+ f', x2 shape {x2}(transpose_b={self.transpose_b}).')
ValueError: For 'MatMul' evaluator shapes of inputs can not do this operator, got 44944 and 400, with x1 shape [32, 44944](transpose_a=False), x2 shape [120, 400](transpose_b=True).
再次修改大小:
input = np.random.uniform(0.0, 1.0, size=[1, 1, 32, 32]).astype(np.float32)
终于pass了
最终定稿代码:
import numpy as np
from src.lenet import LeNet5
from mindspore import Tensor, export, load_checkpoint, load_param_into_net
# 定义网络
lenetwork = LeNet5(10)
load_checkpoint("ckpt/checkpoint_lenet-10_1875.ckpt", net=lenetwork)
input = np.random.uniform(0.0, 1.0, size=[16, 1, 32, 32]).astype(np.float32)
export(lenetwork, Tensor(input), file_name="lenet", file_format="MINDIR")
Resnet50的步骤与lenet基本类似:
将cifar10数据集放到/dataset/cifar-10-batches-bin 目录了
python train.py --net=$1 --dataset=$2 --dataset_path=$PATH1
用这条命令启动:python train.py --net="resnet50" --dataset="cifar10" --dataset_path="dataset/cifar-10-batches-bin"
看了一下src/config.py文件,里面epoch的设置为:
"epoch_size": 90
用npu训练的时候发现1个epoch大约需要30秒,这样一共大约需要45分钟。
相比lenet,这个需要的时间更多,因此需要npu或者gpu才能更好的完成。
最后一个作业是在华为云模拟器里面完成:
作业内容:请完成华为云可视化调试调优实验(环境限制,此实验中不包括调试器实操), 提供实验报告截图,所有步骤状态应为完成。
作业链接:https://lab.huaweicloud.com/testdetail.html?testId=464
只要按照提示一步一步的操作,就可以完成。发现华为云MindSpore的可视化非常棒,可以为调试网络提供很大的助力!
因为时间有限,就先写到这里了。
学无止境,让我们与华为一起前行!