Ascend910训练CSWin-Transformer失败

源码:https://gitee.com/lljyoyo1995/cswin.git

【操作步骤&问题现象】

1、train.py拉起训练失败

【截图信息】

运行平台:在ModelArts
镜像:tensorflow1.15-mindspore1.5.1-cann5.0.2-euler2.8-aarch64

RuntimeError: ({'errCode': 'E60011', 'range': '[2,4096]', 'attr_name': 'dy_w*stride_w', 'value': 1}, 'In op, the [dy_w*stride_w] must in range [[2,4096]], actual is [1]')

【日志信息】(可选,上传日志内容或者附件)

Traceback (most recent call last):
File "/home/ma-user/modelarts/user-job-dir/cswin/train.py", line 101, in
main()
File "/home/ma-user/modelarts/user-job-dir/cswin/train.py", line 92, in main
dataset_sink_mode = False)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/train/model.py", line 726, in train
sink_size=sink_size)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/train/model.py", line 498, in _train
self._train_process(epoch, train_dataset, list_callback, cb_params)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/train/model.py", line 626, in _train_process
outputs = self._train_network(*next_element)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/nn/cell.py", line 404, in call
out = self.compile_and_run(*inputs)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/nn/cell.py", line 682, in compile_and_run
self.compile(*inputs)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/nn/cell.py", line 669, in compile
_cell_graph_executor.compile(self, *inputs, phase=self.phase, auto_parallel_mode=self._auto_parallel_mode)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/common/api.py", line 548, in compile
result = self._graph_executor.compile(obj, args_list, phase, use_vm, self.queue_name)
RuntimeError: mindspore/ccsrc/backend/kernel_compiler/tbe/ascend_kernel_compile.cc:384 ParseTargetJobStatus] Single op compile failed, op: depthwise_conv2d_backprop_input_d_4348327921004145570_0
except_msg : 2022-07-15 10:30:01.881592: Query except_msg:Traceback (most recent call last):
File "/usr/local/Ascend/nnae/latest/fwkacllib/python/site-packages/te_fusion/parallel_compilation.py", line 1471, in run
tune_param=self._tune_param)
File "/usr/local/Ascend/nnae/latest/fwkacllib/python/site-packages/te_fusion/fusion_manager.py", line 1271, in build_single_op
compile_info = call_op()
File "/usr/local/Ascend/nnae/latest/fwkacllib/python/site-packages/te_fusion/fusion_manager.py", line 1259, in call_op
opfunc(*inputs, *outputs, *new_attrs, **kwargs)
File "/usr/local/Ascend/nnae/latest/fwkacllib/python/site-packages/tbe/common/utils/para_check.py", line 539, in _in_wrapper
return func(args, **kwargs)
File "/usr/local/Ascend/nnae/latest/opp/op_impl/built-in/ai_core/tbe/impl/depthwise_conv2d_backprop_input_d.py", line 380, in depthwise_conv2d_backprop_input_d
para_dict=para_dict
File "/usr/local/Ascend/nnae/latest/fwkacllib/python/site-packages/te/lang/cce/te_compute/conv2d_backprop_input_compute.py", line 65, in conv2d_backprop_input_compute
return conv2d_backprop_input_compute(filters, out_backprop, filter_sizes, input_sizes, para_dict)
File "/usr/local/Ascend/nnae/latest/fwkacllib/python/site-packages/tbe/common/utils/para_check.py", line 986, in in_wrapper
return func(args, **kwargs)
File "/usr/local/Ascend/nnae/latest/fwkacllib/python/site-packages/tbe/dsl/compute/conv2d_backprop_input_compute.py", line 735, in conv2d_backprop_input_compute
switch_to_general_scheme=switch_to_general_scheme)
File "/usr/local/Ascend/nnae/latest/fwkacllib/python/site-packages/tbe/dsl/compute/conv2d_backprop_input_compute.py", line 594, in _check_input_params
_check_dy()
File "/usr/local/Ascend/nnae/latest/fwkacllib/python/site-packages/tbe/dsl/compute/conv2d_backprop_input_compute.py", line 417, in _check_dy
dy_filling_w_max
File "/usr/local/Ascend/nnae/latest/fwkacllib/python/site-packages/tbe/dsl/compute/conv2d_backprop_input_compute.py", line 149, in _check_variable_range
error_manager_util.get_error_message(args_dict))
RuntimeError: ({'errCode': 'E60011', 'range': '[2,4096]', 'attr_name': 'dy_w
stride_w', 'value': 1}, 'In op, the [dy_w
stride_w] must in range [[2,4096]], actual is [1]')

代码跑通了,但是不确定是否有环境变量等设置不一致的地方,你可以通过一下教程尝试自己定位一下问题。 我怀疑是在前向传播的过程中某个层出了问题,这个思路没有错吧。 [url]https://mindspore.cn/docs/zh-CN/r1.7/api_python/ops/mindspore.ops.Print.html#mindspore.ops.Print[/url]

你可能感兴趣的:(深度学习,transformer,深度学习,人工智能)