mindspore-ascend 1.7.1
我在使用自定义Sampler策略,对imagenet1K数据集进行采样,进行ResNet50训练。
Sampler代码如下:
class ImagenetSampler(ds.Sampler):
def __init__(self, data_sourse) -> None:
super().__init__()
if data_sourse == None:
self.num_samples = 1281167
else:
self.num_samples = data_sourse.get_dataset_size()
def __iter__(self) -> Iterator[int]:
self.shuffled_list = self.get_shuffled_list()
yield from self.shuffled_list
def get_shuffled_list(self):
def randperm(x):
return op.Randperm(max_length=x)(Tensor([x], dtype=int32))
num_samples = self.num_samples
samples_per_cls = 1281167 // KIND_NUM
num_cls = KIND_NUM
perm_list = np.array([],dtype=int)
for i in range(num_cls):
rand_list =(randperm(samples_per_cls) + i * samples_per_cls).asnumpy()
perm_list = np.append(perm_list, rand_list)
shuffled_list = []
out_len = samples_per_cls // 2
last_idx = 0
last_rand_perm = (randperm(num_samples - samples_per_cls * num_cls) + samples_per_cls * num_cls).asnumpy()
for i in range(out_len):
perm_class = randperm(num_cls).asnumpy()
for j in range(num_cls):
idx = perm_class[j]*samples_per_cls + 2 * i
shuffled_list.append(perm_list[idx])
shuffled_list.append(perm_list[idx+1])
if last_idx + 1 < num_samples - samples_per_cls * num_cls:
shuffled_list.append(last_rand_perm[last_idx])
shuffled_list.append(last_rand_perm[last_idx+1])
last_idx += 2
perm_class = randperm(num_cls).asnumpy()
for j in range(num_cls):
idx = perm_class[j] * samples_per_cls + samples_per_cls - 1
shuffled_list.append(perm_list[idx])
return shuffled_list
def __len__(self):
return self.num_samples
同时,为保证训练数据具有随机性,在每个epoch 获取数据时(调用 __iter__() ),我会根据Sampler策略重新生成 index数组:
def __init__(self, data_sourse) -> None:
super().__init__()
if data_sourse == None:
self.num_samples = 1281167
else:
self.num_samples = data_sourse.get_dataset_size()
def __iter__(self) -> Iterator[int]:
self.shuffled_list = self.get_shuffled_list()
yield from self.shuffled_list
目前有两个问题,待处理。
自定义Sampler策略 在数据下沉模式下 超时。 (提示 get next 超过25s 没有获取到数据)
[INFO] DRV(110667,python):2022-10-18-14:15:16.432.542 [ascend][curpid: 110667, 111474][drv][tsdrv][TsDrvLogicCqIrqWait 431]end, logic_cqid=564 reporCnt=0 devId=2 tsId=0 err_code=16.
[INFO] DRV(110667,python):2022-10-18-14:15:16.432.569 [ascend][curpid: 110667, 111474][drv][tsdrv][TsDrvLogicCqIrqWait 398]start, logic_cqid=564 devId=2 tsId=0
[WARNING] DRV(110667,python):2022-10-18-14:15:21.552.464 [ascend][curpid: 110667, 111474][drv][tsdrv][TsDrvLogicCqIrqWait 415]wait logic cq failed, err_code=16 devId=2 tsId=0
同时 数据集num_parallel_workers 设为了12
使用create_dict_iterator迭代,取得首个batch的数据 用时14s,远远小于错误信息提到的25s,想知道如何修改能够不超时??
在非数据下沉模式下,使用如下的Sampler策略(仅在初始化的时候生成一次index 数组),能够正常运行:
def __init__(self, data_sourse) -> None:
super().__init__()
if data_sourse == None:
self.num_samples = 1281167
else:
self.num_samples = data_sourse.get_dataset_size()
self.shuffled_list = self.get_shuffled_list()
def __iter__(self) -> Iterator[int]:
yield from self.shuffled_list
使用如下的Sampler策略(在每次epoch的时候重新生成index数组),会在运行完第一个epoch时,报错Segmentation fault(core dumped)
def __init__(self, data_sourse) -> None:
super().__init__()
if data_sourse == None:
self.num_samples = 1281167
else:
self.num_samples = data_sourse.get_dataset_size()
def __iter__(self) -> Iterator[int]:
self.shuffled_list = self.get_shuffled_list()
yield from self.shuffled_list
报错信息如下:
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.035 [stream.cc:921] 109171 AddTaskToStream: recorded public task to stream, stream_id=3, task_id=14390, task_type=2, head=4148, tail=4150
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.048 [npu_driver.cc:407] 109171 CommandOccupy: sqId=238, deviceId=0, tsId=0, command=0xf0000ee0d80, cmdCount=1.
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.061 [engine.cc:1660] 109171 SendTask: device_id=0, ts_id=0, sq_id=238, cq_id=10, stream_id=3, task_id=14390, task_type=2
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.073 [logger.cc:1564] 109171 TaskLaunchedEx: device_id=0, stream_id=3, task_id=14390, event_id=1023,task_type=EventRecord, task_launched_num=15246
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.090 [npu_driver.cc:433] 109171 CommandSend: Command send success, device_id=0, ts_id=0, sq_id=238, reportCount=1, command=0xf0000ee0d80, cmdCount=1.
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.108 [pool.cc:527] 109171 GetItemBySerial: stream_id=3, task_id=14390, serial_id=14390
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.120 [stream.cc:939] 109171 TryDelRecordedTask: del public task from stream, stream_id=3, tailTaskId=14390, delTaskId=14389, head=4149, tail=4150
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.131 [pool.cc:527] 109171 GetItemBySerial: stream_id=3, task_id=14389, serial_id=14389
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.141 [logger.cc:1624] 109171 TaskFinished: device_id=0, stream_id=3, task_id=14389, task_type=1 (KERNEL_AICPU), task_finish_num=15245
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.156 [task.cc:128] 109171 TaskFailCallBack: task ok, stream_id=3, task_id=14389, retCode=0
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.166 [pool.cc:504] 109171 GetTaskId: stream_id=3, task_id=14389, serial_id=14389
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.180 [stream.cc:939] 109171 TryDelRecordedTask: del public task from stream, stream_id=3, tailTaskId=14390, delTaskId=14390, head=4150, tail=4150
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.190 [pool.cc:527] 109171 GetItemBySerial: stream_id=3, task_id=14390, serial_id=14390
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.199 [logger.cc:1624] 109171 TaskFinished: device_id=0, stream_id=3, task_id=14390, task_type=2 (EVENT_RECORD), task_finish_num=15246
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.214 [pool.cc:504] 109171 GetTaskId: stream_id=3, task_id=14390, serial_id=14390
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.224 [engine.cc:585] 109171 ProcessTask: ProcessTask.
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.233 [engine.cc:1120] 109171 SyncTaskQueryShm: Task Wait: stream_id=3, task_id=14390, exec_id=14390, cq_id=564
[INFO] DRV(108000,python):2022-10-18-22:37:37.001.250 [ascend][curpid: 108000, 109171][drv][tsdrv][TsDrvLogicCqIrqWait 398]start, logic_cqid=564 devId=0 tsId=0
[INFO] DRV(108000,python):2022-10-18-22:37:37.001.272 [ascend][curpid: 108000, 109171][drv][tsdrv][TsDrvLogicCqIrqWait 431]end, logic_cqid=564 reporCnt=1 devId=0 tsId=0 err_code=0.
[INFO] DRV(108000,python):2022-10-18-22:37:37.001.288 [ascend][curpid: 108000, 109171][drv][tsdrv][TsDrvLogicCqReportGet 511]logic_cqid=564 devId=0 tsId=0
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.306 [engine.cc:1249] 109171 SyncTask: Notify: count=1, idx=0, stream_id=3, task_id=14390, type=0, retCode=0, payLoad=0, drvErr=0
[INFO] DRV(108000,python):2022-10-18-22:37:37.001.324 [ascend][curpid: 108000, 109171][drv][tsdrv][TsDrvLogicReportRelease 481]logic_cqid=564 devId=0 tsId=0
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.337 [logger.cc:289] 109171 StreamSynchronize: Stream synchronize exit
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.349 [api_impl.cc:495] 109171 StreamSynchronize: stream_id=555
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.362 [logger.cc:289] 109171 StreamSynchronize: Stream synchronize exit
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.371 [api_impl.cc:495] 109171 StreamSynchronize: stream_id=106
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.384 [logger.cc:289] 109171 StreamSynchronize: Stream synchronize exit
[INFO] ASCENDCL(108000,python):2022-10-18-22:37:37.001.401 [memory.cpp:257]109171 aclrtMemcpy: start to execute aclrtMemcpy, destMaxSize = 5116, srcSize = 5116, kind = 2
Segmentation fault (core dumped)
(mindspore_zyb) [root@localhost ResNet]# [INFO] TBE(108412,python):2022-10-18-22:37:37.900.577 [../../../te_fusion/parallel_compilation.py:939][deinit_multi_process_env] destory compiler
[INFO] TBE(108412,python):2022-10-18-22:37:38.102.274 [../../../te_fusion/parallel_compilation.py:939][deinit_multi_process_env] destory compiler
[INFO] TBE(108412,python):2022-10-18-22:37:38.302.716 [../../../te_fusion/parallel_compilation.py:941][deinit_multi_process_env] all compiler destoryed
请问是什么原因?如何解决?
****************************************************解答*****************************************************
问题1:目前只支持固定设为25s,预计在2.1版本支持超时时间设置
问题2,需要用gdb看,我这边可以提供mindspore的debug版本,需要你这边说下用的是arm还是x86,可以导出core文件,然后在gdb里面加载 导出core文件方法: 1、ulimit -c unlimited 2、 echo '/自定义路径/core.%p' > /proc/sys/kernel/core_pattern
尝试将ops.Randperm替换成相应功能的numpy接口或自行实现其逻辑试试,当前在数据处理中混用mindspore.ops接口会出现未知错误