Sampler 在数据下沉模式超时; 不同Sampler策略,在非数据下沉模式下,模型训练失败 报错Segmentation fault(core dumped)

环境版本

mindspore-ascend 1.7.1

背景

我在使用自定义Sampler策略,对imagenet1K数据集进行采样,进行ResNet50训练。

Sampler代码如下:

class ImagenetSampler(ds.Sampler):
    def __init__(self, data_sourse) -> None:
        super().__init__()
        if data_sourse == None:
            self.num_samples = 1281167
        else:
            self.num_samples = data_sourse.get_dataset_size()
            
    def __iter__(self) -> Iterator[int]:
        self.shuffled_list = self.get_shuffled_list()
        yield from self.shuffled_list

    def get_shuffled_list(self):
        def randperm(x):
            return op.Randperm(max_length=x)(Tensor([x], dtype=int32))

        num_samples = self.num_samples
        samples_per_cls = 1281167 // KIND_NUM
        num_cls = KIND_NUM
        perm_list = np.array([],dtype=int)
        for i in range(num_cls):
            rand_list =(randperm(samples_per_cls) + i * samples_per_cls).asnumpy()
            perm_list = np.append(perm_list, rand_list)

        shuffled_list = []
        out_len = samples_per_cls // 2
        last_idx = 0
        last_rand_perm = (randperm(num_samples - samples_per_cls * num_cls) + samples_per_cls * num_cls).asnumpy()
        for i in range(out_len):
            perm_class = randperm(num_cls).asnumpy()
            for j in range(num_cls):
                idx = perm_class[j]*samples_per_cls + 2 * i
                shuffled_list.append(perm_list[idx])
                shuffled_list.append(perm_list[idx+1])
            if last_idx + 1 < num_samples - samples_per_cls * num_cls:
                shuffled_list.append(last_rand_perm[last_idx])
                shuffled_list.append(last_rand_perm[last_idx+1])
                last_idx += 2

        perm_class = randperm(num_cls).asnumpy()
        for j in range(num_cls):
            idx = perm_class[j] * samples_per_cls + samples_per_cls - 1
            shuffled_list.append(perm_list[idx])
        return shuffled_list

    def __len__(self):
        return self.num_samples

同时,为保证训练数据具有随机性,在每个epoch 获取数据时(调用 __iter__() ),我会根据Sampler策略重新生成 index数组:

    def __init__(self, data_sourse) -> None:
        super().__init__()
        if data_sourse == None:
            self.num_samples = 1281167
        else:
            self.num_samples = data_sourse.get_dataset_size()

    def __iter__(self) -> Iterator[int]:
        self.shuffled_list = self.get_shuffled_list()
        yield from self.shuffled_list

目前有两个问题,待处理。

问题1

自定义Sampler策略 在数据下沉模式下 超时。 (提示 get next 超过25s 没有获取到数据)

[INFO] DRV(110667,python):2022-10-18-14:15:16.432.542 [ascend][curpid: 110667, 111474][drv][tsdrv][TsDrvLogicCqIrqWait 431]end, logic_cqid=564 reporCnt=0 devId=2 tsId=0 err_code=16.
[INFO] DRV(110667,python):2022-10-18-14:15:16.432.569 [ascend][curpid: 110667, 111474][drv][tsdrv][TsDrvLogicCqIrqWait 398]start, logic_cqid=564 devId=2 tsId=0
[WARNING] DRV(110667,python):2022-10-18-14:15:21.552.464 [ascend][curpid: 110667, 111474][drv][tsdrv][TsDrvLogicCqIrqWait 415]wait logic cq failed, err_code=16 devId=2 tsId=0

同时 数据集num_parallel_workers 设为了12

使用create_dict_iterator迭代,取得首个batch的数据 用时14s,远远小于错误信息提到的25s,想知道如何修改能够不超时??

问题2

在非数据下沉模式下,使用如下的Sampler策略(仅在初始化的时候生成一次index 数组),能够正常运行:

    def __init__(self, data_sourse) -> None:
        super().__init__()
        if data_sourse == None:
            self.num_samples = 1281167
        else:
            self.num_samples = data_sourse.get_dataset_size()
        self.shuffled_list = self.get_shuffled_list()
    def __iter__(self) -> Iterator[int]:
        yield from self.shuffled_list

使用如下的Sampler策略(在每次epoch的时候重新生成index数组),会在运行完第一个epoch时,报错Segmentation fault(core dumped)

    def __init__(self, data_sourse) -> None:
        super().__init__()
        if data_sourse == None:
            self.num_samples = 1281167
        else:
            self.num_samples = data_sourse.get_dataset_size()
    def __iter__(self) -> Iterator[int]:
        self.shuffled_list = self.get_shuffled_list()
        yield from self.shuffled_list

报错信息如下:

[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.035 [stream.cc:921] 109171 AddTaskToStream: recorded public task to stream, stream_id=3, task_id=14390, task_type=2, head=4148, tail=4150
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.048 [npu_driver.cc:407] 109171 CommandOccupy: sqId=238, deviceId=0, tsId=0, command=0xf0000ee0d80, cmdCount=1.
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.061 [engine.cc:1660] 109171 SendTask: device_id=0, ts_id=0, sq_id=238, cq_id=10, stream_id=3, task_id=14390, task_type=2
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.073 [logger.cc:1564] 109171 TaskLaunchedEx: device_id=0, stream_id=3, task_id=14390, event_id=1023,task_type=EventRecord, task_launched_num=15246
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.090 [npu_driver.cc:433] 109171 CommandSend: Command send success, device_id=0, ts_id=0, sq_id=238, reportCount=1, command=0xf0000ee0d80, cmdCount=1.
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.108 [pool.cc:527] 109171 GetItemBySerial: stream_id=3, task_id=14390, serial_id=14390
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.120 [stream.cc:939] 109171 TryDelRecordedTask: del public task from stream, stream_id=3, tailTaskId=14390, delTaskId=14389, head=4149, tail=4150
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.131 [pool.cc:527] 109171 GetItemBySerial: stream_id=3, task_id=14389, serial_id=14389
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.141 [logger.cc:1624] 109171 TaskFinished: device_id=0, stream_id=3, task_id=14389, task_type=1 (KERNEL_AICPU), task_finish_num=15245
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.156 [task.cc:128] 109171 TaskFailCallBack: task ok, stream_id=3, task_id=14389, retCode=0
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.166 [pool.cc:504] 109171 GetTaskId: stream_id=3, task_id=14389, serial_id=14389
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.180 [stream.cc:939] 109171 TryDelRecordedTask: del public task from stream, stream_id=3, tailTaskId=14390, delTaskId=14390, head=4150, tail=4150
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.190 [pool.cc:527] 109171 GetItemBySerial: stream_id=3, task_id=14390, serial_id=14390
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.199 [logger.cc:1624] 109171 TaskFinished: device_id=0, stream_id=3, task_id=14390, task_type=2 (EVENT_RECORD), task_finish_num=15246
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.214 [pool.cc:504] 109171 GetTaskId: stream_id=3, task_id=14390, serial_id=14390
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.224 [engine.cc:585] 109171 ProcessTask: ProcessTask.
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.233 [engine.cc:1120] 109171 SyncTaskQueryShm: Task Wait: stream_id=3, task_id=14390, exec_id=14390, cq_id=564
[INFO] DRV(108000,python):2022-10-18-22:37:37.001.250 [ascend][curpid: 108000, 109171][drv][tsdrv][TsDrvLogicCqIrqWait 398]start, logic_cqid=564 devId=0 tsId=0
[INFO] DRV(108000,python):2022-10-18-22:37:37.001.272 [ascend][curpid: 108000, 109171][drv][tsdrv][TsDrvLogicCqIrqWait 431]end, logic_cqid=564 reporCnt=1 devId=0 tsId=0 err_code=0.
[INFO] DRV(108000,python):2022-10-18-22:37:37.001.288 [ascend][curpid: 108000, 109171][drv][tsdrv][TsDrvLogicCqReportGet 511]logic_cqid=564 devId=0 tsId=0
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.306 [engine.cc:1249] 109171 SyncTask: Notify: count=1, idx=0, stream_id=3, task_id=14390, type=0, retCode=0, payLoad=0, drvErr=0
[INFO] DRV(108000,python):2022-10-18-22:37:37.001.324 [ascend][curpid: 108000, 109171][drv][tsdrv][TsDrvLogicReportRelease 481]logic_cqid=564 devId=0 tsId=0
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.337 [logger.cc:289] 109171 StreamSynchronize: Stream synchronize exit
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.349 [api_impl.cc:495] 109171 StreamSynchronize: stream_id=555
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.362 [logger.cc:289] 109171 StreamSynchronize: Stream synchronize exit
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.371 [api_impl.cc:495] 109171 StreamSynchronize: stream_id=106
[INFO] RUNTIME(108000,python):2022-10-18-22:37:37.001.384 [logger.cc:289] 109171 StreamSynchronize: Stream synchronize exit
[INFO] ASCENDCL(108000,python):2022-10-18-22:37:37.001.401 [memory.cpp:257]109171 aclrtMemcpy: start to execute aclrtMemcpy, destMaxSize = 5116, srcSize = 5116, kind = 2
Segmentation fault (core dumped)
(mindspore_zyb) [root@localhost ResNet]# [INFO] TBE(108412,python):2022-10-18-22:37:37.900.577 [../../../te_fusion/parallel_compilation.py:939][deinit_multi_process_env] destory compiler 
[INFO] TBE(108412,python):2022-10-18-22:37:38.102.274 [../../../te_fusion/parallel_compilation.py:939][deinit_multi_process_env] destory compiler 
[INFO] TBE(108412,python):2022-10-18-22:37:38.302.716 [../../../te_fusion/parallel_compilation.py:941][deinit_multi_process_env] all compiler destoryed

请问是什么原因?如何解决?

****************************************************解答*****************************************************

问题1:目前只支持固定设为25s,预计在2.1版本支持超时时间设置

问题2,需要用gdb看,我这边可以提供mindspore的debug版本,需要你这边说下用的是arm还是x86,可以导出core文件,然后在gdb里面加载 导出core文件方法: 1、ulimit -c unlimited  2、 echo '/自定义路径/core.%p' > /proc/sys/kernel/core_pattern

尝试将ops.Randperm替换成相应功能的numpy接口或自行实现其逻辑试试,当前在数据处理中混用mindspore.ops接口会出现未知错误

你可能感兴趣的:(python,算法,人工智能)