mmdet3d+waymo 踩坑+验证环境正确性流程

处理新版的waymo数据集已经很费劲了,结果eval的结果和train的loss总是很差,原本以为只是model的问题,后面发现环境也有大坑。修好了之后,evaluate结束又开始报error了,明明结果都对的,非要报个error,一系列的事情忙了20天才弄好,中间基本没休息过,累死了。

前面配mmdet3d的时候,由于使用了最新版mmdet3d v1.0.0rc2,导致使用官方的config和model,nuscenes数据集上的eval和train结果都不对,后面用了同学环境的版本才好了,但这个时候测waymo就会报错,找了很久bug,才发现新版cuda,旧版torch和tensorflow存在一定程度的冲突,以至于一起用显卡的时候会出现问题。

waymo evaluate error复现

有空交个issue
环境:

mmcv-full                 1.4.0            
mmdet                     2.19.1     
mmdet3d                   0.17.3 
tensorflow                2.6.0
torch                     1.10.2+cu113
waymo-open-dataset-tf-2-6-0 1.4.7

问题出现在bash tools/dist_train.sh或者bash tools/dist_test.sh。
一旦调用过waymo_dataset.evaluate,程序结束都会报错:

terminate called after throwing an instance of ‘c10::CUDAError’
what(): CUDA error: unspecified launch failure
Exception raised from create_event_internal at
…/c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42
(0x7f27dd18ed62 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: + 0x1c5f3 (0x7f282083b5f3 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a2
(0x7f282083c002 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7f27dd178314
in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: + 0x29eb89 (0x7f28a3b62b89 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0xadfbe1 (0x7f28a43a3be1 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: THPVariable_subclass_dealloc(_object*) + 0x292
(0x7f28a43a3ee2 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #61: PyRun_SimpleFileExFlags + 0x1bf (0x56231eaba54f in
/home/zhengliangtao/anaconda3/envs/open-mmlab/bin/python) frame #62:
Py_RunMain + 0x3a9 (0x56231eabaa29 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/bin/python) frame #63:
Py_BytesMain + 0x39 (0x56231eabac29 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/bin/python)

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process
57875 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process
57876 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode:
-6) local_rank: 0 (pid: 57874) of binary: /home/zhengliangtao/anaconda3/envs/open-mmlab/bin/python Traceback
(most recent call last): File
“/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/runpy.py”,
line 194, in _run_module_as_main
return _run_code(code, main_globals, None, File “/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/runpy.py”,
line 87, in _run_code
exec(code, run_globals) File “/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launch.py”,
line 193, in
main() File “/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launch.py”,
line 189, in main
launch(args) File “/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launch.py”,
line 174, in launch
run(args) File “/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/run.py”,
line 710, in run
elastic_launch( File “/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 131, in call
return launch_agent(self._config, self._entrypoint, list(args)) File
“/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 259, in launch_agent
raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

注意到cuda进程和cpu进程是异步执行的,很坑爹的一点是cuda出错的时候cpu是不会停的,得等到某个时刻可能要同步了才停。所以无法根据报错信息定位。
但我用CUDA_LAUNCH_BLOCKING=1 bash tools/dist_test.sh没有用,一样执行完test.py才报错。于是我只能一段一段注释掉,看看究竟是哪里出的问题,找了1个小时之后才找到,如下:

mmdet3d/mmdet3d/core/evaluation/waymo_utils/
prediction_kitti_to_waymo.py:

def convert_one(self, file_idx):
	 file_pathname = self.waymo_tfrecord_pathnames[file_idx]
     file_data = tf.data.TFRecordDataset(file_pathname, compression_type='')

在最后调用tf.data.TFRecordDataset读取.tfrecord文件的时候,错误发生。如果注释掉这一句,则错误不发生。
而这个是多线程执行的,执行的命令为:

mmcv.track_parallel_progress(self.convert_one, range(len(self)),
                                     self.workers)

一旦使用多线程,错误就发生,不使用则不会发生。如下:

for idx in range(len(self)):
    self.convert_one(idx)##too slow

但对于大数据集,也会报错,看上去大概是gpu占用过久,报错如下:

[E ProcessGroupNCCL.cpp:587] [Rank 4] Watchdog caught collective
operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran
for 1808663 milliseconds before timing out. [E
ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed
out. Due to the asynchronous nature of CUDA kernels, subsequent GPU
operations might run on corrupted/incomplete data. To avoid this
inconsistency, we are taking the entire process down. terminate called
after throwing an instance of ‘std::runtime_error’ what(): [Rank 5]
Watchdog caught collective operation timeout:
WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808705
milliseconds before timing out. [E ProcessGroupNCCL.cpp:341] Some NCCL
operations have failed or timed out. Due to the asynchronous nature of
CUDA kernels, subsequent GPU operations might run on
corrupted/incomplete data. To avoid this inconsistency, we are taking
the entire process down. what(): [Rank 3] Watchdog caught
collective operation timeout: WorkNCCL(OpType=ALLREDUCE,
Timeout(ms)=1800000) ran for 1801257 milliseconds before timing out.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process
101573 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode:
-6) local_rank: 1 (pid: 101582) of binary: /home/zhenglt/anaconda3/envs/open-mmlab/bin/python Traceback (most
recent call last): File
“/home/zhenglt/anaconda3/envs/open-mmlab/lib/python3.8/runpy.py”, line
194, in _run_module_as_main
return _run_code(code, main_globals, None, File “/home/zhenglt/anaconda3/envs/open-mmlab/lib/python3.8/runpy.py”, line
87, in _run_code
exec(code, run_globals) File “/home/zhenglt/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launch.py”,
line 193, in main() File
“/home/zhenglt/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launch.py”,
line 189, in main File
“/home/zhenglt/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launch.py”,
line 174, in launch
run(args) File “/home/zhenglt/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/run.py”,
line 710, in run
elastic_launch( File “/home/zhenglt/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 131, in call
return launch_agent(self._config, self._entrypoint, list(args)) File
“/home/zhenglt/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

在mmdet3d==v1.0.0rc2 和torch==1.11.0+cu113的时候没有这个错误,但并行worker开太大就会出现cuda out of memory的错误。

不同版本的错误情况

cuda11.3 + mmdet3d=0.17.3 + torch=1.10.2 + $waymo

waymo-open-dataset-tf-2-6-0:eval和train中的eval都报错。
waymo-open-dataset-tf-2-5-0:降级到2-5-0,也就是装了tensorflow 2.5.0,此时单独dist_test.sh不报错,但是用dist_train.sh,模型在train了几个epoch之后需要eval,这个时候eval完会报错。
更低的版本也不能用,具体是:

Successfully uninstalled waymo-open-dataset-tf-2-4-0-1.4.1
pip uninstall waymo-open-dataset-tf-2-3-0
Successfully uninstalled waymo-open-dataset-tf-2-2-0-1.3.1

另外tf-2-1-0版本太低了,错误更多。

cuda11.3 + mmdet3d=0.18.1 + torch = 1.11.0

mmdet3d都编译不了,虽然适配的mmcv支持torch 1.11,但是mmdet3d 0.18.1不支持。

cuda111 + mmdet3d=0.17.3 + torch = 1.9.0 + $waymo

和torch1.10.2的情况一样。

transfusion的实验环境

没装成功,版本变动过大,有很多需要测试正确性的工作,失去更换环境耐心。放弃了。

waymo evaluate error问题解决

安装不同版本不可行,那么只能绕开tensorflow读取了,把tfrecord里需要使用的文件存成对应.pkl,然后每次读取不读.tfrecord,读.pkl即可,如下:

def convert_one_pkl_style(self, file_idx):
        """Convert action for single file.

        Args:
            file_idx (int): Index of the file to be converted.
        """
        pkl_pathname = self.waymo_tfrecord_pathnames[file_idx].replace('.tfrecord','.pkl')
        infos = mmcv.load(pkl_pathname)
        ## return still got error here
        for info in infos:
            filename = info['filename']
            T_k2w = info['T_k2w']
            context_name = info['context_name']
            frame_timestamp_micros = info['frame_timestamp_micros']

            if filename in self.name2idx:
                kitti_result = \
                    self.kitti_result_files[self.name2idx[filename]]
                objects = self.parse_objects(kitti_result, T_k2w, context_name,
                                             frame_timestamp_micros)
            else:
                print(filename, 'not found.')
                objects = metrics_pb2.Objects()

            with open(
                    join(self.waymo_results_save_dir, f'{filename}.bin'),
                    'wb') as f:
                f.write(objects.SerializeToString())

其他失败的解决办法

我尝试过直接用dataset里的data_infos来传递参数,每次子进程convert_one读取data_infos里的某个frame的信息,但不仅结果错误,而且很慢,因为无法激活多线程执行,同一时间只有一个子进程是活跃的,原因未知。

waymo evaluation 过慢

北京的机器老卡,read file很慢,有时候重启就好了。但调用waymo官方的compute_metrics的时候不知道为什么有时候也很慢,但只在detr3d的eval中出现,pointpillar的eval没事。原因未知。包括create waymo data的gt_database的时候也会卡,跑了几天都跑不完。
上海的机器很快,没有任何卡住的问题。

验证环境正确性步骤

为了能安心训练,首先得保证环境不会出错,不知道mmdet为什么特别坑…
装了一个环境之后,目前应该做的事:

  1. evaluation check: 去官网下一个checkpoint model,在dataset上跑,比如fcos3d+nuscene:bash tools/dist_test.sh configs/fcos3d/fcos3d_r101_caffe_fpn_gn-head_dcn_2x8_1x_nus-mono3d.py ckpts/xxx.pth 4 --eval=bbox
  2. training procedure check:下载detr3d提供的fcos3d pretrained weights,然后 bash tools/dist_train.sh projects/configs/detr3d/detr3d_res101_gridmask.py 4,将结果和github给的log做对比。
  3. waymo eval check: bash tools/dist_test.sh configs/pointpillars/hv_pointpillars_secfpn_sbn_2x16_2x_waymoD5-3d-car.py ckpts/hv_pointpillars_secfpn_sbn_2x16_2x_waymoD5-3d-car_20200901_204315-302fc3e7.pth 4 --eval=waymo
  4. waymo training check:用改好的detr3d测:bash tools/dist_train.sh projects/configs/detr3d/detr3d_res101_gridmask_waymo_debug.py 4

我用waymo的子集弄了一个debug用的dataset,在修改工程代码,或者新装环境之后可以先跑个子集,快速检测是否有明显的bug:

  1. training procedure check:顺便生成eval用的模型:bash tools/dist_train.sh projects/configs/detr3d/detr3d_res101_gridmask_waymo_debug.py 4
  2. evaluation check:bash tools/dist_test.sh projects/configs/detr3d/detr3d_res101_gridmask_way mo_debug.py work_dirs/detr3d_res101_gridmask_waymo_debug/epoch_1.pth 4 --eval=waymo

如果mmdet3d都错了就没救了,重装吧,如果只是detr3d上错了,mmdet3d是对的,就得查一下自己的bug了。

misc

  • evaluation的时候设sample stride并不make
    sense,主要是waymo给的compute_metrics好像并不支持做这种事…

  • 对于pointpillar进行waymo_subset的eval结果如下,可以进行fast check而无需跑大数据集:

{‘Vehicle/L1 mAP’: 0.00619617, ‘Vehicle/L1 mAPH’: 0.00615365,
‘Vehicle/L2 mAP’: 0.00528095, ‘Vehicle/L2 mAPH’: 0.00524471,
‘Pedestrian/L1 mAP’: 0.0, ‘Pedestrian/L1 mAPH’: 0.0, ‘Pedestrian/L2
mAP’: 0.0, ‘Pedestrian/L2 mAPH’: 0.0, ‘Sign/L1 mAP’: 0.0, ‘Sign/L1
mAPH’: 0.0, ‘Sign/L2 mAP’: 0.0, ‘Sign/L2 mAPH’: 0.0, ‘Cyclist/L1 mAP’:
0.0, ‘Cyclist/L1 mAPH’: 0.0, ‘Cyclist/L2 mAP’: 0.0, ‘Cyclist/L2 mAPH’: 0.0, ‘Overall/L1 mAP’: 0.00206539, ‘Overall/L1 mAPH’: 0.0020512166666666666, ‘Overall/L2 mAP’: 0.0017603166666666668, ‘Overall/L2 mAPH’: 0.0017482366666666666}

你可能感兴趣的:(open-mmlab)