处理新版的waymo数据集已经很费劲了,结果eval的结果和train的loss总是很差,原本以为只是model的问题,后面发现环境也有大坑。修好了之后,evaluate结束又开始报error了,明明结果都对的,非要报个error,一系列的事情忙了20天才弄好,中间基本没休息过,累死了。
前面配mmdet3d的时候,由于使用了最新版mmdet3d v1.0.0rc2,导致使用官方的config和model,nuscenes数据集上的eval和train结果都不对,后面用了同学环境的版本才好了,但这个时候测waymo就会报错,找了很久bug,才发现新版cuda,旧版torch和tensorflow存在一定程度的冲突,以至于一起用显卡的时候会出现问题。
有空交个issue
环境:
mmcv-full 1.4.0
mmdet 2.19.1
mmdet3d 0.17.3
tensorflow 2.6.0
torch 1.10.2+cu113
waymo-open-dataset-tf-2-6-0 1.4.7
问题出现在bash tools/dist_train.sh或者bash tools/dist_test.sh。
一旦调用过waymo_dataset.evaluate,程序结束都会报错:
terminate called after throwing an instance of ‘c10::CUDAError’
what(): CUDA error: unspecified launch failure
Exception raised from create_event_internal at
…/c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42
(0x7f27dd18ed62 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: + 0x1c5f3 (0x7f282083b5f3 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a2
(0x7f282083c002 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7f27dd178314
in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: + 0x29eb89 (0x7f28a3b62b89 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0xadfbe1 (0x7f28a43a3be1 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: THPVariable_subclass_dealloc(_object*) + 0x292
(0x7f28a43a3ee2 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #61: PyRun_SimpleFileExFlags + 0x1bf (0x56231eaba54f in
/home/zhengliangtao/anaconda3/envs/open-mmlab/bin/python) frame #62:
Py_RunMain + 0x3a9 (0x56231eabaa29 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/bin/python) frame #63:
Py_BytesMain + 0x39 (0x56231eabac29 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/bin/python)WARNING:torch.distributed.elastic.multiprocessing.api:Sending process
57875 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process
57876 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode:
-6) local_rank: 0 (pid: 57874) of binary: /home/zhengliangtao/anaconda3/envs/open-mmlab/bin/python Traceback
(most recent call last): File
“/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/runpy.py”,
line 194, in _run_module_as_main
return _run_code(code, main_globals, None, File “/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/runpy.py”,
line 87, in _run_code
exec(code, run_globals) File “/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launch.py”,
line 193, in
main() File “/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launch.py”,
line 189, in main
launch(args) File “/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launch.py”,
line 174, in launch
run(args) File “/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/run.py”,
line 710, in run
elastic_launch( File “/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 131, in call
return launch_agent(self._config, self._entrypoint, list(args)) File
“/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 259, in launch_agent
raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
注意到cuda进程和cpu进程是异步执行的,很坑爹的一点是cuda出错的时候cpu是不会停的,得等到某个时刻可能要同步了才停。所以无法根据报错信息定位。
但我用CUDA_LAUNCH_BLOCKING=1 bash tools/dist_test.sh没有用,一样执行完test.py才报错。于是我只能一段一段注释掉,看看究竟是哪里出的问题,找了1个小时之后才找到,如下:
mmdet3d/mmdet3d/core/evaluation/waymo_utils/
prediction_kitti_to_waymo.py:
def convert_one(self, file_idx):
file_pathname = self.waymo_tfrecord_pathnames[file_idx]
file_data = tf.data.TFRecordDataset(file_pathname, compression_type='')
在最后调用tf.data.TFRecordDataset读取.tfrecord文件的时候,错误发生。如果注释掉这一句,则错误不发生。
而这个是多线程执行的,执行的命令为:
mmcv.track_parallel_progress(self.convert_one, range(len(self)),
self.workers)
一旦使用多线程,错误就发生,不使用则不会发生。如下:
for idx in range(len(self)):
self.convert_one(idx)##too slow
但对于大数据集,也会报错,看上去大概是gpu占用过久,报错如下:
[E ProcessGroupNCCL.cpp:587] [Rank 4] Watchdog caught collective
operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran
for 1808663 milliseconds before timing out. [E
ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed
out. Due to the asynchronous nature of CUDA kernels, subsequent GPU
operations might run on corrupted/incomplete data. To avoid this
inconsistency, we are taking the entire process down. terminate called
after throwing an instance of ‘std::runtime_error’ what(): [Rank 5]
Watchdog caught collective operation timeout:
WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808705
milliseconds before timing out. [E ProcessGroupNCCL.cpp:341] Some NCCL
operations have failed or timed out. Due to the asynchronous nature of
CUDA kernels, subsequent GPU operations might run on
corrupted/incomplete data. To avoid this inconsistency, we are taking
the entire process down. what(): [Rank 3] Watchdog caught
collective operation timeout: WorkNCCL(OpType=ALLREDUCE,
Timeout(ms)=1800000) ran for 1801257 milliseconds before timing out.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process
101573 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode:
-6) local_rank: 1 (pid: 101582) of binary: /home/zhenglt/anaconda3/envs/open-mmlab/bin/python Traceback (most
recent call last): File
“/home/zhenglt/anaconda3/envs/open-mmlab/lib/python3.8/runpy.py”, line
194, in _run_module_as_main
return _run_code(code, main_globals, None, File “/home/zhenglt/anaconda3/envs/open-mmlab/lib/python3.8/runpy.py”, line
87, in _run_code
exec(code, run_globals) File “/home/zhenglt/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launch.py”,
line 193, in main() File
“/home/zhenglt/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launch.py”,
line 189, in main File
“/home/zhenglt/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launch.py”,
line 174, in launch
run(args) File “/home/zhenglt/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/run.py”,
line 710, in run
elastic_launch( File “/home/zhenglt/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 131, in call
return launch_agent(self._config, self._entrypoint, list(args)) File
“/home/zhenglt/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
在mmdet3d==v1.0.0rc2 和torch==1.11.0+cu113的时候没有这个错误,但并行worker开太大就会出现cuda out of memory的错误。
waymo-open-dataset-tf-2-6-0:eval和train中的eval都报错。
waymo-open-dataset-tf-2-5-0:降级到2-5-0,也就是装了tensorflow 2.5.0,此时单独dist_test.sh不报错,但是用dist_train.sh,模型在train了几个epoch之后需要eval,这个时候eval完会报错。
更低的版本也不能用,具体是:
Successfully uninstalled waymo-open-dataset-tf-2-4-0-1.4.1
pip uninstall waymo-open-dataset-tf-2-3-0
Successfully uninstalled waymo-open-dataset-tf-2-2-0-1.3.1
另外tf-2-1-0版本太低了,错误更多。
mmdet3d都编译不了,虽然适配的mmcv支持torch 1.11,但是mmdet3d 0.18.1不支持。
和torch1.10.2的情况一样。
没装成功,版本变动过大,有很多需要测试正确性的工作,失去更换环境耐心。放弃了。
安装不同版本不可行,那么只能绕开tensorflow读取了,把tfrecord里需要使用的文件存成对应.pkl,然后每次读取不读.tfrecord,读.pkl即可,如下:
def convert_one_pkl_style(self, file_idx):
"""Convert action for single file.
Args:
file_idx (int): Index of the file to be converted.
"""
pkl_pathname = self.waymo_tfrecord_pathnames[file_idx].replace('.tfrecord','.pkl')
infos = mmcv.load(pkl_pathname)
## return still got error here
for info in infos:
filename = info['filename']
T_k2w = info['T_k2w']
context_name = info['context_name']
frame_timestamp_micros = info['frame_timestamp_micros']
if filename in self.name2idx:
kitti_result = \
self.kitti_result_files[self.name2idx[filename]]
objects = self.parse_objects(kitti_result, T_k2w, context_name,
frame_timestamp_micros)
else:
print(filename, 'not found.')
objects = metrics_pb2.Objects()
with open(
join(self.waymo_results_save_dir, f'{filename}.bin'),
'wb') as f:
f.write(objects.SerializeToString())
我尝试过直接用dataset里的data_infos来传递参数,每次子进程convert_one读取data_infos里的某个frame的信息,但不仅结果错误,而且很慢,因为无法激活多线程执行,同一时间只有一个子进程是活跃的,原因未知。
北京的机器老卡,read file很慢,有时候重启就好了。但调用waymo官方的compute_metrics的时候不知道为什么有时候也很慢,但只在detr3d的eval中出现,pointpillar的eval没事。原因未知。包括create waymo data的gt_database的时候也会卡,跑了几天都跑不完。
上海的机器很快,没有任何卡住的问题。
为了能安心训练,首先得保证环境不会出错,不知道mmdet为什么特别坑…
装了一个环境之后,目前应该做的事:
bash tools/dist_test.sh configs/fcos3d/fcos3d_r101_caffe_fpn_gn-head_dcn_2x8_1x_nus-mono3d.py ckpts/xxx.pth 4 --eval=bbox
bash tools/dist_train.sh projects/configs/detr3d/detr3d_res101_gridmask.py 4
,将结果和github给的log做对比。bash tools/dist_test.sh configs/pointpillars/hv_pointpillars_secfpn_sbn_2x16_2x_waymoD5-3d-car.py ckpts/hv_pointpillars_secfpn_sbn_2x16_2x_waymoD5-3d-car_20200901_204315-302fc3e7.pth 4 --eval=waymo
bash tools/dist_train.sh projects/configs/detr3d/detr3d_res101_gridmask_waymo_debug.py 4
我用waymo的子集弄了一个debug用的dataset,在修改工程代码,或者新装环境之后可以先跑个子集,快速检测是否有明显的bug:
bash tools/dist_train.sh projects/configs/detr3d/detr3d_res101_gridmask_waymo_debug.py 4
bash tools/dist_test.sh projects/configs/detr3d/detr3d_res101_gridmask_way mo_debug.py work_dirs/detr3d_res101_gridmask_waymo_debug/epoch_1.pth 4 --eval=waymo
如果mmdet3d都错了就没救了,重装吧,如果只是detr3d上错了,mmdet3d是对的,就得查一下自己的bug了。
evaluation的时候设sample stride并不make
sense,主要是waymo给的compute_metrics好像并不支持做这种事…
对于pointpillar进行waymo_subset的eval结果如下,可以进行fast check而无需跑大数据集:
{‘Vehicle/L1 mAP’: 0.00619617, ‘Vehicle/L1 mAPH’: 0.00615365,
‘Vehicle/L2 mAP’: 0.00528095, ‘Vehicle/L2 mAPH’: 0.00524471,
‘Pedestrian/L1 mAP’: 0.0, ‘Pedestrian/L1 mAPH’: 0.0, ‘Pedestrian/L2
mAP’: 0.0, ‘Pedestrian/L2 mAPH’: 0.0, ‘Sign/L1 mAP’: 0.0, ‘Sign/L1
mAPH’: 0.0, ‘Sign/L2 mAP’: 0.0, ‘Sign/L2 mAPH’: 0.0, ‘Cyclist/L1 mAP’:
0.0, ‘Cyclist/L1 mAPH’: 0.0, ‘Cyclist/L2 mAP’: 0.0, ‘Cyclist/L2 mAPH’: 0.0, ‘Overall/L1 mAP’: 0.00206539, ‘Overall/L1 mAPH’: 0.0020512166666666666, ‘Overall/L2 mAP’: 0.0017603166666666668, ‘Overall/L2 mAPH’: 0.0017482366666666666}