复现ReDet RTX 3090 pytorch1.8.1

今天开始复现2021 CVPR ReDet
ReDet: A Rotation-equivariant Detector for Aerial Object Detection
原文GitHub:https://github.com/csuhan/ReDet
复现环境基于autodl租用的RTX3090(2.6¥/小时),数据集是HRSC2016,36epochs大约两小时,从调试到跑通共消费在20元左右,如果环境没问题,单训练36epochs大约5块钱。
注意。我第一次没跑成功以及遇到的问题在后边的记流水账版本。前边的这个版本跑成功了。

1.Installation

Requirements
Linux
Python 3.5/3.6/3.7
PyTorch 1.1/1.3.1
CUDA 10.0/10.1
NCCL 2+
GCC 4.9+
mmcv<=0.2.14

官方提示的Requirements如上所示,我选配的AUTODL的服务器是
RTX3090 ,
PyTorch 1.8.1
Python 3.8
Cuda 11.1

Install ReDet

a. Create a conda virtual environment and activate it. Then install Cython.

先创建一个conda环境名叫redet,python版本3.7,然后安装cython

conda create -n redet python=3.7 -y
source activate redet
conda install cython

然后补充上边的mmcv==0.2.13(后边mmdet 0.6.0不支持0.2.14,所以用0.2.13)

pip install mmcv==0.2.13

b. Install PyTorch and torchvision following the official instructions.

这里因为我们的版本不同,用
从pytorch官网找官方语句
https://pytorch.org/get-started/previous-versions/

pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html

tips:这里官方提示了许多版本的相关问题,我上一次调试就是因为版本不对,pytorch1.1.0却用的3090,肯定不行,这里有想法的同学可以自己去看,这里不再赘述。

Note:
1.If you want to use Pytorch>1.5, you have to made some modifications to the cuda ops. See here for a reference.
2.There is a known bug happened to some users but not all (As I have successfully run it on V100 and Titan Xp). If it occurs, please refer to here.
3.If you want to use Python<=3.6, you need to install e2cnn@legacy_py3.6 mamually, see here for an instruction.

c. Clone the ReDet repository.

上一步没操作完不要紧,可以再开一个链接操作着这个(如果听不懂就当我没说)

git clone https://github.com/csuhan/ReDet.git
cd ReDet

在这里创建一个文件夹data(他源代码就是这样的,方便调bug)

然后我们可以先把数据集同步进来,我自己上传到autodl的网盘上的,开的多线(窗)程(口),(不懂就当我没说)

cp /root/autodl-nas/HRSC2016 /root/ReDet/data/ -r

d. Compile cuda extensions.

然后就到了激动人心的bash了,前情介绍,这里bug巨多,如果你成功了算你幸运,我的环境需要替换mmdet/ops里的所有AT_CHECK为TORCH_CHECK。
这里用一下GitHub在issue里边一位大神的代码,作用是把该文件夹中所有文件遍历,然后修改文件中的AT_CHECK为TORCH_CHECK。我是在Redet/mmdet/ops里边运行的,因为在系统里运行太漫长了

find . -type f -exec sed -i 's/AT_CHECK/TORCH_CHECK/g' {} +

然后再进行编译:

bash compile.sh

报错:

(redet) root@container-2f3811a53c-c526a191:~/ReDet# bash compile.sh
Building roi align op...
Traceback (most recent call last):
  File "setup.py", line 2, in <module>
    from torch.utils.cpp_extension import BuildExtension, CUDAExtension
ModuleNotFoundError: No module named 'torch'

居然是没有安装好pytorch
回过头去看:

(redet) root@container-2f3811a53c-c526a191:~/redet# pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html
Looking in indexes: http://mirrors.aliyun.com/pypi/simple
Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.8.1+cu111
  Downloading https://download.pytorch.org/whl/cu111/torch-1.8.1%2Bcu111-cp37-cp37m-linux_x86_64.whl (1982.2 MB)
     |███████████████████████████████ | 1922.0 MB 31 kB/s eta 0:31:26Killed

好家伙,刚才科学上网不小心把远程连接断开了,失误失误。
安装等待ing 16:22开始,看视频ing(不看了,调了一个小时了,站起来歇歇老腰)
ma de 又一遍还是killled,网上找的解决办法是后边加个尾缀试试

pip install xxxx--no-cache-dir
pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html --no-cache-dir

网友评论显示很有用,包括下载pip包也是,我没试过。
pytorch安装能行了,但不确定是不是这个原因。
这下提示安装成功了

Successfully installed pillow-9.0.1 torch-1.8.1+cu111 torchaudio-0.8.1 torchvision-0.9.1+cu111 typing-extensions-4.1.1

编译一手。
注意了一下编译报的错,编译

/root/miniconda3/envs/redet/lib/python3.7/site-packages/torch/include/ATen/core/TensorBody.h:303:30: note: declared here
   DeprecatedTypeProperties & type() const {
                              ^~~~

  File "/root/miniconda3/envs/redet/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1561, in _get_cuda_arch_flags
    arch_list[-1] += '+PTX'
IndexError: list index out of range

好像影响不大,有个疑问就是我用的是无卡模式,不知道是否影响编译。

gcc: internal compiler error: Killed (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-7/README.Bugs> for instructions.
error: command 'gcc' failed with exit status 4 

这也有个大红error,网友说内存不够,那有卡的时候再跑一边。
先不管了,下一步走起。

e. Install ReDet (other dependencies will be installed automatically).

python setup.py develop
# or "pip install -e ."

中途有卡住的地方自己手动pip就行

Install DOTA_devkit

sudo apt-get install swig
cd DOTA_devkit
swig -c++ -python polyiou.i
python setup.py build_ext --inplace

第一行我得用conda install swig

吃饭去了。

2022年3月27日,早上起来,10点20分,租了RTX 3090,再bash compile.sh一遍,看看跟内存相关的那个大红error还有没有。
目前没有,希望一切正常.目前我没看到错误,已经编译完成了hhh。
然后准备txt,

2.get start

准备数据集,
我的程序放在

/root/ReDet

我的数据集放在

/root/ReDet/data/HRSC2016

由于HRSC2016带着的imageSets不行,和Train、Test里边对应的图片不符,自己手写了generate_txt.py来生成train.txt和test.txt

import os
import re
images_path = '/root/ReDet/data/HRSC2016/Train/images'   # 图片存放目录
txt_save_path = '/root/ReDet/data/HRSC2016/train.txt'  # 生成的图片列表清单txt文件名
fw = open(txt_save_path, "w")
for filename in os.listdir(images_path):
    print(filename.split(".")[0])  
    fw.write(filename.split(".")[0] + '\n') 

images_path = '/root/ReDet/data/HRSC2016/Test/images'   # 图片存放目录
txt_save_path = '/root/ReDet/data/HRSC2016/test.txt'  # 生成的图片列表清单txt文件名
fw = open(txt_save_path, "w")
for filename in os.listdir(images_path):
    print(filename.split(".")[0])  
    fw.write(filename.split(".")[0] + '\n') 

然后运行

python DOTA_devkit/HRSC20162COCO.py

然后,把他提供的文件放到新建的work_dirs里边

 cp /root/autodl-nas/ReDet_re50_refpn_3x_hrsc2016/ /root/ReDet/work_dirs/ -r

测试test.py

python tools/test.py configs/ReDet/ReDet_re50_refpn_3x_hrsc2016.py work_dirs/ReDet_re50_refpn_3x_hrsc2016/ReDet_re50_refpn_3x_hrsc2016-d1b4bd29.pth --out work_dirs/ReDet_re50_refpn_3x_hrsc2016/results.pkl

输出:


ReResNet Orientation: 8 Fix Params: False
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
/root/miniconda3/envs/redet/lib/python3.7/site-packages/e2cnn-0.2.1-py3.7.egg/e2cnn/nn/modules/r2_conv/basisexpansion_singleblock.py:80: 
UserWarning: indexing with dtype torch.uint8 is now deprecated, please use a dtype torch.bool instead. (Triggered internally at  
/pytorch/aten/src/ATen/native/IndexingUtils.h:30.)
  full_mask[mask] = norms.to(torch.uint8)
The model and loaded state dict do not match exactly

missing keys in source state_dict: neck.fpn_convs.0.conv.expanded_bias, backbone.layer3.5.conv3.filter, neck.fpn_convs.0.conv.filter, 
此处略过20行
backbone.layer4.0.conv3.filter, backbone.conv1.filter

尼玛,终于显示了:

[                                                  ] 0/444, elapsed: 0s, ETA:/root/ReDet/mmdet/core/bbox/transforms.py:56: UserWarning: This overload of addcmul is deprecated:
        addcmul(Tensor input, Number value, Tensor tensor1, Tensor tensor2, *, Tensor out)
Consider using one of the following signatures instead:
        addcmul(Tensor input, Tensor tensor1, Tensor tensor2, *, Number value, Tensor out) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:1005.)
  gx = torch.addcmul(px, 1, pw, dx)  # gx = px + pw * dx
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 444/444, 3.2 task/s, elapsed: 138s, ETA:     0s
writing results to work_dirs/ReDet_re50_refpn_3x_hrsc2016/results.pkl
(redet) root@container-2f3811a53c-c526a191:~/ReDet# 

我猜应该是行了。泪目啊。

试试评价
先把hrsc2016_evaluation.py里边的几行改了

	detpath = r'work_dirs/Task1_{:s}.txt'#
    annopath = r'data/HRSC2016/Test/labelTxt/{:s}.txt'  # change the directory to the path of val/labelTxt, if you want to do evaluation on the valset
    imagesetfile = r'data/HRSC2016/test.txt'

然后运行

python DOTA_devkit/hrsc2016_evaluation.py

显示的东西咱也看不懂。只认得最后那个ap50是90.46,是论文中的结果。

(redet) root@container-2f3811a53c-c526a191:~/ReDet# python DOTA_devkit/hrsc2016_evaluation.py
DOTA_devkit/hrsc2016_evaluation.py:153: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  difficult = np.array([x['difficult'] for x in R]).astype(np.bool)
check fp: [0. 0. 0. ... 1. 1. 1.]
check tp [1. 1. 1. ... 0. 0. 0.]
npos num: 1188
check fp: [0. 0. 0. ... 1. 1. 1.]
check tp [1. 1. 1. ... 0. 0. 0.]
npos num: 1188
check fp: [0. 0. 0. ... 1. 1. 1.]
check tp [1. 1. 1. ... 0. 0. 0.]
npos num: 1188
check fp: [0. 0. 0. ... 1. 1. 1.]
check tp [1. 1. 1. ... 0. 0. 0.]
npos num: 1188
check fp: [0. 0. 0. ... 1. 1. 1.]
check tp [1. 1. 1. ... 0. 0. 0.]
npos num: 1188
check fp: [0. 0. 0. ... 1. 1. 1.]
check tp [1. 1. 1. ... 0. 0. 0.]
npos num: 1188
check fp: [0. 0. 0. ... 1. 1. 1.]
check tp [1. 1. 1. ... 0. 0. 0.]
npos num: 1188
check fp: [0. 0. 0. ... 1. 1. 1.]
check tp [1. 1. 1. ... 0. 0. 0.]
npos num: 1188
check fp: [0. 1. 0. ... 1. 1. 1.]
check tp [1. 0. 1. ... 0. 0. 0.]
npos num: 1188
check fp: [1. 1. 0. ... 1. 1. 1.]
check tp [0. 0. 1. ... 0. 0. 0.]
npos num: 1188
AP50: 90.46     AP75: 89.46      mAP: 70.41

测试大尺寸图像中的推理演示。

python demo_large_image.py

报错:

ReResNet Orientation: 8 Fix Params: False
loading annotations into memory...
Traceback (most recent call last):
  File "demo_large_image.py", line 137, in <module>
    r"work_dirs/ReDet_re50_refpn_1x_dota15_ms/ReDet_re50_refpn_1x_dota15_ms-9d1a523c.pth")
  File "demo_large_image.py", line 89, in __init__
    self.dataset = get_dataset(self.data_test)
  File "/root/ReDet/mmdet/datasets/utils.py", line 109, in get_dataset
    dset = obj_from_dict(data_info, datasets)
  File "/root/miniconda3/envs/redet/lib/python3.7/site-packages/mmcv/runner/utils.py", line 78, in obj_from_dict
    return obj_type(**args)
  File "/root/ReDet/mmdet/datasets/custom.py", line 68, in __init__
    self.img_infos = self.load_annotations(ann_file)
  File "/root/ReDet/mmdet/datasets/coco.py", line 25, in load_annotations
    self.coco = COCO(ann_file)
  File "/root/miniconda3/envs/redet/lib/python3.7/site-packages/pycocotools-2.0.4-py3.7-linux-x86_64.egg/pycocotools/coco.py", line 81, in __init__
    with open(annotation_file, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/workfs/jmhan/dota15_1024_ms/test1024/DOTA1_5_test1024.json'

不改了,吃饭了
2022年3月27日15点55分
开始训练吧,

测试了大图片推理(预测)

把测试文件的路径稍作修改:

 model = DetectorModel(
        r"configs/ReDet/ReDet_re50_refpn_3x_hrsc2016.py",
        r"work_dirs/ReDet_re50_refpn_3x_hrsc2016/ReDet_re50_refpn_3x_hrsc2016-d1b4bd29.pth")

    img_dir = "byHand/largeImage"
    out_dir = 'byHand'

就放了一张图,1000011.bmp
然后运行
输出如下:

(redet) root@container-2f3811a53c-c526a191:~/ReDet# python demo_large_image.py
ReResNet Orientation: 8 Fix Params: False
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
/root/miniconda3/envs/redet/lib/python3.7/site-packages/e2cnn-0.2.1-py3.7.egg/e2cnn/nn/modules/r2_conv/basisexpansion_singleblock.py:80: UserWarning: indexing with dtype torch.uint8 is now deprecated, please use a dtype torch.bool instead. (Triggered internally at  /pytorch/aten/src/ATen/native/IndexingUtils.h:30.)
  full_mask[mask] = norms.to(torch.uint8)
The model and loaded state dict do not match exactly

missing keys in source state_dict:
 backbone.layer3.4.conv2.filter,
  backbone.layer3.5.conv1.filter, 
  backbone.layer3.5.conv3.filter,
   neck.lateral_convs.3.conv.filter
   此处上略一万行

100000011.bmp
  0%|                                                                                                      | 0/2 [00:00, ?it/s]/root/ReDet/mmdet/core/bbox/transforms.py:56: UserWarning: This overload of addcmul is deprecated:
        addcmul(Tensor input, Number value, Tensor tensor1, Tensor tensor2, *, Tensor out)
Consider using one of the following signatures instead:
        addcmul(Tensor input, Tensor tensor1, Tensor tensor2, *, Number value, Tensor out) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:1005.)
  gx = torch.addcmul(px, 1, pw, dx)  # gx = px + pw * dx
100%|██████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  8.33it/s]
(redet) root@container-2f3811a53c-c526a191:~/ReDet# 

然后查看文件夹中生成的图片:

贴上原图做对比

我很激动,能推理了,表明大概理也能训练了,这么激动的时刻,先多测试几张,好写实验报告hhh。

后台挂起训练

nohup试一下

nohup python tools/test.py configs/ReDet/ReDet_re50_refpn_3x_hrsc2016.py work_dirs/ReDet_re50_refpn_3x_hrsc2016/ReDet_re50_refpn_3x_hrsc2016-d1b4bd29.pth --out work_dirs/ReDet_re50_refpn_3x_hrsc2016/results.pkl >xxxcbtest.log 2>&1 &

行,之前显示nohup: ignoring input是有点慢,现在行了,待会开始训练。
先把work_dirs的东西清空
训练开始给我卡住了,到处都不会


positional arguments:
  config                train config file path

optional arguments:
  -h, --help            show this help message and exit
  --work_dir WORK_DIR   the dir to save logs and models
  --resume_from RESUME_FROM
                        the checkpoint file to resume from
  --validate            whether to evaluate the checkpoint during training
  --gpus GPUS           number of gpus to use (only applicable to non-distributed training)
  --seed SEED           random seed
  --launcher {none,pytorch,slurm,mpi}
                        job launcher
  --local_rank LOCAL_RANK

必选参数我没加,谁叫咱不懂什么是必选参数呢

 python tools/train.py configs/ReDet/ReDet_re50_refpn_3x_hrsc2016.py

还要上传预训练模型到work_dirs
复现ReDet RTX 3090 pytorch1.8.1_第1张图片
修改ReDet_re50_refpn_3x_hrsc2016。py文件中的路径和刚才上传的与训练pth文件相同。

pretrained='work_dirs/ReResNet_pretrain/re_resnet50_c8_batch256-25b16846.pth',

然后开始训练
他说
0.01 for 4 GPUs
and
0.04 for 16 GPUs.
但是我1 GPUs,也没改lr,目前是1,可能改了训练慢了,就这样吧,后
改了学习率为0.005,用两块RTX 3090 开始训练
142行 img_per_gpu是batch_size

现在img_per_gpu==4
lr == 0.005

nohup python tools/train.py configs/ReDet/ReDet_re50_refpn_3x_hrsc2016.py --gpus 2 >xxxcbtest.log 2>&1 &
[2] 3974

试试distribute train

bash tools/dist_train.sh configs/ReDet/ReDet_re50_refpn_3x_hrsc2016.py 2

不行,卡在ReResNet Orientation: 8 Fix Params: False不动了。
还是单卡吧。

nohup python tools/train.py configs/ReDet/ReDet_re50_refpn_3x_hrsc2016.py >xxxcbtest.log 2>&1 &
GPU总是使用6G左右,还得改。
Sun Mar 27 18:32:11 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.00    Driver Version: 470.82.00    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:3D:00.0 Off |                  N/A |
| 30%   43C    P2   216W / 350W |   6402MiB / 24268MiB |     94%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

然后不会了。卡住了,我猜可能要测试评估、转化结果、再验证,试试吧。

python tools/test.py configs/ReDet/ReDet_re50_refpn_3x_hrsc2016.py work_dirs/ReDet_re50_refpn_3x_hrsc2016/latest.pth --out work_dirs/ReDet_re50_refpn_3x_hrsc2016/results.pkl

本想无卡模式运行上边这个,结果太慢了,还是3090吧,
完成了,生成了pkl文件,
然后运行

parser.add_argument('--config', default='configs/ReDet/ReDet_re50_refpn_3x_hrsc2016.py')

上边的parse_results.py文件会把pkl格式输出为txt格式,最后用txt的文件评价,自己调一下文件路径。

出结果了:

(redet) root@container-2f3811a53c-c526a191:~/ReDet# python DOTA_devkit/hrsc2016_evaluation.py 
DOTA_devkit/hrsc2016_evaluation.py:153: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  difficult = np.array([x['difficult'] for x in R]).astype(np.bool)
check fp: [0. 0. 0. ... 1. 1. 1.]
check tp [1. 1. 1. ... 0. 0. 0.]
npos num: 1188
check fp: [0. 0. 0. ... 1. 1. 1.]
check tp [1. 1. 1. ... 0. 0. 0.]
npos num: 1188
check fp: [0. 0. 0. ... 1. 1. 1.]
check tp [1. 1. 1. ... 0. 0. 0.]
npos num: 1188
check fp: [0. 0. 0. ... 1. 1. 1.]
check tp [1. 1. 1. ... 0. 0. 0.]
npos num: 1188
check fp: [0. 0. 0. ... 1. 1. 1.]
check tp [1. 1. 1. ... 0. 0. 0.]
npos num: 1188
check fp: [0. 0. 0. ... 1. 1. 1.]
check tp [1. 1. 1. ... 0. 0. 0.]
npos num: 1188
check fp: [0. 0. 0. ... 1. 1. 1.]
check tp [1. 1. 1. ... 0. 0. 0.]
npos num: 1188
check fp: [0. 0. 0. ... 1. 1. 1.]
check tp [1. 1. 1. ... 0. 0. 0.]
npos num: 1188
check fp: [1. 0. 0. ... 1. 1. 1.]
check tp [0. 1. 1. ... 0. 0. 0.]
npos num: 1188
check fp: [1. 1. 1. ... 1. 1. 1.]
check tp [0. 0. 0. ... 0. 0. 0.]
npos num: 1188
AP50: 90.37     AP75: 88.93      mAP: 69.46

还是一堆看不懂的东西,不过最后的AP50变了,变小了。自认为复现完成了。我先回顾回顾。

以下是流水账部分

1.从autodl租了一台机器,配Requirements

先租一个2080Ti,
环境选择在这里插入图片描述

先用无卡模式配置环境
费用如下:
复现ReDet RTX 3090 pytorch1.8.1_第2张图片
然后检查GitHub要求的库
复现ReDet RTX 3090 pytorch1.8.1_第3张图片
查看NCCL:
在这里插入图片描述
查看GCC
命令是`

gcc -v

复现ReDet RTX 3090 pytorch1.8.1_第4张图片
查看mmcv,我没找到查看方法,我直接安装了

pip install mmcv==0.2.14

在这里插入图片描述

2.安装库Install ReDet

复现ReDet RTX 3090 pytorch1.8.1_第5张图片

完成
复现ReDet RTX 3090 pytorch1.8.1_第6张图片
因为我的环境是:
在这里插入图片描述
,所以安装的命令是:

conda install pytorch==1.1.0 torchvision==0.3.0 cudatoolkit=10.0 -c pytorch

复现ReDet RTX 3090 pytorch1.8.1_第7张图片

这一步(c)最好安装到根目录下,不然autodl的卡被别人占用了,无法数据迁移,自己就必须重新配环境
复现ReDet RTX 3090 pytorch1.8.1_第8张图片

复现ReDet RTX 3090 pytorch1.8.1_第9张图片

复现ReDet RTX 3090 pytorch1.8.1_第10张图片

这一步经常卡住,卡住的包就自己用pip install

对数据集的处理:
复现ReDet RTX 3090 pytorch1.8.1_第11张图片
就是先运行HRSC2DOTA。py,这个文件我在他别的repo里边找到的,然后按照缺少的文件去他GitHub其它程序中找找,搬过来,然后运行,最后修改一下文件名。

开始租一个3090跑一下

2022年3月23日

第一次运行测试HRSC2016的语句

python tools/test.py configs/ReDet/ReDet_re50_refpn_3x_hrsc2016.py work_dirs/ReDet_re50_refpn_3x_hrsc2016/ReDet_re50_refpn_3x_hrsc2016-d1b4bd29.pth --out work_dirs/ReDet_re50_refpn_3x_hrsc2016/results.pkl

出现了错误:

Traceback (most recent call last):
  File "tools/test.py", line 9, in <module>
    from mmcv.runner import load_checkpoint, get_dist_info
  File "/root/miniconda3/lib/python3.7/site-packages/mmcv/runner/__init__.py", line 1, in <module>
    from .runner import Runner
  File "/root/miniconda3/lib/python3.7/site-packages/mmcv/runner/runner.py", line 9, in <module>
    from .checkpoint import load_checkpoint, save_checkpoint
  File "/root/miniconda3/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 10, in <module>
    import torchvision
  File "/root/miniconda3/lib/python3.7/site-packages/torchvision/__init__.py", line 2, in <module>
    from torchvision import datasets
  File "/root/miniconda3/lib/python3.7/site-packages/torchvision/datasets/__init__.py", line 9, in <module>
    from .fakedata import FakeData
  File "/root/miniconda3/lib/python3.7/site-packages/torchvision/datasets/fakedata.py", line 3, in <module>
    from .. import transforms
  File "/root/miniconda3/lib/python3.7/site-packages/torchvision/transforms/__init__.py", line 1, in <module>
    from .transforms import *
  File "/root/miniconda3/lib/python3.7/site-packages/torchvision/transforms/transforms.py", line 17, in <module>
    from . import functional as F
  File "/root/miniconda3/lib/python3.7/site-packages/torchvision/transforms/functional.py", line 5, in <module>
    from PIL import Image, ImageOps, ImageEnhance, PILLOW_VERSION
ImportError: cannot import name 'PILLOW_VERSION' from 'PIL' (/root/miniconda3/lib/python3.7/site-packages/PIL/__init__.py)

看看他的issue里边有没有这个问题。
(先把checkpoints下下来试试)
复现ReDet RTX 3090 pytorch1.8.1_第12张图片
还是不行,接着看issue吧。
忘了激活conda环境了,(虽然不是这个的问题)

source activate redet

issue没找到,去看百度。
晚上说pillow库的版本过高导致的,然后我降了版本。

conda install pillow==6.2.0

然后报错,可能是路径不对

(redet) root@container-e19b1182ac-a18adac2:~/ReDet# python tools/test.py configs/ReDet/ReDet_re50_refpn_3x_hrsc2016.py work_dirs/ReDet_re50_refpn_3x_hrsc2016/ReDet_re50_refpn_3x_hrsc2016-d1b4bd29.pth --out work_dirs/ReDet_re50_refpn_3x_hrsc2016/results.pkl
ReResNet Orientation: 8 Fix Params: False
loading annotations into memory...
Traceback (most recent call last):
  File "tools/test.py", line 208, in <module>
    main()
  File "tools/test.py", line 158, in main
    dataset = get_dataset(cfg.data.test)
  File "/root/ReDet/mmdet/datasets/utils.py", line 109, in get_dataset
    dset = obj_from_dict(data_info, datasets)
  File "/root/miniconda3/envs/redet/lib/python3.7/site-packages/mmcv-0.2.13-py3.7-linux-x86_64.egg/mmcv/runner/utils.py", line 78, in obj_from_dict
    return obj_type(**args)
  File "/root/ReDet/mmdet/datasets/custom.py", line 68, in __init__
    self.img_infos = self.load_annotations(ann_file)
  File "/root/ReDet/mmdet/datasets/coco.py", line 25, in load_annotations
    self.coco = COCO(ann_file)
  File "/root/miniconda3/envs/redet/lib/python3.7/site-packages/pycocotools-2.0.4-py3.7-linux-x86_64.egg/pycocotools/coco.py", line 81, in __init__
    with open(annotation_file, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/HRSC2016/Test/HRSC_L1_test.json'

然后找这个。(后悔啊,重新把文件和数据集按照他的要求放吧,至少少出问题。

把HRSC2016数据集放到/root/ReDet/data/HRSC2016

(redet) root@container-e19b1182ac-a18adac2:~/ReDet# python tools/test.py configs/ReDet/ReDet_re50_refpn_3x_hrsc2016.py work_dirs/ReDet_re50_refpn_3x_hrsc2016/ReDet_re50_refpn_3x_hrsc2016-d1b4bd29.pth --out work_dirs/ReDet_re50_refpn_3x_hrsc2016/results.pkl
ReResNet Orientation: 8 Fix Params: False
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!


The model and loaded state dict do not match exactly

missing keys in source state_dict: neck.lateral_convs.1.conv.filter, backbone.layer4.2.conv2.filter, backbone.layer3.3.conv3.filter, backbone.layer3.0.conv1.filter, backbone.layer2.1.conv1.filter, backbone.layer3.5.conv3.filter, backbone.layer4.1.conv2.filter, backbone.layer4.0.conv1.filter, backbone.layer3.4.conv1.filter, 
neck.lateral_convs.2.conv.expanded_bias, neck.lateral_convs.3.conv.filter, backbone.layer2.0.downsample.0.filter, backbone.conv1.filter, backbone.layer4.0.downsample.0.filter, backbone.layer2.2.conv2.filter, backbone.layer3.1.conv2.filter, backbone.layer2.3.conv1.filter, backbone.layer2.0.conv1.filter, neck.lateral_convs.1.conv.expanded_bias, 
backbone.layer4.0.conv3.filter, backbone.layer4.2.conv3.filter, backbone.layer3.1.conv3.filter, backbone.layer3.5.conv2.filter, backbone.layer3.2.conv3.filter, neck.fpn_convs.2.conv.filter, backbone.layer2.0.conv3.filter, neck.fpn_convs.3.conv.filter, backbone.layer3.4.conv2.filter, 
backbone.layer3.0.conv2.filter, backbone.layer4.1.conv1.filter, neck.fpn_convs.0.conv.filter, backbone.layer4.2.conv1.filter, backbone.layer3.0.conv3.filter, backbone.layer4.0.conv2.filter, backbone.layer3.5.conv1.filter, backbone.layer2.1.conv3.filter, backbone.layer2.1.conv2.filter, neck.fpn_convs.2.conv.expanded_bias, neck.fpn_convs.3.conv.expanded_bias, backbone.layer3.1.conv1.filter, backbone.layer4.1.conv3.filter, neck.lateral_convs.2.conv.filter, neck.fpn_convs.1.conv.expanded_bias, neck.fpn_convs.1.conv.filter, 
backbone.layer2.2.conv1.filter, neck.lateral_convs.0.conv.expanded_bias, backbone.layer3.2.conv1.filter, 
backbone.layer3.4.conv3.filter, neck.lateral_convs.0.conv.filter, neck.fpn_convs.0.conv.expanded_bias, backbone.layer2.3.conv3.filter, backbone.layer2.0.conv2.filter, 
neck.lateral_convs.3.conv.expanded_bias, backbone.layer3.3.conv1.filter, backbone.layer3.2.conv2.filter, backbone.layer2.3.conv2.filter, backbone.layer3.0.downsample.0.filter, backbone.layer2.2.conv3.filter, backbone.layer3.3.conv2.filter

看issue里边说这是正常现象,15:33开始,再试一遍,可能是刚才时间太长,没把握。

一块GPU 3090 运行了15分钟,还没有结果,
复现ReDet RTX 3090 pytorch1.8.1_第13张图片
还没反应,关掉试试test
一直没成功,改hrsc2016_evalxxxx.py没成功

root@container-e19b1182ac-a18adac2:~/ReDet# source activate redet
(redet) root@container-e19b1182ac-a18adac2:~/ReDet# python DOTA_devkit/hrsc2016_evaluation.py
Traceback (most recent call last):
  File "DOTA_devkit/hrsc2016_evaluation.py", line 293, in <module>
    main()
  File "DOTA_devkit/hrsc2016_evaluation.py", line 282, in main
    rec, prec, ap = voc_eval(detpath, annopath, imagesetfile, 'ship', ovthresh=iou_thr, use_07_metric=True)
  File "DOTA_devkit/hrsc2016_evaluation.py", line 125, in voc_eval
    with open(imagesetfile, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/HRSC2016/Test/test.txt'
(redet) root@container-e19b1182ac-a18adac2:~/ReDet# python DOTA_devkit/hrsc2016_evaluation.py
Traceback (most recent call last):
  File "DOTA_devkit/hrsc2016_evaluation.py", line 297, in <module>
    main()
  File "DOTA_devkit/hrsc2016_evaluation.py", line 286, in main
    rec, prec, ap = voc_eval(detpath, annopath, imagesetfile, 'ship', ovthresh=iou_thr, use_07_metric=True)
  File "DOTA_devkit/hrsc2016_evaluation.py", line 134, in voc_eval
    recs[imagename] = parse_gt(annopath.format(imagename))
  File "DOTA_devkit/hrsc2016_evaluation.py", line 28, in parse_gt
    with  open(filename, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/HRSC2016/Test/labelTxt/100000624.txt'

问了学姐,可能是generate没完成,txt文件中才500行。
是数据集生成、转换的格式有问题,原文没提供hrsc2dota.py文件,去其他地方找的有问题,待会在搞。

开始读HRSC2DOTA.PY,
其中的if difficult==0 处理,==1进行忽视,有点疑问。

HRSC2DOTA.py读完了,没问题,运行结果:
查看当前文件夹下有多少文件(夹)

ls | wc -w

Train
AllImages 626个文件
Annotations 626
labelTxt 626
没问题
Test
AllImages 444
Annotations 444
labelTxt 444
同时自己windows下载了数据集检查了Train 626 Test 444个文件,是对的
再次使用HRSC2DOTA.py出现

(redet) root@container-e19b1182ac-a18adac2:~/ReDet/DOTA_devkit# python HRSC2DOTA.py 
Traceback (most recent call last):
  File "HRSC2DOTA.py", line 79, in <module>
    generate_txt_labels('/root/ReDet/data/HRSC2016/Train')
  File "HRSC2DOTA.py", line 59, in generate_txt_labels
    f_label = open(label)#打开原来的.xml文件
FileNotFoundError: [Errno 2] No such file or directory: '/root/ReDet/data/HRSC2016/Train/Annotations/.ipynb_checkpoints.xml'

结果发现从jupyterlab打开一次图片就会留下一个.ipy…的文件夹,里边有100000624-checkpoint.bmp文件,应该是图片的 缓存。
ll命令后显示的total是占用的空间,默认是Bytes

目前不懂
复现ReDet RTX 3090 pytorch1.8.1_第14张图片
开始读HRSC2COCO.py
然后运行完了。
hrsc2016_evalate.py还是不行,缺少624.bmp
重新下一便数据集试试

看ReDet论文中显示的数据集信息如下:
复现ReDet RTX 3090 pytorch1.8.1_第15张图片
总共1061张,理论上 train val test 分别有 436 181 444 张图片,woc,我之前下载的是个包团md。
好像他把train和val合并当作train了,我再瞅瞅。好像还是有问题,回想起,我当时复现DAL他的txt是自己写了套代码生成的,我去找找。
把DAL的generate_images.py拷贝过来了。
运行完了发现并没有卵用,这属于瞎搞了,南辕北辙了属于是,老老实实写写获取当前列表的脚本方法吧。

find  -name '*.bmp' > train.txt

先把目录下的文件名都搞进txt去,然后用python处理,吃完饭回来再处理。
2022年3月24日19点12分
今晚上的任务就是把generate_txt.py写好,把hrsc2016_evalate.py运行起来。


#2022年3月24日19点15分 手写

import os
import re
images_path = '/root/ReDet/data/HRSC2016/Train/images'   # 图片存放目录
txt_save_path = '/root/ReDet/data/HRSC2016/Train/images/train.txt'  # 生成的图片列表清单txt文件名
fw = open(txt_save_path, "w")
for filename in os.listdir(images_path):
    print(filename.split(".")[0])  
    fw.write(filename.split(".")[0] + '\n')  


然后把这个运行一手。
再把Train改为Test再运行一手。

train。txt 626个
test。txt 444个

(redet) root@container-e19b1182ac-a18adac2:~/ReDet# python DOTA_devkit/hrsc2016_evaluation.py 
DOTA_devkit/hrsc2016_evaluation.py:153: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  difficult = np.array([x['difficult'] for x in R]).astype(np.bool)
check fp: [0. 0. 0. ... 1. 1. 1.]
check tp [1. 1. 1. ... 0. 0. 0.]
npos num: 1188
check fp: [0. 0. 0. ... 1. 1. 1.]
check tp [1. 1. 1. ... 0. 0. 0.]
npos num: 1188
check fp: [0. 0. 0. ... 1. 1. 1.]
check tp [1. 1. 1. ... 0. 0. 0.]
npos num: 1188
check fp: [0. 0. 0. ... 1. 1. 1.]
check tp [1. 1. 1. ... 0. 0. 0.]
npos num: 1188
check fp: [0. 0. 0. ... 1. 1. 1.]
check tp [1. 1. 1. ... 0. 0. 0.]
npos num: 1188
check fp: [0. 0. 0. ... 1. 1. 1.]
check tp [1. 1. 1. ... 0. 0. 0.]
npos num: 1188
check fp: [0. 0. 0. ... 1. 1. 1.]
check tp [1. 1. 1. ... 0. 0. 0.]
npos num: 1188
check fp: [0. 0. 0. ... 1. 1. 1.]
check tp [1. 1. 1. ... 0. 0. 0.]
npos num: 1188
check fp: [0. 1. 0. ... 1. 1. 1.]
check tp [1. 0. 1. ... 0. 0. 0.]
npos num: 1188
check fp: [1. 1. 0. ... 1. 1. 1.]
check tp [0. 0. 1. ... 0. 0. 0.]
npos num: 1188
AP50: 90.46     AP75: 89.46      mAP: 70.41
(redet) root@container-e19b1182ac-a18adac2:~/

能运行起来了,歇会。

2022年3月25日10点50分
还没看hrsc2016_evlate.py的代码,先运行着test.py试试几分钟有反应,上次15分钟没反应

python tools/test.py configs/ReDet/ReDet_re50_refpn_3x_hrsc2016.py work_dirs/ReDet_re50_refpn_3x_hrsc2016/ReDet_re50_refpn_3x_hrsc2016-d1b4bd29.pth --out work_dirs/ReDet_re50_refpn_3x_hrsc2016/results.pkl

nohup python tools/test.py configs/ReDet/ReDet_re50_refpn_3x_hrsc2016.py work_dirs/ReDet_re50_refpn_3x_hrsc2016/ReDet_re50_refpn_3x_hrsc2016-d1b4bd29.pth --out work_dirs/ReDet_re50_refpn_3x_hrsc2016/results.pkl >redet3251056.log 2>&1 &

nohup python tools/test.py configs/ReDet/ReDet_re50_refpn_3x_hrsc2016.py work_dirs/ReDet_re50_refpn_3x_hrsc2016/ReDet_re50_refpn_3x_hrsc2016-d1b4bd29.pth --out work_dirs/ReDet_re50_refpn_3x_hrsc2016/results.pkl >redett.log 2>&1 &

nohup python tesss.py >redett.log 2>&1 &

暂时没用nohup,老显示nohup: ignoring input,不知道什么原因

(redet) root@container-e19b1182ac-a18adac2:~/ReDet# python tools/test.py configs/ReDet/ReDet_re50_refpn_3x_hrsc2016.py work_dirs/ReDet_re50_refpn_3x_hrsc2016/ReDet_re50_refpn_3x_hrsc2016-d1b4bd29.pth --out work_dirs/ReDet_re50_refpn_3x_hrsc2016/results.pkl
ReResNet Orientation: 8 Fix Params: False
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
The model and loaded state dict do not match exactly

missing keys in source state_dict: neck.fpn_convs.0.conv.expanded_bias, backbone.layer3.2.conv3.filter, backbone.layer2.2.conv2.filter, backbone.layer3.5.conv2.filter, backbone.layer4.2.conv1.filter, backbone.layer4.1.conv2.filter, backbone.layer4.2.conv2.filter, backbone.layer4.0.conv3.filter, backbone.layer2.0.downsample.0.filter, backbone.layer4.0.conv1.filter, backbone.layer2.0.conv3.filter, backbone.conv1.filter, backbone.layer3.1.conv3.filter, backbone.layer2.1.conv3.filter, neck.lateral_convs.0.conv.expanded_bias, backbone.layer3.2.conv2.filter, neck.fpn_convs.1.conv.filter, backbone.layer4.2.conv3.filter, neck.lateral_convs.1.conv.filter, backbone.layer2.1.conv2.filter, backbone.layer2.0.conv1.filter, backbone.layer3.4.conv3.filter, backbone.layer3.0.downsample.0.filter, backbone.layer3.0.conv1.filter, backbone.layer3.0.conv2.filter, backbone.layer3.4.conv1.filter, backbone.layer4.1.conv3.filter, backbone.layer2.1.conv1.filter, backbone.layer3.1.conv2.filter, backbone.layer3.3.conv1.filter, backbone.layer3.3.conv3.filter, backbone.layer2.2.conv3.filter, backbone.layer3.3.conv2.filter, backbone.layer3.2.conv1.filter, neck.fpn_convs.3.conv.expanded_bias, backbone.layer4.0.downsample.0.filter, backbone.layer4.1.conv1.filter, neck.fpn_convs.2.conv.expanded_bias, backbone.layer2.3.conv1.filter, neck.lateral_convs.2.conv.filter, backbone.layer2.2.conv1.filter, neck.fpn_convs.0.conv.filter, backbone.layer3.5.conv3.filter, backbone.layer3.5.conv1.filter, neck.fpn_convs.3.conv.filter, backbone.layer3.1.conv1.filter, backbone.layer4.0.conv2.filter, neck.lateral_convs.1.conv.expanded_bias, backbone.layer2.0.conv2.filter, neck.lateral_convs.2.conv.expanded_bias, backbone.layer2.3.conv2.filter, backbone.layer3.4.conv2.filter, backbone.layer2.3.conv3.filter, neck.fpn_convs.2.conv.filter, neck.lateral_convs.3.conv.expanded_bias, neck.lateral_convs.3.conv.filter, neck.fpn_convs.1.conv.expanded_bias, neck.lateral_convs.0.conv.filter, backbone.layer3.0.conv3.filter

Traceback (most recent call last):
  File "tools/test.py", line 208, in <module>
    main()
  File "tools/test.py", line 178, in main
    outputs = single_gpu_test(model, data_loader, args.show, args.log_dir)
  File "tools/test.py", line 22, in single_gpu_test
    model.eval()
  File "/root/miniconda3/envs/redet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1009, in eval
    return self.train(False)
  File "/root/miniconda3/envs/redet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 998, in train
    module.train(mode)
  File "/root/miniconda3/envs/redet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 998, in train
    module.train(mode)
  File "/root/ReDet/mmdet/models/backbones/re_resnet.py", line 726, in train
    super(ReResNet, self).train(mode)
  File "/root/ReDet/mmdet/models/backbones/base_backbone.py", line 56, in train
    super(BaseBackbone, self).train(mode)
  File "/root/miniconda3/envs/redet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 998, in train
    module.train(mode)
  File "/root/miniconda3/envs/redet/lib/python3.7/site-packages/e2cnn-0.2.1-py3.7.egg/e2cnn/nn/modules/r2_conv/r2convolution.py", line 387, in train
    _filter, _bias = self.expand_parameters()
  File "/root/miniconda3/envs/redet/lib/python3.7/site-packages/e2cnn-0.2.1-py3.7.egg/e2cnn/nn/modules/r2_conv/r2convolution.py", line 304, in expand_parameters
    _filter = self.basisexpansion(self.weights)
  File "/root/miniconda3/envs/redet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/miniconda3/envs/redet/lib/python3.7/site-packages/e2cnn-0.2.1-py3.7.egg/e2cnn/nn/modules/r2_conv/basisexpansion_blocks.py", line 327, in forward
    _filter = self._expand_block(weights, io_pair).reshape(out_indices[2], in_indices[2], self.S)
  File "/root/miniconda3/envs/redet/lib/python3.7/site-packages/e2cnn-0.2.1-py3.7.egg/e2cnn/nn/modules/r2_conv/basisexpansion_blocks.py", line 294, in _expand_block
    _filter = block_expansion(coefficients)
  File "/root/miniconda3/envs/redet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/miniconda3/envs/redet/lib/python3.7/site-packages/e2cnn-0.2.1-py3.7.egg/e2cnn/nn/modules/r2_conv/basisexpansion_singleblock.py", line 115, in forward
    return torch.einsum('boi...,kb->koi...', self.sampled_basis, weights) #.transpose(1, 2).contiguous()
  File "/root/miniconda3/envs/redet/lib/python3.7/site-packages/torch/functional.py", line 211, in einsum
    return torch._C._VariableFunctions.einsum(equation, operands)
RuntimeError: cublas runtime error : the GPU program failed to execute at /opt/conda/conda-bld/pytorch_1556653114079/work/aten/src/THC/THCBlas.cu:450
[1]+  Terminated              nohup python tools/test.py configs/ReDet/ReDet_re50_refpn_3x_hrsc2016.py work_dirs/ReDet_re50_refpn_3x_hrsc2016/ReDet_re50_refpn_3x_hrsc2016-d1b4bd29.pth --out work_dirs/ReDet_re50_refpn_3x_hrsc2016/results.pkl > redett.log 2>&1

应该是显卡不够用了。偶不,应该是GPU 3090和pytorch 1.1.0不匹配,

root@container-93a511873c-4353c534:~# python
Python 3.7.9 (default, Aug 31 2020, 12:42:55) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> torch.cuda.is_available()
Traceback (most recent call last):
  File "", line 1, in <module>
NameError: name 'torch' is not defined
>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
1
>>> torch.cuda.get_device_name(0)
'NVIDIA GeForce RTX 3090'
>>> torch.__version__
'1.1.0'
>>> 

说明租用的GPU环境可以正常使用3090,只是代码中的AT_CHECK在 torch 1.5 #36581 中已弃用,所以要修改所有的编译源文件,改成TORCH_CHECK。
作者的回复:
I guess the reason is: cuda11.0 requires higher version pytorch (>1.3), while some ops in our code are designed for pytorch<1.5.

If so, to fix this, you need to replace all AT_CHECK with TORCH_CHECK in the source code (.cpp and .cu). See pytorch/pytorch#36581

从issue里找到了一行代码,不知道啥意思。

find . -type f -exec sed -i 's/AT_CHECK/TORCH_CHECK/g' {} +

运行完之后好像是有用,mmdet/ops/文件夹下运行了一下,又不放心,在ReDet里运行了一下(ps,因为在系统下运行简直时间太久了,等不及了就ctrl c了
在这里插入图片描述

(redet) root@container-e19b1182ac-a18adac2:~/ReDet# python tools/test.py configs/ReDet/ReDet_re50_refpn_3x_hrsc2016.py work_dirs/ReDet_re50_refpn_3x_hrsc2016/ReDet_re50_refpn_3x_hrsc2016-d1b4bd29.pth --out work_dirs/ReDet_re50_refpn_3x_hrsc2016/results.pkl
ReResNet Orientation: 8 Fix Params: False
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
The model and loaded state dict do not match exactly

missing keys in source state_dict: neck.fpn_convs.3.conv.filter, backbone.layer3.1.conv1.filter, backbone.layer3.4.conv2.filter, backbone.layer2.0.downsample.0.filter, backbone.layer4.0.conv2.filter, backbone.layer3.3.conv3.filter, backbone.layer2.1.conv2.filter, backbone.layer2.0.conv3.filter, backbone.layer4.1.conv2.filter, backbone.layer3.0.conv2.filter, neck.fpn_convs.0.conv.expanded_bias, backbone.layer2.3.conv3.filter, backbone.layer2.0.conv1.filter, neck.fpn_convs.1.conv.filter, backbone.layer2.3.conv2.filter, neck.lateral_convs.3.conv.filter, backbone.layer4.0.downsample.0.filter, backbone.layer3.4.conv1.filter, backbone.layer4.0.conv3.filter, backbone.layer3.0.conv1.filter, neck.lateral_convs.0.conv.filter, backbone.layer2.0.conv2.filter, neck.lateral_convs.2.conv.expanded_bias, backbone.layer3.3.conv1.filter, backbone.layer4.1.conv1.filter, neck.lateral_convs.3.conv.expanded_bias, backbone.layer3.5.conv3.filter, backbone.layer3.2.conv1.filter, backbone.layer4.0.conv1.filter, backbone.layer2.1.conv3.filter, backbone.layer3.1.conv2.filter, backbone.layer2.3.conv1.filter, backbone.layer3.5.conv1.filter, backbone.layer4.1.conv3.filter, neck.lateral_convs.0.conv.expanded_bias, backbone.layer4.2.conv2.filter, backbone.layer4.2.conv3.filter, neck.lateral_convs.2.conv.filter, neck.lateral_convs.1.conv.expanded_bias, neck.fpn_convs.0.conv.filter, backbone.layer3.0.conv3.filter, neck.fpn_convs.2.conv.expanded_bias, backbone.layer2.1.conv1.filter, backbone.layer2.2.conv2.filter, backbone.layer3.0.downsample.0.filter, backbone.layer2.2.conv3.filter, neck.fpn_convs.1.conv.expanded_bias, backbone.layer3.2.conv2.filter, backbone.layer2.2.conv1.filter, neck.fpn_convs.3.conv.expanded_bias, backbone.layer3.1.conv3.filter, backbone.layer3.5.conv2.filter, backbone.layer3.4.conv3.filter, neck.fpn_convs.2.conv.filter, backbone.layer4.2.conv1.filter, backbone.layer3.2.conv3.filter, neck.lateral_convs.1.conv.filter, backbone.layer3.3.conv2.filter, backbone.conv1.filter

Traceback (most recent call last):
  File "tools/test.py", line 208, in <module>
    main()
  File "tools/test.py", line 178, in main
    outputs = single_gpu_test(model, data_loader, args.show, args.log_dir)
  File "tools/test.py", line 22, in single_gpu_test
    model.eval()
  File "/root/miniconda3/envs/redet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1009, in eval
    return self.train(False)
  File "/root/miniconda3/envs/redet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 998, in train
    module.train(mode)
  File "/root/miniconda3/envs/redet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 998, in train
    module.train(mode)
  File "/root/ReDet/mmdet/models/backbones/re_resnet.py", line 726, in train
    super(ReResNet, self).train(mode)
  File "/root/ReDet/mmdet/models/backbones/base_backbone.py", line 56, in train
    super(BaseBackbone, self).train(mode)
  File "/root/miniconda3/envs/redet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 998, in train
    module.train(mode)
  File "/root/miniconda3/envs/redet/lib/python3.7/site-packages/e2cnn-0.2.1-py3.7.egg/e2cnn/nn/modules/r2_conv/r2convolution.py", line 387, in train
    _filter, _bias = self.expand_parameters()
  File "/root/miniconda3/envs/redet/lib/python3.7/site-packages/e2cnn-0.2.1-py3.7.egg/e2cnn/nn/modules/r2_conv/r2convolution.py", line 304, in expand_parameters
    _filter = self.basisexpansion(self.weights)
  File "/root/miniconda3/envs/redet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/miniconda3/envs/redet/lib/python3.7/site-packages/e2cnn-0.2.1-py3.7.egg/e2cnn/nn/modules/r2_conv/basisexpansion_blocks.py", line 327, in forward
    _filter = self._expand_block(weights, io_pair).reshape(out_indices[2], in_indices[2], self.S)
  File "/root/miniconda3/envs/redet/lib/python3.7/site-packages/e2cnn-0.2.1-py3.7.egg/e2cnn/nn/modules/r2_conv/basisexpansion_blocks.py", line 294, in _expand_block
    _filter = block_expansion(coefficients)
  File "/root/miniconda3/envs/redet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/miniconda3/envs/redet/lib/python3.7/site-packages/e2cnn-0.2.1-py3.7.egg/e2cnn/nn/modules/r2_conv/basisexpansion_singleblock.py", line 115, in forward
    return torch.einsum('boi...,kb->koi...', self.sampled_basis, weights) #.transpose(1, 2).contiguous()
  File "/root/miniconda3/envs/redet/lib/python3.7/site-packages/torch/functional.py", line 211, in einsum
    return torch._C._VariableFunctions.einsum(equation, operands)
RuntimeError: cublas runtime error : the GPU program failed to execute at /opt/conda/conda-bld/pytorch_1556653114079/work/aten/src/THC/THCBlas.cu:450
(redet) root@container-e19b1182ac-a18adac2:~/ReDet# python
Python 3.7.11 (default, Jul 27 2021, 14:32:16) 
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version>>
  File "", line 1
    torch.__version>>
                    ^
SyntaxError: invalid syntax
>>> torch.__version__
'1.1.0'

还是不行,先歇会,码码字。

2022年3月26日14点21分
看了autodl租的3090的环境,我泪目了
复现ReDet RTX 3090 pytorch1.8.1_第16张图片
这版本高的离谱,pytorch1.1.0必不支持啊。
换块2080Ti看看驱动版本。

昨天改完了TORCH_CHECK好像没编译啊,重新试试。

复现ReDet RTX 3090 pytorch1.8.1_第17张图片
确实要修改AT_CHECK
昨天改的mmdet/ops/src TORCH_CHECK不全
好像是 3090的pytorch必须1.7以上???环境白瞎了啊。

复现ReDet RTX 3090 pytorch1.8.1_第18张图片
行 重新安装吧,看了看A40,虽然有空闲,但是很多信息不知道啊,比如A40适配哪些pytorch版本,这方面还是3090的信息相对多一点,还有潮汐算力。

你可能感兴趣的:(复现论文,python)