Real-time Scene Text Detection with Differentiable Binarization 问题记录

官方:https://github.com/MhLiao/DB
周军大神实现的:https://github.com/WenmuZhou/DBNet.pytorch

1.官方的

官方的按照安装流程很容易安装,只是我的环境是ubuntu16.04+cuda8,所以一直用的pytorch1.0.1(py3.7)的。也可以跑起来.但是训练的模型推理预测出来全是空啊,txt全是空的,visualize文件夹图片灰蒙蒙的没有框。loss不收敛

[INFO] [2020-01-18 16:24:09,584] step:   1340, epoch:   0, loss: 4.332346, lr: 0.007000
[INFO] [2020-01-18 16:24:09,585] bce_loss: 0.568492
[INFO] [2020-01-18 16:24:09,585] thresh_loss: 0.563487
[INFO] [2020-01-18 16:24:09,586] l1_loss: 0.092640
[INFO] [2020-01-18 16:24:19,117] step:   1360, epoch:   0, loss: 4.255758, lr: 0.007000
[INFO] [2020-01-18 16:24:19,120] bce_loss: 0.544069
[INFO] [2020-01-18 16:24:19,122] thresh_loss: 0.539020
[INFO] [2020-01-18 16:24:19,124] l1_loss: 0.099640
[INFO] [2020-01-18 16:24:28,766] step:   1380, epoch:   0, loss: 4.507674, lr: 0.007000
[INFO] [2020-01-18 16:24:28,767] bce_loss: 0.560643
[INFO] [2020-01-18 16:24:28,768] thresh_loss: 0.652172
[INFO] [2020-01-18 16:24:28,768] l1_loss: 0.105229

用ic15数据集训练也是如此,不知道问题出在哪里。后面再看看

有一个bug,是学习率一开始都是0.07,设置都没有用,需要到DB/DB-master/training/learning_rate.py这里改,被写死了

因为怀疑是版本原因导致不收敛什么的,于是我就把自己电脑上装两个cuda,8和10,装10,然后创建虚拟环境,然后又报错,报cuda错误,
error in : invalid device function
RuntimeError: copy_if failed to synchronize: device-side assert triggered
搞了好久,无解,然后解决问题的时候发现conda安装的cuda是10.1版本的,而我本地是10.0版本的,同时看到
作者在回答issue的时候说:https://github.com/MhLiao/DB/issues/36

Make sure your CUDA path of $CUDA_HOME is the same version as your CUDA in PyTorch by the command of echo $CUDA_HOME. If not, you need to change the $CUDA_HOME by export CUDA_HOME=path-of-another-version or re-install PyTorch with the same CUDA version as in CUDA_HOME.

本地与conda的cuda版本需要一致。然后我重来一遍:

conda install numpy=1.17.4 pytorch=1.3 torchvision cudatoolkit=10.0.130 -c pytorch

如此,可以!
但是,好像还不收敛啊。。。。………………。6……6…………-%4$

2.非官方的

安装安装流程,一股脑的安装,确实可以跑,但是一开始显示, DBNet.pytorch INFO: train with device cpu and pytorch 1.3.0
因为我电脑上没有1.3需要的cuda10,所以就跑cpu了。很慢。
后来在群里看到有人用pytorch1.1.0版本编过了,但是他是cuda10.我也安装1.1.0版本,然后训练各种报错啊,无助。。。后来都放弃了,后来又重捣鼓。
在此过程中,越来越觉得anconda很好,在虚拟环境下,敲conda list可以显示安装的各个库的版本。

_libgcc_mutex             0.1                        main  
absl-py                   0.9.0                     
anyconfig                 0.9.10                    
backcall                  0.1.0                    py36_0  
blas                      1.0                         mkl  
ca-certificates           2019.11.27                    0  
cachetools                4.0.0                     
certifi                   2019.11.28               py36_0  
cffi                      1.13.2           py36h2e261b9_0  
chardet                   3.0.4                     
cudatoolkit               8.0                           3  
cycler                    0.10.0                    
decorator                 4.4.1                      py_0  
freetype                  2.9.1                h8a8886c_1  
future                    0.18.2                    
google-auth               1.10.1                    
google-auth-oauthlib      0.4.1                     
grpcio                    1.26.0                    
idna                      2.8                       
imageio                   2.6.1                     
imgaug                    0.3.0                     
intel-openmp              2019.4                      243  
ipython                   7.11.1           py36h39e3cac_0  
ipython_genutils          0.2.0                    py36_0  
jedi                      0.15.2                   py36_0  
jpeg                      9b                   h024ee3a_2  
kiwisolver                1.1.0                     
ld_impl_linux-64          2.33.1               h53a641e_7  
libedit                   3.1.20181209         hc058e9b_0  
libffi                    3.2.1                hd88cf55_4  
libgcc-ng                 9.1.0                hdf63c60_0  
libgfortran-ng            7.3.0                hdf63c60_0  
libpng                    1.6.37               hbc83047_0  
libstdcxx-ng              9.1.0                hdf63c60_0  
libtiff                   4.1.0                h2733197_0  
Markdown                  3.1.1                     
matplotlib                3.1.2                     
mkl                       2019.4                      243  
mkl-service               2.3.0            py36he904b0f_0  
mkl_fft                   1.0.15           py36ha843d7b_0  
mkl_random                1.1.0            py36hd6b4f25_0  
natsort                   7.0.0                     
ncurses                   6.1                  he6710b0_1  
networkx                  2.4                       
ninja                     1.9.0            py36hfd86e86_0  
numpy                     1.18.1           py36h4f9e942_0  
numpy                     1.17.4                    
numpy-base                1.18.1           py36hde5b4d6_0  
oauthlib                  3.1.0                     
olefile                   0.46                       py_0  
opencv-python             4.1.2.30                  
opencv-python-headless    4.1.2.30                  
openssl                   1.1.1d               h7b6447c_3  
parso                     0.5.2                      py_0  
pexpect                   4.7.0                    py36_0  
pickleshare               0.7.5                    py36_0  
Pillow                    6.2.2                     
pillow                    7.0.0            py36hb39fc2d_0  
pip                       19.3.1                   py36_0  
Polygon3                  3.0.8                     
prompt_toolkit            3.0.2                      py_0  
protobuf                  3.11.2                    
ptyprocess                0.6.0                    py36_0  
pyasn1                    0.4.8                     
pyasn1-modules            0.2.8                     
pyclipper                 1.1.0.post3               
pycparser                 2.19                       py_0  
pygments                  2.5.2                      py_0  
pyparsing                 2.4.6                     
python                    3.6.10               h0371630_0  
python-dateutil           2.8.1                     
pytorch                   1.0.1           py3.6_cuda8.0.61_cudnn7.1.2_2    pytorch
PyWavelets                1.1.1                     
PyYAML                    5.2                       
readline                  7.0                  h7b6447c_5  
requests                  2.22.0                    
requests-oauthlib         1.3.0                     
rsa                       4.0                       
scikit-image              0.16.2                    
scipy                     1.4.1                     
setuptools                44.0.0                   py36_0  
Shapely                   1.6.4.post2               
six                       1.13.0                   py36_0  
sqlite                    3.30.1               h7b6447c_0  
tensorboard               2.1.0                     
tensorboardX              1.8                       
tk                        8.6.8                hbc83047_0  
torch                     1.1.0                     
torchvision               0.2.1                     
torchvision               0.2.2                      py_3    pytorch
tqdm                      4.40.1                    
traitlets                 4.3.3                    py36_0  
urllib3                   1.25.7                    
wcwidth                   0.1.7                    py36_0  
Werkzeug                  0.16.0                    
wheel                     0.33.6                   py36_0  
xz                        5.2.4                h14c3975_4  
zlib                      1.2.11               h7b6447c_3  
zstd                      1.3.7                h0b5b093_0  

安装软件直接:pip install tensorboardX==1.8
不加版本号默认装最新的。
还可以pip install 'tensorboardX<1.9'.安装小于1.9版本。
主要有两个错误:

2020-01-18 16:23:24,753 DBNet.pytorch ERROR: Traceback (most recent call last):
  File "/data_1/Yang/project/2019/project/DBNet.pytorch/DBNet.pytorch-master/base/base_trainer.py", line 77, in __init__
    self.writer.add_graph(self.model, dummy_input)
  File "/data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages/tensorboardX/writer.py", line 774, in add_graph
    self._get_file_writer().add_graph(graph(model, input_to_model, verbose, **kwargs))
  File "/data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages/tensorboardX/pytorch_graph.py", line 292, in graph
    list_of_nodes, node_stats = parse(graph, args)
  File "/data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages/tensorboardX/pytorch_graph.py", line 227, in parse
    if node.debugName() == 'self':
AttributeError: 'torch._C.Value' object has no attribute 'debugName'

2020-01-18 16:23:24,753 DBNet.pytorch WARNING: add graph to tensorboard failed
2020-01-18 16:23:24,756 DBNet.pytorch INFO: train dataset has 889 samples,297 in dataloader, validate dataset has 111 samples,111 in dataloader
Traceback (most recent call last):
  File "tools/train.py", line 74, in 
    main(config)
  File "tools/train.py", line 58, in main
    trainer.train()
  File "/data_1/Yang/project/2019/project/DBNet.pytorch/DBNet.pytorch-master/base/base_trainer.py", line 103, in train
    self.epoch_result = self._train_epoch(epoch)
  File "/data_1/Yang/project/2019/project/DBNet.pytorch/DBNet.pytorch-master/trainer/trainer.py", line 46, in _train_epoch
    for i, batch in enumerate(self.train_loader):
  File "/data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 582, in __next__
    return self._process_next_batch(batch)
  File "/data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
TypeError: Traceback (most recent call last):
  File "/data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
TypeError: 'NoneType' object is not callable

首先这个,File "/data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages/tensorboardX/pytorch_graph.py", line 227, in parse
if node.debugName() == 'self':
AttributeError: 'torch._C.Value' object has no attribute 'debugName'

看样子好像是tensorboardX版本不对,百度一下果真,说要把版本整到1.8.conda list显示我是1.9,然后敲:
pip install tensorboardX1.8,显示如下:
Requirement already satisfied: six in /data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages (from tensorboardX
1.8) (1.13.0)
Requirement already satisfied: protobuf>=3.2.0 in /data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages (from tensorboardX1.8) (3.11.2)
Requirement already satisfied: numpy in /data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages (from tensorboardX
1.8) (1.17.4)
Requirement already satisfied: setuptools in /data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages (from protobuf>=3.2.0->tensorboardX==1.8) (44.0.0.post20200106)
Installing collected packages: tensorboardX
Found existing installation: tensorboardX 1.9
Uninstalling tensorboardX-1.9:
Successfully uninstalled tensorboardX-1.9
Successfully installed tensorboardX-1.8

直接会自动卸载1.9装1.8

然后再训练,果真只剩下最后的那个错误。

Traceback (most recent call last):
File "tools/train.py", line 74, in
main(config)
File "tools/train.py", line 58, in main
trainer.train()
File "/data_1/Yang/project/2019/project/DBNet.pytorch/DBNet.pytorch-master/base/base_trainer.py", line 103, in train
self.epoch_result = self._train_epoch(epoch)
File "/data_1/Yang/project/2019/project/DBNet.pytorch/DBNet.pytorch-master/trainer/trainer.py", line 46, in _train_epoch
for i, batch in enumerate(self.train_loader):
File "/data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 582, in next
return self._process_next_batch(batch)
File "/data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
TypeError: Traceback (most recent call last):
File "/data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
TypeError: 'NoneType' object is not callable

在github上面有人解答了这个问题,https://github.com/WenmuZhou/DBNet.pytorch/issues/4
在DBNet.pytorch-master/data_loader/init.py, line 74
if 'collate_fn' not in config['loader'] or config['loader']['collate_fn'] is None or len(config['loader']['collate_fn']) == 0:
#config['loader']['collate_fn'] = None # here has to changle,========= 这里要改成下面的,不然传None进去会被直接赋值 ====
config['loader']['collate_fn'] = torch.utils.data.dataloader.default_collate
else:
config['loader']['collate_fn'] = eval(config['loader']['collate_fn'])()

_dataset = get_dataset(data_path=data_path, module_name=dataset_name, transform=img_transfroms, dataset_args=dataset_args)
sampler = None
if distributed:
from torch.utils.data.distributed import DistributedSampler
# 3)使用DistributedSampler
sampler = DistributedSampler(_dataset)
config['loader']['shuffle'] = False
config['loader']['pin_memory'] = True
loader = DataLoader(dataset=_dataset, sampler=sampler, **config['loader'])
return @loader

如此,再训练,就ok了!!!
赶紧训练,并用自己的数据训练看看!

你可能感兴趣的:(Real-time Scene Text Detection with Differentiable Binarization 问题记录)