使用CityScapes数据训练deeplabV3遇到的一些问题(2019-11-25)

整个过程主要用到的网站:

TensorFlow实战:Chapter-9上(DeepLabv3+代码实现)

(超详细很完整)tensorflow上实现deeplabv3+

官方指导文档

CityScapes数据官方下载地址

CUDA官方下载地址       CUDNN官方下载地址

tensorflow各个版本的CUDA以及Cudnn版本对应关系

CityScapes数据转换TFRecord数据代码地址:https://github.com/mcordts/cityscapesScripts

deeplab模型地址:https://github.com/tensorflow/models

各个数据预训练模型:https://github.com/tensorflow/models/blob/master/research/deeplab/g3doc/model_zoo.md

遇到的主要问题:

1、tensorflow环境问题

2、CUDA环境和兼容问题

3、各个代码的版本问题

4、windows上运行和linux上的区别导致的问题

问题描述和解决方法

1、tensorflow环境问题(一般是在执行转换tfrecord数据的代码的时候出现的)

     tensorflow has no attribute 'app'     tensorflow has no attribute 'logging'     tensorflow has no attribute 'contrib'等一系列缺失各种模块的问题。

     训练过程太慢

     Could not load dynamic library 'cudart64_100.dll'(cuda安装的情况下出现这个问题)

解决这些问题的方法就是安装合适的tensorflow-gpu版本,windows系统一定要用Anaconda安装和版本切换,用pip命令会出现各种问题,我花了相当一段时间解决tensorflow版本问题。我使用的是tensorflow-gpu1.15版本,安装默认的最新版本后,点击左侧对号选择Mark for...就可以选择对应的版本。

使用CityScapes数据训练deeplabV3遇到的一些问题(2019-11-25)_第1张图片

2、CUDA环境和兼容问题

Could not load dynamic library 'cudart64_100.dll'等缺失各种dll文件的问题。

一定要选择好对应的版本(要考虑机器显卡支持的版本),tensorflow1.15对应的是CUDA10.0,这个问题也折腾了很久,cuda下载和安装一次需要很长时间,所以一定要查好相关资料,一次性把cuda的环境搞定。

3、各个代码的版本问题(面临的主要问题

由于github上面的代码一直在更新,所以出现了一些官方文档或者各个教程和代码实际不一致的问题,并且有部分问题全网都找不到答案(stackflow、github上都提问过也没得到解决)

TFRecord的文件都是0kb:这个一定是生成代码没执行成功,引起这个问题的原因可能是convert_cityscapes.sh脚本调用的几个python文件报找不到引用的子模块的错误(通常建议添加环境变量等操作),我通过在出问题的py文件的引用部分加入:

import sys
sys.path.append("H:/dataSet/CityscapesV2/cityscapesScripts/models/research")
sys.path.append("H:/dataSet/CityscapesV2/cityscapesScripts/models/research/slim")

路径和名称自行进行调整。

data split name train not recognized:官方给出的训练命令是:

# From tensorflow/models/research/
python deeplab/train.py \
    --logtostderr \
    --training_number_of_steps=90000 \
    --train_split="train" \
    --model_variant="xception_65" \
    --atrous_rates=6 \
    --atrous_rates=12 \
    --atrous_rates=18 \
    --output_stride=16 \
    --decoder_output_stride=4 \
    --train_crop_size="769,769" \
    --train_batch_size=1 \
    --dataset="cityscapes" \
    --tf_initial_checkpoint=${PATH_TO_INITIAL_CHECKPOINT} \
    --train_logdir=${PATH_TO_TRAIN_DIR} \
    --dataset_dir=${PATH_TO_DATASET}

 但是在代码中已经没有“train”这个选项了,而是train_fine等等,这个问题在执行val和vis的时候都会遇到,将各自加后缀train_fine、val_fine就可以了。但是会发现这里改成train_fine后程序会意外停止。

_CITYSCAPES_INFORMATION = DatasetDescriptor(
    splits_to_sizes={'train_fine': 2975,
                     'train_coarse': 22973,
                     'trainval_fine': 3475,
                     'trainval_coarse': 23473,
                     'val_fine': 500,
                     'test_fine': 1525},
    num_classes=19,
    ignore_label=255,
)
Windows fatal exception: access violation

Thread 0x00005cd8 (most recent call first):
  File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1443 in _call_tf_sessionrun
  File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1350 in _run_fn
  File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1365 in _do_call
  File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1359 in _do_run
  File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1180 in _run
  File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 956 in run
  File "G:\anaconda\lib\site-packages\tensorflow_core\python\training\saver.py", line 1176 in save
  File "G:\anaconda\lib\site-packages\tensorflow_core\python\training\supervisor.py", line 1119 in run_loop
  File "G:\anaconda\lib\site-packages\tensorflow_core\python\training\coordinator.py", line 495 in run
  File "G:\anaconda\lib\threading.py", line 917 in _bootstrap_inner
  File "G:\anaconda\lib\threading.py", line 885 in _bootstrap

Thread 0x00004ef4 (most recent call first):
  File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1443 in _call_tf_sessionrun
  File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1350 in _run_fn
  File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1365 in _do_call
  File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1359 in _do_run
  File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1180 in _run
  File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 956 in run
  File "G:\anaconda\lib\site-packages\tensorflow_core\python\training\training_util.py", line 68 in global_step
  File "G:\anaconda\lib\site-packages\tensorflow_core\python\training\supervisor.py", line 1081 in run_loop
  File "G:\anaconda\lib\site-packages\tensorflow_core\python\training\coordinator.py", line 495 in run
  File "G:\anaconda\lib\threading.py", line 917 in _bootstrap_inner
  File "G:\anaconda\lib\threading.py", line 885 in _bootstrap

Thread 0x000062ec (most recent call first):
  File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1443 in _call_tf_sessionrun
  File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1350 in _run_fn
  File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1365 in _do_call
  File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1359 in _do_run
  File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1180 in _run
  File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 956 in run
  File "G:\anaconda\lib\site-packages\tensorflow_core\python\training\supervisor.py", line 1045 in run_loop
  File "G:\anaconda\lib\site-packages\tensorflow_core\python\training\coordinator.py", line 495 in run
  File "G:\anaconda\lib\threading.py", line 917 in _bootstrap_inner
  File "G:\anaconda\lib\threading.py", line 885 in _bootstrap

Thread 0x00006018 (most recent call first):
  File "G:\anaconda\lib\threading.py", line 296 in wait
  File "G:\anaconda\lib\queue.py", line 170 in get
  File "G:\anaconda\lib\site-packages\tensorflow_core\python\summary\writer\event_file_writer.py", line 159 in run
  File "G:\anaconda\lib\threading.py", line 917 in _bootstrap_inner
  File "G:\anaconda\lib\threading.py", line 885 in _bootstrap

Thread 0x00005df0 (most recent call first):
  File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1443 in _call_tf_sessionrun
  File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1350 in _run_fn
  File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1365 in _do_call
  File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1359 in _do_run
  File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1180 in _run
  File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 956 in run
  File "G:\anaconda\lib\site-packages\tensorflow_core\contrib\slim\python\slim\learning.py", line 490 in train_step
  File "G:\anaconda\lib\site-packages\tensorflow_core\contrib\slim\python\slim\learning.py", line 775 in train
  File "H:/dataSet/CityscapesV2/cityscapesScripts/models/research/deeplab/train.py", line 462 in main
  File "G:\anaconda\lib\site-packages\absl\app.py", line 250 in _run_main
  File "G:\anaconda\lib\site-packages\absl\app.py", line 299 in run
  File "G:\anaconda\lib\site-packages\tensorflow_core\python\platform\app.py", line 40 in run
  File "H:/dataSet/CityscapesV2/cityscapesScripts/models/research/deeplab/train.py", line 468 in 

Windows fatal exception: access violation    Thread 0x00004ef4 (most recent call first):以上这些问题是因为我们生成的tfrecord文件是train开头的,而代码读取的是train_fine开头的,所以需要吧生成的tfrecord文件名修改一下:

使用CityScapes数据训练deeplabV3遇到的一些问题(2019-11-25)_第2张图片   改为  使用CityScapes数据训练deeplabV3遇到的一些问题(2019-11-25)_第3张图片

4、windows上运行和linux上的区别导致的问题

官方和各个教程都是在linux系统上做的介绍,而在windows上会出现一些问题:首先是.sh文件的运行,windows可以通过git的bash窗口运行,但是一些py报错教程都说增加linux的python环境,window还是采用sys.path.append()的方式才能解决。

不用sh运行train和val的测试命令,直接在pycharm里运行train.py等文件是可以的但是需要注意:1.修改各个配置项。2.修改common.py中的网络结构配置项为xception(当然取决于你下载的预训练模型是在哪个结构上的),默认是mobilent_v2:

使用CityScapes数据训练deeplabV3遇到的一些问题(2019-11-25)_第4张图片

否则会报错:

Not found: Key MobilenetV2/Conv/BatchNorm/beta not found in checkpoint

Total size of new array must be unchanged for image_pooling/weights lh_shape

 

你可能感兴趣的:(图像分割,tensorflow,deeplab,cityscapes)