TensorFlow实战:Chapter-9上(DeepLabv3+代码实现)
(超详细很完整)tensorflow上实现deeplabv3+
官方指导文档
CityScapes数据官方下载地址
CUDA官方下载地址 CUDNN官方下载地址
tensorflow各个版本的CUDA以及Cudnn版本对应关系
CityScapes数据转换TFRecord数据代码地址:https://github.com/mcordts/cityscapesScripts
deeplab模型地址:https://github.com/tensorflow/models
各个数据预训练模型:https://github.com/tensorflow/models/blob/master/research/deeplab/g3doc/model_zoo.md
1、tensorflow环境问题
2、CUDA环境和兼容问题
3、各个代码的版本问题
4、windows上运行和linux上的区别导致的问题
1、tensorflow环境问题(一般是在执行转换tfrecord数据的代码的时候出现的)
tensorflow has no attribute 'app' tensorflow has no attribute 'logging' tensorflow has no attribute 'contrib'等一系列缺失各种模块的问题。
训练过程太慢
Could not load dynamic library 'cudart64_100.dll'(cuda安装的情况下出现这个问题)
解决这些问题的方法就是安装合适的tensorflow-gpu版本,windows系统一定要用Anaconda安装和版本切换,用pip命令会出现各种问题,我花了相当一段时间解决tensorflow版本问题。我使用的是tensorflow-gpu1.15版本,安装默认的最新版本后,点击左侧对号选择Mark for...就可以选择对应的版本。
2、CUDA环境和兼容问题
Could not load dynamic library 'cudart64_100.dll'等缺失各种dll文件的问题。
一定要选择好对应的版本(要考虑机器显卡支持的版本),tensorflow1.15对应的是CUDA10.0,这个问题也折腾了很久,cuda下载和安装一次需要很长时间,所以一定要查好相关资料,一次性把cuda的环境搞定。
3、各个代码的版本问题(面临的主要问题)
由于github上面的代码一直在更新,所以出现了一些官方文档或者各个教程和代码实际不一致的问题,并且有部分问题全网都找不到答案(stackflow、github上都提问过也没得到解决)
TFRecord的文件都是0kb:这个一定是生成代码没执行成功,引起这个问题的原因可能是convert_cityscapes.sh脚本调用的几个python文件报找不到引用的子模块的错误(通常建议添加环境变量等操作),我通过在出问题的py文件的引用部分加入:
import sys
sys.path.append("H:/dataSet/CityscapesV2/cityscapesScripts/models/research")
sys.path.append("H:/dataSet/CityscapesV2/cityscapesScripts/models/research/slim")
路径和名称自行进行调整。
data split name train not recognized:官方给出的训练命令是:
# From tensorflow/models/research/
python deeplab/train.py \
--logtostderr \
--training_number_of_steps=90000 \
--train_split="train" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--train_crop_size="769,769" \
--train_batch_size=1 \
--dataset="cityscapes" \
--tf_initial_checkpoint=${PATH_TO_INITIAL_CHECKPOINT} \
--train_logdir=${PATH_TO_TRAIN_DIR} \
--dataset_dir=${PATH_TO_DATASET}
但是在代码中已经没有“train”这个选项了,而是train_fine等等,这个问题在执行val和vis的时候都会遇到,将各自加后缀train_fine、val_fine就可以了。但是会发现这里改成train_fine后程序会意外停止。
_CITYSCAPES_INFORMATION = DatasetDescriptor(
splits_to_sizes={'train_fine': 2975,
'train_coarse': 22973,
'trainval_fine': 3475,
'trainval_coarse': 23473,
'val_fine': 500,
'test_fine': 1525},
num_classes=19,
ignore_label=255,
)
Windows fatal exception: access violation
Thread 0x00005cd8 (most recent call first):
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1443 in _call_tf_sessionrun
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1350 in _run_fn
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1365 in _do_call
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1359 in _do_run
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1180 in _run
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 956 in run
File "G:\anaconda\lib\site-packages\tensorflow_core\python\training\saver.py", line 1176 in save
File "G:\anaconda\lib\site-packages\tensorflow_core\python\training\supervisor.py", line 1119 in run_loop
File "G:\anaconda\lib\site-packages\tensorflow_core\python\training\coordinator.py", line 495 in run
File "G:\anaconda\lib\threading.py", line 917 in _bootstrap_inner
File "G:\anaconda\lib\threading.py", line 885 in _bootstrap
Thread 0x00004ef4 (most recent call first):
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1443 in _call_tf_sessionrun
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1350 in _run_fn
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1365 in _do_call
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1359 in _do_run
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1180 in _run
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 956 in run
File "G:\anaconda\lib\site-packages\tensorflow_core\python\training\training_util.py", line 68 in global_step
File "G:\anaconda\lib\site-packages\tensorflow_core\python\training\supervisor.py", line 1081 in run_loop
File "G:\anaconda\lib\site-packages\tensorflow_core\python\training\coordinator.py", line 495 in run
File "G:\anaconda\lib\threading.py", line 917 in _bootstrap_inner
File "G:\anaconda\lib\threading.py", line 885 in _bootstrap
Thread 0x000062ec (most recent call first):
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1443 in _call_tf_sessionrun
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1350 in _run_fn
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1365 in _do_call
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1359 in _do_run
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1180 in _run
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 956 in run
File "G:\anaconda\lib\site-packages\tensorflow_core\python\training\supervisor.py", line 1045 in run_loop
File "G:\anaconda\lib\site-packages\tensorflow_core\python\training\coordinator.py", line 495 in run
File "G:\anaconda\lib\threading.py", line 917 in _bootstrap_inner
File "G:\anaconda\lib\threading.py", line 885 in _bootstrap
Thread 0x00006018 (most recent call first):
File "G:\anaconda\lib\threading.py", line 296 in wait
File "G:\anaconda\lib\queue.py", line 170 in get
File "G:\anaconda\lib\site-packages\tensorflow_core\python\summary\writer\event_file_writer.py", line 159 in run
File "G:\anaconda\lib\threading.py", line 917 in _bootstrap_inner
File "G:\anaconda\lib\threading.py", line 885 in _bootstrap
Thread 0x00005df0 (most recent call first):
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1443 in _call_tf_sessionrun
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1350 in _run_fn
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1365 in _do_call
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1359 in _do_run
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1180 in _run
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 956 in run
File "G:\anaconda\lib\site-packages\tensorflow_core\contrib\slim\python\slim\learning.py", line 490 in train_step
File "G:\anaconda\lib\site-packages\tensorflow_core\contrib\slim\python\slim\learning.py", line 775 in train
File "H:/dataSet/CityscapesV2/cityscapesScripts/models/research/deeplab/train.py", line 462 in main
File "G:\anaconda\lib\site-packages\absl\app.py", line 250 in _run_main
File "G:\anaconda\lib\site-packages\absl\app.py", line 299 in run
File "G:\anaconda\lib\site-packages\tensorflow_core\python\platform\app.py", line 40 in run
File "H:/dataSet/CityscapesV2/cityscapesScripts/models/research/deeplab/train.py", line 468 in
Windows fatal exception: access violation Thread 0x00004ef4 (most recent call first):以上这些问题是因为我们生成的tfrecord文件是train开头的,而代码读取的是train_fine开头的,所以需要吧生成的tfrecord文件名修改一下:
4、windows上运行和linux上的区别导致的问题
官方和各个教程都是在linux系统上做的介绍,而在windows上会出现一些问题:首先是.sh文件的运行,windows可以通过git的bash窗口运行,但是一些py报错教程都说增加linux的python环境,window还是采用sys.path.append()的方式才能解决。
不用sh运行train和val的测试命令,直接在pycharm里运行train.py等文件是可以的但是需要注意:1.修改各个配置项。2.修改common.py中的网络结构配置项为xception(当然取决于你下载的预训练模型是在哪个结构上的),默认是mobilent_v2:
否则会报错:
Not found: Key MobilenetV2/Conv/BatchNorm/beta not found in checkpoint
Total size of new array must be unchanged for image_pooling/weights lh_shape