TensorFlow实战:Chapter-9上(DeepLabv3+代码实现)主要参考1
TensorFlow实战:Chapter-9下(DeepLabv3+在自己的数据集训练)主要参考2
图像语义分割 — 利用Deeplab v3+训练自己的数据 loss震荡解决办法
图像语义分割 —利用Deeplab v3+训练VOC2012数据集用VOC训练
deeplabv3+Xception讲了一些概念
图像语义分割 Deeplab v3+报错[predictions out of bound]解决办法
图像语义分割 DeepLab v3+ 训练自己的数据集
Tensorflow - 语义分割 Deeplab API 之 Demo 主要是对demo的讲解,这个博主的其他博客也很好
Deeplab v3 (2): train.py、eval.py源码分析
图像语义分割 DEEPLAB V3+的代码走读
---------------------------感谢这些博主的无私奉献!------------------------------
使用labelme制作。
labelme的打开:
cxx@cxx-211:~/labelmemaster$ python labelme/main.py
[关于labelme做数据集](下面所提到的三段代码在这个笔记里,注意更改)
调整label.png为灰度图
批量转换
cxx@cxx-211:~/labelmemaster$ python labelme/cli/json_to_dataset.py /home/cxx/labelmemaster/data
# 后者是json文件所在文件夹
会生成5个文件,如:000000.png、000000_gt.png、000000_viz.png、info.yaml、label_names.txt。其中_gt.png是所需要的label文件。
提取所有的_gt.png文件
cxx@cxx-211:~/labelmemaster$ python get_gt.py
目录如下:
#from /home/cxx/Deeplab/models/research/deeplab/datasets/screw_seg/
+ image
+ mask
+ index
- train.txt
- trainval.txt
- val.txt
+ tfrecord
iamge和mask
iamge中存放所有的输入图片,包括训练、测试、验证集的图片
mask中存放所有的labeled图片,,和输入图片(即iamge)是一一对应的
PS:这里需要注意一个点,image和mask的文件名应该一致,且全部小写,上一步产生的iamge后缀大写,用 rename ‘y/A-Z/a-z/’ * 修改,,mask文件名是000000_gt.png,用 rename ‘s/_gt.png/.png/’ ./* 修改,这样image和mska的文件名就能对应。对应代码如下:
rename 's/\_gt.png/.png/' ./* #修改后缀
rename 'y/A-Z/a-z/' * #全部小写
index
该目录下包含三个.txt文件:
train.txt:所有训练集的文件名称
trainval.txt:所有验证集的文件名称
val.txt:所有测试集的文件名称
这三个文件的生成文件是/home/cxx/Deeplab/models/research/deeplab/datasets/screw_seg/train_data.py,命令如下:
cxx@cxx-211:~/Deeplab/models/research/deeplab/datasets/screw_seg$ python train_data.py
修改对应文件地址,代码生成不同train、val的txt。具体代码在
python deeplab/datasets/build_voc2012_data.py \
--image_folder="/home/cxx/Deeplab/models/research/deeplab/datasets/screw_seg/image" \
--semantic_segmentation_folder="/home/cxx/Deeplab/models/research/deeplab/datasets/screw_seg/mask" \
--list_folder="/home/cxx/Deeplab/models/research/deeplab/datasets/screw_seg/index" \
--image_format="png" \
--output_dir="/home/cxx/Deeplab/models/research/deeplab/datasets/screw_seg/tfrecord"
大约在line 100左右,添加如下代码(注意num_classes=num( label+2 )):
_SCREW = DatasetDescriptor(
splits_to_sizes={
'train': 119, # num of samples in images/training
#'train_aug': 10582,
#'trainval': 2913,
#'val': 3000,
},
num_classes=3, #background、ignore_label、ignore_label,即label数+2
ignore_label=255,
)
大约在line 112,添加对应数据集的名称:
_DATASETS_INFORMATION = {
'cityscapes': _CITYSCAPES_INFORMATION,
'pascal_voc_seg': _PASCAL_VOC_SEG_INFORMATION,
'ade20k': _ADE20K_INFORMATION,
'tank': _TANK,
'screw':_SCREW #螺丝
}
大约在line 109,exclude_list的设置修改,作用是在使用预训练权重时候,不加载该logit层:
# Variables that will not be restored.
#exclude_list = ['global_step','logits']
exclude_list = ['global_step']
if not initialize_last_layer:
exclude_list.extend(last_layers)
在train_utils.py的70行修改权重
#################change
irgore_weight = 0
label0_weight = 1 #background
label1_weight = 10 #object1
not_ignore_mask = tf.to_float(tf.equal(scaled_labels, 0)) * label0_weight + tf.to_float(tf.equal(scaled_labels, 1)) * label1_weight + tf.to_float(tf.equal(scaled_labels, ignore_label)) * irgore_weight
one_hot_labels = slim.one_hot_encoding(
scaled_labels, num_classes, on_value=1.0, off_value=0.0)
tf.losses.softmax_cross_entropy(
one_hot_labels,
tf.reshape(logits, shape=[-1, num_classes]),
weights=not_ignore_mask,
scope=loss_scope)
因为是三分类问题,其中background占了非常大的比例,并且object2比object1要稍微少一点,最终的设置的权重比例为1:10:15:
irgore_weight = 0
label0_weight =1 # 对应灰度值0,即background
label1_weight = 10 # 对应object1, mask 中灰度值1
label2_weight = 15 # 对应object2,.mask 中灰度值2
initialize_last_layer=False
last_layers_contain_logits_only=True
python deeplab/train02.py \
--logtostderr \
--training_number_of_steps=30000 \
--train_split="train" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--train_crop_size=513 \
--train_crop_size=513 \
--train_batch_size=2 \
--dataset="screw" \
--tf_initial_checkpoint='/home/cxx/Deeplab/models/research/deeplab/backbone/deeplabv3_cityscapes_train/model.ckpt' \
--train_logdir='/home/cxx/Deeplab/models/research/deeplab/datasets/screw_seg/exp/train_on_train_set/train' \
--dataset_dir='/home/cxx/Deeplab/models/research/deeplab/datasets/screw_seg/tfrecord'
python deeplab/vis.py \
--logtostderr \
--vis_split="val" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--vis_crop_size=1025 \
--vis_crop_size=2049 \
--dataset="tank" \
--colormap_type="pascal" \
--checkpoint_dir='/home/cxx/Deeplab/models/research/deeplab/datasets/tank/exp/train_on_train_set/train' \
--vis_logdir='/home/cxx/Deeplab/models/research/deeplab/datasets/tank/exp/train_on_train_set/vis' \
--dataset_dir='/home/cxx/Deeplab/models/research/deeplab/datasets/tank/tfrecord'
python deeplab/eval.py \
--logtostderr \
--eval_split="train" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--eval_crop_size=800 \
--eval_crop_size=1200 \
--dataset="screw" \
--checkpoint_dir='/home/cxx/Deeplab/models/research/deeplab/datasets/screw_seg/exp/train_on_train_set/train' \
--eval_logdir='/home/cxx/Deeplab/models/research/deeplab/datasets/screw_seg/exp/train_on_train_set/eval' \
--dataset_dir='/home/cxx/Deeplab/models/research/deeplab/datasets/screw_seg/tfrecord'
[制作viz.py](具体代码在这个笔记)
就是把image和prediction融合在一起。
cxx@cxx-211:~/Deeplab/models/research/deeplab/datasets/screw_seg$ python creat_viz.py
ERROR 1
E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 20.62M (21626880 bytes) from device: CUDA_ERROR_OUT_OF_MEMO
解决方案:显存不够,减少batchsize或cropsize
ERROR 2
INFO:tensorflow:Error reported to Coordinator: , Dst tensor is not initialized.
[[Node: xception_65/middle_flow/block1/unit_8/xception_module/separable_conv1_depthwise/BatchNorm/beta/read/_1965 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_1837_.../beta/read", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
解决方案:注意清理显存
nvidia-smi
sudo kill -9 PID #PID是进程号
ERROR 3
INFO:tensorflow:Error reported to Coordinator: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_2
[[Node: image_pooling/BatchNorm/moving_variance_2 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](image_pooling/BatchNorm/moving_variance_2/tag, image_pooling/BatchNorm/moving_variance/read)]]
[[Node: xception_65/middle_flow/block1/unit_11/xception_module/separable_conv2_pointwise/BatchNorm/beta/read/_687 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_1983_.../beta/read", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
解决方案:set smaller learning rate and fine_tune_batch_norm = False
参考链接:https://github.com/tensorflow/models/issues/3716
It seems that this is really due to limited GPU memory. As stated in the document, setting --fine_tune_batch_norm=False will solve this problem. I tried setting this option and can be able to train with a training step of 30,000 now ?
大概率显存不够问题,设置–fine_tune_batch_norm=False,batch_size尽可能大,调整OS(Set output_stride = 16 or maybe even 32 (remember to change the flag atrous_rates accordingly, e.g., atrous_rates = [3, 6, 9] for output_stride = 32))
INFO:tensorflow:global step 29910: loss = 0.3183 (0.347 sec/step)
INFO:tensorflow:global step 29920: loss = 0.2940 (0.353 sec/step)
INFO:tensorflow:global step 29930: loss = 0.3336 (0.377 sec/step)
INFO:tensorflow:global step 29940: loss = 0.3120 (0.364 sec/step)
INFO:tensorflow:global step 29950: loss = 0.2715 (0.349 sec/step)
INFO:tensorflow:global step 29960: loss = 0.2974 (0.367 sec/step)
INFO:tensorflow:global step 29970: loss = 0.2942 (0.350 sec/step)
INFO:tensorflow:global step 29980: loss = 0.3177 (0.357 sec/step)
INFO:tensorflow:global step 29990: loss = 0.3083 (0.391 sec/step)
INFO:tensorflow:global step 30000: loss = 0.3131 (0.343 sec/step)
INFO:tensorflow:Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.
INFO:tensorflow:Visualizing batch 105 / 119
INFO:tensorflow:Visualizing batch 106 / 119
INFO:tensorflow:Visualizing batch 107 / 119
INFO:tensorflow:Visualizing batch 108 / 119
INFO:tensorflow:Visualizing batch 109 / 119
INFO:tensorflow:Visualizing batch 110 / 119
INFO:tensorflow:Visualizing batch 111 / 119
INFO:tensorflow:Visualizing batch 112 / 119
INFO:tensorflow:Visualizing batch 113 / 119
INFO:tensorflow:Visualizing batch 114 / 119
INFO:tensorflow:Visualizing batch 115 / 119
INFO:tensorflow:Visualizing batch 116 / 119
INFO:tensorflow:Visualizing batch 117 / 119
INFO:tensorflow:Visualizing batch 118 / 119
INFO:tensorflow:Visualizing batch 119 / 119
INFO:tensorflow:Finished visualization at 2018-11-09-02:56:56
INFO:tensorflow:Starting evaluation at 2018-11-09-03:03:40
INFO:tensorflow:Evaluation [11/119]
INFO:tensorflow:Evaluation [22/119]
INFO:tensorflow:Evaluation [33/119]
INFO:tensorflow:Evaluation [44/119]
INFO:tensorflow:Evaluation [55/119]
INFO:tensorflow:Evaluation [66/119]
INFO:tensorflow:Evaluation [77/119]
INFO:tensorflow:Evaluation [88/119]
INFO:tensorflow:Evaluation [99/119]
INFO:tensorflow:Evaluation [110/119]
INFO:tensorflow:Evaluation [119/119]
INFO:tensorflow:Finished evaluation at 2018-11-09-03:04:04
miou_1.0[0.780412197]