Deeplab V3+训练自己数据集全过程

Deeplab V3+训练过程

  • 0. 参考链接
  • 1. 制作数据集
    • 1.1. 制作json文件。
    • 1.2. 批量转换 json--png
    • 1.3. 数据集目录制作
    • 1.4. 生成tfrecord
  • 2. 训练前代码准备
    • 2.1. 修改segmentation_dataset.py
    • 2.2. 修改train_utils.py
    • 2.3. 数据不平衡问题
    • 2.4. 修改train .py
  • 3. TRAIN
    • 3.1. train
    • 3.2. vis
    • 3.3. val
    • 3.4. 生成更方便观看的结果
  • 4. 一些事后
    • 4.1. 出现的错误
    • 4.2. train的输出
    • 4.3. vis的输出
    • 4.4. vis的输出

0. 参考链接

TensorFlow实战:Chapter-9上(DeepLabv3+代码实现)主要参考1

TensorFlow实战:Chapter-9下(DeepLabv3+在自己的数据集训练)主要参考2

图像语义分割 — 利用Deeplab v3+训练自己的数据 loss震荡解决办法

图像语义分割 —利用Deeplab v3+训练VOC2012数据集用VOC训练

deeplabv3+Xception讲了一些概念

图像语义分割 Deeplab v3+报错[predictions out of bound]解决办法

图像语义分割 DeepLab v3+ 训练自己的数据集

Tensorflow - 语义分割 Deeplab API 之 Demo 主要是对demo的讲解,这个博主的其他博客也很好

Deeplab v3 (2): train.py、eval.py源码分析

图像语义分割 DEEPLAB V3+的代码走读

---------------------------感谢这些博主的无私奉献!------------------------------

1. 制作数据集

使用labelme制作。

1.1. 制作json文件。

labelme的打开:

cxx@cxx-211:~/labelmemaster$ python labelme/main.py

1.2. 批量转换 json–png

[关于labelme做数据集](下面所提到的三段代码在这个笔记里,注意更改)

  1. 调整label.png为灰度图

  2. 批量转换

     cxx@cxx-211:~/labelmemaster$ python labelme/cli/json_to_dataset.py /home/cxx/labelmemaster/data
     
     # 后者是json文件所在文件夹
    

    会生成5个文件,如:000000.png、000000_gt.png、000000_viz.png、info.yaml、label_names.txt。其中_gt.png是所需要的label文件。

  3. 提取所有的_gt.png文件

     cxx@cxx-211:~/labelmemaster$ python get_gt.py 
    

1.3. 数据集目录制作

目录如下:

#from /home/cxx/Deeplab/models/research/deeplab/datasets/screw_seg/
    + image
    + mask
    + index 
        - train.txt
        - trainval.txt
        - val.txt
    + tfrecord 
  1. iamge和mask

    1. iamge中存放所有的输入图片,包括训练、测试、验证集的图片

    2. mask中存放所有的labeled图片,,和输入图片(即iamge)是一一对应的

    • PS:这里需要注意一个点,image和mask的文件名应该一致,且全部小写,上一步产生的iamge后缀大写,用 rename ‘y/A-Z/a-z/’ * 修改,,mask文件名是000000_gt.png,用 rename ‘s/_gt.png/.png/’ ./* 修改,这样image和mska的文件名就能对应。对应代码如下:

      rename   's/\_gt.png/.png/' ./*  #修改后缀
      
      rename 'y/A-Z/a-z/' *   #全部小写
      
  2. index

    该目录下包含三个.txt文件:

    train.txt:所有训练集的文件名称
    trainval.txt:所有验证集的文件名称
    val.txt:所有测试集的文件名称
    

这三个文件的生成文件是/home/cxx/Deeplab/models/research/deeplab/datasets/screw_seg/train_data.py,命令如下:

cxx@cxx-211:~/Deeplab/models/research/deeplab/datasets/screw_seg$ python train_data.py

修改对应文件地址,代码生成不同train、val的txt。具体代码在

1.4. 生成tfrecord

  • image_folder :数据集中原输入数据的文件目录地址
  • semantic_segmentation_folder:数据集中标签的文件目录地址
  • list_folder : 将数据集分类成训练集、验证集等的指示目录文件目录
  • image_format : 输入图片数据的格式,CamVid的是png格式
  • output_dir:制作的TFRecord存放的目录地址(自己创建)
    python deeplab/datasets/build_voc2012_data.py \
      --image_folder="/home/cxx/Deeplab/models/research/deeplab/datasets/screw_seg/image" \
      --semantic_segmentation_folder="/home/cxx/Deeplab/models/research/deeplab/datasets/screw_seg/mask" \
      --list_folder="/home/cxx/Deeplab/models/research/deeplab/datasets/screw_seg/index" \
      --image_format="png" \
      --output_dir="/home/cxx/Deeplab/models/research/deeplab/datasets/screw_seg/tfrecord"

2. 训练前代码准备

2.1. 修改segmentation_dataset.py

大约在line 100左右,添加如下代码(注意num_classes=num( label+2 )):

_SCREW = DatasetDescriptor(
    splits_to_sizes={
        'train': 119,   # num of samples in images/training
        #'train_aug': 10582,
        #'trainval': 2913,
        #'val': 3000,
    },
    num_classes=3,   #background、ignore_label、ignore_label,即label数+2
    ignore_label=255,
)

大约在line 112,添加对应数据集的名称:

_DATASETS_INFORMATION = {
    'cityscapes': _CITYSCAPES_INFORMATION,
    'pascal_voc_seg': _PASCAL_VOC_SEG_INFORMATION,
    'ade20k': _ADE20K_INFORMATION,
    'tank': _TANK,
    'screw':_SCREW   #螺丝
}

2.2. 修改train_utils.py

大约在line 109,exclude_list的设置修改,作用是在使用预训练权重时候,不加载该logit层:

  # Variables that will not be restored.
  #exclude_list = ['global_step','logits']
  exclude_list = ['global_step']
  if not initialize_last_layer:
    exclude_list.extend(last_layers)

2.3. 数据不平衡问题

在train_utils.py的70行修改权重

    #################change
    irgore_weight = 0
    label0_weight = 1  #background
    label1_weight = 10   #object1

    not_ignore_mask = tf.to_float(tf.equal(scaled_labels, 0)) * label0_weight + tf.to_float(tf.equal(scaled_labels, 1)) * label1_weight + tf.to_float(tf.equal(scaled_labels, ignore_label)) * irgore_weight

    one_hot_labels = slim.one_hot_encoding(
        scaled_labels, num_classes, on_value=1.0, off_value=0.0)
    tf.losses.softmax_cross_entropy(
        one_hot_labels,
        tf.reshape(logits, shape=[-1, num_classes]),
        weights=not_ignore_mask,
        scope=loss_scope)

因为是三分类问题,其中background占了非常大的比例,并且object2比object1要稍微少一点,最终的设置的权重比例为1:10:15:

irgore_weight = 0
label0_weight =1 # 对应灰度值0,即background
label1_weight = 10 # 对应object1, mask 中灰度值1
label2_weight = 15 # 对应object2,.mask 中灰度值2

2.4. 修改train .py

    initialize_last_layer=False
    
    last_layers_contain_logits_only=True

3. TRAIN

3.1. train

  • tf_initial_checkpoint:预训练的权重,使用CityScapes的预训练权重
  • train_logdir:训练产生的文件存放位置
  • dataset_dir:数据集的TFRecord文件
  • dataset:设置为在segmentation_dataset.py文件设置的数据集名称
python deeplab/train02.py \
    --logtostderr \
    --training_number_of_steps=30000 \
    --train_split="train" \
    --model_variant="xception_65" \
    --atrous_rates=6 \
    --atrous_rates=12 \
    --atrous_rates=18 \
    --output_stride=16 \
    --decoder_output_stride=4 \
    --train_crop_size=513 \
    --train_crop_size=513 \
    --train_batch_size=2 \
    --dataset="screw" \
    --tf_initial_checkpoint='/home/cxx/Deeplab/models/research/deeplab/backbone/deeplabv3_cityscapes_train/model.ckpt' \
    --train_logdir='/home/cxx/Deeplab/models/research/deeplab/datasets/screw_seg/exp/train_on_train_set/train' \
    --dataset_dir='/home/cxx/Deeplab/models/research/deeplab/datasets/screw_seg/tfrecord'

3.2. vis

  • vis_split:设置为测试集
  • vis_crop_size:设置360,480为图片的大小
  • dataset:设置为我们在segmentation_dataset.py文件设置的数据集名称
  • dataset_dir:设置为创建的TFRecord
  • colormap_type:可视化标注的颜色
python deeplab/vis.py \
    --logtostderr \
    --vis_split="val" \
    --model_variant="xception_65" \
    --atrous_rates=6 \
    --atrous_rates=12 \
    --atrous_rates=18 \
    --output_stride=16 \
    --decoder_output_stride=4 \
    --vis_crop_size=1025 \
    --vis_crop_size=2049 \
    --dataset="tank" \
    --colormap_type="pascal" \
    --checkpoint_dir='/home/cxx/Deeplab/models/research/deeplab/datasets/tank/exp/train_on_train_set/train' \
    --vis_logdir='/home/cxx/Deeplab/models/research/deeplab/datasets/tank/exp/train_on_train_set/vis' \
    --dataset_dir='/home/cxx/Deeplab/models/research/deeplab/datasets/tank/tfrecord'

3.3. val

  • eval_split:设置为测试集
  • crop_size:同样设置为360和480
  • dataset:设置为camvid
  • dataset_dir:设置为我们创建的数据集
python deeplab/eval.py \
    --logtostderr \
    --eval_split="train" \
    --model_variant="xception_65" \
    --atrous_rates=6 \
    --atrous_rates=12 \
    --atrous_rates=18 \
    --output_stride=16 \
    --decoder_output_stride=4 \
    --eval_crop_size=800 \
    --eval_crop_size=1200 \
    --dataset="screw" \
    --checkpoint_dir='/home/cxx/Deeplab/models/research/deeplab/datasets/screw_seg/exp/train_on_train_set/train' \
    --eval_logdir='/home/cxx/Deeplab/models/research/deeplab/datasets/screw_seg/exp/train_on_train_set/eval' \
    --dataset_dir='/home/cxx/Deeplab/models/research/deeplab/datasets/screw_seg/tfrecord'

3.4. 生成更方便观看的结果

[制作viz.py](具体代码在这个笔记)

就是把image和prediction融合在一起。

    cxx@cxx-211:~/Deeplab/models/research/deeplab/datasets/screw_seg$ python creat_viz.py 

4. 一些事后

4.1. 出现的错误

ERROR 1

E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 20.62M (21626880 bytes) from device: CUDA_ERROR_OUT_OF_MEMO

解决方案:显存不够,减少batchsize或cropsize

ERROR 2

INFO:tensorflow:Error reported to Coordinator: , Dst tensor is not initialized.
	 [[Node: xception_65/middle_flow/block1/unit_8/xception_module/separable_conv1_depthwise/BatchNorm/beta/read/_1965 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_1837_.../beta/read", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

解决方案:注意清理显存

nvidia-smi

sudo kill -9 PID #PID是进程号

ERROR 3

INFO:tensorflow:Error reported to Coordinator: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_2
	 [[Node: image_pooling/BatchNorm/moving_variance_2 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](image_pooling/BatchNorm/moving_variance_2/tag, image_pooling/BatchNorm/moving_variance/read)]]
	 [[Node: xception_65/middle_flow/block1/unit_11/xception_module/separable_conv2_pointwise/BatchNorm/beta/read/_687 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_1983_.../beta/read", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

解决方案:set smaller learning rate and fine_tune_batch_norm = False

参考链接:https://github.com/tensorflow/models/issues/3716

It seems that this is really due to limited GPU memory. As stated in the document, setting --fine_tune_batch_norm=False will solve this problem. I tried setting this option and can be able to train with a training step of 30,000 now ?

大概率显存不够问题,设置–fine_tune_batch_norm=False,batch_size尽可能大,调整OS(Set output_stride = 16 or maybe even 32 (remember to change the flag atrous_rates accordingly, e.g., atrous_rates = [3, 6, 9] for output_stride = 32))

4.2. train的输出

INFO:tensorflow:global step 29910: loss = 0.3183 (0.347 sec/step)
INFO:tensorflow:global step 29920: loss = 0.2940 (0.353 sec/step)
INFO:tensorflow:global step 29930: loss = 0.3336 (0.377 sec/step)
INFO:tensorflow:global step 29940: loss = 0.3120 (0.364 sec/step)
INFO:tensorflow:global step 29950: loss = 0.2715 (0.349 sec/step)
INFO:tensorflow:global step 29960: loss = 0.2974 (0.367 sec/step)
INFO:tensorflow:global step 29970: loss = 0.2942 (0.350 sec/step)
INFO:tensorflow:global step 29980: loss = 0.3177 (0.357 sec/step)
INFO:tensorflow:global step 29990: loss = 0.3083 (0.391 sec/step)
INFO:tensorflow:global step 30000: loss = 0.3131 (0.343 sec/step)
INFO:tensorflow:Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.

4.3. vis的输出

INFO:tensorflow:Visualizing batch 105 / 119
INFO:tensorflow:Visualizing batch 106 / 119
INFO:tensorflow:Visualizing batch 107 / 119
INFO:tensorflow:Visualizing batch 108 / 119
INFO:tensorflow:Visualizing batch 109 / 119
INFO:tensorflow:Visualizing batch 110 / 119
INFO:tensorflow:Visualizing batch 111 / 119
INFO:tensorflow:Visualizing batch 112 / 119
INFO:tensorflow:Visualizing batch 113 / 119
INFO:tensorflow:Visualizing batch 114 / 119
INFO:tensorflow:Visualizing batch 115 / 119
INFO:tensorflow:Visualizing batch 116 / 119
INFO:tensorflow:Visualizing batch 117 / 119
INFO:tensorflow:Visualizing batch 118 / 119
INFO:tensorflow:Visualizing batch 119 / 119
INFO:tensorflow:Finished visualization at 2018-11-09-02:56:56

4.4. vis的输出

INFO:tensorflow:Starting evaluation at 2018-11-09-03:03:40
INFO:tensorflow:Evaluation [11/119]
INFO:tensorflow:Evaluation [22/119]
INFO:tensorflow:Evaluation [33/119]
INFO:tensorflow:Evaluation [44/119]
INFO:tensorflow:Evaluation [55/119]
INFO:tensorflow:Evaluation [66/119]
INFO:tensorflow:Evaluation [77/119]
INFO:tensorflow:Evaluation [88/119]
INFO:tensorflow:Evaluation [99/119]
INFO:tensorflow:Evaluation [110/119]
INFO:tensorflow:Evaluation [119/119]
INFO:tensorflow:Finished evaluation at 2018-11-09-03:04:04
miou_1.0[0.780412197]

你可能感兴趣的:(python,深度学习)