【Semantic Segmentation】语义分割综述 -- Atrous Convolution

【Semantic Segmentation】语义分割综述 -- Atrous Convolution

  • Atrous Convolution
    • [DeepLabV2] Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution,and Fully Connected CRFs 2016
      • [ASPP] atrous spatial pyramid pooling
    • [DeepLabV3] Rethinking Atrous Convolution for Semantic Image Segmentation 2017-06
    • [DeepLabV3+] Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation 2018-02
      • code
    • [DenseASPP] DenseASPP for Semantic Segmentation in Street Scenes 2019-01

Atrous Convolution

为了解决encoder+decoder 损失图片细节精度的问题,使用空洞卷积(Atrous Convolution)解决问题。
以前的CNN主要问题总结:
   (1)Up-sampling / pooling layer
   (2)内部数据结构丢失;空间层级化信息丢失。
   (3)小物体信息无法重建 (假设有四个pooling layer 则 任何小于 2^4 = 16 pixel 的物体信息将理论上无法重建。)

[DeepLabV2] Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution,and Fully Connected CRFs 2016

https://arxiv.org/abs/1606.00915

【Semantic Segmentation】语义分割综述 -- Atrous Convolution_第1张图片【Semantic Segmentation】语义分割综述 -- Atrous Convolution_第2张图片
【Semantic Segmentation】语义分割综述 -- Atrous Convolution_第3张图片

[ASPP] atrous spatial pyramid pooling

【Semantic Segmentation】语义分割综述 -- Atrous Convolution_第4张图片
【Semantic Segmentation】语义分割综述 -- Atrous Convolution_第5张图片

[DeepLabV3] Rethinking Atrous Convolution for Semantic Image Segmentation 2017-06

https://arxiv.org/abs/1706.05587
【Semantic Segmentation】语义分割综述 -- Atrous Convolution_第6张图片
【Semantic Segmentation】语义分割综述 -- Atrous Convolution_第7张图片

[DeepLabV3+] Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation 2018-02

https://arxiv.org/abs/1802.02611v1

【Semantic Segmentation】语义分割综述 -- Atrous Convolution_第8张图片
【Semantic Segmentation】语义分割综述 -- Atrous Convolution_第9张图片
定义 output stride 为feature map较原图缩小的倍数.
Encoder

  1. DCNN 可以选择resnet101,或者xception;
    输出Low-Level Features 的output stride = 4,channel = 256/128(resnet101/xception);
    输出 tensor 的 output stride = 8/16,channel=2048;
  2. ASPP将tensor 并行通过4个卷积层和一个全局平均池化层;
    每一层的输出size 保持不变,channel数为256;
    做concat 之后 通过1X1卷积特征融合,输出output stride = 8/16,channel 为256维的tensor;

Decoder

  1. Low-Level Features 通过1X1卷积特征融合,size 不变, 减少channel 为 48 (实验得出);
  2. ASPP 的输出tensor output stride = 8/16 ,需要上采样2/4倍;
  3. 此时size 相同,即output stride = 4 做concat得到新的tensor;
  4. tensor 经过卷积,上采样得到最终结果.

code

ASPP
ASPP模块通过空洞卷积提取feature map不同尺度的特征.

O = I + 2 p − d ( k − 1 ) − 1 s + 1 O = \frac{I+2p-d(k-1)-1}{s}+1 O=sI+2pd(k1)1+1
其 中 I 是 输 入 f e a t u r e   m a p 的 s i z e , s 表 示 步 长 s t r i d e , p 表 示 p a d , d 表 示 d i l a t i o n , O 表 示 输 出 f e a t u r e   m a p   s i z e 其中I 是输入feature\ map的size,s表示步长stride,p表示pad,d表示dilation,O表示输出feature \ map\ size Ifeature mapsize,sstride,ppad,ddilation,Ofeature map size

令 O = I , k = 3 令O=I,k=3 O=I,k=3
s ( I − 1 ) = I + 2 p − 2 d − 1 s(I-1)=I+2p-2d-1 s(I1)=I+2p2d1
( s − 1 ) ( I − 1 ) = 2 p − 2 d (s-1)(I-1)=2p-2d (s1)(I1)=2p2d
为 了 保 证 O = I , 显 而 易 见 的 s , p , d 的 关 系 可 以 为 s = 1 , p = d = 任 意 数 为了保证O=I,显而易见的s,p,d 的关系可以为s=1,p=d=任意数 O=I,s,p,ds=1,p=d=
令 O = I , k = 1 令O=I,k=1 O=I,k=1
s ( I − 1 ) = I + 2 p − 1 s(I-1)=I+2p-1 s(I1)=I+2p1
( s − 1 ) ( I − 1 ) = 2 p (s-1)(I-1)=2p (s1)(I1)=2p
为 了 保 证 O = I , 显 而 易 见 的 s , p 的 关 系 可 以 为 s = 1 , p = 0 , d 为 任 意 数 为了保证O=I,显而易见的s,p 的关系可以为s=1,p=0,d 为任意数 O=I,s,ps=1,p=0,d

import torch
import torch.nn as nn
import torch.nn.functional as F
from .utils import initialize_weights
def assp_branch(in_channels, out_channles, kernel_size, dilation):
	#padding 和 dilation 的设置在上面解释了	 
    padding = 0 if kernel_size == 1 else dilation
    return nn.Sequential(
        nn.Conv2d(in_channels, out_channles, kernel_size, padding=padding, dilation=dilation, bias=False),
        nn.BatchNorm2d(out_channles),
        nn.ReLU(inplace=True))


class ImagePooling(nn.Module):
    def __init__(self, in_channels):
        super(ImagePooling, self).__init__()

        self.avg_pool = nn.Sequential(
            nn.AdaptiveAvgPool2d((1, 1)),
            nn.Conv2d(in_channels, 256, 1, bias=False),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True))

    def forward(self, x):
        return F.interpolate(self.avg_pool(x), size=(x.size(2), x.size(3)), mode='bilinear', align_corners=True)


class ASSP(nn.Module):
    def __init__(self, in_channels, output_stride):
        super(ASSP, self).__init__()

        assert output_stride in [8, 16], 'Only output strides of 8 or 16 are suported'
        dilations = [None] * 4
        if output_stride == 16:
            dilations = [1, 6, 12, 18]
        elif output_stride == 8:
            dilations = [1, 12, 24, 36]

        self.aspp1 = assp_branch(in_channels, 256, 1, dilation=dilations[0])
        self.aspp2 = assp_branch(in_channels, 256, 3, dilation=dilations[1])
        self.aspp3 = assp_branch(in_channels, 256, 3, dilation=dilations[2])
        self.aspp4 = assp_branch(in_channels, 256, 3, dilation=dilations[3])
        self.aspp5 = ImagePooling(in_channels)

        self.conv1 = nn.Conv2d(256 * 5, 256, 1, bias=False)
        self.bn1 = nn.BatchNorm2d(256)
        self.relu = nn.ReLU(inplace=True)
        self.dropout = nn.Dropout(0.5)

        initialize_weights(self)

    def forward(self, x):
        x1 = self.aspp1(x)
        x2 = self.aspp2(x)
        x3 = self.aspp3(x)
        x4 = self.aspp4(x)
        x5 = self.aspp5(x)

        x = self.conv1(torch.cat((x1, x2, x3, x4, x5), dim=1))
        x = self.bn1(x)
        x = self.dropout(self.relu(x))

        return x

Decoder code

class Decoder(nn.Module):
    def __init__(self, low_level_channels, num_classes):
        super(Decoder, self).__init__()

        self.low_level_conv = nn.Sequential(
            nn.Conv2d(low_level_channels, 48, 1, bias=False),
            nn.BatchNorm2d(48),
            nn.ReLU(inplace=True),
        )

        # Table 2, best performance with two 3x3 convs
        self.output = nn.Sequential(
            nn.Conv2d(48 + 256, 256, 3, stride=1, padding=1, bias=False),
            nn.BatchNorm2d(256),
            nn.LeakyReLU(inplace=True),
            nn.Conv2d(256, 256, 3, stride=1, padding=1, bias=False),
            nn.BatchNorm2d(256),
            nn.LeakyReLU(inplace=True),
            nn.Dropout(0.1),
            nn.Conv2d(256, num_classes, 1, stride=1),
        )
        initialize_weights(self)

    def forward(self, x, low_level_features, H, W):
        # low_level_features 变为 channel = 48
        low_level_features = self.low_level_conv(low_level_features)
        # 上采用2/4倍
        x = F.interpolate(x, size=(low_level_features.size(2), low_level_features.size(3)),
                          mode='bilinear',
                          align_corners=True)
        # concat
        x = torch.cat((x, low_level_features), dim=1)
        # 得到结果卷积
        x = self.output(x)
        # 上采用4倍 恢复到原图大小,传入H,W 是保证输入图像size不是2的倍数也可以恢复到原图大小
        x = F.interpolate(x, size=(H, W), mode='bilinear', align_corners=True)

        return x

DeepLabV3+的主干网络

from itertools import chain

import torch
import torch.nn as nn
import torch.nn.functional as F

from .models import getBackBone
from .models.aspp import ASSP
from .models.decoder import Decoder


class DeepLabV3Plus(nn.Module):
    def __init__(self, num_classes, in_channels=3, backbone='xception', pretrained=True,
                 output_stride=16, freeze_bn=False, **_):
        super(DeepLabV3Plus, self).__init__()
        assert ('xception' or 'resnet' in backbone)
        self.backbone, low_level_channels = getBackBone(backbone, in_channels=in_channels, output_stride=output_stride,
                                                        pretrained=pretrained)

        self.assp = ASSP(in_channels=2048, output_stride=output_stride)

        self.decoder = Decoder(low_level_channels, num_classes)

        if freeze_bn:
            self.freeze_bn()

    def forward(self, x):
        H, W = x.size(2), x.size(3)
        x, low_level_features = self.backbone(x)
        x = self.assp(x)
        x = self.decoder(x, low_level_features, H, W)
        return x

    # Two functions to yield the parameters of the backbone
    # & Decoder / ASSP to use differentiable learning rates
    # FIXME: in xception, we use the parameters from xception and not aligned xception
    # better to have higher lr for this backbone

    def get_backbone_params(self):
        return self.backbone.parameters()

    def get_decoder_params(self):
        return chain(self.ASSP.parameters(), self.decoder.parameters())

    def freeze_bn(self):
        for module in self.modules():
            if isinstance(module, nn.BatchNorm2d): module.eval()

参考
[细致讲解] https://blog.csdn.net/u011974639/article/details/79518175
[code] https://github.com/ArronHZG/segmentation-pytorch/tree/master/torch_model/net/DeepLabV3Plus

[DenseASPP] DenseASPP for Semantic Segmentation in Street Scenes 2019-01

http://openaccess.thecvf.com/content_cvpr_2018/papers/Yang_DenseASPP_for_Semantic_CVPR_2018_paper.pdf


DenseASPP 结合了ASPP和denseNet 的优点。
ASPP 的结构是一种平行的结构,通过不同的空洞卷积获得不同的感受野,兼顾了宏观和微观的语义信息。
而另一种结构例如Encode 和 Decode 是一种级联结构(cascading) ,天然地可以获得更大的感受野,但是对于物体边缘信息的判别就不是很准。
DenseASPP 综合了上述两种结构的优势。在Cityscapes数据集上在cat 取得了90.7的mIOU。

你可能感兴趣的:(深度学习)