paper: Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs
针对第一个问题,作者提出“空洞卷积”,空洞卷积可以在不增加参数的情况下增大卷积核的感受野。同时去掉后几层的池化层避免池化造成的空间信息的损失。
第二个问题是分类网络本身具有的空间平移不变特性造成的,这同时限制了DCNN空间维度的准确性。为了解决这个问题,作者引入了条件随机场(Conditional Random Field, CRF),CRF在传统图像处理主要用于平滑处理,在这里目的是为了恢复局部信息,而不是进一步平滑图像,因此作者引入了全连接CRF,简单来讲就是每个像素点作为节点,像素与像素间的关系作为边,即构成了一个条件随机场。通过二元势函数描述像素点与像素点之间的关系,鼓励相似像素分配相同的标签,而相差较大的像素分配不同标签,而这个“距离”的定义与颜色值和实际相对距离有关。所以这样CRF能够使图片在分割的边界处获得比较好的效果。
paper: DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs
ASPP的结构如下
通过并行不同膨胀率的空洞卷积获得不同大小的感受野,从而同时捕获不同尺度目标的信息,即让模型同时看清大物体和小物体。
paper: Rethinking Atrous Convolution for Semantic Image Segmentation
作者通过实验发现在ASPP中引入BN层可以提升性能。同时发现随着膨胀率的增加,有效的特征权重数量会变少,比如考虑一个极端情况,对于65x65的feature map,膨胀率为63x63的3x3卷积会退化成一个1x1卷积,即这个3x3卷积只有中间的权重是有效的。为了解决这个问题,作者将全局信息引入到ASPP中,增加了一个全局平均池化分支,GAP后经过一个1x1卷积然后再双线性插值还原回原本的分辨率。同时调整各个分支的膨胀率,ASPP中包含1个1x1卷积和3个3x3的膨胀卷积,当output_stride=16时,3x3卷积的膨胀率分别为6、12、18,将这5个分支的输出沿通道维度拼接后,经过一个1x1卷积得到ASPP模块的最终输出。
ASPP的结构如下图所示
下面以MMSegmentation中的deeplab v3实现为例,介绍一下具体实现细节
假设batch_size=4,输入shape为(4, 3, 480, 480)。backbone采用ResNet-50,output_stride=8
class ASPPModule(nn.ModuleList):
"""Atrous Spatial Pyramid Pooling (ASPP) Module.
Args:
dilations (tuple[int]): Dilation rate of each layer.
in_channels (int): Input channels.
channels (int): Channels after modules, before conv_seg.
conv_cfg (dict|None): Config of conv layers.
norm_cfg (dict|None): Config of norm layers.
act_cfg (dict): Config of activation layers.
"""
def __init__(self, dilations, in_channels, channels, conv_cfg, norm_cfg,
act_cfg):
super(ASPPModule, self).__init__()
self.dilations = dilations
self.in_channels = in_channels
self.channels = channels
self.conv_cfg = conv_cfg
self.norm_cfg = norm_cfg
self.act_cfg = act_cfg
for dilation in dilations:
self.append(
ConvModule(
self.in_channels,
self.channels,
1 if dilation == 1 else 3,
dilation=dilation,
padding=0 if dilation == 1 else dilation,
conv_cfg=self.conv_cfg,
norm_cfg=self.norm_cfg,
act_cfg=self.act_cfg))
def forward(self, x):
"""Forward function."""
aspp_outs = []
for aspp_module in self:
aspp_outs.append(aspp_module(x))
return aspp_outs
class ASPPHead(BaseDecodeHead):
"""Rethinking Atrous Convolution for Semantic Image Segmentation.
This head is the implementation of `DeepLabV3
`_.
Args:
dilations (tuple[int]): Dilation rates for ASPP module.
Default: (1, 6, 12, 18).
"""
def __init__(self, dilations=(1, 6, 12, 18), **kwargs):
super(ASPPHead, self).__init__(**kwargs)
assert isinstance(dilations, (list, tuple))
self.dilations = dilations
self.image_pool = nn.Sequential(
nn.AdaptiveAvgPool2d(1),
ConvModule(
self.in_channels,
self.channels,
1,
conv_cfg=self.conv_cfg,
norm_cfg=self.norm_cfg,
act_cfg=self.act_cfg))
self.aspp_modules = ASPPModule(
dilations,
self.in_channels,
self.channels,
conv_cfg=self.conv_cfg,
norm_cfg=self.norm_cfg,
act_cfg=self.act_cfg)
self.bottleneck = ConvModule(
(len(dilations) + 1) * self.channels,
self.channels,
3,
padding=1,
conv_cfg=self.conv_cfg,
norm_cfg=self.norm_cfg,
act_cfg=self.act_cfg)
def _forward_feature(self, inputs):
"""Forward function for feature maps before classifying each pixel with
``self.cls_seg`` fc.
Args:
inputs (list[Tensor]): List of multi-level img features.
Returns:
feats (Tensor): A tensor of shape (batch_size, self.channels,
H, W) which is feature map for last layer of decoder head.
"""
x = self._transform_inputs(inputs) # (4,2048,60,60)
aspp_outs = [
resize(
self.image_pool(x),
size=x.size()[2:],
mode='bilinear',
align_corners=self.align_corners)
]
# (4,2048,60,60)->(4,2048,1,1)->(4,512,1,1)->(4,512,60,60)
aspp_outs.extend(self.aspp_modules(x)) # [(4,512,60,60),(4,512,60,60),(4,512,60,60),(4,512,60,60),(4,512,60,60)]
aspp_outs = torch.cat(aspp_outs, dim=1) # (4,2560,60,60)
feats = self.bottleneck(aspp_outs) # (4,512,60,60)
return feats
def forward(self, inputs):
"""Forward function."""
output = self._forward_feature(inputs) # (4,512,60,60)
output = self.cls_seg(output) # (4,2,60,60)
return output
(decode_head): ASPPHead(
input_transform=None, ignore_index=255, align_corners=False
(loss_decode): CrossEntropyLoss(avg_non_ignore=False)
(conv_seg): Conv2d(512, 2, kernel_size=(1, 1), stride=(1, 1))
(dropout): Dropout2d(p=0.1, inplace=False)
(image_pool): Sequential(
(0): AdaptiveAvgPool2d(output_size=1)
(1): ConvModule(
(conv): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): _BatchNormXd(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
)
(aspp_modules): ASPPModule(
(0): ConvModule(
(conv): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): _BatchNormXd(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
(1): ConvModule(
(conv): Conv2d(2048, 512, kernel_size=(3, 3), stride=(1, 1), padding=(12, 12), dilation=(12, 12), bias=False)
(bn): _BatchNormXd(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
(2): ConvModule(
(conv): Conv2d(2048, 512, kernel_size=(3, 3), stride=(1, 1), padding=(24, 24), dilation=(24, 24), bias=False)
(bn): _BatchNormXd(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
(3): ConvModule(
(conv): Conv2d(2048, 512, kernel_size=(3, 3), stride=(1, 1), padding=(36, 36), dilation=(36, 36), bias=False)
(bn): _BatchNormXd(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
)
(bottleneck): ConvModule(
(conv): Conv2d(2560, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): _BatchNormXd(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
)
paper: Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation
为了在多个尺度上获取上下文信息,DeepLab v3并行应用了几个不同膨胀率的空洞卷积即ASPP,而PSPNet在不同的网格尺度上进行池化。即使最后一层特征图上编码了丰富的语义信息,但由于backbone中池化或stride>1的卷积操作导致物体边界细节信息的损失。这可以通过应用空洞卷积提取更密集的特征图来缓解,但是,考虑到目前复杂的网络设计和有限的GPU资源,输出output_stride=8或是4的大特征图对计算资源的要求太高。比如以ResNet-101为例,如果要输出output_stride=16的特征图,最后3个residual blocks共9层需要膨胀卷积,如果output_stride=8,则26个residual blocks共78层需要膨胀卷积,因此对于这种结构的模型提取更密集的输出特征非常消耗计算资源。
另一方面,encoder-decoder结构的计算非常快,因为不需要膨胀卷积输出大分辨率的特征图,在encoder阶段不断减小输出特征图的分辨率提取更丰富的语义信息,在decoder阶段逐步增大输出特征图的分辨率恢复物体边界的细节信息。
因此作者提出结合这两种方法的优点,通过引入多尺度上下文信息来丰富encoder-decoder结构中的编码模块。提出的新模型结构如下图(c)所示
此外,受到深度可分离卷积的启发,作者还引入了Xception模型,并在ASPP模块和decoder阶段使用了空洞可分离卷积,获得了精度和速度上的提升。
atrous depthwise convolution如下所示
作者使用deeplab v3中logits前的最后一个feature map作为新提出的encoder-decoder结构中编码模块的输出。encoder的输出包含256个通道和丰富的语义信息。
在deeplab v3中,aspp的输出特征output_stride=16,然后通过双线性插值上采样16倍恢复原始输入大小,这里的插值上采样也可以看作是decoder模块,但是这种简单的decoder可能没法很好的恢复细节信息,因此作者提出了一个简单且更有效的decoder module,如下图所示
encoder的输出特征首先通过双线性插值进行4倍上采样,然后与backbone中对应的具有相同分辨率的浅层特征进行concatenate(比如ResNet-101中striding前的Conv2),在拼接之前首先对浅层feature map通过1x1卷积进行降维,因为对应的浅层特征的通道数比较大(256或512),而encoder的输出特征图通道数为256,如果不降维直接拼接可能会让前者的比重更大使得训练变得困难。拼接之后再接3x3卷积来refine特征,最后再通过一个bilinear插值进行4倍上采样。
Xception在ImageNet上展现了很好的分类效果同时计算速度很快,MSRA团队对其进行了改进提出了Aligned Xception并进一步推动了目标检测任务的性能。作者又对其进一步改进使其更适合语义分割任务,具体来说包括(1)更深的层数(2)所有的max pooling都替换成了stride>1的深度可分离卷积(空洞深度可分离卷积)(3)类似于MobileNet,每个3x3 depthwise convolution后都加了BN和ReLU。
下面以MMSegmentation中的deeplab v3+实现为例,介绍一下具体实现细节
假设batch_size=4,输入shape为(4, 3, 480, 480)。backbone采用ResNet-50,output_stride=8
backbone和上面的deeplab v3中的一样。
# Copyright (c) OpenMMLab. All rights reserved.
import torch
import torch.nn as nn
from mmcv.cnn import ConvModule, DepthwiseSeparableConvModule
from mmseg.ops import resize
from ..builder import HEADS
from .aspp_head import ASPPHead, ASPPModule
class DepthwiseSeparableASPPModule(ASPPModule):
"""Atrous Spatial Pyramid Pooling (ASPP) Module with depthwise separable
conv."""
def __init__(self, **kwargs):
super(DepthwiseSeparableASPPModule, self).__init__(**kwargs)
for i, dilation in enumerate(self.dilations):
if dilation > 1:
self[i] = DepthwiseSeparableConvModule(
self.in_channels,
self.channels,
3,
dilation=dilation,
padding=dilation,
norm_cfg=self.norm_cfg,
act_cfg=self.act_cfg)
@HEADS.register_module()
class DepthwiseSeparableASPPHead(ASPPHead):
"""Encoder-Decoder with Atrous Separable Convolution for Semantic Image
Segmentation.
This head is the implementation of `DeepLabV3+
`_.
Args:
c1_in_channels (int): The input channels of c1 decoder. If is 0,
the no decoder will be used.
c1_channels (int): The intermediate channels of c1 decoder.
"""
def __init__(self, c1_in_channels, c1_channels, **kwargs):
super(DepthwiseSeparableASPPHead, self).__init__(**kwargs)
assert c1_in_channels >= 0
self.aspp_modules = DepthwiseSeparableASPPModule(
dilations=self.dilations,
in_channels=self.in_channels,
channels=self.channels,
conv_cfg=self.conv_cfg,
norm_cfg=self.norm_cfg,
act_cfg=self.act_cfg)
if c1_in_channels > 0:
self.c1_bottleneck = ConvModule(
c1_in_channels,
c1_channels,
1,
conv_cfg=self.conv_cfg,
norm_cfg=self.norm_cfg,
act_cfg=self.act_cfg)
else:
self.c1_bottleneck = None
self.sep_bottleneck = nn.Sequential(
DepthwiseSeparableConvModule(
self.channels + c1_channels,
self.channels,
3,
padding=1,
norm_cfg=self.norm_cfg,
act_cfg=self.act_cfg),
DepthwiseSeparableConvModule(
self.channels,
self.channels,
3,
padding=1,
norm_cfg=self.norm_cfg,
act_cfg=self.act_cfg))
def forward(self, inputs):
"""Forward function."""
x = self._transform_inputs(inputs) # (2,2048,60,60)
aspp_outs = [
resize(
self.image_pool(x),
size=x.size()[2:],
mode='bilinear',
align_corners=self.align_corners)
]
aspp_outs.extend(self.aspp_modules(x)) # [(4,512,60,60),(4,512,60,60),(4,512,60,60),(4,512,60,60),(4,512,60,60)]
aspp_outs = torch.cat(aspp_outs, dim=1) # (4,2560,60,60)
output = self.bottleneck(aspp_outs) # (4,512,60,60)
if self.c1_bottleneck is not None:
c1_output = self.c1_bottleneck(inputs[0]) # (4,256,120,120) -> (4,48,120,120)
output = resize(
input=output,
size=c1_output.shape[2:],
mode='bilinear',
align_corners=self.align_corners) # (4,512,120,120)
output = torch.cat([output, c1_output], dim=1) # (4,560,120,120)
output = self.sep_bottleneck(output) # (4,512,120,120)
output = self.cls_seg(output) # (4,2,120,120)
return output
深度学习|语义分割:DeepLab系列 - 知乎
《Deeplab V1》论文阅读