重要说明:本文从网上资料整理而来,仅记录博主学习相关知识点的过程,侵删。
经典的语义分割(semantic segmentation)网络模型
原始论文:[1]
DeepLabV1网络简析
bilibili视频讲解:DeepLabV1网络简介(语义分割)
DeepLab v1加入了多尺度的特性,是LargeFOV的升级版。
针对语义分割任务,信号下采样导致分辨率降低和空间“不敏感” 问题。
信号下采样导致分辨率降低。作者说主要是采用Maxpooling导致的,为了解决这个问题作者引入了'atrous'(with holes) algorithm
(空洞卷积 / 膨胀卷积 / 扩张卷积)。
空间“不敏感”。作者说分类器自身的问题,因为分类器本来就具备一定空间不变性。为了解决这个问题,作者采用了fully-connected CRF(Conditional Random Field)方法,这个方法只在DeepLabv1-v2中使用到了,从v3之后就不去使用了,而且这个方法挺耗时的。
DeepLab v1的backbone为VGG-16。
原始论文:[2]
DeepLabV2网络简析
解读DeepLab v2
bilibili视频讲解:DeepLabV2网络简介(语义分割)
DeepLab v2加入了ASPP模块,通过四个并行的膨胀卷积层,每个分支上的膨胀卷积层所采用的膨胀系数不同。这里的膨胀卷积层后面没有BatchNorm,并使用了Bias偏置。接着通过add相加的方式融合四个分支上的输出。
在文章的引言部分,作者提出了DCNNs应用在语义分割任务中遇到的问题。
stride>1
的层导致)。解决办法
DeepLab v1的backbone为ResNet101。
如下图所示,和v1的流程类似,DeepLab v2的流程为:输入Input -> CNN提取特征 -> 粗糙的分割图(1/8原图大小) -> 双线性插值回原图大小 -> CRF后处理 -> 最终输出Output
。
这里以ResNet101作为backbone为例。在ResNet的Layer3中的Bottleneck1中原本是需要下采样的(3x3的卷积层stride=2),但在DeepLab v2中将stride设置为1,即不在进行下采样。而且3x3卷积层全部采用膨胀卷积膨胀系数为2。在Layer4中也是一样,取消了下采样,所有的3x3卷积层全部采用膨胀卷积膨胀系数为4。最后需要注意的是ASPP模块,在以ResNet101做为Backbone时,每个分支只有一个3x3的膨胀卷积层,且卷积核的个数都等于num_classes
。
这里以VGG-16作为backbone为例。
import torch
import torch.nn as nn
import torch.nn.functional as F
class ASPP(nn.Module):
def __init__(self, in_channels, num_classes):
super().__init__()
self.branch1 = nn.Sequential(
nn.Conv2d(in_channels=in_channels, out_channels=128, kernel_size=3, stride=1, padding=6, dilation=6, bias=True),
nn.ReLU(inplace=True),
nn.Conv2d(in_channels=128, out_channels=128, kernel_size=1, stride=1, padding=0, bias=True),
nn.ReLU(inplace=True),
nn.Conv2d(in_channels=128, out_channels=num_classes, kernel_size=1, stride=1, padding=0, bias=True),
)
self.branch2 = nn.Sequential(
nn.Conv2d(in_channels=in_channels, out_channels=128, kernel_size=3, stride=1, padding=12, dilation=12, bias=True),
nn.ReLU(inplace=True),
nn.Conv2d(in_channels=128, out_channels=128, kernel_size=1, stride=1, padding=0, bias=True),
nn.ReLU(inplace=True),
nn.Conv2d(in_channels=128, out_channels=num_classes, kernel_size=1, stride=1, padding=0, bias=True),
)
self.branch3 = nn.Sequential(
nn.Conv2d(in_channels=in_channels, out_channels=128, kernel_size=3, stride=1, padding=18, dilation=18, bias=True),
nn.ReLU(inplace=True),
nn.Conv2d(in_channels=128, out_channels=128, kernel_size=1, stride=1, padding=0, bias=True),
nn.ReLU(inplace=True),
nn.Conv2d(in_channels=128, out_channels=num_classes, kernel_size=1, stride=1, padding=0, bias=True),
)
self.branch4 = nn.Sequential(
nn.Conv2d(in_channels=in_channels, out_channels=128, kernel_size=3, stride=1, padding=24, dilation=24, bias=True),
nn.ReLU(inplace=True),
nn.Conv2d(in_channels=128, out_channels=128, kernel_size=1, stride=1, padding=0, bias=True),
nn.ReLU(inplace=True),
nn.Conv2d(in_channels=128, out_channels=num_classes, kernel_size=1, stride=1, padding=0, bias=True),
)
def forward(self, x):
return self.branch1(x) + self.branch2(x) + self.branch3(x) + self.branch4(x)
class DeepLabv2(nn.Module):
def __init__(self, in_channels: int = 3, num_classes: int = 21):
super().__init__()
self.conv1 = nn.Sequential(
nn.Conv2d(in_channels=in_channels, out_channels=64, kernel_size=3, stride=1, padding=1, bias=True),
nn.ReLU(inplace=True),
nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, stride=1, padding=1, bias=True),
nn.ReLU(inplace=True),
)
self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
self.conv2 = nn.Sequential(
nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, stride=1, padding=1, bias=True),
nn.ReLU(inplace=True),
nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, padding=1, bias=True),
nn.ReLU(inplace=True),
)
self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
self.conv3 = nn.Sequential(
nn.Conv2d(in_channels=128, out_channels=256, kernel_size=3, stride=1, padding=1, bias=True),
nn.ReLU(inplace=True),
nn.Conv2d(in_channels=256, out_channels=256, kernel_size=3, stride=1, padding=1, bias=True),
nn.ReLU(inplace=True),
nn.Conv2d(in_channels=256, out_channels=256, kernel_size=3, stride=1, padding=1, bias=True),
nn.ReLU(inplace=True),
)
self.pool3 = nn.MaxPool2d(kernel_size=2, stride=2)
self.conv4 = nn.Sequential(
nn.Conv2d(in_channels=256, out_channels=512, kernel_size=3, stride=1, padding=1, bias=True),
nn.ReLU(inplace=True),
nn.Conv2d(in_channels=512, out_channels=512, kernel_size=3, stride=1, padding=1, bias=True),
nn.ReLU(inplace=True),
nn.Conv2d(in_channels=512, out_channels=512, kernel_size=3, stride=1, padding=1, bias=True),
nn.ReLU(inplace=True),
)
self.pool4 = nn.MaxPool2d(kernel_size=3, stride=1, padding=1)
self.conv5 = nn.Sequential(
nn.Conv2d(in_channels=512, out_channels=512, kernel_size=3, stride=1, padding=2, dilation=2, bias=True),
nn.ReLU(inplace=True),
nn.Conv2d(in_channels=512, out_channels=512, kernel_size=3, stride=1, padding=2, dilation=2, bias=True),
nn.ReLU(inplace=True),
nn.Conv2d(in_channels=512, out_channels=512, kernel_size=3, stride=1, padding=2, dilation=2, bias=True),
nn.ReLU(inplace=True),
)
self.pool5 = nn.MaxPool2d(kernel_size=3, stride=1, padding=1)
self.ASPP = ASPP(in_channels=512, num_classes=num_classes)
def forward(self, x):
conv1_x = self.conv1(x)
print('# Conv1 output shape:', conv1_x.shape)
pool1_x = self.pool1(conv1_x)
print('# Pool1 output shape:', pool1_x.shape)
conv2_x = self.conv2(pool1_x)
print('# Conv2 output shape:', conv2_x.shape)
pool2_x = self.pool2(conv2_x)
print('# Pool2 output shape:', pool2_x.shape)
conv3_x = self.conv3(pool2_x)
print('# Conv3 output shape:', conv3_x.shape)
pool3_x = self.pool3(conv3_x)
print('# Pool3 output shape:', pool3_x.shape)
conv4_x = self.conv4(pool3_x)
print('# Conv4 output shape:', conv4_x.shape)
pool4_x = self.pool4(conv4_x)
print('# Pool4 output shape:', pool4_x.shape)
conv5_x = self.conv5(pool4_x)
print('# Conv5 output shape:', conv5_x.shape)
pool5_x = self.pool5(conv5_x)
print('# Pool5 output shape:', pool5_x.shape)
out = self.ASPP(pool5_x)
print('# Output shape:', out.shape)
return out
if __name__ == '__main__':
inputs = torch.randn(4, 3, 224, 224)
print('# input shape:', inputs.shape)
net = DeepLabv2(in_channels=3, num_classes=21)
output = net(inputs)
输出结果
# input shape: torch.Size([4, 3, 224, 224])
# Conv1 output shape: torch.Size([4, 64, 224, 224])
# Pool1 output shape: torch.Size([4, 64, 112, 112])
# Conv2 output shape: torch.Size([4, 128, 112, 112])
# Pool2 output shape: torch.Size([4, 128, 56, 56])
# Conv3 output shape: torch.Size([4, 256, 56, 56])
# Pool3 output shape: torch.Size([4, 256, 28, 28])
# Conv4 output shape: torch.Size([4, 512, 28, 28])
# Pool4 output shape: torch.Size([4, 512, 28, 28])
# Conv5 output shape: torch.Size([4, 512, 28, 28])
# Pool5 output shape: torch.Size([4, 512, 28, 28])
# Output shape: torch.Size([4, 21, 28, 28])
DeepLab v3(2017年):[3]
DeepLab v3+(2018年):[4]
DeepLab V3网络简介
DeepLabV3网络简析
bilibili视频讲解:DeepLabV3网络简介(语义分割)
DeepLab v3改进了ASPP模块,通过五个并行的膨胀卷积层,其分别是1x1的卷积层,三个3x3的膨胀卷积层,以及一个全局平均池化层。其中,全局平均池化层后面跟有一个1x1的卷积层,然后通过双线性插值的方法还原回输入的W和H,全局平均池化分支增加了全局上下文信息。之后,通过Concat的方式将5个分支的输出沿着channels进行拼接。最后再通过一个1x1的卷积层进一步融合信息。
这里以ResNet101作为backbone为例。
关于ASPP的详细介绍,请参考另一篇博客:深入浅出理解SPP、ASPP、DSPP、MDSPP空间金字塔池化系列结构(综合版)
DeepLab v3论文中的ASPP结构,如下图所示:
其中的1*1
卷积,论文中的解释是当 rate = feature map size
时,dilation conv
就变成了 1 ×1 conv,所以这个 1 × 1 conv相当于rate很大的空洞卷积。还加入了全局池化,再上采样到原来的 feature map size,思想来源于PSPnet。为什么用 rate = [6, 12, 18] ?是论文实验得到的,因为这个搭配比例的 mIOU 最高。
[1] Chen L C, Papandreou G, Kokkinos I, et al. Semantic image segmentation with deep convolutional nets and fully connected crfs[J]. arxiv preprint arxiv:1412.7062, 2014.
[2] Chen L C, Papandreou G, Kokkinos I, et al. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs[J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 40(4): 834-848.
[3] Papa L, Alati E, Russo P, et al. Speed: Separable pyramidal pooling encoder-decoder for real-time monocular depth estimation on low-resource settings[J]. IEEE Access, 2022, 10: 44881-44890.
[4] Chen L C, Zhu Y, Papandreou G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 801-818.