https://github.com/fyu/dilation 作者用caffe写的
This prompts new questions motivated by the structural differences between image classification and dense prediction. Which aspects of the repurposed networks are truly necessary and which reduce accuracy when operated densely? Can dedicated modules designed specifically for dense prediction improve accuracy further?
Modern image classification networks integrate multi-scale contextual information via successive pooling and subsampling layers that reduce resolution until a global prediction is obtained.
In contrast, dense prediction calls for multiscale contextual reasoning in combination with full-resolution output.
CNN的池化操作会减少resolution从而失去位置信息,这和语义分割的目标是冲突的。因为dense prediction 要求结合full-resolution全分辨率进行多尺度上下文推理。
One approach involves repeated up-convolutions that aim to recover lost resolution while carrying over the global perspective from downsampled layers (Noh et al., 2015; Fischer et al., 2015).
《Learning deconvolution network for semantic segmentation.》和
《Learning optical flow with convolutional neural net- works.》都用了这种思想。
https://cloud.tencent.com/developer/article/1008415 具体学习到时候看这篇笔记
This leaves open the question of whether severe intermediate downsampling was truly necessary。
Another approach involves providing multiple rescaled versions of the image as input to the network and combining the predictions obtained for these multiple inputs。
主要思想是 提供多尺寸的输入图片,并将这些图片的预测结果进行组合。
《 Learning hierarchical features for scene labeling.》、《Efficient piecewise training of deep structured models for semantic segmentation.》和《Scale-aware semantic image segmentation.》都用到了这种思想。
Again, it is not clear whether separate analysis of rescaled input images is truly necessary.
所以我们就想用专门用于dense prediction的dedicated modules进一步改善语义分割的精度。
In this work, we develop a convolutional network module that aggregates multi-scale contextual information without losing resolution or analyzing rescaled images. The module can be plugged into existing architectures at any resolution. Unlike pyramid-shaped architectures carried over from image classification, the presented context module is designed specifically for dense prediction. It is a rectangular prism of convolutional layers, with no pooling or subsampling. The module is based on dilated convolutions, which support exponential expansion of the receptive field without loss of resolution or coverage.
我们提出了一个卷积网络模块,能够在不损失分辨率的情况下混合多尺度的上下文信息。然后这个模块能够以任意的分辨率被嵌入到现有的结构中(能够任意嵌入的原因就是他的输入和输出都是C个feature maps,即输入输出时相同的形式)。与从图像分类中延续的金字塔形结构不同,所呈现的上下文模块专门用于密集预测。它没有池化和下采样操作。我们的网络是它主要基于空洞卷积,其支持指数级扩展感受野而不损失分辨率或覆盖范围。 【也就是不需要下采样只用空洞卷积就可以获得较大感受野】
In recent work on convolutional networks for semantic segmentation,
可以看到在Conv1中的每一个单元所能看到的原始图像范围是3*3,而由于Conv2的每个单元都是由2x2范围的Conv1构成,因此回溯到原始图像,其实是能够看到5x5的原始图像范围的。因此我们说Conv1的感受野是3,Conv2的感受野是5. 输入图像的每个单元的感受野被定义为1,这应该很好理解,因为每个像素只能看到自己。
R F l + 1 = R F l + ( k e r n e l _ s i z e l + 1 − 1 ) ∗ f e a t u r e _ s t r i d e l RF_{l+1} = RF_l+(kernel\_size_{l+1}-1)*feature\_stride_l RFl+1=RFl+(kernel_sizel+1−1)∗feature_stridel
空洞卷积其实就是有dilated filter的卷积,相比原来的标准卷积,空洞卷积(dilated convolution) 多了一个hyper-parameter(超参数)称之为dilation rate(扩张率),指的是kernel各点之前的间隔数量。这样在和原来有相同参数和计算量下拥有了更大的感受野。
F i + 1 = F i ∗ 2 i k i F_{i+1}=F_{i}*_{2^i}k_i Fi+1=Fi∗2iki for i = 0 , 1 , . . . , n − 2 i=0,1,...,n-2 i=0,1,...,n−2
公式就是说每个特征图都是由前一个feature map 通过空洞因子为 2 i 2^i 2i的3x3卷积核 k i k_i ki得来的。
①可以算出每一个在 F i + 1 F_{i+1} Fi+1的元素的感受野的大小是: ( 2 i + 2 − 1 ) × ( 2 i + 2 − 1 ) (2^{i+2}-1)\times(2^{i+2}-1) (2i+2−1)×(2i+2−1)
②卷积核大小kxk,dilation factor:n-推出感受野大小为:(k+1)x n - 1
本节介绍了用来进行多尺寸信息融合的context Network architecture模型。模型有C通道的输入feature maps输入模型后,输出C通道的feature maps。就是因为输入和输出的通道数一样,我们的模型才能被任意嵌入到已经存在的dense prediction 结构中。
本文介绍了context Network architecture 的basic形式和large形式,large形式就是一个训练了一个更大的context Network,在更深的网络中更多数量的feature maps。
The basic context module has 7 layers that apply 3×3 convolutions with different dilation factors. The dilations are 1, 1, 2, 4, 8, 16, and 1. Each convolution operates on all layers: strictly speaking, these are 3×3×C convolutions with dilation in the first two dimensions. Each of these convolutions is followed by a pointwise truncation max(·, 0). A final layer performs 1×1×C convolutions and produces the output of the module.
下图为context Network architecture的基本结构,包含7层网络,其中使用了不同dilation factor的3x3的卷积。dilations分别为1,1,2,4,8,16,1,每层上都有卷积操作,也就是说在前两个维度都做3×3×C 空洞卷积 。
因为实验的输入为64x64的图片,在第六层的时候感受野已经是 65 × 65 65\times65 65×65了,所以在第六层之后就把dilation因子设为1,因为不需要再增加感受野了。
因为空洞卷积会扩大图像尺寸,所以在前7层进行了边缘剪裁。同时dilation 从小到大,也就是从小区域的感知来获得局部特征到大卷积将特征分配到更多的区域中。
Our initial attempts to train the context module failed to yield an improvement in prediction accuracy. Experiments revealed that standard initialization procedures do not readily support the training of the module. Convolutional networks are commonly initialized using samples from random distributions.
我们最开始尝试训练我们的context module的时候失败了,我们的实验表明标准的初始化方法不适用我们的空洞卷积。卷积网络一般用samples from random distributions随机采样分布初始化。
下式为本文basic模型采用的初始化方式identity initialization
k b ( t , a ) = 1 [ t = 0 ] 1 [ a = b ] k^b(t,a)=1_{[t=0]}1_{[a=b]} kb(t,a)=1[t=0]1[a=b]
其中a是输入feature map的index,b是输出feature map 的index
This initialization sets all filters such that each layer simply passes the input directly to the next.A natural concern is that this initialization could put the network in a mode where backpropagation cannot significantly improve the default behavior of simply passing information through. However, experiments indicate that this is not the case. Backpropagation reliably harvests the contextual information provided by the network to increase the accuracy of the processed maps.
basic 的context module只有64 C 2 C^2 C2个参数,参数的数量非常少,但实验结果已经表现的非常好了
large context 有了更多的feature maps 具体数量如下图
We generalize the initialization scheme to account for the difference in the number of feature maps in different layers.
我们下面这个初始化方案去解决不同层的feature maps数量不同的问题。
其中 c i c_i ci和 c i + 1 c_{i+1} ci+1是分别是两个相邻层的feature maps的数量
Here ε ∼ N ( 0 , σ 2 ) ε ∼ N(0, σ_2) ε∼N(0,σ2)and σ < < C / c i + 1 σ <
随机噪声的使用打破了具有常见前身feature maps之间的联系。
我们训练了一个front-end prediction module。
只用了front-end 没有加context module精度就已经有了明显提升,这都归功于我们把原始网络中不适用于dense prediction的部分去掉了。
我们把basic和large context module 分别嵌入到front-end module中。具体就是在context module训练的时候把front-end的feature map作为输入。(因为context module的感受野为 67 × 67 67\times67 67×67,所以我们把输入的feature map进行了buffer宽为33的pad,其中zero padding和reflection padding对实验结果没影响)
Joint training of the context module and the front-end module did not yield a significant improvement in our experiments.
context module和front-end的训练结合在一起对实验结果不会有明显改善。也就是把front-end的训练结果给context module就好了,不用一起训练。
下表为把context module加入到三种不同的语义分割结构的结果,实验结果表明不管front-end后面加不加structured prediction,context module都对精度有明显的改善。
下表为我们的模型在VOC-2012测试集进行评估的结果。其中Context指的是把larger context module嵌入到front-end后的模型。
一句话来讲本文的工作就是,通过去除现有网络的用于图像分类的部分得到front-end,然后利用空洞卷积设计了个context-module,把context-module嵌入到front-end中就是我们的网络结构(具体嵌入方式就是把front-end的feature map作为context module的输出)。