空洞卷积 deeplab_深入探讨深度卷积语义分割网络和Deeplab_V3

空洞卷积 deeplab

by Thalles Silva

由Thalles Silva

深入探讨深度卷积语义分割网络和Deeplab_V3 (Diving into Deep Convolutional Semantic Segmentation Networks and Deeplab_V3)

Deep Convolutional Neural Networks (DCNNs) have achieved remarkable success in various Computer Vision applications. Like others, the task of semantic segmentation is not an exception to this trend.

深度卷积神经网络(DCNN)在各种计算机视觉应用中均取得了显著成功。 像其他人一样,语义分割的任务也不例外。

This piece provides an introduction to Semantic Segmentation with a hands-on TensorFlow implementation. We’ll go over one of the most relevant papers on Semantic Segmentation of general objects — Deeplab_v3. You can clone the notebook for this post here.

本文通过动手的TensorFlow实现对语义分段进行了介绍。 我们将介绍有关通用对象语义分割的最相关论文之一-Deeplab_v3 。 您可以在此处为该帖子克隆笔记本。

语义分割 (Semantic Segmentation)

Regular image classification DCNNs have similar structure. These models take images as input and output a single value representing the category of that image.

常规图像分类DCNN具有相似的结构。 这些模型将图像作为输入,并输出代表该图像类别的单个值。

Usually, classification DCNNs have four main operations. Convolutions, activation function, pooling, and fully-connected layers. Passing an image through a series of these operations outputs a feature vector containing the probabilities for each class label. Note that in this setup, we categorize an image as a whole. That is, we assign a single label to an entire image.

通常,分类DCNN具有四个主要操作。 卷积,激活功能,池化和完全连接的层。 通过一系列这些操作传递图像会输出一个特征向量,其中包含每个类标签的概率。 请注意,在此设置中,我们将图像整体分类。 也就是说,我们为整个图像分配一个标签。

Different from image classification, in semantic segmentation we want to make decisions for every pixel in an image. So, for each pixel, the model needs to classify it as one of the pre-determined classes. Put another way, semantic segmentation means understanding images at a pixel level.

与图像分类不同,在语义分割中,我们要为图像中的每个像素做出决策。 因此,对于每个像素,模型需要将其分类为预定类之一。 换句话说,语义分割意味着理解像素级的图像。

Keep in mind that semantic segmentation doesn’t differentiate between object instances. Here, we try to assign an individual label to each pixel of a digital image. Thus, if we have two objects of the same class, they end up having the same category label. Instance Segmentation is the class of problems that differentiate instances of the same class.

请记住,语义分割不会区分对象实例。 在这里,我们尝试为数字图像的每个像素分配一个单独的标签。 因此,如果我们有两个相同类的对象,那么它们最终将具有相同的类别标签。 实例细分是区分同一类实例的一类问题。

Yet, regular DCNNs such as the AlexNet and VGG aren’t suitable for dense prediction tasks. First, these models contain many layers designed to reduce the spatial dimensions of the input features. As a consequence, these layers end up producing highly decimated feature vectors that lack sharp details. Second, fully-connected layers have fixed sizes and loose spatial information during computation.

但是,常规的DCNN(例如AlexNet和VGG)不适合密集的预测任务。 首先,这些模型包含许多旨在减少输入要素的空间尺寸的图层。 结果,这些层最终产生缺少清晰细节的高度抽取的特征向量。 其次,全连接层在计算过程中具有固定的大小和松散的空间信息。

As an example, instead of having pooling and fully-connected layers, imagine passing an image through a series of convolutions. We can set each convolution to have stride of 1 and “SAME” padding. Doing this, each convolution preserves the spatial dimensions of its input. We can stack a bunch of these convolutions and have a segmentation model.

例如,假设没有图像通过池化和完全连接,而是通过一系列卷积传递图像。 我们可以将每个卷积设置为跨度为1,并使用“ SAME”填充。 这样做,每个卷积都会保留其输入的空间尺寸 。 我们可以堆叠一堆这些卷积并具有分割模型。

This model could output a probability tensor with shape [W,H,C], where W and H represent the Width and Height, and C is the number of class labels. Applying the argmax function (on the third axis) gives us a tensor shape of [W,H,1]. After, we compute the cross-entropy loss between each pixel of the ground-truth images and our predictions. In the end, we average that value and train the network using back prop.

该模型可以输出形状为[W,H,C]的概率张量,其中W和H表示宽度和高度,C是类别标签的数量。 应用argmax函数(在第三轴上)可得到张量形状[W,H,1] 。 之后,我们计算真实图像的每个像素与我们的预测之间的交叉熵损失。 最后,我们平均该值并使用反向道具训练网络。

There is one problem with this approach, though. As we mentioned, using convolutions with stride 1 and “SAME” padding preserves the input dimensions. However, doing that would make the model super expensive in both memory consumption and computation complexity.

但是,这种方法存在一个问题。 如前所述,将卷积与步幅1和“ SAME”填充一起使用可保留输入尺寸。 但是,这样做会使模型在内存消耗和计算复杂度上都非常昂贵。

To ease that problem, segmentation networks usually have three main components: convolutions, downsampling, and upsampling layers.

为了缓解该问题,分段网络通常具有三个主要组成部分:卷积,下采样和上采样层。

There are two common ways to do downsampling in neural nets: by using convolution striding or regular pooling operations. In general, downsampling has one goal, and that is to reduce the spatial dimensions of given feature maps. For that reason, downsampling allows us to perform deeper convolutions without many memory concerns. Yet, they do it to the detriment of losing some features in the process.

在神经网络中有两种常见的下采样方法:使用卷积步幅或常规操作。 通常,下采样具有一个目标,那就是减小给定特征图的空间尺寸。 出于这个原因,下采样使我们能够执行更深层的卷积而无需太多内存问题。 但是,这样做会损害过程中失去某些功能。

Also, note that the first part of this architecture looks a lot like usual classification DCNNs. With one exception, they do not put in place fully-connected layers.

另外,请注意,该体系结构的第一部分看起来很像通常的分类DCNN。 除了一个例外,它们没有放置完全连接的层。

After the first part, we have a feature vector with shape [W, H, D] where W, H, and D are the width, height and depth of the feature tensor. Note that the spatial dimensions of this compressed vector are smaller (yet denser) than the original input.

在第一部分之后,我们有一个形状为[W,H,D]的特征向量,其中W,H和D是特征张量的宽度,高度和深度。 请注意,此压缩向量的空间尺寸比原始输入要小(但更密集)。

At this point, regular classification DCNNs would output a dense (non-spatial) vector containing probabilities for each class label. Instead, we feed this compressed feature vector to a series of upsampling layers. These layers work on reconstructing the output of the first part of the network. The goal is to increase the spatial resolution so the output vector has the same dimensions as the input.

此时,常规分类DCNN将输出一个密集的(非空间)矢量,其中包含每个类别标签的概率。 取而代之的是,我们将此压缩特征向量馈送到一系列上采样层。 这些层用于重建网络第一部分的输出。 目的是提高空间分辨率,以便输出矢量具有与输入相同的尺寸

Usually, upsampling layers are based on strided transpose convolutions. These functions go from deep and narrow layers to wider and shallower ones. Here, we use transpose convolutions to increase the feature vector’s dimensions to the desired value.

通常,上采样层基于跨步转置卷积这些功能从深层和窄层到较宽和较浅的层 。 在这里,我们使用转置卷积将特征向量的维数增加到所需值。

In most papers, these two components of a segmentation network are called encoder and decoder. In short, the first “encodes” its information into a compressed vector used to represent its input. The second (the decoder) works on reconstructing this signal to the desired outcome.

在大多数论文中,分段网络的这两个组件称为编码器和解码器。 简而言之,第一个将其信息“编码”为用于表示其输入的压缩向量。 第二个(解码器)将信号重构为所需的结果。

There are many network implementations based on encoder-decoder architectures. FCNs, SegNet, and UNet are some of the most popular ones. As a result, we have seen many successful segmentation models in a variety of fields.

有许多基于编码器-解码器体系结构的网络实现。 FCNs, SegNet和UNET是一些最流行的。 结果,我们在各个领域都看到了许多成功的细分模型。

模型架构 (Model Architecture)

Different from most encoder-decoder designs, Deeplab offers a different approach to semantic segmentation. It presents an architecture for controlling signal decimation and learning multi-scale contextual features.

与大多数编码器/解码器设计不同,Deeplab提供了一种不同的语义分段方法。 它提出了一种用于控制信号抽取和学习多尺度上下文特征的体系结构。

Image credits: Rethinking Atrous Convolution for Semantic Image Segmentation.

图像功劳: 重新思考Atrous卷积以进行语义图像分割 。

Deeplab uses an ImageNet pre-trained ResNet as its main feature extractor network. However, it proposes a new Residual block for multi-scale feature learning. Instead of regular convolutions, the last ResNet block uses atrous convolutions. Also, each convolution (within this new block) uses different dilation rates to capture multi-scale context.

Deeplab使用ImageNet预先训练的ResNet作为其主要特征提取器网络。 但是,它提出了一种用于多尺度特征学习的新残差块。 最后一个ResNet块使用常规的卷积代替常规的卷积。 而且,每个卷积(在此新块内)都使用不同的膨胀率来捕获多尺度上下文。

Additionally, on top of this new block, it uses Atrous Spatial Pyramid Pooling (ASPP). ASPP uses dilated convolutions with different rates as an attempt at classifying regions of an arbitrary scale.

此外,在此新块的顶部,它使用Atrous空间金字塔池(ASPP)。 ASPP使用具有不同速率的膨胀卷积来尝试对任意尺度的区域进行分类。

To understand the deeplab architecture, we need to focus on three components. (i) The ResNet architecture, (ii) atrous convolutions and (iii) Atrous Spatial Pyramid Pooling (ASPP). Let’s go over each one of them.

要了解deeplab架构,我们需要专注于三个组件。 (i)ResNet体系结构,(ii)原子卷积和(iii)原子空间金字塔池(ASPP)。 让我们逐一检查它们。

ResNets (ResNets)

ResNet is a very popular DCNN that won the ILSVRC 2015 classification task. One of the main contributions of ResNets was to provide a framework to ease the training of deeper models.

ResNet是非常受欢迎的DCNN,赢得了2015年ILSVRC分类任务。 ResNets的主要贡献之一是提供了一个框架,以简化对更深层次模型的训练。

In its original form, ResNets contain 4 computational blocks. Each block contains a different number of Residual Units. These units perform a series of convolutions in a special way. Also, each block is intercalated with max-pooling operations to reduce spatial dimensions.

ResNets最初包含4个计算块。 每个块包含不同数量的剩余单位。 这些单元以特殊方式执行一系列卷积。 同样,每个块都插入了最大池操作以减小空间尺寸。

The original paper presents two types of Residual Units. The baseline and the bottleneck blocks.

原始文件介绍了两种类型的剩余单位。 基线瓶颈区。

The baseline unit contains two 3x3 convolutions with Batch Normalization(BN) and ReLU activations.

基线单元包含两个3x3卷积,具有批处理规范化(BN)和ReLU激活。

The second, the bottleneck unit, consists of three stacked operations. A series of 1x1, 3x3 and 1x1 convolutions substitute the previous design. The two 1x1 operations are designed for reducing and restoring dimensions. This leaves the 3x3 convolution, in the middle, to operate on a less dense feature vector. Also, BN is applied after each convolution and before ReLU non-linearity.

第二个是瓶颈单元,由三个堆叠操作组成。 一系列1x1,3x31x1的卷积代替了以往的设计。 这两个1x1操作旨在减少和还原尺寸。 这使3x3卷积在中间,以对密度较小的特征向量进行运算。 同样,在每次卷积之后和ReLU非线性之前应用BN。

To help clarify, let’s denote these group of operations as a function F of its input x F(x).

为了澄清,让我们将这些操作组表示为其输入xF(x)的函数F。

After the non-linear transformations in F(x), the unit combines the result of F(x) with the original input x. This combination is done by adding the two functions. Merging the original input x with the non-linear function F(x) offers some advantages. It allows earlier layers to access the gradient signal from later layers. In other words, skipping the operations on F(x) allows earlier layers to have access to a stronger gradient signal. As a result, this type of connectivity has been shown to ease the training of deeper networks.

F(x)中进行非线性转换后,该单元将F(x)的结果与原始输入x合并。 通过添加两个功能来完成此组合。 将原始输入x与非线性函数F(x)合并可提供一些优势。 它允许较早的层访问较晚的层的梯度信号。 换句话说,跳过对F(x)的操作将允许较早的层访问更强的梯度信号。 结果,这种类型的连接已被证明可以简化对更深层网络的训练。

Non-bottleneck units also show gain in accuracy as we increase model capacity. Yet, bottleneck residual units have some practical advantages. First, they perform more computations having almost the same number of parameters. Second, they also perform in a similar computational complexity as their counterparts.

随着我们增加模型容量,非瓶颈单位也显示出准确性的提高。 但是,瓶颈残留单元具有一些实际优势。 首先,它们执行几乎具有相同数量参数的更多计算。 其次,它们的计算复杂度也与同类产品相似。

In practice, bottleneck units are more suitable for training deeper models because less training time and computational resources are needed.

实际上, 瓶颈单元更适合训练更深的模型,因为需要更少的训练时间和计算资源。

For our implementation, we’ll use the full pre-activation Residual Unit. The only difference from the standard bottleneck unit lies in the order in which BN and ReLU activations are placed. For the full pre-activation, BN and ReLU (in this order) occur before convolutions.

对于我们的实施,我们将使用完整的预激活剩余单元 。 与标准瓶颈单元的唯一区别在于BN和ReLU激活的放置顺序。 对于完全预激活,BN和ReLU(按此顺序)在卷积之前发生。

As shown in Identity Mappings in Deep Residual Networks, the full pre-activation unit performs better than other variants.

如深度残留网络中的身份映射中所示,完整的预激活单元的性能要优于其他变体。

Note that the only difference among these designs is the order of BN and RELu in the convolution stack.

请注意,这些设计之间的唯一区别是卷积堆栈中BN和RELu的顺序。

圆卷积 (Atrous Convolutions)

Atrous (or dilated) convolutions are regular convolutions with a factor that allows us to expand the filter’s field of view.

多Kong(或膨胀)卷积是规则卷积,其乘积使我们能够扩展滤镜的视场。

Consider a 3x3 convolution filter, for instance. When the dilation rate is equal to 1, it behaves like a standard convolution. But, if we set the dilation factor to 2, it has the effect of enlarging the convolution kernel.

例如,考虑一个3x3卷积滤波器。 当膨胀率等于1时,其行为类似于标准卷积。 但是,如果将膨胀因子设置为2,则具有扩大卷积核的作用。

In theory, it works like that. First, it expands (dilates) the convolution filter according to the dilation rate. Second, it fills the empty spaces with zeros — creating a sparse like filter. Finally, it performs regular convolution using the dilated filter.

从理论上讲,它的工作原理是这样的。 首先,它根据扩张率扩展(扩张)卷积滤波器。 其次,它用零填充空白空间-创建一个稀疏的过滤器。 最后,它使用膨胀滤波器执行规则卷积。

As a consequence, a convolution with a dilated 2, 3x3 filter would make it able to cover an area equivalent to a 5x5. Yet, because it acts like a sparse filter, only the original 3x3 cells will do computation and produce results. I said “act” because most frameworks don’t implement atrous convolutions using sparse filters (because of memory concerns).

结果,用膨胀的2、3x3滤镜进行卷积将使其能够覆盖相当于5x5的区域。 但是,由于它像稀疏滤波器一样起作用,因此只有原始的3x3像元才能进行计算并产生结果。 我之所以说“行动”,是因为大多数框架都不使用稀疏过滤器来实现无用的卷积(因为存在内存问题)。

In a similar way, setting the atrous factor to 3 allows a regular 3x3 convolution to get signals from a 7x7 corresponding area.

以类似的方式,将原子因数设置为3可使常规3x3卷积从7x7对应区域获取信号。

This effect allows us to control the resolution at which we compute feature responses. Also, atrous convolution adds larger context without increasing the number of parameters or the amount of computation.

这种效果使我们可以控制计算特征响应的分辨率。 同样,无穷卷积增加了更大的上下文,而没有增加参数的数量或计算量。

Deeplab also shows that the dilation rate must be tuned according to the size of the feature maps. They studied the consequences of using large dilation rates over small feature maps.

Deeplab还显示必须根据特征图的大小来调整膨胀率。 他们研究了在小的特征图上使用较大的膨胀率的后果。

When the dilation rate is very close to the feature map’s size, a regular 3x3 atrous filter acts as a standard 1x1 convolution.

当膨胀率非常接近特征图的大小时,常规3x3圆角滤波器将作为标准1x1卷积。

Put in another way, the efficiency of atrous convolutions depends on a good choice of the dilation rate. Because of that, it is important to know the concept of output stride in neural networks.

换句话说,无穷卷积的效率取决于对膨胀率的良好选择。 因此,重要的是要了解神经网络中输出步幅的概念。

Output stride explains the ratio of the input image size to the output feature map size. It defines how much signal decimation the input vector suffers as it passes the network.

输出步幅说明输入图像尺寸与输出特征图尺寸的比率。 它定义了输入向量通过网络时遭受的信号抽取量。

For an output stride of 16, an image size of 224x224x3 outputs a feature vector with 16 times smaller dimensions. That is 14x14.

对于16的输出步幅,图像大小为224x224x3的特征向量输出的尺寸要小16倍。 那是14x14

Besides, Deeplab also debates the effects of different output strides on segmentation models. It argues that excessive signal decimation is harmful for dense prediction tasks. In summary, models with smaller output stride — less signal decimation — tend to output finer segmentation results. Yet, training models with smaller output stride demand more training time.

此外,Deeplab还讨论了不同输出步幅对细分模型的影响。 它认为过度的信号抽取对密集的预测任务是有害的 。 总而言之,输出步幅较小(信号抽取较少)的模型往往会输出更好的分割结果。 但是,输出步幅较小的训练模型需要更多的训练时间。

Deeplab reports experiments with two configurations of output strides, 8 and 16. As expected, output stride = 8 was able to produce slightly better results. Here we choose output stride = 16 for practical reasons.

Deeplab报告了两种输出步幅配置,分别为8和16。如预期的那样,输出步幅= 8可以产生更好的结果。 出于实际原因,此处选择输出步幅= 16。

Also, because the atrous block doesn’t implement downsampling, ASPP also runs on the same feature response size. As a result, it allows learning features from multi-scale context using relatively large dilation rates.

另外,由于atrous块未实现下采样,因此ASPP也以相同的功能响应大小运行。 结果,它允许使用相对较大的膨胀率从多尺度上下文中学习特征。

The new Atrous Residual Block contains three residual units. In total, the 3 units have three 3x3 convolutions. Motivated by multigrid methods, Deeplab proposes different dilation rates for each convolution. In summary, multigrid defines the dilation rates for each of the three convolutions.

新的Atrous残差块包含三个残差单元。 这3个单元总共有3个3x3卷积。 在多网格方法的激励下,Deeplab为每个卷积提出了不同的膨胀率。 总而言之, multigrid定义了三个卷积中每一个的膨胀率。

In practice:

在实践中:

For the new block4, when output stride = 16 and Multi Grid = (1, 2, 4), the three convolutions have rates = 2 · (1, 2, 4) = (2, 4, 8) respectively.

对于新块4,当输出步幅= 16且多网格=(1、2、4)时 ,三个卷积的比率分别为2·(1、2、4)=(2、4、8)

多Kong空间金字塔池 (Atrous Spatial Pyramid Pooling)

For ASPP, the idea is to provide the model with multi-scale information. To do that, ASPP adds a series of atrous convolutions with different dilation rates. These rates are designed to capture long-range context. Also, to add global context information, ASPP incorporates image-level features via Global Average Pooling (GAP).

对于ASPP,其想法是为模型提供多尺度信息。 为此,ASPP添加了具有不同膨胀率的一系列圆环卷积。 这些速率旨在捕获远程上下文。 同样,为了添加全局上下文信息,ASPP通过全局平均池(GAP)合并了图像级功能。

This version of ASPP contains 4 parallel operations. These are a 1x1 convolution and three 3x3 convolutions with dilation rates =(6,12,18). As we mentioned, at this point, the feature maps’ nominal stride is equal to 16.

此版本的ASPP包含4个并行操作。 这些是1x1卷积和三个3x3卷积,其扩张率=(6,12,18) 。 如前所述,要素地图的标称步幅等于16。

Based on the original implementation, we use crop sizes of 513x513 for both training and testing. Thus, using an output stride of 16 means that ASPP receives feature vectors of size 32x32.

根据原始实施,我们将513x513的作物尺寸用于培训和测试。 因此,使用16的输出步幅意味着ASPP接收大小为32x32的特征向量。

Also, to add more global context information, ASPP incorporates image-level features. First, it applies GAP to the features output from the last atrous block. Second, the resulting features are fed to a 1x1 convolution with 256 filters. Finally, the result is bilinearly upsampled to the correct dimensions.

同样,为了添加更多的全局上下文信息,ASPP合并了图像级功能。 首先,它将GAP应用于从最后一个圆块输出的要素。 其次,将生成的特征馈入具有256个滤波器的1x1卷积。 最后,将结果双线性升采样到正确的尺寸。

In the end, the features, from all the branches, are combined into a single vector via concatenation. This output is then convoluted with another 1x1 kernel — using BN and 256 filters.

最后,来自所有分支的特征通过级联被组合为单个向量。 然后,使用BN和256滤波器将输出与另一个1x1内核进行卷积。

After ASPP, we feed the result to another 1x1 convolution — to produce the final segmentation logits.

在ASPP之后,我们将结果反馈给另一个1x1卷积-生成最终的分割logit。

实施细节 (Implementation Details)

Using the ResNet-50 as a feature extractor, this implementation of Deeplab_v3 employs the following network configuration:

通过使用ResNet-50作为功能提取器, Deeplab_v3的此实现采用以下网络配置:

  • output stride = 16

    输出步幅= 16

  • Fixed multi-grid atrous convolution rates of (1,2,4) to the new Atrous Residual block (block 4).

    将(1,2,4)的多网格atrous卷积率固定到新的Atrous Residual块(块4)。

  • ASPP with rates (6,12,18) after the last Atrous Residual block.

    最后一个Atrous Residual块之后的比率(6,12,18)的ASPP。

Setting output stride to 16 gives us the advantage of substantially faster training. Comparing to an output stride of 8, a stride of 16 makes the Atrous Residual block deal with feature maps that are four times smaller than those its counterpart deals with.

输出步幅设置为16,可以为我们带来训练速度显着提高的优势。 与输出步幅8相比,步幅16使Atrous Residual块处理的特征图比其对应的特征图小四倍。

The multi-grid dilation rates are applied to the 3 convolutions inside the Atrous Residual block.

多网格膨胀率应用于Atrous残差块内部的3个卷积。

Finally, each of the three parallel 3x3 convolutions in ASPP gets a different dilation rate — (6,12,18).

最后,ASPP中的三个并行3x3卷积中的每一个都获得了不同的扩展率- (6,12,18)

Before computing the cross-entropy error, we resize the logits to the input’s size. As argued in the paper, it’s better to resize the logits than the ground-truth labels to keep resolution details.

在计算交叉熵误差之前 ,我们将对数调整为输入大小。 如本文所述,调整logit的大小比保留真实的标签更好,以保留分辨率的详细信息。

Based on the original training procedures, we scale each image using a random factor from 0.5 to 2. Also, we apply random left-right flipping to the scaled images.

基于原始的训练过程,我们使用0.5到2的随机因子缩放每个图像。此外,我们对缩放后的图像应用随机的左右翻转。

Finally, we crop patches of size 513x513 for both training and testing.

最后,我们裁剪了513x513大小的补丁,以进行培训和测试。

To implement atrous convolutions with multi-grid in the block4 of the resnet, we just changed this piece in the resnet_utils.py file.

为了在resnet的block4中使用多网格实现无用的卷积,我们只是在resnet_utils.py文件中更改了这一部分。

训练 (Training)

To train the network, we decided to use the augmented Pascal VOC dataset provided by Semantic contours from inverse detectors.

为了训练网络,我们决定使用由逆检测器提供的语义轮廓提供的增强Pascal VOC数据集。

The training data is composed of 8,252 images. There are 5,623 from the training set and 2,299 from the validation set. To test the model using the original VOC 2012 val dataset, I removed 558 images from the 2,299 validation set. These 558 samples were also present on the official VOC validation set. Also, I added 330 images from the VOC 2012 train set that weren’t present either among the 5,623 nor the 2,299 sets. Finally, 10% of the 8,252 images (~825 samples) are held for validation, leaving the rest for training.

训练数据由8,252张图像组成。 训练集中有5,623个,验证集中有2,299个。 为了使用原始VOC 2012 val数据集测试模型,我从2299个验证集中删除了558张图像。 这558个样本也出现在官方的VOC验证集中。 另外,我还添加了330张来自VOC 2012火车的图像,这些图像在5,623套和2,299套中都不存在。 最后,在8,252张图像(约825个样本)中,有10%被保留以进行验证,其余的则用于训练。

Note that this is different from the original paper: this implementation is not pre-trained in the COCO dataset. Also, some of the techniques described in the paper for training and evaluation were not implemented.

请注意,这与原始论文不同:此实现未在COCO数据集中进行预训练。 同样,本文中描述的一些用于培训和评估的技术也未实现。

结果 (Results)

The model was able to achieve decent results on the PASCAL VOC validation set.

该模型能够在PASCAL VOC验证集上获得不错的结果。

  • Pixel accuracy: ~91%

    像素精度:〜91%
  • Mean Accuracy: ~82%

    平均准确度:〜82%
  • Mean Intersection over Union (mIoU): ~74%

    工会平均交集(mIoU):〜74%
  • Frequency weighed Intersection over Union: ~86%.

    频率加权交集的交点:〜86%。

Bellow, you can check out some of the results in a variety of images from the PASCAL VOC validation set.

在下面,您可以从PASCAL VOC验证集中查看各种图像中的某些结果。

结论 (Conclusion)

The field of Semantic Segmentation is no doubt one of the hottest ones in Computer Vision. Deeplab presents an alternative to classic encoder-decoder architectures. It advocates the usage of atrous convolutions for feature learning in multi-range contexts. Feel free to clone the repo and tune the model to achieve closer results to the original implementation. The complete code is here.

语义分割领域无疑是计算机视觉中最热门的领域之一。 Deeplab提供了经典编码器-解码器体系结构的替代方案。 它提倡在多范围上下文中使用无穷卷积进行特征学习。 随时克隆存储库并调整模型以达到与原始实现更接近的结果。 完整的代码在这里 。

Hope you enjoyed reading!

希望您喜欢阅读!

Originally published at sthalles.github.io.

最初发布于sthalles.github.io 。

翻译自: https://www.freecodecamp.org/news/diving-into-deep-convolutional-semantic-segmentation-networks-and-deeplab-v3-4f094fa387df/

空洞卷积 deeplab

你可能感兴趣的:(网络,算法,python,计算机视觉,神经网络)