PSPNet论文翻译及解读（中英文对照）

Pyramid Scene Parsing Network

Abstract

Scene parsing is challenging for unrestricted open vocabulary and diverse scenes. In this paper, we exploit the capability of global context information by different-regionbased context aggregation through our pyramid pooling module together with the proposed pyramid scene parsing network (PSPNet). Our global prior representation is effective to produce good quality results on the scene parsing task, while PSPNet provides a superior framework for pixellevel prediction. The proposed approach achieves state-of-the-art performance on various datasets. It came first in ImageNet scene parsing challenge 2016, PASCAL VOC 2012 benchmark and Cityscapes benchmark. A single PSPNet yields the new record of mIoU accuracy 85.4% on PASCAL VOC 2012 and accuracy 80.2% on Cityscapes.

摘要

场景解析对于无限制开放词汇库和不同的场景来说是一个挑战。本文通过所提出的金字塔场景分析网络（PSPNet），对不同区域的语境进行聚合，使模型拥有了理解全局语境信息的能力。我们的全局信息可以有效地在场景分析任务中产生高质量的结果，而PSPNet则为像素级预测提供了一个优越的框架。该方法在各种数据集上展现出了最高水平的性能，在2016年ImageNet场景分析挑战赛、Pascal VOC 2012数据集和Cityscapes数据集中排名第一。本文所提出这种PSPNet在Pascal VOC 2012上的mIoU准确率达到85.4%，在Cityscapes数据集上的mIoU准确率达到80.2%。

1. Introduction

Scene parsing, based on semantic segmentation, is a fundamental topic in computer vision. The goal is to assign each pixel in the image a category label. Scene parsing provides complete understanding of the scene. It predicts the label, location, as well as shape for each element. This topic is of broad interest for potential applications of automatic driving, robot sensing, to name a few.

1. 引言

基于语义分割的场景分析是计算机视觉中的一个基本课题，其目的是为图像中的每个像素指定一个类别标签。场景分析提供了对场景的完整理解，预测了每个元素的类别、位置和形状。这一课题对于自动驾驶、机器人传感技术等潜在的应用领域来说具有广泛的研究意义。

Difficulty of scene parsing is closely related to scene and label variety. The pioneer scene parsing task [23] is to classify 33 scenes for 2,688 images on LMO dataset [22]. More recent PASCAL VOC semantic segmentation and PASCAL context datasets [8, 29] include more labels with similar context, such as chair and sofa, horse and cow, etc. The new ADE20K dataset [43] is the most challenging one with a large and unrestricted open vocabulary and more scene classes. A few representative images are shown in Fig. 1. To develop an effective algorithm for these datasets needs to conquer a few difficulties.

场景解析的难度与场景和类别标签的多样性密切相关。“先驱者”场景解析任务[23]是将LMO数据集[22]上的33个场景共2688张图像进行分类。最新的Pascal VOC语义分割和Pascal语境数据集[8，29]中包括了更多具有类似语境的标签，如椅子和沙发、马和牛等。新的ADE20K数据集[43]是最具挑战性的，它具有庞大且无限制的开放词汇库和更多的场景类别，其中一些有代表性的图像如图1所示。要为这些数据集开发一种有效的算法，仍需要克服较大的困难。

图1. ADE20K数据集中复杂场景示例

State-of-the-art scene parsing frameworks are mostly based on the fully convolutional network (FCN) [26]. The deep convolutional neural network (CNN) based methods boost dynamic object understanding, and yet still face chal-lenges considering diverse scenes and unrestricted vocabulary. One example is shown in the first row of Fig. 2, where a boat is mistaken as a car. These errors are due to similar appearance of objects. But when viewing the image regarding the context prior that the scene is described as boathouse near a river, correct prediction should be yielded.

最先进的场景分析框架主要是基于全卷积网络（FCN）[26]。基于深度卷积神经网络（CNN）的方法提高了对动态对象的理解，但考虑到场景的多样性和词汇库的不受限制，其仍然面临着较大的挑战。图2的第一行展示了一个例子，图中的一艘船被误认为是一辆汽车。这些错误是由于对象的外观相似造成的。但是，基于上下文信息，我们发现图像中的场景被描述为靠近河流的船坞，利用这一点，当要判别图像中元素的类别时，模型应要做出正确的预测。

Towards accurate scene perception, the knowledge graph relies on prior information of scene context. We found that the major issue for current FCN based models is lack of suitable strategy to utilize global scene category clues. For typical complex scene understanding, previously to get a global image-level feature, spatial pyramid pooling [18] was widely employed where spatial statistics provide a good descriptor for overall scene interpretation. Spatial pyramid pooling network [12] further enhances the ability.

准确的感知场景依赖于事先理解的场景语境信息。我们发现，当前基于FCN的模型的主要问题是缺乏合适的策略来利用全局场景类别线索。对于典型的复杂场景理解，以前为了获得全局图像级特征，广泛使用了空间金字塔池化技术[18]，该技术中的空间统计数据为整体场景解释提供了良好的描述词。后来，在空间金字塔池化网络[12]中进一步增强了这种能力。

Different from these methods, to incorporate suitable global features, we propose pyramid scene parsing network (PSPNet). In addition to traditional dilated FCN [3, 40] for pixel prediction, we extend the pixel-level feature to the specially designed global pyramid pooling one. The local and global clues together make the final prediction more reliable. We also propose an optimization strategy with deeply supervised loss. We give all implementation details, which are key to our decent performance in this paper, and make the code and trained models publicly available.

不同于这些方法，为了融合合适的全局特征，我们提出了金字塔场景解析网络（PSPNet）。除了使用传统扩展后的FCN[3，40]进行像素级类别预测外，我们还将像素级特征扩展到专门设计的全局金字塔池化模块，局部和全局线索的结合可使最终的预测结果更加可靠。我们还提出了一种深度监督损失函数的优化策略。在本文中，我们给出了所有实现细节，这是我们良好性能的关键，并公开了代码和经过培训的模型。

Our approach achieves state-of-the-art performance on all available datasets. It is the champion of ImageNet scene parsing challenge 2016 [43], and arrived the 1st place on PASCAL VOC 2012 semantic segmentation benchmark [8], and the 1st place on urban scene Cityscapes data [6]. They manifest that PSPNet gives a promising direction for pixellevel prediction tasks, which may even benefit CNN-based stereo matching, optical flow, depth estimation, etc. in follow-up work. Our main contributions are threefold.

We propose a pyramid scene parsing network to embed difficult scenery context features in an FCN based pixel prediction framework.

We develop an effective optimization strategy for deep ResNet [13] based on deeply supervised loss.

We build a practical system for state-of-the-art scene parsing and semantic segmentation where all crucial implementation details are included.

我们的方法在所有可用数据集上展现了最高水平的性能。它是2016年ImageNet场景解析挑战的冠军[43]，在Pascal VOC 2012语义分割数据库[8]上排名第一，在城市场景Cityscapes数据库上也排名第一[6]。结果表明，PSPNet为像素级预测任务指示了方向，甚至能有利于基于CNN的立体匹配、光流、深度评估等后续工作。我们的主要贡献有以下三部分：

我们提出了一个金字塔场景分析网络，在传统基于FCN的像素级预测框架中嵌入复杂场景的语境特征。
对基于深度监测损失函数的ResNet（残差网络）提出了一种有效的优化策略[13]。
我们构建了一个实用的场景分析和语义分割系统，其中包括所有关键的实现细节。

2. Related Work

In the following, we review recent advances in scene parsing and semantic segmentation tasks. Driven by powerful deep neural networks [17, 33, 34, 13], pixel-level prediction tasks like scene parsing and semantic segmentation achieve great progress inspired by replacing the fully-connected layer in classification with the convolution layer [26]. To enlarge the receptive field of neural networks, methods of [3, 40] used dilated convolution. Noh et al. [30] proposed a coarse-to-fine structure with deconvolution network to learn the segmentation mask. Our baseline network is FCN and dilated network [26, 3].

2. 相关工作

接下来，我们回顾了场景解析和语义分割任务的最新进展。在强大的深度神经网络[17，33，34，13]的驱动下，像场景解析和语义分割这样的像素级预测任务，在分类任务中将完全连接层替换为卷积层后，取得了极大的进展[26]。为了扩大神经网络的感受野，方法[3，40]使用了空洞卷积。而后，Noh等人[30]提出了一种由粗到细的反卷积网络结构来学习分割掩模。本文则是基于FCN及其扩张网络[26，3]进行进一步的研究。

Other work mainly proceeds in two directions. One line [26, 3, 5, 39, 11] is with multi-scale feature ensembling. Since in deep networks, higher-layer feature contains more semantic meaning and less location information. Combining multi-scale features can improve the performance.

其他工作主要有两个方向。方向一[26、3、5、39、11]是多尺度特征的集成，因为在深层网络中，越深层次的特征包含越多的语义含义和更少的位置信息，结合多尺度特征可以提高网络性能。

The other direction is based on structure prediction. The pioneer work [3] used conditional random field (CRF) as post processing to refine the segmentation result. Following methods [25, 41, 1] refined networks via end-to-end modeling. Both of the two directions ameliorate the localization ability of scene parsing where predicted semantic boundary fits objects. Yet there is still much room to exploit necessary information in complex scenes.

另一个方向是结构预测。文献[3]的开拓性工作在于其采用条件随机场（CRF）作为后续步骤，对分割结果进行了改善。文献[25，41，1]中的方法则通过端到端建模的方式优化网络。这两种方法都改善了场景解析的定位能力，即预测语义边界来拟合对象位置的能力。然而，在复杂的场景中，利用一些必要的信息后，这种能力仍然有很大可改进的余地。

To make good use of global image-level priors for diverse scene understanding, methods of [18, 27] extracted global context information with traditional features not from deep neural networks. Similar improvement was made under object detection frameworks [35]. Liu et al. [24] proved that global average pooling with FCN can improve semantic segmentation results. However, our experiments show that these global descriptors are not representative enough for the challenging ADE20K data. Therefore, different from global pooling in [24], we exploit the capability of global context information by different-region-based context aggregation via our pyramid scene parsing network.

为了更好地利用图像级全局信息来理解不同的场景，方法[18，27]提取了具有传统特征而非从深层神经网络中获得的全局语境信息，在目标检测框架下也进行了类似的改进[35]。刘等人[24]证明了使用全局平均池化的FCN可以改善语义分割结果。然而，我们的实验表明，上述的这些全局描述词对于极具挑战性的ADE20K数据库来说并不具有足够的代表性。因此，与[24]中的全局池化不同，我们利用金字塔场景解析网络，对不同区域的语境进行聚合，使模型拥有了理解全局语境信息的能力。

3. Pyramid Scene Parsing Network

We start with our observation and analysis of representative failure cases when applying FCN methods to scene parsing. They motivate proposal of our pyramid pooling module as the effective global context prior. Our pyramid scene parsing network (PSPNet) illustrated in Fig. 3 is then described to improve performance for open-vocabulary object and stuff identification in complex scene parsing.

3. 金字塔场景解析网络

我们首先观察和分析了应用FCN进行场景解析的典型失败案例，这些案例促使我们提出建议，即将金字塔池化模块作为有效提取全局语境信息的手段。图3所示的金字塔场景分析网络（PSPNet）可提高复杂场景分析中识别物体的性能。

3.1. Important Observations

The new ADE20K dataset [43] contains 150 stuff/object category labels (e.g., wall, sky, and tree) and 1,038 imagelevel scene descriptors (e.g., airport terminal, bedroom, and street). So a large amount of labels and vast distributions of scenes come into existence. Inspecting the prediction results of the FCN baseline provided in [43], we summarize several common issues for complex-scene parsing.

3.1. 重要发现

最新的ADE20K数据集[43]包含150个物体/对象类别标签（例如，墙、天空和树）和1038个图像级场景描述词（例如，机场航站楼、卧室和街道），因此，大量的标签和场景出现了。通过审查文献[43]中提供的基于FCN的场景分割结果，我们总结了复杂场景解析的几个常见问题。

Mismatched Relationship Context relationship is universal and important especially for complex scene understanding. There exist co-occurrent visual patterns. For example, an airplane is likely to be in runway or fly in sky while not over a road. For the first-row example in Fig. 2, FCN predicts the boat in the yellow box as a “car” based on its appearance. But the common knowledge is that a car is seldom over a river. Lack of the ability to collect contextual information increases the chance of misclassification.

语境关系不匹配 语境关系是普遍存在的，尤其在对复杂场景的理解中极为重要，有些物体常常是一起出现的，例如，飞机很可能在跑道上或在空中飞行，而不是在公路上。对于图2中的第一行示例，FCN根据外观将黄色框中的船预测为“汽车”，但众所周知，汽车很少在河上行驶。所以，缺乏收集语境信息的能力会增大错误分类的概率。

Confusion Categories There are many class label pairs in the ADE20K dataset [43] that are confusing in classification. Examples are field and earth; mountain and hill; wall, house, building and skyscraper. They are with similar appearance. The expert annotator who labeled the entire dataset, still makes 17.60% pixel error as described in [43]. In the second row of Fig. 2, FCN predicts the object in the box as part of skyscraper and part of building. These results should be excluded so that the whole object is either skyscraper or building, but not both. This problem can be remedied by utilizing the relationship between categories.

类别混淆 ADE20K数据集[43]中有许多类别标签在分类时容易出现混淆。例如：田野和土地；山脉和丘陵；墙、房子、建筑物和摩天大楼，它们的外观十分相似。如文献[43]所述，专家注释员标记了整个数据集，仍然产生17.60%的像素误差。在图2的第二行中，对于框中的物体，FCN预测其部分是摩天大楼，部分是建筑物。这些结果是不正确的，框中的整个物体只能要么是摩天大楼，要么是建筑物，但不能两者兼有，而利用类别之间的关系即可解决上述问题。

Inconspicuous Classes Scene contains objects/stuff of arbitrary size. Several small-size things, like streetlight and signboard, are hard to find while they may be of great importance. Contrarily, big objects or stuff may exceed the receptive field of FCN and thus cause discontinuous prediction. As shown in the third row of Fig. 2, the pillow has similar appearance with the sheet. Overlooking the global scene category may fail to parse the pillow. To improve performance for remarkably small or large objects, one should pay much attention to different sub-regions that contain inconspicuous-category stuff.

不明显的类别 通常来说，场景中包含着任意大小的物体。一些小的东西，比如路灯和标志牌，尽管它们可能很重要，但很难被找到。相反，大的物体或东西可能会超过fcn的感受野，从而导致的预测不连续性。如图2第三行所示，枕头与床单外观相似，忽略全局场景类别，枕头可能无法被解析分割出来。要提高对非常小或非常大的对象的识别能力，应该特别注意包含不明显类别物体的不同子区域。

图2. 从ADE20K上观察到的场景分析问题

To summarize these observations, many errors are partially or completely related to contextual relationship and global information for different receptive fields. Thus a deep network with a suitable global-scene-level prior can much improve the performance of scene parsing.

总结这些观察结果，我们可以发现，许多错误都与不同感受野获取的全局信息和语境关系有着部分甚至是完全的关联。因此，一个拥有适当场景级全局信息的深度网络可以大大提高场景解析的能力。

3.2. Pyramid Pooling Module

With above analysis, in what follows, we introduce the pyramid pooling module, which empirically proves to be an effective global contextual prior.

3.2. 金字塔池化模块

通过上述分析，在接下来的部分中，我们引入了金字塔池化模块，研究表明，它能有效获得全局语境信息。

In a deep neural network, the size of receptive field can roughly indicates how much we use context information. Although theoretically the receptive field of ResNet [13] is already larger than the input image, it is shown by Zhou et al. [42] that the empirical receptive field of CNN is much smaller than the theoretical one especially on high-level layers. This makes many networks not sufficiently incorporate the momentous global scenery prior. We address this issue by proposing an effective global prior representation.

在深度神经网络中，感受野的大小可以大致代表我们使用语境信息的程度。尽管理论上ResNet[13]的感受野已经大于输入图像，但zhou等人[42]指出，从经验中可以发现，CNN感受野远小于理论大小，尤其是深层网络中，这使得许多网络没有充分融入重要的全局场景信息。针对于此，我们提出了有效的全局信息特征尝试解决这一问题。

Global average pooling is a good baseline model as the global contextual prior, which is commonly used in image classification tasks [34, 13]. In [24], it was successfully applied to semantic segmentation. But regarding the complex scene images in ADE20K [43], this strategy is not enough to cover necessary information. Pixels in these scene images are annotated regarding many stuff and objects. Directly fusing them to form a single vector may lose the spatial relation and cause ambiguity. Global context information along with sub-region context is helpful in this regard to distinguish among various categories. A more powerful representation could be fused information from different sub-regions with these receptive fields. Similar conclusion was drawn in classical work [18, 12] of scene/image classification.

全局平均池化通常用于图像分类任务[34，13]，将它用于提取全局语境信息是一种比较好的方式。在文献[24]中，它成功地被应用于语义分割。但对于ADE20K[43]中的复杂场景，该策略无法涵盖必要的信息。这些场景图像中的像素都被标注为各种物体，直接将它们融合成一个向量，可能会失去它们间的空间关系，造成模糊。全局语境信息结合局部语境信息有助于区分出各种类别，越好的特征越能融合来自感受野大小不同的子区域的信息，在场景/图像分类的经典著作[18，12]中也得出了类似的结论。

In [12], feature maps in different levels generated by pyramid pooling were finally flattened and concatenated to be fed into a fully connected layer for classification. This global prior is designed to remove the fixed-size constraint of CNN for image classification. To further reduce context information loss between different sub-regions, we propose a hierarchical global prior, containing information with different scales and varying among different sub-regions. We call it pyramid pooling module for global scene prior construction upon the final-layer-feature-map of the deep neural network, as illustrated in part (c) of Fig. 3.

在文献[12]中，金字塔池化生成的不同级别的特征图最终被展平并拼接起来，然后输入到全连接层中进行分类。该全局先验模块是为消除CNN进行图像分类时需输入固定尺寸图像的这一约束而设计的。为了进一步避免丢失表征不同子区域之间关系的语境信息，我们提出了一个包含不同尺度、不同子区域间关系的分层全局信息。如图3中（c）部分所示，将该金字塔池化模块的输出作为深度神经网络最终的特征图，并称其为全局场景先验信息。

The pyramid pooling module fuses features under four different pyramid scales. The coarsest level highlighted in red is global pooling to generate a single bin output. The following pyramid level separates the feature map into different sub-regions and forms pooled representation for different locations. The output of different levels in the pyramid pooling module contains the feature map with varied sizes. To maintain the weight of global feature, we use 1x1 convolution layer after each pyramid level to reduce the dimension of context representation to 1/N of the original one if the level size of pyramid is N. Then we directly upsample the low-dimension feature maps to get the same size feature as the original feature map via bilinear interpolation. Finally, different levels of features are concatenated as the final pyramid pooling global feature.

金字塔池化模块融合了四种不同尺度下的特征。图中用红色突出显示的为最粗略的层级，是使用全局池化生成的单个bin输出。剩下的三个层级将输入特征图划分成若干个不同的子区域，并对每个子区域进行池化，最后将包含位置信息的池化后的单个bin组合起来。金字塔池化模块中不同层级输出不同尺度的特征图，为了保持全局特征的权重，我们在每个金字塔层级后使用1x1的卷积核，当某个层级维数为n时，即可将语境特征的维数降到原始特征的1/n。然后，通过双线性插值直接对低维特征图进行上采样，使其与原始特征图尺度相同。最后，将不同层级的特征图拼接为最终的金字塔池化全局特征。

Noted that the number of pyramid levels and size of each level can be modified. They are related to the size of feature map that is fed into the pyramid pooling layer. The structure abstracts different sub-regions by adopting varying-size pooling kernels in a few strides. Thus the multi-stage kernels should maintain a reasonable gap in representation. Our pyramid pooling module is a four-level one with bin sizes of 1x1, 2x2, 3x3 and 6x6 respectively. For the type of pooling operation between max and average, we perform extensive experiments to show the difference in Section 5.2.

注意到金字塔层数和每个层级尺度的大小都可以修改，它们与输入金字塔池化层的特征图尺度有关。该结构通过采用不同大小的池化核，通过简单几步即可提取出不同的子区域的特征，因此，多层级的池化核大小应保持合理的间隔。我们的金字塔池化模块是一个四层级的模块，分别有1x1、2x2、3x3和6x6的bin大小。针对于每个层级的池化操作是选择max pooling还是average pooling，我们进行了大量的实验，在第5.2节中将会展示两者之间的差异。

图3. PSPNet概览

3.3. Network Architecture

With the pyramid pooling module, we propose our pyramid scene parsing network (PSPNet) as illustrated in Fig. 3. Given an input image in Fig. 3(a), we use a pretrained ResNet [13] model with the dilated network strategy [3, 40] to extract the feature map. The final feature map size is 1/8 of the input image, as shown in Fig. 3(b). On top of the map, we use the pyramid pooling module shown in (c) to gather context information. Using our 4-level pyramid, the pooling kernels cover the whole, half of, and small portions of the image. They are fused as the global prior. Then we concatenate the prior with the original feature map in the final part of (c). It is followed by a convolution layer to generate the final prediction map in (d).

3.3 网络结构

基于金字塔池化模块，我们提出了金字塔场景分析网络（PSPNet），如图3所示。对于图3（a）中的输入图像，我们使用一个带有扩展网络策略[3，40]且预训练过的ResNet[13]模型来提取特征图，最终特征图尺寸为输入图像的1/8，如图3（b）所示。我们对上述特征图使用（c）中所示的金字塔池化模块来获取语境信息，其中，金字塔池化模块分4个层级，其池化核大小分别为图像的全部、一半和小部分，最终它们可融合为全局特征。然后，在（c）模块的最后部分，我们将融合得到的全局特征与原始特征图连接起来。最后，在（d）中通过一层卷积层生成最终的预测图。

To explain our structure, PSPNet provides an effective global contextual prior for pixel-level scene parsing. The pyramid pooling module can collect levels of information, more representative than global pooling [24]. In terms of computational cost, our PSPNet does not much increase it compared to the original dilated FCN network. In end-to-end learning, the global pyramid pooling module and the local FCN feature can be optimized simultaneously.

为了解释我们的网络结构，PSPNet为像素级场景解析提供了一个有效的全局语境信息，其中，金字塔池化模块能收集不同尺度的语境信息并融合，会比全局池化所得的全局信息更具代表性[24]。在计算成本方面，我们的PSPNet并没有比原来的扩展FCN网络增加多少。另外，在端到端的学习中，全局金字塔池化模块和局部FCN特征还可被同时优化。

4. Deep Supervision for ResNet-Based FCN

Deep pretrained networks lead to good performance [17, 33, 13]. However, increasing depth of the network may introduce additional optimization difficulty as shown in [32, 19] for image classification. ResNet solves this problem with skip connection in each block. Latter layers of deep ResNet mainly learn residues based on previous ones.

4. 残差全卷积网络的损失函数

神经网络经过深度预训练能带来良好的性能[17，33，13]，然而，对于图像分类任务来说，如[32，19]中所示，网络深度的增加可能会带来额外的优化困难。ResNet通过在每个块中使用跳远连接来解决这个问题，在深度残差网络中，后一层主要学习前一层抛出的残差。

We contrarily propose generating initial results by supervision with an additional loss, and learning the residue afterwards with the final loss. Thus, optimization of the deep network is decomposed into two, each is simpler to solve.

我们提出通过另外的损失函数来产生初始结果，然后通过最终的损失函数来学习残差。因此，将深层网络的优化问题可分解为两个，单个问题的求解就会比较简单。

An example of our deeply supervised ResNet101 [13] model is illustrated in Fig. 4. Apart from the main branch using softmax loss to train the final classifier, another classifier is applied after the fourth stage, i.e., the res4b22 residue block. Different from relay backpropagation [32] that blocks the backward auxiliary loss to several shallow layers, we let the two loss functions pass through all previous layers. The auxiliary loss helps optimize the learning process, while the master branch loss takes the most responsibility. We add weight to balance the auxiliary loss.

图4给出了我们增加辅助损失函数后的ResNet101[13]模型的一个示例。除了使用softmax loss训练最终分类器的主要分支外，在第四阶段（即res4b22残差块）后应用另一个分类器。分程反向传播[32]会阻塞辅助损失函数传递到较浅的网络层，区别与此，我们让这两个损失函数通过在其之前的所有网络层。辅助损失函数有助于优化学习过程，而主分支损失函数承担起了最大的责任。最后，我们还增加权重以平衡辅助损失函数。

图4. ResNet101中辅助损失函数的图示

In the testing phase, we abandon this auxiliary branch and only use the well optimized master branch for final prediction. This kind of deeply supervised training strategy for ResNet-based FCN is broadly useful under different experimental settings and works with the pre-trained ResNet model. This manifests the generality of such a learning strategy. More details are provided in Section 5.2.

在测试阶段，我们放弃了这个辅助分支，只使用经过良好优化的主分支进行最终预测。但是，这种残差全卷积网络的损失函数训练策略与预先训练的ResNet模型相结合后，在不同的实验参数设置下都着实有效，这表明了该学习策略的普适性，更多细节参见5.2节。

后续便是实验部分和总结了，可参见原论文，此处不在详细翻译。

以上。

PSPNet论文翻译及解读（中英文对照）

Pyramid Scene Parsing Network

Abstract

摘要

1. Introduction

1. 引言

2. Related Work

2. 相关工作

3. Pyramid Scene Parsing Network

3. 金字塔场景解析网络

3.1. Important Observations

3.1. 重要发现

3.2. Pyramid Pooling Module

3.2. 金字塔池化模块

3.3. Network Architecture

3.3 网络结构

4. Deep Supervision for ResNet-Based FCN

4. 残差全卷积网络的损失函数

你可能感兴趣的:(PSPNet论文翻译及解读（中英文对照）)