本文是论文的相关摘要,因为作者的原话最容易理解,所以将精彩语句摘录,帮助快速回忆起文章主要信息。
Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L L layers have L L connections—one between each layer and its subsequent layer—our network has L(L+1)/2 L ( L + 1 ) / 2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters.
最近的研究表明,如果在靠近输入层与输出层之间的地方使用短连接(shorter connections),就可以训练更深、更准确、更有效的卷积网络。在这篇文章中,我们基于这个观点,介绍了稠密卷积网络(DenseNet),该网络在前馈时将每一层都与其他的任一层进行了连接。传统的 L L 层卷积网络有 L L 个连接——每一层与它的前一层和后一层相连——我们的网络有 L(L+1)/2 L ( L + 1 ) / 2 个连接。每一层都将之前的所有层的特征图作为输入,而它自己的特征图是之后所有层的输入。DenseNets有一些很不错的优点:有助于解决梯度消失问题,有利于特征传播,鼓励特征的重复利用,还可以减少参数量。
As CNNs become increasingly deep, a new research problem emerges: as information about the input or gradient passes through many layers, it can vanish and “wash out” by the time it reaches the end (or beginning) of the network. Many recent publications address this or related problems. ResNets [11] and Highway Networks [34] by pass signal from one layer to the next via identity connections. Stochastic depth [13] shortens ResNets by randomly dropping layers during training to allow better information and gradient flow. FractalNets [17] repeatedly combine several parallel layer sequences with different number of convolutional blocks to obtain a large nominal depth, while maintaining many short paths in the network. Although these different approaches vary in network topology and training procedure, they all share a key characteristic: they create short paths from early layers to later layers.
随着CNNs变得越来越深,一个新的问题出现了:当输入或梯度信息在经过很多层的传递之后,在到达网络的最后(或开始)可能会消失或者弥散(wash out)。很多最新的研究都说明了这个或者与这个相关的问题。ResNets网络和Highway网络将旁路信息(bypass signal)进行连接。随机深度(stochastic depth)在训练过程中随机丢掉一些层,进而缩短了ResNets网络,获得了更好的信息和梯度流。FractalNets使用不同数量的卷积block来重复的连接一些平行层,获得更深的网络同时还保留了网络中的short paths。尽管这些方法在网络结构和训练方法等方面有所不同,但它们都有一个关键点:它们都在前几层和后几层之间产生了短路径(short paths)。
In this paper, we propose an architecture that distills this insight into a simple connectivity pattern: to ensure maximum information flow between layers in the network, we connect all layers (with matching feature-map sizes) directly with each other. To preserve the feed-forward nature, each layer obtains additional inputs from all preceding layers and passes on its own feature-maps to all subsequent layers. Figure 1 illustrates this layout schematically. Crucially, in contrast to ResNets, we never combine features through summation before they are passed into a layer; instead, we combine features by concatenating them. Hence, the l l th layer has l l inputs, consisting of the feature-maps of all preceding convolutional blocks. Its own feature-maps are passed on to all L L subsequent layers. This introduces L(L+1)/2 L ( L + 1 ) / 2 connections in an L L -layer network, instead of just L L , as in traditional architectures.
在这篇文章中,我们提出了一个结构,该结构是提炼上述观点而形成的一种简单的连接模式:为了保证能够获得网络层之间的最大信息,我们将所有层(使用合适的特征图尺寸)都进行互相连接。为了能够保证前馈的特性,每一层将之前所有层的输入进行拼接,之后将输出的特征图传递给之后的所有层。结构如图1所示。重要的一点,与ResNets不同的是,我们不是在特征传递给某一层之前将其进行相加(combine),而是将其进行拼接(concatenate)。因此,第 l l 层有 l l 个输入,这些输入是该层之前的所有卷积块(block)的特征图,而它自己的特征图则传递给之后的所有 L L 层。这就表示,一个 L L 层的网络就有 L(L+1)/2 L ( L + 1 ) / 2 个连接,而不是像传统的结构仅仅有 L L 个连接。
A possibly counter-intuitive effect of this dense connectivity pattern is that it requires fewer parameters than traditional convolutional networks, as there is no need to relearn redundant feature-maps. Traditional feed-forward architectures can be viewed as algorithms with a state, which is passed on from layer to layer. Each layer reads the state from its preceding layer and writes to the subsequent layer. It changes the state but also passes on information that needs to be preserved. ResNets [11] make this information preservation explicit through additive identity transformations. Recent variations of ResNets [13] show that many layers contribute very little and can in fact be randomly dropped during training. This makes the state of ResNets similar to (unrolled) recurrent neural networks [21], but the number of parameters of ResNets is substantially larger because each layer has its own weights. Our proposed DenseNet architecture explicitly differentiates between information that is added to the network and information that is preserved. DenseNet layers are very narrow (e.g., 12 filters per layer), adding only a small set of feature-maps to the “collective knowledge” of the network and keep the remaining feature-maps unchanged—and the final classifier makes a decision based on all feature-maps in the network.
该稠密连接模块的一个优点是它比传统的卷积网络有更少的参数,因为它不需要再重新学习多余的特征图。传统的前馈结构可以被看成一种层与层之间状态传递的算法。每一层接收前一层的状态,然后将新的状态传递给下一层。它改变了状态,但也传递了需要保留的信息。ResNets将这种信息保留的更明显,因为它加入了本身的变换(identity transformations)。最近很多关于ResNets的研究都表明ResNets的很多层是几乎没有起作用的,可以在训练时随机的丢掉。这篇论文[21]阐述了ResNets很像(展开的)循环神经网络,但是比循环神经网络有更多的参数,因为它每一层都有自己的权重。我们提出的DenseNet结构,增加到网络中的信息与保留的信息有着明显的不同。DenseNet层很窄(例如每一层有12个滤波器),仅仅增加小数量的特征图到网络的“集体知识”(collective knowledge),并且保持这些特征图不变——最后的分类器基于网络中的所有特征图进行预测。
Besides better parameter efficiency, one big advantage of DenseNets is their improved flow of information and gradients throughout the network, which makes them easy to train. Each layer has direct access to the gradients from the loss function and the original input signal, leading to an implicit deep supervision [20]. This helps training of deeper network architectures. Further, we also observe that dense connections have a regularizing effect, which reduces overfitting on tasks with smaller training set sizes.
除了具有更好的参数利用率,DenseNets还有一个优点是它改善了网络中信息和梯度的传递,这就让网络更容易训练。每一层都可以直接利用损失函数的梯度以及最开始的输入信息,相当于是一种隐形的深度监督(implicit deep supervision)。这有助于训练更深的网络。此外,我们还发现稠密连接有正则化的作用,在更少训练集的任务中可以降低过拟合。
A cascade structure similar to our proposed dense network layout has already been studied in the neural networks literature in the 1980s [3]. Their pioneering work focuses on fully connected multi-layer perceptrons trained in a layer-by-layer fashion. More recently, fully connected cascade networks to be trained with batch gradient descent were proposed [40].
在1980s神经网络论文中提出的级联结构很像我们提出的稠密网络。他们之前的工作主要关注在全连接的多层感知机上。最近,使用批梯度下降训练的全连接的级联网络也被提出来了。
Highway Networks [34] were amongst the first architectures that provided a means to effectively train end-to-end networks with more than 100 layers. Using bypassing paths along with gating units, Highway Networks with hundreds of layers can be optimized without difficulty. The bypassing paths are presumed to be the key factor that eases the training of these very deep networks. This point is further supported by ResNets [11], in which pure identity mappings are used as bypassing paths. ResNets have achieved impressive, record-breaking performance on many challenging image recognition, localization, and detection tasks, such as ImageNet and COCO object detection [11]. Recently, stochastic depth was proposed as a way to successfully train a 1202-layer ResNet [13]. Stochastic depth improves the training of deep residual networks by dropping layers randomly during training. This shows that not all layers may be needed and highlights that there is a great amount of redundancy in deep (residual) networks. Our paper was partly inspired by that observation. ResNets with pre-activation also facilitate the training of state-of-the-art networks with > 1000 layers [12].
HighWay是这些网络中第一个提出使用100多层的结构训练一个端到端的网络。使用旁路(bypassing paths)和门控单元(gating units),HighWay网络可以很轻松的优化上百层的网络。旁路被认为是使深层网络容易训练关键因素。该观点在ResNets中被进一步证实,ResNets使用本身的特征图作为旁路。ResNets在很多图像识别、定位和检测任务(如ImageNet和COCO目标检测)中都获得了不错的效果,并且还打破了之前的记录。最近,一种可以成功训练1202层ResNet的随机深度(stochastic depth)被提出。随机深度通过在训练过程中随机丢掉一些层来优化深度残差网络的训练过程。这表明深度(残差)网络中并不是所有的层都是必要的,有很多层是冗余的。我们论文的一部分就受到了该结论的启发。预激活(pre-activation)的ResNets也有助于训练超过1000层的网络。
An orthogonal approach to making networks deeper (e.g., with the help of skip connections) is to increase the network width. The GoogLeNet [36, 37] uses an “Inception module” which concatenates feature-maps produced by filters of different sizes. In [38], a variant of ResNets with wide generalized residual blocks was proposed. In fact, simply increasing the number of filters in each layer of ResNets can improve its performance provided the depth is sufficient [42]. FractalNets also achieve competitive results on several datasets using a wide network structure [17].
一种让网络更深(如跨层连接)的正交法(orthogonal approach)是增加网络的宽度。GooLeNet使用了“inception”模块,将不同尺寸的滤波器产生的特征进行组合连接。在[37]中,提出一种具有广泛宽度的残差模块,它是ResNets的一种变形。事实上,只简单的增加ResNets每一层的滤波器个数就可以提升网络的性能。FractalNets使用一个宽的网络结构在一些数据集上也获得了不错的效果。
Instead of drawing representational power from extremely deep or wide architectures, DenseNets exploit the potential of the network through feature reuse, yielding condensed models that are easy to train and highly parameterefficient. Concatenating feature-maps learned by different layers increases variation in the input of subsequent layers and improves efficiency. This constitutes a major difference between DenseNets and ResNets. Compared to Inception networks [36, 37], which also concatenate features from different layers, DenseNets are simpler and more efficient.
DenseNets不是通过很深或者很宽的网络来获得表征能力,而是通过特征的重复使用来利用网络的隐含信息,获得更容易训练、参数效率更高的稠密模型。将不同层学到的特征图进行组合连接,增加了之后层输入的多样性,提升了性能。这同时也指出了DenseNets和ResNets之间的主要差异。尽管inception网络也组合连接了不同层的特征,但DenseNets更简单,也更高效。
There are other notable network architecture innovations which have yielded competitive results. The Network in Network (NIN) [22] structure includes micro multi-layer perceptrons into the filters of convolutional layers to extract more complicated features. In Deeply Supervised Network (DSN) [20], internal layers are directly supervised by auxiliary classifiers, which can strengthen the gradients received by earlier layers. Ladder Networks [27, 25] introduce lateral connections into autoencoders, producing impressive accuracies on semi-supervised learning tasks. In [39], Deeply-Fused Nets (DFNs) were proposed to improve information flow by combining intermediate layers of different base networks. The augmentation of networks with pathways that minimize reconstruction losses was also shown to improve image classification models [43].
也有很多著名的网络结构获得了不错的结果。NIN结构将多层感知机与卷积层的滤波器相连来提取更复杂的特征。在DSN中,通过辅助分类器来监督内部层,加强了前几层的梯度。Ladder网络将横向连接(lateral connection)引入到自编码器中,在半监督学习任务中获得不错的效果。在[38]中,DFNs通过连接不同基础网络的中间层来改善信息的传递。带有可以最小化重建损失路径(pathways that minimize reconstruction losses)的网络也可以改善图像分类模型的性能。
Superficially, DenseNets are quite similar to ResNets: Eq. (2) differs from Eq. (1) only in that the inputs to
H l l (·) are concatenated instead of summed. However, the implications of this seemingly small modification lead to substantially different behaviors of the two network architectures.
从表面来看,DenseNets和ResNets很像:方程(2)和方程(1)的不同主要在输入 H l l (·) (进行拼接而不是求和)。然而,这个小的改变却给这两种网络结构的性能带来了很大的差异。
Model compactness. As a direct consequence of the input concatenation, the feature-maps learned by any of the DenseNet layers can be accessed by all subsequent layers. This encourages feature reuse throughout the network, and leads to more compact models.
模型简化性(compactness)。将输入进行连接的直接结果是,DenseNets每一层学到的特征图都可以被以后的任一层利用。该方式有助于网络特征的重复利用,也因此得到了更简化的模型。
Implicit Deep Supervision. One explanation for the improved accuracy of dense convolutional networks may be that individual layers receive additional supervision from the loss function through the shorter connections. One can interpret DenseNets to perform a kind of “deep supervision”. The benefits of deep supervision have previously been shown in deeply-supervised nets (DSN; [20]), which have classifiers attached to every hidden layer, enforcing the intermediate layers to learn discriminative features.
隐含的深度监督(implicit deep supervision)。稠密卷积网络可以提升准确率的一个解释是,由于更短的连接,每一层都可以从损失函数中获得监督信息。可以将DenseNets理解为一种“深度监督”(deep supervision)。深度监督的好处已经在之前的深度监督网络(DSN)中说明,该网络在每一隐含层都加了分类器,迫使中间层也学习判断特征(discriminative features)。
DenseNets perform a similar deep supervision in an implicit fashion: a single classifier on top of the network provides direct supervision to all layers through at most two or three transition layers. However, the loss function and gradient of DenseNets are substantially less complicated, as the same loss function is shared between all layers.
DensNets和深度监督网络相似:网络最后的分类器通过最多两个或三个过渡层为所有层都提供监督信息。然而,DenseNets的损失函数值和梯度不是很复杂,这是因为所有层之间共享了损失函数。
Stochastic vs. deterministic connection. There is an interesting connection between dense convolutional networks and stochastic depth regularization of residual networks [13]. In stochastic depth, layers in residual networks are randomly dropped, which creates direct connections between the surrounding layers. As the pooling layers are never dropped, the network results in a similar connectivity pattern as DenseNet: there is a small probability for any two layers, between the same pooling layers, to be directly connected—if all intermediate layers are randomly dropped. Although the methods are ultimately quite different, the DenseNet interpretation of stochastic depth may provide insights into the success of this regulariser.
随机vs确定连接。稠密卷积网络与残差网络的随机深度正则化(stochastic depth regularzation)之间有着有趣的关系。在随机深度中,残差网络随机丢掉一些层,直接将周围的层进行连接。因为池化层没有丢掉,所以该网络和DenseNet有着相似的连接模式:以一定的小概率对相同池化层之间的任意两层进行直接连接——如果中间层随机丢掉的话。尽管这两个方法在根本上是完全不一样的,但是DenseNet关于随机深度的解释会给该正则化的成功提供依据。
Feature Reuse:
Alllayersspreadtheirweightsovermanyinputswithin the same block. This indicates that features extracted by very early layers are, indeed, directly used by deep layers throughout the same dense block.
The weights of the transition layers also spread their weight across all layers within the preceding dense block, indicating information flow from the first to the last layers of the DenseNet through few indirections.
The layers within the second and third dense block consistently assign the least weight to the outputs of the transition layer (the top row of the triangles), indicating that the transition layer outputs many redundant features (with low weight on average). This is in keeping with the strong results of DenseNet-BC where exactly these outputs are compressed.
Although the final classification layer, shown on the very right, also uses weights across the entire dense block, there seems to be a concentration towards final feature-maps, suggesting that there may be some more high-level features produced late in the network.
特征重复利用:
在同一个block中,所有层都将它的权重传递给其他层作为输入。这表明早期层提取的特征可以被同一个dense block下深层所利用;
过渡层的权重也可以传递给之前dense block的所有层,也就是说DenseNet的信息可以以很少的间接方式从第一层流向最后一层;
第二个和第三个dense block内的所有层分配最少的权重给过渡层的输出,表明过渡层输出很多冗余特征。这和DenseNet-BC强大的结果有关系;
尽管最后的分类器也使用通过整个dense block的权重,但似乎更关注最后的特征图,表明网络的最后也会产生一些高层次的特征。
We proposed a new convolutional network architecture, which we refer to as Dense Convolutional Network (DenseNet). It introduces direct connections between any two layers with the same feature-map size. We showed that DenseNets scale naturally to hundreds of layers, while exhibiting no optimization difficulties. In our experiments,DenseNets tend to yield consistent improvement in accuracy with growing number of parameters, without any signs of performance degradation or overfitting. Under multiple settings, it achieved state-of-the-art results across several highly competitive datasets. Moreover, DenseNets require substantially fewer parameters and less computation to achieve state-of-the-art performances. Because we adopted hyperparameter settings optimized for residual networks in our study, we believe that further gains in accuracy of DenseNets may be obtained by more detailed tuning of hyperparameters and learning rate schedules.
我们提出了一个新的卷积网络结构,称之为稠密卷积网络(DenseNet)。它将两个相同特征图尺寸的任意层进行连接。这样我们就可以很自然的设计上百层的网络,还不会出现优化困难的问题。在我们的实验中,随着参数量的增加,DenseNets的准确率也随之提高,而且也没有出现较差表现或过拟合的现象。通过超参数的调整,该结构在很多比赛的数据上都获得了不错的结果。此外,DenseNets有更少的参数和计算量。因为我们只是在实验中调整了对于残差网络的超参数,所以我们相信通过调整更多的超参数和学习率,DenseNets的准确率还会有更大的提升。
Whilst following a simple connectivity rule, DenseNets naturally integrate the properties of identity mappings, deep supervision, and diversified depth. They allow feature reuse throughout the networks and can consequently learn more compact and, according to our experiments, more accurate models. Because of their compact internal representations and reduced feature redundancy, DenseNets may be good feature extractors for various computer vision tasks that build on convolutional features, e.g., [4, 5].
遵循这个简单的连接规则,DenseNets可以很自然的将自身映射(identity mappings)、深度监督(deep supervision)和深度多样化(diversified depth)结合在一起。根据我们的实验来看,该结构通过对网络特征的重复利用,可以学习到更简单、准确率更高的模型。由于简化了内部表征和降低了特征冗余,DenseNets可能是目前计算机视觉领域中在卷积网络方面非常不错的特征提取器。
相关参考