RePr: Improved Training of Convolutional Filters翻译[下]

RePr: Improved Training of Convolutional Filters翻译上

6. Ablation study

6.消融研究

Comparison of pruning criteria We measure the correlation of our metric with the Oracle to answer the question - how good a substitute is our metric for the ﬁlter importance ranking. Pearson correlation of our metric, henceforth referred to as Ortho, with the Oracle is 0.38. This is not a strong correlation, however, when we compare this with other known metrics, it is the closest. Molchanov et al. [9] report Spearman correlation of their criteria (Taylor) with greedy Oracle at 0.73. We observed similar numbers for Taylor ranking during the early epochs but the correlation diminished signiﬁcantly as the models converged. This is due to low gradient value from ﬁlters that have converged. The Taylor metric is a product of the activation and the gradient. High gradients correlate with important ﬁlters during early phases of learning but when models converge low gradient do not necessarily mean less salient weights.It could be that the ﬁlter has already converged to a useful feature that is not contributing to the overall error of the model or is stuck at a saddle point. With the norm of activations, the relationship is reversed. Thus by multiplying the terms together hope is to achieve a balance. But our experiments show that in a fully converged model, low gradients dominate high activations. Therefore, the Taylor term will have lower values as the models converge and will no longer be correlated with the inefﬁcient ﬁlters. While the correlation of the values denotes how well the metric is the substitute for predicting the accuracy, it is more important to measure the correlation of the rank of the ﬁlters. Correlation of the values and the rank may not be the same, and the correlation with the rank is the more meaningful measurement to determine the weaker ﬁlters. Ortho has a correlation of 0.58 against the Oracle when measured over the rank of the ﬁlters. Other metrics show very poor correlation using the rank. Figure 3 (Left and Center) shows the correlation plot for various metrics with the Oracle. The table on the right of Figure 3 presents the test accuracy on CIFAR-10 of various ranking metrics. From the table, it is evident that Orthogonality ranking leads to a signiﬁcant boost of accuracy compared to standard training and other ranking criteria.

修剪标准的比较我们测量我们的度量标准与Oracle的相关性来回答问题 - 我们的过滤器对于过滤器重要性排名的指标有多好。我们的度量标准（以下称为Ortho）与Oracle的Pearson相关性为0.38。然而，这并不是一个强相关性，当我们将其与其他已知指标进行比较时，它是最接近的。Molchanov等。 [9]报告Spearman将他们的标准（泰勒）与贪婪的甲骨文（0.73）相关联。我们观察到在早期时期泰勒排名的类似数字，但随着模型收敛，相关性显着减小。这是由于已经收敛的过滤器的梯度值较低。泰勒度量是激活和梯度的乘积。在学习的早期阶段，高梯度与重要滤波器相关，但是当模型收敛时，低梯度并不一定意味着较少的显着权重。可能是滤波器已经收敛到一个有用的特征，它不会导致模型的整体误差或卡在鞍点上。根据激活的规范，这种关系是相反的。因此，通过将这些术语相乘，希望是实现平衡。但是我们的实验表明，在完全收敛的模型中，低梯度主导着高度激活。因此，当模型收敛时，泰勒项将具有较低的值，并且将不再与无效滤波器相关联。虽然值的相关性表示度量是预测准确度的替代程度，但更重要的是测量过滤器等级的相关性。值和等级的相关性可能不相同，并且与等级的相关性是用于确定较弱过滤器的更有意义的度量。当在过滤器的等级上测量时，Ortho与Oracle的相关性为0.58。其他指标使用排名显示非常差的相关性。图3（左侧和中间）显示了Oracle的各种指标的相关性图。图3右侧的表格显示了各种排名指标的CIFAR-10的测试准确性。从表中可以看出，与标准培训和其他排名标准相比，正交性排名显着提高了准确性。

Percentage of ﬁlters pruned One of the key factors in our training scheme is the percentage of the ﬁlters to prune at each pruning phase (

image

). It behaves like the Dropout parameter, and impacts the training time and generalization ability of the model (see Figure: 4). In general the higher the pruned percentage, the better the performance. However, beyond 30%, the performances are not signiﬁcant. Up to 50%, the model seems to recover from the dropping of ﬁlters. Beyond that, the training is not stable, and sometimes the model fails to converge.

修剪过滤器的百分比我们培训计划的关键因素之一是在每个修剪阶段修剪过滤器的百分比（

image

）。它的行为类似于Dropout参数，并影响模型的训练时间和泛化能力（见图4）。通常，修剪百分比越高，性能越好。然而，超过30％，表现并不重要。高达50％，该模型似乎从过滤器的丢失中恢复。除此之外，训练不稳定，有时模型无法收敛。

Number of RePr iterations Our experiments suggest that each repeat of the RePr process has diminishing returns, and therefore should be limited to a single-digit number (see Figure 4 (Right)).Similar to Dense-Sparse-Dense [18] and Born-Again-Networks [20], we observe that for most networks, two to three iterations is sufﬁcient to achieve the maximum beneﬁt.

RePr迭代次数我们的实验表明RePr过程的每次重复都有递减的回报，因此应限制为一位数（见图4（右））。类似于Dense-Sparse-Dense [18]和Born-Again-Networks [20]，我们观察到对于大多数网络，两到三次迭代足以实现最大收益。

Optimizer and S1/S2 Figure 5 (left) shows variance in improvement when using different optimizers.Our model works well with most well-known optimizers. Adam and Momentum perform better than SGD due to their added stability in training. We experimented with various values of S1 and S2, and there is not much difference if either of them is large enough for the model to converge temporarily.

优化器和S1 / S2图5（左）显示了使用不同优化器时的改进差异。我们的模型适用于大多数知名的优化器。Adam和Momentum表现优于SGD，因为他们在训练中增加了稳定性。我们尝试了S1和S2的各种值，如果它们中的任何一个足够大以使模型暂时收敛，则没有太大差异。

image

Figure 5: Left: Impact of using various optimizers on RePr training scheme. Right: Results from using different S1/S2 values. For clarity, these experiments only shows results with

image

*图5：左：使用各种优化器对RePr训练方案的影响。右：使用不同的S1 / S2值的结果。为清楚起见，这些实验仅显示

image

的结果*

image

Learning Rate Schedules SGD with a ﬁxed learning rate does not typically produce optimal model performance. Instead, gradually annealing the learning rate over the course of training is known to produce models with higher test accuracy. State-of-the-art results on ResNet, DenseNet, Inception were all reported with a predetermined learning rate schedule. However, the selection of the exact learning rate schedule is itself a hyperparameter, one which needs to be speciﬁcally tuned for each model.Cyclical learning rates [52] can provide stronger performance without exhaustive tuning of a precise learning rate schedule. Figure 6 shows the comparison of our training technique when applied in conjunction with ﬁxed schedule learning rate scheme and cyclical learning rate. Our training scheme is not impacted by using these schemes, and improvements over standard training is still apparent.

学习率计划SGD具有固定的学习率通常不会产生最佳的模型性能。相反，已知在训练过程中逐渐退出学习速率以产生具有更高测试精度的模型。ResNet，DenseNet，Inception的最新结果均以预定的学习率计划报告。然而，精确学习率计划的选择本身就是一个超参数，需要针对每个模型进行特定调整。循环学习率[52]可以提供更强的性能，而无需详尽地调整精确的学习率计划。图6显示了我们的训练技术与固定时间表学习速率方案和循环学习速率相结合时的比较。我们的培训计划不受使用这些计划的影响，并且对标准培训的改进仍然很明显。

Impact of Dropout Dropout, while commonly applied in Multilayer Perceptrons, is typically not used for ConvNets. Our technique can be viewed as a type of nonrandom Dropout, speciﬁcally applicable to ConvNets. Unlike standard Dropout, out method acts on entire ﬁlters, rather than individual weights, and is applied only during select stages of training, rather than in every training step. Dropout prevents overﬁtting by encouraging coadaptation of weights. This is effective in the case of overparameterized models, but in compact or shallow models, Dropout may needlessly reduce already limited model ca

Dropout Dropout的影响虽然通常应用于多层感知器，但通常不用于ConvNets。我们的技术可以被视为一种非随机丢失，特别适用于ConvNets。与标准Dropout不同，out方法对整个过滤器而不是单独的权重起作用，并且仅在训练的选定阶段而不是在每个训练步骤中应用。辍学通过鼓励权重的共同适应来防止过度配置。这在过度参数化模型的情况下是有效的，但在紧凑或浅模型中，Dropout可能会不必要地减少已经有限的模型ca

image

Figure 7: Test accuracy of a three layer ConvNet with 32 ﬁlters each over 100 epochs using standard scheme, RePr with Oracle and RePr with Ortho on CIFAR-10. Left: With Dropout of 0.5. Right: No Dropout

图7：三层ConvNet的测试精度，使用标准方案，每个超过100个时期的32个过滤器，带有Oracle的RePr和带有CIFAR-10上的Ortho的RePr。左：Dropout为0.5。右：没有辍学

image

7. Orthogonality and Distillation

7.正交和蒸馏

Our method, RePr and Knowledge Distillation (KD) are both techniques to improve performance of compact models. RePr reduces the overlap of ﬁlter representations and KD distills the information from a larger network. We present a brief comparison of the techniques and show that they can be combined to achieve even better performance. RePr repetitively drops the ﬁlters with most overlap in the directions of the weights using the inter-ﬁlter orthogonality, as shown in the equation 2. Therefore, we expect this value to gradually reduce over time during training. Figure 8 (left) shows the sum of this value over the entire network with three training schemes. We show RePr with two different ﬁlter ranking criteria - Ortho and Oracle. It is not surprising that RePr training scheme with Ortho ranking has lowest Ortho sum but it is surprising that RePr training with Oracle ranking also reduces the ﬁlter overlap, compared to the standard training. Once the model starts to converge, the least important ﬁlters based on Oracle ranking are the ones with the most overlap. And dropping these ﬁlters leads to better test accuracy (table on the right of Figure 3). Does this improvement come from the same source as the that due to Knowledge Distillation? Knowledge Distillation (KD) is a well-proven methodology to train compact models. Using soft logits from the teacher and the ground truth signal the model converges to better optima compared to standard training. If we apply KD to the same three experiments (see Figure 8, right), we see that all the models have signiﬁcantly larger Ortho sum. Even the RePr (Ortho) model struggles to lower the sum as the model is strongly guided to converge to a speciﬁc solution. This suggests that this improvement due to KD is not due to reducing ﬁlter overlap. Therefore, a model which uses both the techniques should beneﬁt by even better generalization. Indeed, that is the case as the combined model has signiﬁcantly better performance than either of the individual models, as shown in Table 2.

我们的方法，RePr和知识蒸馏（KD）都是提高紧凑模型性能的技术。RePr减少了过滤器表示的重叠，KD从更大的网络中提取信息。我们对这些技术进行了简要比较，并表明它们可以结合起来以实现更好的性能。RePr使用中间滤波器正交性重复地丢弃在重量方向上具有最多重叠的滤波器，如等式2所示。因此，我们希望这个值在训练期间逐渐减少。图8（左）显示了具有三种训练方案的整个网络上该值的总和。我们向RePr展示了两种不同的过滤器排名标准 - Ortho和Oracle。具有Ortho排名的RePr训练方案具有最低的Ortho总和并不令人惊讶，与标准训练相比，令人惊讶的是，与Oracle排名的RePr训练也减少了过滤器重叠。一旦模型开始收敛，基于Oracle排名的最不重要的过滤器就是重叠最多的过滤器。丢弃这些滤波器可以提高测试精度（图3右侧的表格）。这种改进是否来自与知识蒸馏相同的来源？知识蒸馏（KD）是一种用于训练紧凑模型的成熟方法。使用来自教师的软对数和地面实况信号，与标准训练相比，模型收敛到更好的最佳状态。如果我们将KD应用于相同的三个实验（见图8，右），我们发现所有模型都具有显着更大的正交和。甚至RePr（Ortho）模型都在努力降低总和，因为模型被强烈引导收敛到一个特定的解决方案。这表明由于KD导致的这种改善不是由于减少了过滤器重叠。因此，使用这两种技术的模型应该通过更好的泛化来获益。实际上，情况正是如此，因为组合模型的性能明显优于单个模型，如表2所示。

image

Figure 8: Comparison of orthogonality of ﬁlters (Ortho-sum - eq 2) in standard training and RePr training with and without Knowledge Distillation. Lower value signiﬁes less overlapping ﬁlters. Dashed vertical lines denotes ﬁlter dropping.

图8：标准训练中过滤器的正交性（Ortho-sum-eq 2）与有和没有知识蒸馏的RePr训练的比较。较低的值意味着较少重叠的过滤器。虚线垂直线表示过滤器掉落。

8. Results

8.结果

We present the performance of our training scheme, RePr, with our ranking criteria, inter-ﬁlter orthogonality, Ortho, on different ConvNets [53, 1, 29, 54, 31]. For all the results provided RePr parameters are:

image

, and with three iterations,

image

我们在不同的ConvNets [53,1,29,54,31]上展示了我们的训练计划RePr的性能，我们的排名标准，中间正交性，Ortho。对于所有提供的结果，RePr参数为：

image

，

image

，

image

，以及三次迭代

image

。

image

*Figure 9: Accuracy improvement using RePr over standard training on Vanilla ConvNets across many layered networks [

image

*图9：使用RePr在多个分层网络上对Vanilla ConvNets进行标准培训时的准确性改进[

image

We compare our training scheme with other similar schemes like BAN and DSD in table 3. All three schemes were trained for three iterations i.e. N=3. All models were trained for 150 epochs with similar learning rate schedule and initialization. DSD and RePr (Weights) perform roughly the same function - sparsifying the model guided by magnitude, with the difference that DSD acts on individual weights, while RePr (Weights) acts on entire ﬁlters. Thus, we observe similar performance between these techniques. RePr (Ortho) outperforms the other techniques and is signiﬁcantly cheaper to train compared to BAN, which requires N full training cycles.

我们将我们的培训方案与表3中的BAN和DSD等其他类似方案进行了比较。所有三种方案都经过三次迭代训练，即N = 3。所有模型都经过150个时期的训练，具有相似的学习速率计划和初始化。DSD和RePr（权重）执行大致相同的功能 - 按幅度指导稀疏模型，差异在于DSD作用于单个权重，而RePr（权重）作用于整个过滤器。因此，我们观察到这些技术之间的类似性能。与BAN相比，RePr（Ortho）优于其他技术并且训练显着更便宜，BAN需要N个完整的训练周期。

Compared to modern architectures, vanilla ConvNets show signiﬁcantly more inefﬁciency in the allocation of their feature representations. Thus, we ﬁnd larger improvements from our method when applied to vanilla ConvNets, as compared to modern architectures. Table 4 shows test errors on CIFAR 10 & 100. Vanilla CNNs with 32 ﬁlters each have high error compared to DenseNet or ResNet but their inference time is signiﬁcantly faster. RePr training improves the relative accuracy of vanilla CNNs by 8% on CIFAR-10 and 25% on CIFAR-100. The performance of baseline DenseNet and ResNet models is still better than vanilla CNNs trained with RePr, but these models incur more than twice the inference cost. For comparison, we also consider a reduced DenseNet model with only 5 layers, which has similar inference time to the 3-layer vanilla ConvNet. This model has many fewer parameters (by a factor of

image

) than the vanilla ConvNet, leading to signiﬁcantly higher error rates, but we choose to equalize inference time rather than parameter count, due to the importance of inference time in many practical applications. Figure 9 shows more results on vanilla CNNs with varying depth. Vanilla CNNs start to overﬁt the data, as most ﬁlters converge to similar representation. Our training scheme forces them to be different which reduces the overﬁtting (Figure 4 - right). This is evident in the larger test error of 18-layer vanilla CNN with CIFAR-10 compared to 3-layer CNN. With RePr training, 18 layer model shows lower test error.

与现代架构相比，vanilla ConvNets在其功能表示的分配方面显示出更低的效率。因此，与现代架构相比，我们在应用于vanilla ConvNets时，从我们的方法中获得了更大的改进。表4显示了CIFAR 10和100的测试错误。与DenseNet或ResNet相比，具有32个过滤器的Vanilla CNN各自具有高误差，但它们的推理时间明显更快。RePr培训使CIFAR-10的香草CNN相对准确度提高了8％，CIFAR-100提高了25％。基线DenseNet和ResNet模型的性能仍然优于使用RePr训练的vanilla CNN，但这些模型的推理成本是其两倍多。为了进行比较，我们还考虑仅使用5层的简化DenseNet模型，其具有与3层香草ConvNet相似的推理时间。该模型比香草ConvNet具有更少的参数（通过因子

image

），导致错误率显着更高，但由于在许多实际应用中推理时间的重要性，我们选择均衡推理时间而不是参数计数。图9显示了具有不同深度的香草CNN的更多结果。香草CNN开始过度处理数据，因为大多数过滤器会收敛到类似的表示。我们的培训计划强制它们不同，这减少了过度配置（图4 - 右）。这与使用CIFAR-10的18层香草CNN与3层CNN相比具有更大的测试误差是显而易见的。通过RePr培训，18层模型显示较低的测试错误。

image

RePr is also able to improve the performance of ResNet and shallow DenseNet. This improvement is larger on CIFAR-100, which is a 100 class classiﬁcation and thus is a harder task and requires more specialized ﬁlters. Similarly, our training scheme shows bigger relative improvement on ImageNet, a 1000 way classiﬁcation problem. Table 5 presents top-1 test error on ImageNet [55] of various ConvNets trained using standard training and with RePr. RePr was applied three times (N=3), and the table shows errors after each round. We have attempted to replicate the results of the known models as closely as possible with suggested hyper-parameters and are within

image

of the reported results. More details of the training and hyper-parameters are provided in the supplementary material. Each subsequent RePr leads to improved performance with signiﬁcantly diminishing returns. Improvement is more distinct in architectures which do not have skip connections, like Inception v1 and VGG and have lower baseline performance.

RePr还能够改善ResNet和浅层DenseNet的性能。这种改进在CIFAR-100上更大，这是100级分类，因此是一项更难的任务，需要更专业的过滤器。同样，我们的培训计划在ImageNet上显示出更大的相对改进，这是一种1000分类的分类问题。表5列出了使用标准培训和RePr培训的各种ConvNets的ImageNet [55]的前1个测试错误。RePr应用三次（N = 3），表格显示每轮后的错误。我们尝试使用建议的超参数尽可能接近地复制已知模型的结果，并且在报告结果的

image

内。补充材料中提供了培训和超参数的更多细节。每个后续的RePr都可以提高性能，同时显着降低回报。在没有跳过连接的体系结构（如Inception v1和VGG）中，改进更加明显，并且具有较低的基线性能。

Our model improves upon other computer vision tasks that use similar ConvNets. We present a small sample of results from visual question answering and object detection tasks. Both these tasks involve using ConvNets to extract features, and RePr improves their baseline results.

我们的模型改进了使用类似ConvNets的其他计算机视觉任务。我们提供了一些视觉问答和目标检测任务的结果样本。这两项任务都涉及使用ConvNets提取功能，RePr可以改善其基线结果。

Visual Question Answering In the domain of visual question answering (VQA), a model is provided with an image and question (as text) about that image, and must produce an answer to that question.Most of the models that solve this problem use standard ConvNets to extract image features and an LSTM network to extract text fea tures. These features are then fed to a third model which learns to select the correct answer as a classiﬁcation problem. State-of-the-art models use an attention layer and intricate mapping between features. We experimented with a more standard model where image features and language features are fed to a Multi-layer Perceptron with a softmax layer at the end that does 1000-way classiﬁcation over candidate answers. Table 6 provides accuracy on VQAv1 using VQA-LSTM-CNN model [56]. Results are reported for Open-Ended questions, which is a harder task compared to multiple-choice questions. We extract image features from Inception-v1, trained using standard training and with RePr (Ortho) training, and then feed these image features and the language embeddings (GloVe vectors) from the question, to a two layer fully connected network. Thus, the only difference between the two reported results 6 is the training methodology of Inception-v1.

视觉问题回答在视觉问答（VQA）领域，模型提供有关该图像的图像和问题（作为文本），并且必须产生该问题的答案。解决此问题的大多数模型使用标准ConvNets来提取图像特征，使用LSTM网络来提取文本特征。然后将这些特征馈送到第三个模型，该模型学习选择正确的答案作为分类问题。最先进的模型使用注意层和功能之间的复杂映射。我们尝试了一个更标准的模型，其中图像特征和语言特征被馈送到多层感知器，最后一个softmax层对候选答案进行1000路分类。表6提供了使用VQA-LSTM-CNN模型的VQAv1的准确性[56]。报告了开放式问题的结果，与多项选择问题相比，这是一项更难的任务。我们从Inception-v1中提取图像特征，使用标准训练和RePr（Ortho）训练进行训练，然后将这些图像特征和语言嵌入（GloVe向量）从问题提供给两层完全连接的网络。因此，两个报告结果6之间的唯一区别是Inception-v1的训练方法。

image

Object Detection For object detection, we experimented with Faster R-CNN using ResNet 50 and 101 pretrained on ImageNet. We experimented with both Feature Pyramid Network and baseline RPN with c4 conv layer. We use the model structure from Tensorpack [57], which is able to reproduce the reported mAP scores. The model was trained on ’trainval35k + minival’ split of COCO dataset (2014). Mean Average Precision (mAP) is calculated at ten IoU thresholds from 0.5 to 0.95. mAP for the boxes obtained with standard training and RePr training is shown in the table 7.

对象检测对于对象检测，我们使用在ImageNet上预训练的ResNet 50和101进行了更快的R-CNN实验。我们使用c4转换层对特征金字塔网络和基线RPN进行了实验。我们使用Tensorpack [57]的模型结构，它能够重现报告的mAP分数。该模型接受了COCO数据集“trainval35k + minival”拆分的培训（2014年）。平均平均精度（mAP）计算为10个IoU阈值，从0.5到0.95。通过标准培训和RePr培训获得的盒子的mAP显示在表7中。

image

9. Conclusion

9.结论

We have introduced RePr, a training paradigm which cyclically drops and relearns some percentage of the least expressive ﬁlters. After dropping these ﬁlters, the pruned sub-model is able to recapture the lost features using the remaining parameters, allowing a more robust and efﬁcient allocation of model capacity once the ﬁlters are reintroduced. We show that a reduced model needs training before re-introducing the ﬁlters, and careful selection of this training duration leads to substantial gains. We also demonstrate that this process can be repeated with diminishing returns.

我们引入了RePr，这是一种循环下降并重新获得一些最不具有表现力的过滤器的培训范例。丢弃这些过滤器后，修剪后的子模型能够使用其余参数重新捕获丢失的特征，一旦重新引入过滤器，就可以更加稳健和高效地分配模型容量。我们表明减少的模型需要在重新引入过滤器之前进行培训，仔细选择这个培训持续时间会带来可观的收益。我们还证明了这个过程可以重复，收益递减。

Motivated by prior research which highlights inefﬁciencies in the feature representations learned by convolutional neural networks, we further introduce a novel inter-ﬁlter orthogonality metric for ranking ﬁlter importance for the purpose of RePr training, and demonstrate that this metric outperforms established ranking metrics. Our training method is able to signiﬁcantly improve performance in under-parameterized networks by ensuring the efﬁcient use of limited capacity, and the performance gains are complementary to knowledge distillation. Even in the case of complex, over-parameterized network architectures, our method is able to improve performance across a variety of tasks.

由先前的研究强调了卷积神经网络学习的特征表示的无效性，我们进一步引入了一种新的中间正交性度量指标，用于对RePr训练目的的过滤器重要性进行排序，并证明该指标优于既定的排名指标。我们的培训方法能够通过确保有效使用有限容量来显着提高参数不足网络的性能，并且性能提升是知识蒸馏的补充。即使在复杂的，过度参数化的网络架构的情况下，我们的方法也能够提高各种任务的性能。

10. Acknowledgement

10.致谢

First author would like to thank NVIDIA and Google for donating hardware resources partially used for this research. He would also like to thank Nick Moran, Solomon Garber and Ryan Marcus for helpful comments.

第一作者要感谢NVIDIA和Google捐赠部分用于本研究的硬件资源。他还要感谢Nick Moran，Solomon Garber和Ryan Marcus的有益评论。

References

参考

[1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 1, 2, 8

[1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.深度残差学习用于图像识别。 2016年IEEE计算机视觉和模式识别会议（CVPR），2016年.1,2,8

[2] Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. IEEE transactions on pattern analysis and machine intelligence, 2018. 1

[2] Tsung-Yi Lin，Priya Goyal，Ross B. Girshick，Kaiming He和Piotr Doll'ar。密集物体检测的焦点损失。关于模式分析和机器智能的IEEE交易，2018

[3] Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and Ross B. Girshick. Mask r-cnn. 2017 IEEE International Conference on Computer Vision (ICCV), 2017. 1

[3] Kaiming He，Georgia Gkioxari，Piotr Doll'ar和Ross B. Girshick。面具r-cnn。 2017年IEEE IEEE计算机视觉国际会议（ICCV），2017年.1

[4] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015. 1

[4] Olaf Ronneberger，Philipp Fischer和Thomas Brox。U-net：用于生物医学图像分割的卷积网络。在MICCAI，2015年.1

[5] Michael Cogswell, Faruk Ahmed, Ross B. Girshick, C. Lawrence Zitnick, and Dhruv Batra. Reducing overﬁtting in deep networks by decorrelating representations. ICLR, abs/1511.06068, 2016. 1, 2, 3

[5] Michael Cogswell，Faruk Ahmed，Ross B. Girshick，C。Lawrence Zitnick和Dhruv Batra。通过去相关表示减少深度网络中的过度配置。 ICLR，abs / 1511.06068,2016.1,2,3

[6] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning ﬁlters for efﬁcient convnets. ICLR, abs/1608.08710, 2017. 1

[6] Hao Li，Asim Kadav，Igor Durdanovic，Hanan Samet和Hans Peter Graf。修剪过滤器以实现有效的网络。 ICLR，abs / 1608.08710,2017。1

[7] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. 2017 IEEE International Conference on Computer Vision (ICCV), 2017. 1

[7]何一辉，张翔宇，孙健。用于加速非常深的神经网络的通道修剪。 2017年IEEE IEEE计算机视觉国际会议（ICCV），2017年.1

[8] Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. Structured pruning of deep convolutional neural networks. JETC, 2017. 1

[8] Sajid Anwar，Kyuyeon Hwang和Wonyong Sung。深度卷积神经网络的结构化修剪。 JETC，2017年.1

[9] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efﬁcient transfer learning. ICLR, abs/1611.06440, 2017. 1, 2, 3, 5, 6

[9] Pavlo Molchanov，Stephen Tyree，Tero Karras，Timo Aila和Jan Kautz。修剪卷积神经网络，实现资源高效的转移学习。 ICLR，abs / 1611.06440,2017.1,2,3,5,6

[10] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang.Learning efﬁcient convolutional networks through network slimming. 2017 IEEE International Conference on Computer Vision (ICCV), 2017. 1

[10] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang.通过网络瘦身学习有效的卷积网络。 2017年IEEE IEEE计算机视觉国际会议（ICCV），2017年.1

[11] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A ﬁlter level pruning method for deep neural network compression. 2017 IEEE International Conference on Computer Vision (ICCV), 2017. 1

[11]罗建浩，吴建新，林维尧。 Thinet：深层神经网络压缩的过滤级修剪方法。 2017年IEEE IEEE计算机视觉国际会议（ICCV），2017年.1

[12] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. ICLR, abs/1510.00149, 2016. 1, 2, 6

[12] Song Han，Huizi Mao和William J. Dally。深度压缩：通过修剪，训练量化和霍夫曼编码压缩深度神经网络。 ICLR，abs / 1510.00149,2016.1,2,6

[13] Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efﬁcacy of pruning for model compression. NIPS Workshop on Machine Learning of Phones and other Consumer Devices, abs/1710.01878, 2017. 1

[13] Michael Zhu和Suyog Gupta。修剪或不修剪：探索修剪模型压缩的效果。NIPS电话和其他消费类设备机器学习研讨会，abs / 1710.01878,2017。1

[14] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Training pruned neural networks. CoRR, abs/1803.03635, 2018. 2

[14] Jonathan Frankle和Michael Carbin。彩票假设：训练修剪的神经网络。 CoRR，abs / 1803.03635,2018。2

[15] Haibing Wu and Xiaodong Gu. Towards dropout training for convolutional neural networks.Neural networks : the ofﬁcial journal of the International Neural Network Society, 71, 2015. 2

[15]吴海兵和顾晓东。进行卷积神经网络的辍学训练。神经网络：国际神经网络学会的官方期刊，71,2015

[16] Li Wan, Matthew D. Zeiler, Sixin Zhang, Yann LeCun, and Rob Fergus. Regularization of neural networks using dropconnect. In ICML, 2013. 2

[16] Li Wan，Matthew D. Zeiler，Sixin Zhang，Yann LeCun和Rob Fergus。使用dropconnect进行神经网络的正则化。在ICML，2013年.2

[17] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. Weinberger. Deep networks with stochastic depth. In ECCV, 2016. 2

[17]高黄，孙宇，刘壮，丹尼尔塞德拉和基利安。温伯格。具有随机深度的深度网络。在ECCV，2016年.2

[18] Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, Peter Vajda, Manohar

Paluri, John Tran, Bryan Catanzaro, and William J. Dally. Dsd: Dense-sparse-dense training for deep neural networks. 2016. 2, 6, 8

Paluri，John Tran，Bryan Catanzaro和William J. Dally。 Dsd：深度神经网络的密集稀疏训练。 2016. 2,6,8

[19] Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. CoRR, abs/1503.02531, 2015. 2

[19] Geoffrey E. Hinton，Oriol Vinyals和Jeffrey Dean。在神经网络中提炼知识。 CoRR，abs / 1503.02531,2015。2

[20] Tommaso Furlanello, Zachary Chase Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. Born again neural networks. In ICML, 2018. 2, 6, 8

[20] Tommaso Furlanello，Zachary Chase Lipton，Michael Tschannen，Laurent Itti和Anima Anandkumar。再次出生神经网络。在ICML，2018。，2,6，8

[21] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. ICLR, abs/1412.6550, 2015. 2

[21] Adriana Romero，Nicolas Ballas，Samira Ebrahimi Kahou，Antoine Chassang，Carlo Gatta和Yoshua Bengio。 Fitnets：薄网的提示。 ICLR，abs / 1412.6550,2015。2

[22] Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. In NIPS, 1989. 2, 5, 6

[22] Yann LeCun，John S. Denker和Sara A. Solla。最佳的脑损伤。在NIPS，1989.2,5,6

[23] Babak Hassibi and David G. Stork. Second order derivatives for network pruning: Optimal brain surgeon. In NIPS, 1992. 2, 6

[23] Babak Hassibi和David G. Stork。用于网络修剪的二阶导数：最佳脑外科医生。在NIPS，1992.2,6

[24] Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. Network trimming: A data-driven neuron pruning approach towards efﬁcient deep architectures. CoRR, abs/1607.03250, 2016. 2, 6

[24] Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang.网络修整：一种数据驱动的神经元修剪方法，可实现高效的深层体系结构。 CoRR，abs / 1607.03250,2016.2,6

[25] Chenxi Liu, Barret Zoph, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan L. Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. CoRR, abs/1712.00559, 2017. 2

[26] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. Regularized evolution for image classiﬁer architecture search. CoRR, abs/1802.01548, 2018. 2

[26] Esteban Real，Alok Aggarwal，Yanping Huang和Quoc V. Le。图像分类架构搜索的规范化演化。 CoRR，abs / 1802.01548,2018。2

[27] Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. CoRR, abs/1611.01578, 2016. 2

[27] Barret Zoph和Quoc V. Le。神经结构搜索与强化学习。 CoRR，abs / 1611.01578,2016。2

[28] Xavier Glorot and Yoshua Bengio. Understanding the difﬁculty of training deep feedforward neural networks. In AISTATS, 2010. 2

[28] Xavier Glorot和Yoshua Bengio。了解训练深度前馈神经网络的困难。在AISTATS，2010年.2

[29] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. 2, 8

[29] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E.Reed，Dragomir Anguelov，Dumitru Erhan，Vincent Vanhoucke和Andrew Rabinovich。进一步深化卷积。 2015年IEEE计算机视觉和模式识别会议（CVPR），2015年.2,8

[30] Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. In NIPS, 2017. 2, 3, 4

[30] Maithra Raghu，Justin Gilmer，Jason Yosinski和Jascha Sohl-Dickstein。Svcca：深度学习动力学和可解释性的奇异向量典型相关分析。在NIPS，2017年.2,3,4

[31] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. CVPR, 2017. 2, 8

[32] Ari S. Morcos, David G. T. Barrett, Neil C. Rabinowitz, and Matthew Botvinick. On the importance of single directions for generalization. CoRR, abs/1803.06959, 2017. 3

[32] Ari S. Morcos，David G. T. Barrett，Neil C. Rabinowitz和Matthew Botvinick。关于单一方向概括的重要性。 CoRR，abs / 1803.06959,2017。3

[33] Sanjeev Arora, Aditya Bhaskara, Rong Ge, and Tengyu Ma. Provable bounds for learning some deep representations. In ICML, 2014. 3

[33] Sanjeev Arora，Aditya Bhaskara，Rong Ge和Tengyu Ma。可以学习一些深刻的表达方式。在ICML，2014年.3

[34] Pau Rodr´ıguez, Jordi Gonz`alez, Guillem Cucurull, Josep M. Gonfaus, and F. Xavier Roca. Regularizing cnns with locally constrained decorrelations. ICLR, abs/1611.01967, 2017. 3

[34]PauRodrıguez，JordiGonzález，Guillem Cucurull，Josep M. Gonfaus和F. Xavier Roca。规范具有局部约束的去相关的cnns。 ICLR，abs / 1611.01967,2017。3

[35] Andrew Brock, Theodore Lim, James M. Ritchie, and Nick Weston. Neural photo editing with introspective adversarial networks. ICLR, abs/1609.07093, 2017. 3, 6

[35] Andrew Brock，Theodore Lim，James M. Ritchie和Nick Weston。内省对抗网络的神经照片编辑。 ICLR，abs / 1609.07093,2017。3,6

[36] Ben Poole, Jascha Sohl-Dickstein, and Surya Ganguli. Analyzing noise in autoencoders and deep networks. NIPS Workshop on Deep Learning, abs/1406.1831, 2013. 3, 6

[36] Ben Poole，Jascha Sohl-Dickstein和Surya Ganguli。分析自动编码器和深度网络中的噪声。NIPS深度学习研讨会，abs / 1406.1831,2013。3,6

[37] Pengtao Xie, Barnab´as P´oczos, and Eric P. Xing. Nearorthogonality regularization in kernel methods. In UAI, 2017. 3, 6

[37]谢鹏涛，Barnab'as P'oczos和Eric P. Xing。核方法中的近正则性正则化。在UAI，2017年.3,6

[38] Di Xie, Jiang Xiong, and Shiliang Pu. All you need is beyond a good init: Exploring better solution for training extremely deep convolutional neural networks with orthonormality and modulation. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 3

[38]狄燮，蒋雄，石世良。您所需要的只是一个好的初始：探索更好的解决方案，用于训练具有正交性和调制的极深度卷积神经网络。 2017年IEEE计算机视觉和模式识别会议（CVPR），2017年.3

[39] Wenling Shang, Kihyuk Sohn, Diogo Almeida, and Honglak Lee. Understanding and improving convolutional neural networks via concatenated rectiﬁed linear units. In ICML, 2016. 3

[39] Wenling Shang, Kihyuk Sohn, Diogo Almeida, and Honglak Lee.通过级联的整形线性单元理解和改进卷积神经网络。在ICML，2016年.3

[40] Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. ICLR, abs/1312.6120, 2014. 3

[40] Andrew M. Saxe，James L. McClelland和Surya Ganguli。深层线性神经网络学习非线性动力学的精确解。 ICLR，abs / 1312.6120,2014。3

[41] Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S. Schoenholz, and Jeffrey Pennington. Dynamical isometry and a mean ﬁeld theory of cnns: How to train 10,000-layer vanilla convolutional neural networks. In ICML, 2018. 3, 5

[41] Lechao Xiao，Yasaman Bahri，Jascha Sohl-Dickstein，Samuel S. Schoenholz和Jeffrey Pennington。动态等距和cnns的平均场理论：如何训练10,000层香草卷积神经网络。在ICML，2018。3,5

[42] Eugene Vorontsov, Chiheb Trabelsi, Samuel Kadoury, and Christopher Joseph Pal. On orthogonality and learning recurrent networks with long term dependencies. In ICML, 2017. 3, 5

[42] Eugene Vorontsov，Chiheb Trabelsi，Samuel Kadoury和Christopher Joseph Pal。关于正交性和学习具有长期依赖性的循环网络。在ICML，2017年.3,5

[43] Harold Hotelling. Relations between two sets of variates. Biometrika, 28(3/4), 1936. 3

[43] Harold Hotelling。两组变量之间的关系。 Biometrika，28（3/4），1936。3

[44] Yixuan Li, Jason Yosinski, Jeff Clune, Hod Lipson, and John E. Hopcroft. Convergent learning: Do different neural networks learn the same representations? In ICLR, 2016. 3

[44]李一轩，Jason Yosinski，Jeff Clune，Hod Lipson和John E. Hopcroft。趋同学习：不同的神经网络是否学习相同的表示？在ICLR，2016年.3

[45] David R. Hardoon, S´andor Szedm´ak, and John ShaweTaylor. Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 2004. 4

[45] David R. Hardoon，S'andor Szedm'ak和John ShaweTaylor。典型相关分析：应用于学习方法的概述。神经计算，2004年.4

[46] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, abs/1412.6980, 2015. 4

[46] Diederik P. Kingma和Jimmy Ba。亚当：随机优化的一种方法。 ICLR，abs / 1412.6980,2015。4

[47] T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012. 4

[47] T. Tieleman和G. Hinton。第6.5讲RmsProp：将梯度除以其最近幅度的运行平均值。COURSERA：机器学习神经网络，2012年.4

[48] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015. 4

[48] Sergey Ioffe和Christian Szegedy。批量标准化：通过减少内部协变量偏移来加速深度网络训练。在ICML，2015年.4

[49] Xiaoliang Dai, Hongxu Yin, and Niraj K. Jha. Nest: A neural network synthesis tool based on a grow-and-prune paradigm. CoRR, abs/1711.02017, 2017. 5

[49]戴晓亮，尹宏旭和Niraj K. Jha。 Nest：基于增长和修剪范例的神经网络综合工具。 CoRR，abs / 1711.02017,2017。5

[50] Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for efﬁcient neural networks. In NIPS, 2015. 5

[50] Song Han，Jeff Pool，John Tran和William J. Dally。学习有效神经网络的权重和连接。在NIPS，2015年.5

[51] Maithra Raghu, Ben Poole, Jon M. Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. On the expressive power of deep neural networks. In ICML, 2017. 5

[51] Maithra Raghu，Ben Poole，Jon M. Kleinberg，Surya Ganguli和Jascha Sohl-Dickstein。论深度神经网络的表达能力。在ICML，2017年.5

[52] Leslie N. Smith. Cyclical learning rates for training neural networks. 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), 2017. 7

[52] Leslie N. Smith。训练神经网络的循环学习率。 2017年IEEE计算机视觉应用冬季会议（WACV），2017年.7

[53] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. 8

[53] Karen Simonyan和Andrew Zisserman。用于大规模图像识别的非常深的卷积网络。 CoRR，abs / 1409.1556,2014。8

[54] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna.Rethinking the inception architecture for computer vision. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 8

[54] Christian Szegeda，Vincent Vanhoucke，Sergey Ioffe，Jonathon Shlens和Zbigniew Wojna。重新思考计算机视觉的初始架构。 2016年IEEE计算机视觉和模式识别会议（CVPR），2016年.8

[55] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 2015. 9

[56] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. 2015 IEEE International Conference on Computer Vision (ICCV), 2015. 9

[56] Stanislaw Antol，Aishwarya Agrawal，Jiasen Lu，Margaret Mitchell，Dhruv Batra，C。Lawrence Zitnick和Devi Parikh。 Vqa：视觉问题回答。 2015年IEEE计算机视觉国际会议（ICCV），2015年.9

[57] Yuxin Wu et al. Tensorpack. https://github.com/ tensorpack/, 2016. 9

[57] Yuxin Wu et al.Tensorpack.Https：//github.com/ tensorpack /，2016。9

Appendix

附录

ImageNet Training Details

ImageNet培训详情

Training large models such as ResNet, VGG or Inception (as discussed in Table 4) can be difﬁcult and models may not always converge to similar optima across training runs. With our RePr training scheme, we observed that large values of

image

can sometimes produce collapse upon reintroduction of dropped ﬁlters. On analysis, we found that this was due to large random activations from the newly initialized ﬁlters. This can be overcome by initializing the new ﬁlters with relatively small values.

训练大型模型，如ResNet，VGG或Inception（如表4中所讨论的）可能很困难，并且模型可能并不总是在训练运行中收敛到类似的最佳值。通过我们的RePr培训计划，我们观察到

image

的大值有时会在重新引入掉落的过滤器时产生崩溃。经过分析，我们发现这是由于新初始化过滤器的大量随机激活造成的。这可以通过用相对较小的值初始化新滤波器来克服。

Another trick that minimizes this problem is to also reinitialize the corresponding kernels of next layer for a given ﬁlter. Consider a ﬁlter f at layer ℓ. The activations from this ﬁlter f become input to a kernel of every ﬁlter of the next layer

image

. If the ﬁlter f is pruned, and then re-initialized, then all those kernels in layer

image

should also be initialized to small random values, as the features they had learned to process no longer exist. This prevents new activations these kernels (which are currently random) from dominating the activations from other kernels.

使该问题最小化的另一个技巧是为给定的滤波器重新初始化下一层的相应内核。考虑层l处的滤波器f。来自此滤波器f的激活成为下一层

image

的每个滤波器的内核的输入。如果对滤镜f进行修剪，然后重新初始化，则层

image

中的所有内核也应初始化为小的随机值，因为他们学会处理的特征不再存在。这可以防止新的激活这些内核（目前是随机的）主导其他内核的激活。

Pruning signiﬁcant number of ﬁlters at one iteration could lead to instability in training. This is mostly due to changes in running mean/variance of BatchNorm parameters. To overcome this issue, ﬁlters can be pruned over multiple mini-batches. There is no need to re-evaluate the rank, as it does not change signiﬁcantly with few iterations. Instability of training is compounded in DenseNet, due to the dense connections. Removing multiple ﬁlters leads to signiﬁcant changes to the forward going dense connections, and they impact all the existing activations. One way to overcome this is to decay the ﬁlter weights over multiple iterations to a very small norm before removing the ﬁlter all together from the network. Similarly Squeezeand-Excitation Networks2 are also difﬁcult to prune, because they maintain learned scaling parameters for activations from all the ﬁlters. Unlike, BatchNorm, it is not trivial to remove the corresponding scaling parameters, as they are part of a fully connected layer. Removing this value would change the network structure and also relative scaling of all the other activations.

在一次迭代中修剪显着数量的过滤器可能导致训练不稳定。这主要是由于BatchNorm参数的运行均值/方差的变化。为了解决这个问题，可以通过多个小批量修剪过滤器。没有必要重新评估排名，因为它几乎没有迭代地显着变化。由于连接密集，DenseNet中的训练不稳定。删除多个过滤器会导致前向密集连接发生重大变化，并且会影响所有现有的激活。解决此问题的一种方法是在从网络中一起移除滤波器之前，将多次迭代中的滤波器权重衰减到非常小的范数。类似地，Squeezeand-Excitation Networks2也难以修剪，因为它们维持所有过滤器激活的学习缩放参数。与BatchNorm不同，删除相应的缩放参数并非易事，因为它们是完全连接层的一部分。删除此值将更改网络结构以及所有其他激活的相对缩放。

It is also possible to apply RePr to a pre-trained model. This is especially useful for ImageNet, where the cost of training from scratch is high. Applying RePr to a pretrained model is able to produce some improvement, but is not as effective as applying RePr throughout training. Careful selection of the ﬁne-tuning learning rate is necessary to minimize the required training time. Our experiments show that using adaptive LR optimizers such as Adam might be more suited for ﬁne-tuning from pre-trained weights.

也可以将RePr应用于预先训练的模型。这对ImageNet尤其有用，因为从头开始，培训成本很高。将RePr应用于预训练模型能够产生一些改进，但不如在整个训练中应用RePr有效。仔细选择微调学习速率是必要的，以尽量减少所需的培训时间。我们的实验表明，使用自适应LR优化器（如Adam）可能更适合从预训练的权重进行微调。

Hyper-parameters

超参数

All ImageNet models were trained using Tensorﬂow with Tesla V100 and model deﬁnitions were obtained from the ofﬁcial TF repository3. Images were augmented with brightness (0.6 to 1.4), contrast (0.6 to 1.4), saturation (0.4), lightning (0.1), random center crop and horizontal ﬂip. During the test, images were tested on center

image

crop. Most models were trained with a batch size of 256, but the large ones like ResNet-101, ResNet-152 and Inception-v2 were trained with a batch size of 128.Depending upon the implementation, RePr may add its own non-trainable variables, which will take up GPU memory, thus requiring the use smaller batch size than that originally reported by other papers.Models with batch sizes of 256 were trained using SGD with a learning rate of 0.1 for the ﬁrst 30 epochs, 0.01 for the next 30 epochs, and 0.001 for the remaining epochs. For models with batch sizes of 128, these learning rates were correspondingly reduced by half. For ResNet models convolutional layers were initialized with MSRA initialization with FAN OUT (scaling=2.0), and fully connected layer was initialized with Random Normal (standard deviation =0.01).

所有ImageNet模型都使用Tensor流程与Tesla V100进行训练，模型定义来自官方TF资源库3。图像增加了亮度（0.6到1.4），对比度（0.6到1.4），饱和度（0.4），闪电（0.1），随机中心裁剪和水平闪光。在测试期间，在中心

image

作物上测试图像。大多数模型都经过了批量处理256的培训，但ResNet-101，ResNet-152和Inception-v2等大型模型的批量培训为128。根据实现，RePr可能会添加自己的非训练变量，这将占用GPU内存，因此需要使用比其他论文最初报告的更小的批量。批量大小为256的模型使用SGD进行训练，其中前30个时期的学习率为0.1，接下来的30个时期为0.01，剩余时期为0.001。对于批量大小为128的模型，这些学习率相应地减少了一半。对于ResNet模型，使用FAN OUT（缩放= 2.0）对MSRA初始化初始化卷积层，并且使用随机正态（标准偏差= 0.01）初始化完全连接的层。

2Hu, J., Shen, L., & Sun, G. (2017). Squeeze-and-Excitation Networks. CoRR, abs/1709.01507.

2Hu，J.，Shen，L。，＆Sun，G。（2017）。挤压和激励网络。 CoRR，abs / 1709.01507。

3tensorﬂow/contrib/slim/python/slim/nets

3tensor溢流/的contrib /超薄/蟒蛇/超薄/网

image

Figure 10: Comparison of ﬁlter correlations with RePr and Standard Training.

图10：过滤器与RePr和标准培训的相关性比较。

文章引用于 http://tongtianta.site/paper/13375
编辑 Lornatang
校准 Lornatang

RePr: Improved Training of Convolutional Filters翻译[下]

你可能感兴趣的:(RePr: Improved Training of Convolutional Filters翻译[下])