Now that we have seen that there are many paths through residual networks and that they do not necessarily depend on each other, we investigate their characteristics.
Not all paths through residual networks are of the same length. For example, there is precisely one path that goes through all modules and n paths that go only through a single module. From this reasoning, the distribution of all possible path lengths through a residual network follows a Binomial distribution. Thus, we know that the path lengths are closely centered around the mean of n/2. Figure 6 (a) shows the path length distribution for a residual network with 54 modules; more than 95% of paths go through 19 to 35 modules
Generally, data flows along all paths in residual networks. However, not all paths carry the same amount of gradient. In particular, the length of the paths through the network affects the gradient magnitude during backpropagation [1, 8].To empirically investigate the effect of vanishing gradients on residual networks we perform the following experiment. Starting from a trained network with 54 blocks, we sample individual paths of a certain length and measure the norm of the gradient that arrives at the input. To sample a path of length k, we first feed a batch forward through the whole network. During the backward pass, we randomly sample k residual blocks. For those k blocks, we only propagate through the residual module; for the remaining n −k blocks, we only propagate through the skip connection. Thus, we only measure gradients that flow through the single path of length k. We sample 1,000 measurements for each length k using random batches from the training set. The results show that the gradient magnitude of a path decreases exponentially with the number of modules it went through in the backward pass, Figure 6 (
Finally, we can use these results to deduce whether shorter or longer paths contribute most of the gradient during training.
To find the total gradient magnitude contributed by paths of each length, we multiply the frequency of each path length with the expected gradient magnitude. The result is shown in Figure 6 ©. Surprisingly, almost all of the gradient updates during training come from paths between 5 and 17 modules long. These are the effective paths, even though they constitute only 0.45% of all paths through this network. Moreover, in comparison to the total length of the network, the effective paths are relatively shallow.
To validate this result, we retrain a residual network from scratch that only sees the effective paths during training. This ensures that no long path is ever used. If the retrained model is able to perform competitively compared to training the full network, we know that long paths in residual networks are not needed during training. We achieve this by only training a subset of the modules during each mini batch. In particular, we choose the number of modules such that the distribution of paths during training aligns with the distribution of the effective paths in the whole network. For the network with 54 modules, this means we sample exactly 23 modules during each training batch. Then, the path lengths during training are centered around 11.5 modules, well aligned with the effective paths. In our experiment, the network trained only with the effective paths achieves a 5.96% error rate, whereas the full model achieves a 6.10% error rate. There is no statistically significant difference. This demonstrates that indeed only the effective paths are neede
手翻:
现在我们已经看到有许多通过剩余网络的路径,它们不一定相互依赖,我们研究它们的特点。
并非所有通过残差网络的路径都具有相同的长度。例如,恰好有一条路径穿过所有模块,而 n 条路径只穿过一个模块。根据这个推理,通过残差网络的所有可能路径长度的分布遵循一个二项分布。因此,我们知道路径长度紧密地围绕 n/2的平均值居中。图6(a)显示了具有54个模块的剩余网络的路径长度分布; 超过95% 的路径经过19到35个模块
通常,数据沿着残差网络中的所有路径流动。但是,并非所有路径都带有相同大小的梯度。特别是,通过网络的路径长度会影响反向传播期间的梯度幅度 [1, 8]。为了凭经验研究梯度消失对残差网络的影响,我们进行了以下实验。从具有 54 个块的已训练网络开始,我们对特定长度的单个路径进行采样,并测量到达输入的梯度范数。为了对长度为 k 的路径进行采样,我们首先给网络喂一个banch的数据进行前向传播。在后向传播过程中,我们随机抽取 k 个残差块。对于这 k 个块,我们只通过残差模块进行传播;对于剩余的 n -k 个块,我们只通过直连边传播。通过这样的方法,我们就可以只测量流经长度为 k 的单一路径的梯度。
我们使用训练集中的随机banch,并对每个长度 k 采样 1,000 个测量值。结果表明,一条路径的梯度大小随着它在反向传播中经过的模块数量呈指数下降,图 6
最后,我们可以使用这些结果来推断是长路径还是短路径贡献了大部分梯度。为了找到每种长度的路径贡献的总梯度大小,我们将每个路径长度的频率乘以预期的梯度幅度。结果如图 6(c)所示。令人惊讶的是,几乎训练期间的所有梯度更新都来自于 5 到 17 个模块长的路径。这些是有效路径,尽管它们仅占通过该网络的所有路径的 0.45%。此外,与网络的总长度相比,有效路径相对较浅。
为了验证这个结果,我们从头开始重新训练一个仅在训练期间看到有效路径的残差网络。这确保了永远不会使用长路径。如果与训练完整网络相比,重新训练的模型和完整的模型做比较,我们能够的到在训练期间不需要残差网络中的长路径的结果。我们通过在每个小批量期间只训练模块的一个子集来实现这一点。特别是,我们选择模块的数量,使得训练期间的路径分布与整个网络中有效路径的分布一致。对于有 54 个模块的网络,这意味着我们在每个训练批次中只采样了 23 个模块。然后,训练期间的路径长度以 11.5 个模块左右,与有效路径一致。在我们的实验中,仅使用有效路径训练的网络的错误率达到了 5.96%,而完整模型的错误率达到了 6.10%。统计学上没有显著差异。这表明确实只需要有效路径。