在信号处理、图像处理和其它工程/科学领域,卷积都是一种使用广泛的技术。但是,深度学习领域的卷积本质上是信号/图像处理领域内的互相关(cross-correlation)。这两种操作之间存在细微的差别。
在信号/图像处理领域,卷积的定义是:
其定义是一个函数经过反转和位移后 g(t - t’) 和另一个函数 f(t’) 相乘得到的积f(t’) * g(t - t’) 的积分。在信号处理中,函数 g 是过滤器。它被反转后再沿水平轴滑动。在每一个位置,我们都计算 f 和反转后的 g 之间相交区域的面积。这个相交区域的面积就是特定位置处的卷积值。计算过程可视化如下:
在信号处理中,与标准的卷积运算不同的是互相关是两个函数之间的滑动点积或滑动内积。互相关中的过滤器不经过反转,而是直接滑过函数 f。f 与 g 之间的交叉区域即是互相关。下图展示了卷积与互相关之间的差异。
Note:在深度学习中,卷积中的过滤器不经过反转。严格来说,这是互相关,本质上是执行逐元素乘法和加法。但在深度学习中,直接将其称之为卷积更加方便。这没什么问题,因为过滤器的权重是在训练阶段学习到的。如果上面例子中的反转函数 g 是正确的函数,那么经过训练后,学习得到的过滤器看起来就会像是反转后的函数 g。因此,在训练之前,没必要像在真正的卷积中那样首先反转过滤器。
单通道和多通道也就是 filter 的个数不同,个数为1则输出 single channel ,个数为多个则输出 multi-channel。
Since we talked about depth-wise operation in the previous section of 3D convolution, let’s look at another interesting operation, 1 x 1 convolution.
You may wonder why this is helpful. Do we just multiply a number to every number in the input layer? Yes and No. The operation is trivial for layers with only one channel. There, we multiply every element by a number.
Things become interesting if the input layer has multiple channels. The following picture illustrates how 1 x 1 convolution works for an input layer with dimension H x W x D. After 1 x 1 convolution with filter size 1 x 1 x D, the output channel is with dimension H x W x 1. If we apply N such 1 x 1 convolutions and then concatenate results together, we could have a output layer with dimension H x W x N.
Initially, 1 x 1 convolutions were proposed in the Network-in-network paper. They were then highly used in the Google Inception paper. A few advantages of 1 x 1 convolutions are:
The first two advantages can be observed in the image above. After 1 x 1 convolution, we significantly reduce the dimension depth-wise. Say if the original input has 200 channels, the 1 x 1 convolution will embed these channels (features) into a single channel. The third advantage comes in as after the 1 x 1 convolution, non-linear activation such as ReLU can be added. The non-linearity allows the network to learn more complex function.
计算过程可视化例子:
普通卷积 / 转置卷积的矩阵实现
定义:C 为卷积核,Large 为输入图像,Small 为输出图像。经过卷积(矩阵乘法)后,可以将大图像下采样为小图像。这种矩阵乘法的卷积的实现遵照:C x Large = Small。
具体实现:如下图例子所示,将输入平展为 16×1 的矩阵,并将卷积核转换为一个稀疏矩阵(4×16)。然后,在稀疏矩阵和平展的输入之间使用矩阵乘法。之后,再将所得到的矩阵(4×1)转换为 2×2 的输出。
在上图中等式的两边都乘上矩阵的转置 CT,并借助“一个矩阵与其转置矩阵的乘法得到一个单位矩阵”这一性质,那么我们就能得到公式 CT x Small = Large,如下图所示。实现了从小图像到大图像的上采样的目标,转置卷积中的“转置”因此而来。
转置矩阵的算术解释可参阅:https://arxiv.org/abs/1603.07285
Checkerboard artifacts
One unpleasant behavior that people observe when using transposed convolution is the so-called checkerboard artifacts.
The paper “Deconvolution and Checkerboard Artifacts” has an excellent description about this behavior. Please check out this article for more details. Here, I just summarize a few key points.
Checkerboard artifacts result from “uneven overlap” of transposed convolution. Such overlap puts more of the metaphorical paint in some places than others.
In the image below, the layer on the top is the input layer, and the layer on the bottom is the output layer after transposed convolution. During transposed convolution, a layer with small size is mapped to a layer with larger size.
In the example (a), the stride is 1 and the filer size is 2. As outlined in red, the first pixel on the input maps to the first and second pixels on the output. As outlined in green, the second pixel on the input maps to the second and the third pixels on the output. The second pixel on the output receives information from both the first and the second pixels on the input. Overall, the pixels in the middle portion of the output receive same amount of information from the input. Here exist a region where kernels overlapped. As the filter size is increased to 3 in the example (b), the center portion that receives most information shrinks. But this may not be a big deal, since the overlap is still even. The pixels in the center portion of the output receive same amount of information from the input.
Now for the example below, we change stride = 2. In the example (a) where filter size = 2, all pixels on the output receive same amount of information from the input. They all receive information from a single pixel on the input. There is no overlap of transposed convolution here.
If we change the filter size to 4 in the example (b), the evenly overlapped region shrinks. But still, one can use the center portion of the output as the valid output, where each pixel receives the same amount of information from the input.
However, things become interesting if we change the filter size to 3 and 5 in the example © and (d). For these two cases, every pixel on the output receives different amount of information compared to its adjacent pixels. One cannot find a continuous and evenly overlapped region on the output.
The transposed convolution has uneven overlap when the filter size is not divisible by the stride. This “uneven overlap” puts more of the paint in some places than others, thus creates the checkerboard effects. In fact, the unevenly overlapped region tends to be more extreme in two dimensions. There, two patterns are multiplied together, the unevenness gets squared.
Two things one could do to reduce such artifacts, while applying transposed convolution. First, make sure you use a filer size that is divided by your stride, avoiding the overlap issue. Secondly, one can use transposed convolution with stride = 1, which helps to reduce the checkerboard effects. However, artifacts can still leak through, as seen in many recent models.
The paper further proposed a better up-sampling approach: resize the image first (using nearest-neighbor interpolation or bilinear interpolation) and then do a convolutional layer. By doing that, the authors avoid the checkerboard effects. You may want to try it for your applications.
别名:空洞卷积,膨胀卷积,多孔卷积,带孔卷积
扩张卷积由这两篇论文引入:
这是一个标准的卷积:
扩张卷积如下:
直观而言,扩张卷积就是通过在核元素之间插入空格来使核“膨胀”。新增的参数 l(扩张率)表示希望将核加宽的程度。具体实现可能各不相同,但通常是在核元素之间插入 l-1 个空格。下图展示了 l = 1, 2, 4 时的核大小。
由图可见,l=1 时感受野为 3×3,l=2 时为 7×7。l=3 时,感受野的大小就增加到了 15×15。值得注意的是,与这些操作相关的参数数量是相等的,均为3x3。从而说明「观察」更大的感受野不会有额外的成本。因此,扩张卷积可用于廉价地增大输出单元的感受野,而不会增大其核大小,这在多个扩张卷积彼此堆叠时尤其有效。
1、空间可分卷积
空间可分卷积操作的是图像的 2D 空间维度,即高和宽。从概念上看,空间可分卷积是将一个卷积分解为两个单独的运算。对于下面的示例,3×3 的 Sobel 核被分成了一个 3×1 核和一个 1×3 核。
在卷积中,3×3 核直接与图像卷积。在空间可分卷积中,3×1 核首先与图像卷积,然后再应用 1×3 核卷积。这样,执行同样的操作时仅需 6 个参数,而不是 9 个。此外,使用空间可分卷积时所需的矩阵乘法也更少。例如,5×5 图像与 3×3 核的卷积(步幅=1,填充=0)要求在 3 个位置水平地扫描核(还有 3 个垂直的位置),总共就是 9 个位置,表示为下图中的点。在每个位置,会应用 9 次逐元素乘法。总共就是 9×9=81 次乘法。
另一方面,对于空间可分卷积,首先在 5×5 的图像上应用一个 3×1 的过滤器,可以在水平 5 个位置和垂直 3 个位置扫描这样的核。总共就是 5×3=15 个位置,表示为下图中的点。在每个位置,会应用 3 次逐元素乘法。总共就是 15×3=45 次乘法。现在得到了一个 3×5 的矩阵。这个矩阵再与一个 1×3 核卷积,即在水平 3 个位置和垂直 3 个位置扫描这个矩阵。对于这 9 个位置中的每一个,应用 3 次逐元素乘法。这一步需要 9×3=27 次乘法。因此,总体而言,空间可分卷积需要 45+27=72 次乘法,少于普通卷积。
一般化推广
假设现在将卷积应用于一张 N×N 的图像上,卷积核为 m×m,步幅为 1,填充为 0。传统卷积需要 (N-2) x (N-2) x m x m 次乘法,空间可分卷积需要 N x (N-2) x m + (N-2) x (N-2) x m = (2N-2) x (N-2) x m 次乘法。空间可分卷积与标准卷积的计算成本比为:
因为图像尺寸 N 远大于过滤器大小(N>>m),所以这个比就变成了 2/m。也就是说,在这种渐进情况(N>>m)下,当过滤器大小为 3×3 时,空间可分卷积的计算成本是标准卷积的 2/3。过滤器大小为 5×5 时这一数值是 2/5;过滤器大小为 7×7 时则为 2/7。
存在的问题:
尽管空间可分卷积能节省成本,但深度学习却很少使用它。一大主要原因是并非所有的核都能分成两个更小的核。如果用空间可分卷积替代所有的传统卷积,那么就限制了在训练过程中搜索所有可能的核。这样得到的训练结果可能是次优的。
2、深度可分卷积
首先快速回顾标准的 2D 卷积。举一个具体例子,假设输入层的大小是 7×7×3(高×宽×通道),而过滤器的大小是 3×3×3。经过与一个过滤器的 2D 卷积之后,输出层的大小是 5×5×1(仅有一个通道)。
一般来说,两个神经网络层之间会应用多个过滤器。假设这里有 128 个过滤器。在应用了这 128 个 2D 卷积之后,将有 128 个 5×5×1 的输出映射图(map)。然后将这些映射图堆叠成大小为 5×5×128 的单层。通过这种操作,可将输入层(7×7×3)转换成输出层(5×5×128)。空间维度(即高度和宽度)会变小,而深度会增大。
使用深度可分卷积,目的就是利用1×1 卷积核实现同样的变换。
深度可分卷积的坏处:
因为利用1x1的卷积核来代替3x3卷积核以达到相同的 channel 深度,从而降低了卷积中参数的数量。因此,对于较小的模型而言,如果用深度可分卷积替代 2D 卷积,模型的表达能力可能会显著下降,致使得到的模型可能是次优的。但是,如果使用得当,深度可分卷积能在不降低模型性能的前提下实现效率的提升。
The flattened convolution was introduced in the paper “Flattened convolutional neural networks for feedforward acceleration”. Intuitively, the idea is to apply filter separation. Instead of applying one standard convolution filter to map the input layer to an output layer, we separate this standard filter into 3 1D filters. Such idea is similar as that in the spatial separable convolution described above, where a spatial filter is approximated by two rank-1 filters.
One should notice that if the standard convolution filter is a rank-1 filter, such filter can always be separated into cross-products of three 1D filters. But this is a strong condition and the intrinsic rank of the standard filter is higher than one in practice. As pointed out in the paper “As the difficulty of classification problem increases, the more number of leading components is required to solve the problem… Learned filters in deep networks have distributed eigenvalues and applying the separation directly to the filters results in significant information loss.”
To alleviate such problem, the paper “Flattened Convolutional Neural Networks for Feedforward Acceleration” restricts connections in receptive fields so that the model can learn 1D separated filters upon training. The paper claims that by training with flattened networks that consists of consecutive sequence of 1D filters across all directions in 3D space provides comparable performance as standard convolutional networks, with much less computation costs due to the significant reduction of learning parameters.
AlexNet 论文在 2012 年引入了分组卷积。实现分组卷积的主要原因是让网络训练可在 2 个内存有限(每个 GPU 有 1.5 GB 内存)的 GPU 上进行。下面图中的 AlexNet 表明在大多数层中都有两个分开的卷积路径。这是在两个 GPU 上执行模型并行化(当然如果可以使用更多 GPU,还能执行多 GPU 并行化)。
分组卷积的工作方式
首先,典型的 2D 卷积的步骤如下图所示。在这个例子中,通过应用 128 个大小为 3×3×3 的过滤器将输入层(7×7×3)变换到输出层(5×5×128)。推广而言,即通过应用 Dout 个大小为 h x w x Din 的核将输入层(Hin x Win x Din)变换到输出层(Hout x Wout x Dout)。
在分组卷积中,过滤器会被分为 不同的组 。每一组都负责 特定深度 的典型 2D 卷积。
上图展示了具有两个过滤器分组的分组卷积。在每个过滤器分组中,每个过滤器的深度仅有名义上的 2D 卷积的一半。它们的深度是 Din/2。每个过滤器分组包含 Dout/2 个过滤器。第一个过滤器分组(红色)与输入层的前一半([:, :, 0:Din/2])卷积,而第二个过滤器分组(橙色)与输入层的后一半([:, :, Din/2:Din])卷积。因此,每个过滤器分组都会创建 Dout/2 个通道。整体而言,两个分组会创建 2×Dout/2 = Dout 个通道。然后我们将这些通道堆叠在一起,得到有 Dout 个通道的输出层。
分组卷积的优点:
模型性能更好的原因和稀疏过滤器的关系有关。下图是相邻层过滤器的相关性。其中的关系是稀疏的。
分组矩阵的相关性映射图如下:
上图是当用 1、2、4、8、16 个过滤器分组训练模型时,相邻层的过滤器之间的相关性。那篇文章提出了一个推理:过滤器分组的效果是在通道维度上学习块对角结构的稀疏性……在网络中,具有高相关性的过滤器是使用过滤器分组以一种更为结构化的方式学习到。从效果上看,不必学习的过滤器关系就不再参数化。这样显著地减少网络中的参数数量能使其不容易过拟合,因此,一种类似正则化的效果让优化器可以学习得到更准确更高效的深度网络。 此外,每个过滤器分组都会学习数据的一个独特表征。正如 AlexNet 的作者指出的那样,过滤器分组似乎会将学习到的过滤器结构性地组织成两个不同的分组——黑白过滤器和彩色过滤器。
Shuffled grouped convolution was introduced in the ShuffleNet from Magvii Inc (Face++). ShuffleNet is a computation-efficient convolution architecture, which is designed specially for mobile devices with very limited computing power (e.g. 10–150 MFLOPs).
The ideas behind the shuffled grouped convolution are linked to the ideas behind grouped convolution (used in MobileNet and ResNeXt for examples) and depthwise separable convolution (used in Xception).
Overall, the shuffled grouped convolution involves grouped convolution and channel shuffling.
In the section about grouped convolution, we know that the filters are separated into different groups. Each group is responsible for a conventional 2D convolutions with certain depth. The total operations are significantly reduced. For examples in the figure below, we have 3 filter groups. The first filter group convolves with the red portion in the input layer. Similarly, the second and the third filter group convolves with the green and blue portions in the input. The kernel depth in each filter group is only 1/3 of the total channel count in the input layer. In this example, after the first grouped convolution GConv1, the input layer is mapped to the intermediate feature map. This feature map is then mapped to the output layer through the second grouped convolution GConv2.
Grouped convolution is computationally efficient. But the problem is that each filter group only handles information passed down from the fixed portion in the previous layers. For examples in the image above, the first filter group (red) only process information that is passed down from the first 1/3 of the input channels. The blue filter group (blue) only process information that is passed down from the last 1/3 of the input channels. As such, each filter group is only limited to learn a few specific features. This property blocks information flow between channel groups and weakens representations during training. To overcome this problem, we apply the channel shuffle.
The idea of channel shuffle is that we want to mix up the information from different filter groups. In the image below, we get the feature map after applying the first grouped convolution GConv1 with 3 filter groups. Before feeding this feature map into the second grouped convolution, we first divide the channels in each group into several subgroups. The we mix up these subgroups.
After such shuffling, we continue performing the second grouped convolution GConv2 as usual. But now, since the information in the shuffled layer has already been mixed, we essentially feed each group in GConv2 with different subgroups in the feature map layer (or in the input layer). As a result, we allow the information flow between channels groups and strengthen the representations.
The ShuffleNet paper also introduced the pointwise grouped convolution. Typically for grouped convolution such as in MobileNet or ResNeXt, the group operation is performed on the 3x3 spatial convolution, but not on 1 x 1 convolution.
The shuffleNet paper argues that the 1 x 1 convolution are also computationally costly. It suggests applying group convolution for 1 x 1 convolution as well. The pointwise grouped convolution, as the name suggested, performs group operations for 1 x 1 convolution. The operation is identical as for grouped convolution, with only one modification — performing on 1x1 filters instead of NxN filters (N>1).
In the ShuffleNet paper, authors utilized three types of convolutions we have learned: (1) shuffled grouped convolution; (2) pointwise grouped convolution; and (3) depthwise separable convolution. Such architecture design significantly reduces the computation cost while maintaining the accuracy. For examples the classification error of ShuffleNet and AlexNet is comparable on actual mobile devices. However, the computation cost has been dramatically reduced from 720 MFLOPs in AlexNet down to 40–140 MFLOPs in ShuffleNet. With relatively small computation cost and good model performance, ShuffleNet gained popularity in the field of convolutional neural net for mobile devices.
请参考我的另一篇博客:Dynamic Convolution: Attention over Convolution Kernels
持续更新中
在这里插入代码片