Visualizing and Understanding Convolutional Networks:
The projections from each layer show the hierarchical nature of the features in the network. Layer 2 responds to corners and other edge/color conjunctions. Layer 3 has more complex invariances, capturing similar textures (e.g. mesh patterns (Row 1, Col 1); text (R2, C4)). Layer 4 shows significant variation, but is more class-specific: dog faces (R1, C1); bird’s legs (R4, C2). Layer 5 shows entire objects with significant pose variation, e.g. keyboards (R1, C1) and dogs (R4).
ImageNet Classification with Deep Convolutional Neural Networks:
Below, we describe the two primary ways in which we combat overfitting.
(1) Data Augmentation
The easiest and most common method to reduce overfitting on image data is to artificially enlarge the dataset using label-preserving transformations (e.g., [25, 4, 5]). (图像数据上最简单最常用的用来减少过拟合的方法是使用标签保留变换 (e.g., [25, 4, 5]) 来人工增大数据集。)
The first form of data augmentation consists of generating image translations and horizontal reflections. We do this by extracting random 224 × 224 patches (and their horizontal reflections) from the 256 × 256 images and training our network on these extracted patches. This increases the size of our training set by a factor of 2048, though the resulting training examples are, of course, highly interdependent. Without this scheme, our network suffers from substantial overfitting, which would have forced us to use much smaller networks. At test time, the network makes a prediction by extracting five 224 × 224 patches (the four corner patches and the center patch) as well as their horizontal reflections (hence ten patches in all), and averaging the predictions made by the network’s softmax layer on the ten patches.
第一种数据增强方式包括产生图像变换和水平翻转。我们从 256 x 256 图像上通过随机提取 224 x 224 的图像块实现了这种方式 (还有它们的水平反射),然后在这些提取的图像块上进行训练。这通过一个 2048 的因子增大了我们的训练集,尽管最终的训练样本是高度相关的。没有这个方案,我们的网络会有大量的过拟合,这会迫使我们使用更小的网络。在测试时,网络会提取 5 个 224 x 224 的图像块 (四个角上的图像块和中心的图像块) 和它们的水平翻转 (因此总共 10 个图像块) 进行预测,然后对网络在 10 个图像块上的 softmax 层进行平均。
The second form of data augmentation consists of altering the intensities of the RGB channels in training images. This scheme approximately captures an important property of natural images, namely, that object identity is invariant to changes in the intensity and color of the illumination. This scheme reduces the top-1 error rate by over 1%. (第二种数据增强方式包括改变训练图像的 RGB 通道的强度。这个方案近似抓住了自然图像的一个重要特性,即光照的颜色和强度发生变化时,目标身份是不变的。)
(2) Dropout
The recently-introduced technique, called “dropout” [10], consists of setting to zero the output of each hidden neuron with probability 0.5. The neurons which are “dropped out” in this way do not contribute to the forward pass and do not participate in backpropagation. This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons. At test time, we use all the neurons but multiply their outputs by 0.5, which is a reasonable approximation to taking the geometric mean of the predictive distributions produced by the exponentially-many dropout networks. (这种最近引入的技术,叫做 dropout [10],它会以 0.5 的概率对每个隐层神经元的输出设为 0。那些失活的的神经元不再进行前向传播并且不参与反向传播。这个技术减少了复杂的神经元互适应,因为一个神经元不能依赖特定的其它神经元的存在。因此,神经元被强迫学习更鲁棒的特征,它在与许多不同的其它神经元的随机子集结合时是有用的。在测试时,我们使用所有的神经元但它们的输出乘以 0.5,对指数级的许多失活网络的预测分布进行几何平均,这是一种合理的近似。)
We use dropout in the first two fully-connected layers. Without dropout, our network exhibits substantial overfitting. Dropout roughly doubles the number of iterations required to converge. (我们在前两个全连接层使用失活。如果没有失活,我们的网络表现出大量的过拟合。失活大致上使要求收敛的迭代次数翻了一倍。)
ImageNet Classification with Deep Convolutional Neural Networks:
On ImageNet, it is customary to report two error rates: top-1 and top-5, where the top-5 error rate is the fraction of test images for which the correct label is not among the five labels considered most probable by the model. (在 ImageNet 上,按照惯例报告两个错误率:top-1 和 top-5,top-5 错误率是指测试图像的正确标签不在模型认为的五个最可能的标签之中。)
ImageNet Classification with Deep Convolutional Neural Networks:
Convolutional neural networks (CNNs) constitute one such class of models [16, 11, 13, 18, 15, 22, 26]. Their capacity can be controlled by varying their depth and breadth, and they also make strong and mostly correct assumptions about the nature of images (namely, stationarity of statistics and locality of pixel dependencies). (卷积神经网络 (CNNs) 构成了一个这样的模型 [16, 11, 13, 18, 15, 22, 26]。它们的能力可以通过改变它们的广度和深度来控制,它们也可以对图像的本质进行强大且通常正确的假设 (也就是说,统计的稳定性和像素依赖的局部性)。)
Our final network contains five convolutional and three fully-connected layers, and this depth seems to be important: we found that removing any convolutional layer (each of which contains no more than 1% of the model’s parameters) resulted in inferior performance. (我们最终的网络包含 5 个卷积层和 3 个全连接层,深度似乎是非常重要的:我们发现移除任何卷积层 (每个卷积层包含的参数不超过模型参数的 1%) 都会导致更差的性能。)
ImageNet Classification with Deep Convolutional Neural Networks:
Pooling layers in CNNs summarize the outputs of neighboring groups of neurons in the same kernel map. Traditionally, the neighborhoods summarized by adjacent pooling units do not overlap (e.g., [17, 11, 4]). To be more precise, a pooling layer can be thought of as consisting of a grid of pooling units spaced s s s pixels apart, each summarizing a neighborhood of size z × z z × z z×z centered at the location of the pooling unit. If we set s = z s = z s=z, we obtain traditional local pooling as commonly employed in CNNs. If we set s < z s < z s<z, we obtain overlapping pooling. This is what we use throughout our network, with s = 2 s = 2 s=2 and z = 3 z = 3 z=3. This scheme reduces the top-1 and top-5 error rates by 0.4% and 0.3%, respectively, as compared with the non-overlapping scheme s = 2 , z = 2 s = 2, z = 2 s=2,z=2, which produces output of equivalent dimensions. We generally observe during training that models with overlapping pooling find it slightly more difficult to overfit. (CNN 中的池化层归纳了同一核映射上相邻组神经元的输出。通常情况下上,相邻池化单元归纳的区域是不重叠的 (例如 [17, 11, 4])。更确切的说,池化层可看作由池化单元网格组成,网格间距为 s s s 个像素,每个网格归纳池化单元中心位置 z × z z × z z×z 大小的邻域。如果设置 s = z s = z s=z,我们会得到通常在 CNN 中采用的传统局部池化。如果设置 s < z s < z s<z,我们会得到重叠池化。这就是我们网络中使用的方法,设置 s = 2 s = 2 s=2 and z = 3 z = 3 z=3。与非重叠方案 s = 2 , z = 2 s = 2, z = 2 s=2,z=2 相比,这个方案分别降低了 top-1 0.4%,top-5 0.3% 的错误率,两者输出的维度是相等的。我们在训练过程中通常观察采用重叠池化的模型,发现它更难过拟合。)
ImageNet Classification with Deep Convolutional Neural Networks:
We trained our models using stochastic gradient descent with a batch size of 128 examples, momentum of 0.9, and weight decay of 0.0005. We found that this small amount of weight decay was important for the model to learn. In other words, weight decay here is not merely a regularizer: it reduces the model’s training error. (我们使用随机梯度下降来训练我们的模型,样本的 batch size 为 128,动量为 0.9,权重衰减为 0.0005。我们发现少量的权重衰减对于模型的学习是重要的。换句话说,权重衰减不仅仅是一个正则项:它减少了模型的训练误差。)
ImageNet Classification with Deep Convolutional Neural Networks:
We initialized the weights in each layer from a zero-mean Gaussian distribution with standard deviation 0.01. We initialized the neuron biases in the second, fourth, and fifth convolutional layers, as well as in the fully-connected hidden layers, with the constant 1. This initialization accelerates the early stages of learning by providing the ReLUs with positive inputs. We initialized the neuron biases in the remaining layers with the constant 0. (我们使用均值为 0,标准差为 0.01 的高斯分布对每一层的权重进行初始化。我们在第 2,4,5 卷积层和全连接隐层将神经元偏置初始化为常量 1。这个初始化通过为 ReLU 提供正输入加速了学习的早期阶段。我们在剩下的层将神经元偏置初始化为 0。)
ImageNet Classification with Deep Convolutional Neural Networks:
We used an equal learning rate for all layers, which we adjusted manually throughout training. The heuristic which we followed was to divide the learning rate by 10 when the validation error rate stopped improving with the current learning rate. The learning rate was initialized at 0.01 and reduced three times prior to termination. (我们对所有的层使用相等的学习率,这个是在整个训练过程中我们手动调整得到的。当验证误差在当前的学习率下停止降低时,我们遵循启发式的方法将学习率除以 10。学习率初始化为 0.01,在训练停止之前降低三次。)
ImageNet Classification with Deep Convolutional Neural Networks:
The standard way to model a neuron’s output f f f as a function of its input x x x is with f ( x ) = t a n h ( x ) f(x) = tanh(x) f(x)=tanh(x) or f ( x ) = ( 1 + e − x ) − 1 f(x) = (1 + e^{-x})^{-1} f(x)=(1+e−x)−1. In terms of training time with gradient descent, these saturating nonlinearities are much slower than the non-saturating nonlinearity f ( x ) = m a x ( 0 , x ) f(x) = max(0, x) f(x)=max(0,x). Deep convolutional neural networks with ReLUs train several times faster than their equivalents with t a n h tanh tanh units. This plot shows that we would not have been able to experiment with such large neural networks for this work if we had used traditional saturating neuron models. (将神经元输出 f f f 建模为输入 x x x 的函数的标准方式是用 f ( x ) = t a n h ( x ) f(x) = tanh(x) f(x)=tanh(x) or f ( x ) = ( 1 + e − x ) − 1 f(x) = (1 + e^{-x})^{-1} f(x)=(1+e−x)−1。考虑到梯度下降的训练时间,这些饱和的非线性比非饱和非线性 f ( x ) = m a x ( 0 , x ) f(x) = max(0, x) f(x)=max(0,x) 更慢。采用 ReLU 的深度卷积神经网络训练时间比等价的 t a n h tanh tanh 单元要快几倍。这幅图表明,如果我们采用传统的饱和神经元模型,我们将不能在如此大的神经网络上实验该工作。)
Figure : A four-layer convolutional neural network with ReLUs (solid line) reaches a 25% training error rate on CIFAR-10 six times faster than an equivalent network with t a n h tanh tanh neurons (dashed line). The learning rates for each network were chosen independently to make training as fast as possible. No regularization of any kind was employed. The magnitude of the effect demonstrated here varies with network architecture, but networks with ReLUs consistently learn several times faster than equivalents with saturating neurons. (图:使用 ReLU 的四层卷积神经网络在 CIFAR-10 数据集上达到 25% 的训练误差比使用 t a n h tanh tanh 神经元的等价网络 (虚线) 快六倍。为了使训练尽可能快,每个网络的学习率是单独选择的。没有采用任何类型的正则化。影响的大小随着网络结构的变化而变化,这一点已得到证实,但使用 ReLU 的网络都比等价的饱和神经元快几倍。)
ReLUs have the desirable property that they do not require input normalization to prevent them from saturating. If at least some training examples produce a positive input to a ReLU, learning will happen in that neuron. (ReLU 具有让人满意的特性,它不需要通过输入归一化来防止饱和。如果至少一些训练样本对 ReLU 产生了正输入,那么那个神经元上将发生学习。)
non-saturating nonlinearity: f ( x ) = m a x ( 0 , x ) f(x) = max(0, x) f(x)=max(0,x)
saturating nonlinearity: f ( x ) = t a n h ( x ) f(x) = tanh(x) f(x)=tanh(x) or f ( x ) = ( 1 + e − x ) − 1 f(x) = (1 + e^{-x})^{-1} f(x)=(1+e−x)−1
FaceNet: A Unified Embedding for Face Recognition and Clustering:
Some recent work [15] has reduced this dimensionality using PCA, but this is a linear transformation that can be easily learnt in one layer of the network. (最近的一些工作 [15] 使用 PCA 降低了这种维度,但这是一种线性转换,可以在网络的一个层中轻松学习。)
The works of [15, 17, 23] all employ a complex system of multiple stages, that combines the output of a deep convolutional network with PCA for dimensionality reduction and an SVM for classification. (论文 [15, 17, 23] 都使用了一个复杂的多级系统,它将深度卷积网络的输出与 PCA 相结合,以降低维数,并将 SVM用于分类。)