This paper is a very hot paper on computer vision published in 2012. I briefly sorted out the main points of the paper, if there are mistakes, welcome to correct them. (“ImageNet Classification with Deep Convolutional Neural Networks”,这篇论文是2012年发布的关于计算机视觉非常火爆的论文。本人简单的梳理了论文中讲的要点,若有错误欢迎指正。)
According to my blog on how to read essays efficiently, we know that reading essays is divided into three steps. (根据我的如何高效的阅读论文那一篇博客中,我们知道,读论文要分成三步骤。)
Abstract -> Discussion -> graphs/ charts.
After the first step, we know that the authors used CNNs to beat the other models in the competition, and the results were several times better than theirs. However, some details are also unclear to the authors and need to be discussed by the researchers in the future. (第一步操作过后,我们知道了作者使用了CNNs在竞赛中打败了其他的模型,而且效果比它们的好好几倍。但是,有些细节作者也是不清楚,需要研究者今后讨论。)
We want to do object recognition with very good models, prevent overfitting, and collect big data. It is easy to do big with CNNs, but easy to overfit. Because of the GPU, it is now good to train CNNs. Trained very large neural networks and achieved particularly good results. The neural network has 5 convolutional layers and 3 fully linked layers; it was found that depth is very important and removing any layer is not possible. And used unusual features and new techniques to reduce overfitting. (我们要做object recognition,用很好的模型,防止过拟合,而且收集了大数据。使用CNNs做大容易,但是容易过拟合。因为有GPU,现在好训练CNN了。训练了很大的神经网络,取得了特别好的结果。神经网络有5个卷积层,3个全链接层;发现深度很重要,移掉任何一层都不行。并且用了unusual feature 和新的技术来降低过拟合。)
The Dataset has 15 million pieces of data and 20,000 categories. And this dataset does not have a fixed resolution, so the author made the image, 256 X 256. The short side is reduced to 256, and the long side is to ensure that the aspect ratio also goes down. However, if the long edge is more than 256, the two edges are truncated by the center. (数据集具有1500万条数据,2万种类。并且此数据集不是固定分辨率的,所以作者把图片弄成了,256*256 。把短边减少到256,长边是保证高宽比也往下降。但是,长边如果多于256的话,以中心为界把两个边截掉。)
The point ⚠️: only the original pixels are used, and no data feature extraction is done. After that, this method is the so-called End to End which means that the original picture, the original text directly, does not do any feature extraction, the neural network can help to do it. (重点 ⚠️:只是使用了原始的pixels,没有做数据特征的提取。之后,此方法就是所谓的End to End 就是说原始的图片,原始的文本直接进去,不做任何特征提取,神经网络自己就能帮忙做出来。)
Cube indicates the size of the input and output of this data for each layer. The data input is a 224 * 224 * 3 image. The first convolution layer (the window of convolution is 11 * 11 and the stride is 4) has 48 layers of channels. (立方体表示每一层的输入和输出这个数据的大小。 数据输入的时候是224 * 224 * 3的图片。第一个卷积层(卷积的窗口是11 * 11,stride是4)有48层的通道。)
On the engineering side, it is divided into two GPUs (GPU0, GPU1) with separate convolutional layers. Each GPU has 5 convolutional layers, and the output of one convolutional layer is the input of the other convolutional layer. In convolutional layers 2 and 3, there is communication between GPU0 and GPU1, which are merged in the output channel dimension. In the rest of 1 to 2, 3 to 4, and 4 to 5, each GUP does its learning. However, the height and width between the respective convolutional layers are varied. (在工程方面,分成了两个GPU(GPU0,GPU1)分别有卷积层。每个GPU有5个卷积层,且一个卷积层的输出是另一个卷积层的输入。卷积层2和3中,GPU0和GPU1之间会有通信,在输出通道维度合并。其余的1到2,3到4,4到5中,每个GUP都是各做各的学习。 但是,各自卷积层之间的高宽是有变化的。)
Feature ⚠️: As the height and width of the image slowly decrease, the depth slowly increases. That is, the spatial information is slowly compressed as the network increases (224 * 224 to 13 * 13; indicating that each pixel inside 13 * 13 can represent a large chunk of the previous pixel). And the number of channels slowly increases (each channel number is a specific some patterns, 192 can be simply thought of as being able to identify the middle of picture 192 different patterns, each channel to identify a is a cat leg ah or an edge or something else what). After that, the input of the full linkage layer of each GPU is the result of combining the output of the 5th convolution of the two GPUs, each GPU does the full linkage independently first. Then, finally, it is combined at the classification layer to generate a vector of length 4096. If the vectors of 4096 are very close to each other, it means that the two images are of the same object. And the vector of 4096 lengths can represent the semantic information of the images well. (特点⚠️:随着图片的高宽慢慢的减少,深度慢慢的增加。 也就是说,随着网络的增加,空间信息慢慢的压缩(224 * 224 到 13 * 13;说明13 * 13 里面的每一个像素能够代表之前的一大块的像素)。并且通道数慢慢的增加(每一个通道数是一种特定的一些模式,192个可以简单的认为是能够识别图片中间的192种不同的模式,每一个通道去识别一个是猫腿呀还是一个边还是什么其他的什么东西)。之后,每个GPU的全链接层的输入是两个GPU中的第5个卷积的输出合并起来的结果,每个GPU先独立做全链接。然后,最后在分类层的时候合并起来生成长度为4096的向量。 若两个图片的长度为4096的向量非常接近就说明是同一个物体的图片。并且长度为4096的向量能够很好的表示图片的语义信息。)
AlexNet design with three full links (the last one is the output, the middle two very large 4096 full links is a big bottleneck, resulting in a particularly large model that can not put the network into the GPU as a whole). (模型的缺陷:AlexNet设计的时候用了三个全链接(最后一个是输出,中间两个很大的4096全链接是一个很大的瓶颈,导致模型特别大不能把网络整体的放入GPU里))
Overfitting image understanding: You are given some questions, and you memorize this, but there is no understanding of what the questions are for. So, the exam is not good. (过拟合形象理解:给你一些题,你就把这个背下来的,根本没有理解题是干什么的。所以,考试的时候肯定考不好。)
Thanks to Mr. Li Mu for sharing, the video explanation is detailed in the following link :(感谢李沐老师的分享,视频讲解详见如下链接:)
Pass 1
Pass 2