动手学深度学习-李沐:7.1. 深度卷积神经网络(AlexNet) — 动手学深度学习 2.0.0 documentation (d2l.ai)
动手学深度学习-李沐(pdf):zh-v2.d2l.ai/d2l-zh-pytorch.pdf
论文精读-李沐(视频):AlexNet论文逐段精读【论文精读】_哔哩哔哩_bilibili
Krizhevsky, A., Sutskever, I. & Hinton, G. E. (2012) 'ImageNet Classification with Deep Convolutional Neural Networks', Advances in neural information processing systems, doi: 10.1145/3065386
目录
1. 深度卷积神经网络
1.1. 相关知识点
2. AlexNet
2.1. 整体实现的概念框架
2.2. AlexNet理念
2.3. AlexNet代码实现(动手学深度学习-李沐,pytorch)
2.4. AlexNet的弊端和局限
3. AlexNet论文原文学习
3.1. Abstract:
3.2. Introduction:
3.3. The Dataset:
3.4. The Architecture:
3.5. Reducing Overfitting:
3.6. Details of learning:
3.7. Results:
3.8. DIscussion:
(1)端到端(End-to-End):是一种系统设计和开发的方法,它从系统的起点一直到终点,将整个系统看作一个整体进行设计和优化。
(2)卷积神经网络(Convolutional Neural Networks,CNN)采用端到端的方式,将卷积层隐含在其中
(3)归一化,标准化和正则化
①归一化(Normalization):把数值按一定规律或方式映射/缩放到0~1之间,如处理图像的颜色通常将0~255缩放到0~1为方便计算。同时,常采用的归一化为min-max归一化
②标准化(Standarlization):将数据处理为平均值一定为0,标准差一定是1,且符合正态分布的模式(?)
③正则化(Regularization):引入正则化因子,有效降低过拟合现象
(4)一般第一层卷积神经网络的通道数为3,即R,G,B三个数字的归一化
(1)构造AlexNet类或使用nn.Sequential创造层次序列,实现Alex的核心思想
(2)将已有数据集划分为训练集和验证集
①分别从数据庞大的分类数据中按比例分类为训练集与验证集并保存到新文件路径下(如将'datasets/dog'(10000张)分到'data/train'(包含9000张dog)和'data/val'(包含1000张dog))
(3)图像像素值归一化
(4)裁剪/缩放图片到224*224(我这里代码用的transforms.resize(224,224),即放缩,而非裁剪)
(5)将图片格式转换为张量(如采用transforms.ToTensor())
(5)训练模型
①设置训练集和验证集的批量调用大小(如batch_size=32,这个数据最好是2的整数次幂。一定要时刻注意自己的GPU内存占用!)
②打乱训练集和验证集(shuffle=True)
③定义损失函数
④定义优化器(设置学习率(learning rate,lr))
⑤学习率的手动降低,每十轮变为原来的0.5(AlexNet的设定)
⑥⭐定义训练函数(包含⭐反向传播)
⑦⭐定义验证函数
⑧定义画图函数
⑨调用函数开始训练
⑩设置epoch和初始化准确率,并保存最好的模型权重
⑪绘图
(6)测试模型
①调用自定义AlexNet模型
②加载模型
③进入验证
(1)采用两个GPU连接的结构
(2)卷积层增多
①右侧图的11*11实际上是卷积核的大小,并非处理后的图像大小。
②卷积核就是滤波器。
③第一次的96个卷积核里的数值是随机的(所以直接调用了函数而没有去认为设置)。
(3)使用ReLU激活函数(AlexNet中每个卷积层后都跟一个激活函数,但并不是所有的卷积层必须配激活函数,要根据性能来考虑)
(4)采用暂退法(Dropout)将模型正则化,有效地降低过拟合
①AlexNet在两个全连接层分别使用一次暂退法,即以0.5的概率将每个隐藏神经元的输出设置为零。以这种方式“退出”的神经元不参与正向传递,也不参与反向传播(Krizhevsky, Sutskever, & Hinton, 2012)。
②Dropout大约使收敛所需的迭代次数增加了一倍(Krizhevsky, Sutskever, & Hinton, 2012)。
(5)梯度下降
我们使用随机梯度下降训练模型,批大小为128个样本,动量为0.9,权重衰减为0.0005。我们发现这少量的权重衰减对模型的学习很重要(Krizhevsky, Sutskever, & Hinton, 2012)。
(6)结论
最后Alex以极低的错误率在比赛中取得胜利
(1)AlexNet本体
import torch
from torch import nn
from d2l import torch as d2l
net = nn.Sequential(
# 这里使用一个11*11的更大窗口来捕捉对象。
# 同时,步幅为4,以减少输出的高度和宽度。
# 另外,输出通道的数目远大于LeNet
nn.Conv2d(1, 96, kernel_size=11, stride=4, padding=1), nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2),
# 减小卷积窗口,使用填充为2来使得输入与输出的高和宽一致,且增大输出通道数
nn.Conv2d(96, 256, kernel_size=5, padding=2), nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2),
# 使用三个连续的卷积层和较小的卷积窗口。
# 除了最后的卷积层,输出通道的数量进一步增加。
# 在前两个卷积层之后,汇聚层不用于减少输入的高度和宽度
nn.Conv2d(256, 384, kernel_size=3, padding=1), nn.ReLU(),
nn.Conv2d(384, 384, kernel_size=3, padding=1), nn.ReLU(),
nn.Conv2d(384, 256, kernel_size=3, padding=1), nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2),
nn.Flatten(),
# 这里,全连接层的输出数量是LeNet中的好几倍。使用dropout层来减轻过拟合
nn.Linear(6400, 4096), nn.ReLU(),
nn.Dropout(p=0.5),
nn.Linear(4096, 4096), nn.ReLU(),
nn.Dropout(p=0.5),
# 最后是输出层。由于这里使用Fashion-MNIST,所以用类别数为10,而非论文中的1000
nn.Linear(4096, 10))
(2)调用AlexNet并显示每个卷积层形状(可以不用做这件事吧)
X = torch.randn(1, 1, 224, 224)
for layer in net:
X=layer(X)
print(layer.__class__.__name__,'output shape:\t',X.shape)
(3)读取数据集(根据数据集在自己电脑中保存情况加以更改)
batch_size = 128
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=224)
(4)训练模型
lr, num_epochs = 0.01, 10
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())
d2l.train_ch6函数源于动手学深度学习(6.6. 卷积神经网络(LeNet)):6.6. 卷积神经网络(LeNet) — 动手学深度学习 2.0.0 documentation (d2l.ai)
(1)手动调整学习率
(2)采用两个GPU(当时显卡功能局限,现在不一定必要,当然在实在需要的情况下也可以采取同种方式)
(3)太大的卷积核难以提取局部特征,且卷积运算的参数量也更大。(源于:深度学习-VGGNet - 知乎 (zhihu.com))
(4)增加了训练拟合的次数和时间,0.5的DropOut,多消耗时间大约2倍。(源于:深度解读与思考——AlexNet深层卷积网络 - 知乎 (zhihu.com))
(1)introduce the parameters, classes(output), layers, results and rewards of this model.
(1)Briefly demonstrate how small size datasets limit.
(2)Draw a conclusion on the significance of large capacity and discuss others works.
(3)Elaborate the rewards and structure of this paper.
(4)Mention GPU limits the performance.
(1)Introduce ImageNet and ILSVRC.
(2)State the two competitions they entered and discribe the top-1 and top 5 rule in competition. (张老师的理解如下)
(3)Explain how they resize and process image.(它是放缩图片,若正方形则直接放缩成256*256,若是长方形则短边放缩为256,长的边裁了。同时,图像不经过其他处理,即RGB就是原样)
(1)point out where their summaries lie and their unusual features.
(2)ReLU Nonlinearity:
①List tanh function and Sigmoid function, whereas they choose ReLU in that they hold the point view of that as a non-saturating nonlinearity function, ReLU is faster than those saturating nonlinearities.(饱和非线性的激活函数会将输出结果y即f(x)压缩在有限空间,即有上下界,而ReLU明显在x轴正方向无上界,因此为非饱和)
②Moreover, they present the high error rate of traditional four layer convolutional network. Thus, they then cited examples of others' modifications to traditional models and their own better ideas.
(3)Training on Multiple GPUs:
①Limited to the memory of GPU, they use two GPUs which each one shares half of the neurons. Meanwhile, two GPUs only connect in certain layers. (我觉得现在合并成一个也不会有特别大的不同)
②Provide their results and present two GPUs is faster than one.
③注释:暂目前认为是两个分开的的全连接层连接会丢失数据,因此要重复进行两次。因此作者认为单独一块GPU是更好的。(不一定权威的理解)
(4)Local Response Normalization:
①"local normalization scheme aids generalization" .(翻译:局部归一化方案有助于泛化)
②Introduce how LRN layer works.(为了防止过拟合的LRN层,但因后续很多学者论证效果不大因此现在即便是复现AlexNet的代码也不会用到它)
③Compare the normalization scheme with others.
(5)Overlapping Pooling:
①They verify the performance of overlapping pooling is better than non-overlapping.(但是感觉他们只是对比了s=2的情况下,可能其他情况下的池化还需要实际做实验验证)
(6)Overall Architecture:
①Explain the broad structure of AlexNet.
②Explain figure 2 based on 2 GPUs connected. (Max-pooling layers, of the kind described in Section 3.4, follow both response-normalization layers as well as the fifth convolutional layer.长难句内味儿了,不过把中间那个插入语去了就很好理解了)
③Elaborate every convolutional layer with its parameters.
(1)Emphasize the importance of reducing overfitting in a large parameter environment.
(2)Data Augmentation:
①Label-preserving transformations are common to use.
②Generating image translations and horizontal reflections are their first form. Actually, they choose 5*224*224 images in 256*256 with its four corner patches and the center patch and their 5 horizontal reflections. Then the total ten pictures are used for calculation.
③Altering the intensities of the RGB channels in training images is the second form. They mainly rely on PCA to reduce dimension.(主成分分析(Principal components analysis,PCA))(还没学会)
图片来源:主成分分析(PCA)原理详解 - 知乎 (zhihu.com)
(3)Dropout
①How dropout works.
②They use dropout in the first two fully-connected layers.
(1)They train models by using stochastic gradient descent with a batch size of 128 examples, momentum of 0.9, and weight decay of 0.0005.
(2)They initialize the neuron biases in the second, fourth, and fifth convolutional layers and the fully-connected hidden layers with zero-mean Gaussian distribution with standard deviation 0.01. (神经元偏差还学得不是很好)
(3)Mention how they manually adjust learning rate and expound the training cycles.
(1)Show their performance in the ILSVRC-2010.
(2)Express there is no data because of the non-public nature of the ILSVRC-2012. Additionally, they train an extra sixth convolutional layer over the last pooling layer and further fine tune.
(3)Report their results with data.
(4)Qualitative Evaluations:
①Difference between GPU1 and GPU2.(李沐好像认为这只是巧合)
②Explain figure 4.
③They combine pictures with low Euclidean separation in feature activation vectors that proves a similarision of pictures.(并不会欧几里得分离)
④Improve the efficiency of calculating Euclidean distance by compressing them into binary codes.
(1)Demonstrate the importance of depth.
(2)Explain there is no unsupervised pre-training.
(3)Refer to temporal structure, which the pictures do not really have, might help to video sequences.(未来展望咯~)