前几篇博客讲了一些深度学习常用结构的原理, 计算方法和实现. 大致有仿射网络层, 卷积池化层, 时序运算RNN层, 以及BN, Dropout和激活函数, 损失函数等. 我也尝试了用这些自己写的结构搭建深度网络解决一些实际问题.
线性层,激活函数,损失函数
卷积层, 池化层
BPTT
Dropout, Batch Normalization
今天, 深度学习是一个热门的计算机科学研究方向, 而它最主要的应用场景是CV和NLP. 反观常规的应用场景, 如推荐系统和数据分析, 反而不是很适合用DL. 最主要的原因是, 深度学习太过追求end to end, 虽然免去了人为建模和特征工程的步骤, 但也让模型的可解释性变得很差. 这样的模型落地时, 不但可能在实际场景中遇到无法预测的重大错误, 还不容易写出漂亮的PPT拿去融资(^^). 客户需要的是清晰的高萃取度特征(from Xgboost), 以及比决策树更简单易懂的模型(逻辑回归), 深度网络这种不稳定的模型自然不适合落地. 但是打打kaggle还是没问题的.
但是CV领域有些不同的地方, CV处理的数据是图像数据, 从图像数据中提取特征的最有力手段就是深度卷积, 使用其他传统方法得到的特征其可解释度不一定比深度学习更好, 所以深度学习能在CV领域大行其道. 到了今天, 已经是imagenet的时代, 一般的需求直接拿预训练模型fine tune一下就能得到不错的效果.
这里就先简单讲一下CV中最基础的图像n分类问题的解法, 之前的博客我也实现过简单的CNN, 不过要想得到更好的效果, CNN需要一些其他的设计技巧.
VGG基于Alexnet改动, 比起Alex它具有更深的深度. 它使用小的,多层的卷积核代替原本大的卷积核, 比如5x5, 7x7和11x11的卷积核都可以用多层3x3的卷积核来代替. 池化仍然是使用2x2, 2步长的max pooling. 它使用的参数并不比alex多很多, 虽然计算开销上要更大一些(每一层都是独立的矩阵运算), 但是效果却也远远盖过了alex, 他也证明了深度可以提升模型的性能.
另一点VGG的设计, 在于卷积特征提取结束后的分类全连接层, 它使用size非常大的卷积核(和特征图大小一致)代替全连接层. 我们可以想象, 用KCxCxHxW的卷积核去卷积CxHxW的特征图, 会得到KC个输出. 这样的运算和"把特征图展开成1d, 再做全连接dense层"是完全相同的. 这就允许我们一定程度上用卷积层代替线性全连接层. 这样做的好处是我们可以接收任意尺寸的图片, 只要它不小于卷积核, 那就都可以运算. 这就解除了对图片维度的限制.
GoogLeNet中使用了一种叫做Inception的技术, 它旨在让通道数变得不那么多, 从而减少运算量. 具体的做法, 是把当前M的通道的特征图, 它们每一个位置对应的像素值通过一个矩阵乘法进行线性变换, 变成比M小的N通道. 这个过程也就是用一个1x1的卷积核, 0填充卷积特征图. 卷积层的通道数N要小于M, 这就是Inception的运算.
比如,一个3x3的卷积核,如果其输入和输出的通道数均为512,那么需要的计算量为9x512x512。在卷积操作中,输出特征图上某一个位置,其是与所有的输入特征图是相连的,这是一种密集连接结构。GoogLeNet基于这样的理念:在深度网路中大部分的激活值是不必要的(为0),或者由于相关性是冗余。因此,最高效的深度网路架构应该是激活值之间是稀疏连接的,这意味着512个输出特征图是没有必要与所有的512输入特征图相连。如果我们使用Inception, 就可以先通过很少运算量的1x512x512把特征图压缩到n通道, 再用3x3的卷积核处理输出成512通道, 即只需要1x512x512+9xnx512的计算量. 这样实现了一种间接的剪枝. inception可以随心所欲改变通道数, 但一般不会增加通道数, 不谈数学只谈直觉, 我们也能知道, 降低通道数信息量不一定缩减, 但增加通道数信息量一定不会增加.
GoogLeNet的另一个创新之处在于使用了average pooling, 它对最终输出的特征图再做平均池化, 直接把HxW的特征图压缩成1x1. 这样全连接层接收的信息更少, 使用的参数更少. 这是有利于训练和部署的, 实践时GoogleNet的速度和准确率都超过了VGG.
高效的CNN经典架构有很多, 比如上面的两种. 但是不管是哪种技术, 在网络比较深层时都会面临梯度消失问题, 从而让训练很困难. 即使是VGG和googlenet也只是十几层而已, 但是当我们想要几十层的网络, 事情就变得不简单了. 我们之前说过的batch-norm可以一定程度上解决这个问题, 但是仍然有些治标不治本. 为此kaiming大神的团队在15年提出了resnet的方法, 它允许网络的跨层连接, 从而梯度能够更有效传播到上层.
y = f ( x ) + x y = f(x)+x y=f(x)+x
从前向传播的角度来看, resnet让映射更像恒等映射. 当理想映射极接近于恒等映射时,残差映射易于捕捉恒等映射的细微波动。也就是, 我们认为学习一个映射 t ( x ) = f ( x ) + x t(x) = f(x)+x t(x)=f(x)+x比起直接学习 t ( x ) = f ( x ) t(x) = f(x) t(x)=f(x)要更简单, 实践也告诉我们的确如此.
我们使用在跨层时使用张量加法, 而如果跨层的两者维度不同, 那么这个相加是无法进行的. 因此跨层的部分应该不改变特征图的长宽, 通道数也应该是一样的. 这点和VGG很像, 所以实践时我们一般会用这两种设计思想来设计网络.
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
# 3x3 convolution
def conv3x3(in_channels, out_channels, stride=1):
return nn.Conv2d(in_channels, out_channels, kernel_size=3,
stride=stride, padding=1, bias=False)
# Residual block
class ResidualBlock(nn.Module):
def __init__(self, in_channels, out_channels, stride=1, downsample=None):
"""
Initializes internal Module state, shared by both nn.Module.
"""
super(ResidualBlock, self).__init__()
self.block = nn.Sequential(
conv3x3(in_channels, out_channels,stride = stride),
nn.BatchNorm2d(out_channels),
nn.ReLU(),
conv3x3(out_channels, out_channels),
nn.BatchNorm2d(out_channels)
)
self.downsample = downsample
def forward(self, x):
"""
Defines the computation performed at every call.
x: N * C * H * W
"""
# if the size of input x changes, using downsample to change the size of residual
residual = x
out = self.block(x)
if self.downsample is not None:
residual = self.downsample(x)
out += residual
out = F.relu(out)
return out
我们在两次卷积中可能会使输入的tensor的size与输出的tensor的size不相等,为了使它们能够相加,所以输出的tensor与输入的tensor size不同时,我们使用downsample(由外部传入)来使保持size相同
上图是ResNet18架构, 我们尝试照着这种架构搭建残差网用于cifar-10分类.
class ResNet(nn.Module):
def __init__(self, block, layers, num_classes=10):
"""
Initializes internal Module state, shared by both nn.Module and ScriptModule.
"""
super(ResNet, self).__init__()
self.in_channels = 64
#part1
self.conv_in = nn.Sequential(
conv3x3(3, self.in_channels),
nn.BatchNorm2d(self.in_channels),
nn.ReLU()
)
#part2
self.layer1 = self.make_layer(block, 64, num_blocks=layers[0])
self.layer2 = self.make_layer(block, 128, num_blocks=layers[1], stride=2)
self.layer3 = self.make_layer(block, 256, num_blocks=layers[2], stride=2)
self.layer4 = self.make_layer(block, 512, num_blocks=layers[3], stride=2)
#part3
self.dropout10 = nn.Dropout(0.1)
self.dropout50 = nn.Dropout(0.5)
self.avgpool = nn.AvgPool2d(4, 4)
self.fc = nn.Linear(512, 10)
def make_layer(self, block, out_channels, num_blocks, stride=1):
"""
make a layer with num_blocks blocks.
"""
downsample = None
if (stride != 1) or (self.in_channels != out_channels):
# use Conv2d with stride to downsample
downsample = nn.Sequential(
conv3x3(self.in_channels, out_channels, stride=stride),
nn.BatchNorm2d(out_channels))
# first block with downsample
layers = []
layers.append(block(self.in_channels, out_channels, stride, downsample))
self.in_channels = out_channels
# add num_blocks - 1 blocks
for i in range(1, num_blocks):
layers.append(block(out_channels, out_channels))
# return a layer containing layers
return nn.Sequential(*layers)
def forward(self, x):
"""
Defines the computation performed at every call.
"""
out = self.conv_in(x)
out = self.layer1(out)
out = self.dropout10(out)
out = self.layer2(out)
out = self.dropout10(out)
out = self.layer3(out)
out = self.dropout10(out)
out = self.layer4(out)
out = self.avgpool(out)
# view: here change output size from 4 dimensions to 2 dimensions
out = out.view(out.size(0), -1)
out = self.dropout50(out)
out = self.fc(out)
return out
resnet = ResNet(ResidualBlock, [2, 2, 2, 2]) #ResNet18
def train(model, train_loader, loss_func, optimizer, device):
"""
train model using loss_fn and optimizer in an epoch.
model: CNN networks
train_loader: a Dataloader object with training data
loss_func: loss function
device: train on cpu or gpu device
"""
model.train()
total_loss = 0
# train the model using minibatch
for i, (images, targets) in enumerate(train_loader):
images = images.to(device)
targets = targets.to(device)
# forward
outputs = model(images)
loss = loss_func(outputs, targets)
# backward and optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
# every 100 iteration, print loss
if (i + 1) % 100 == 0:
print ("Step [{}/{}] Train Loss: {:.4f}"
.format(i+1, len(train_loader), loss.item()))
return total_loss / len(train_loader)
def evaluate(model, val_loader, device):
"""
model: CNN networks
val_loader: a Dataloader object with validation data
device: evaluate on cpu or gpu device
return classification accuracy of the model on val dataset
"""
# evaluate the model
model.eval()
# context-manager that disabled gradient computation
with torch.no_grad():
correct = 0
total = 0
for i, (images, targets) in enumerate(val_loader):
# device: cpu or gpu
images = images.to(device)
targets = targets.to(device)
outputs = model(images)
# return the maximum value of each row of the input tensor in the
# given dimension dim, the second return vale is the index location
# of each maxium value found(argmax)
_, predicted = torch.max(outputs.data, dim=1)
correct += (predicted == targets).sum().item()
total += targets.size(0)
accuracy = correct / total
print('Accuracy on Test Set: {:.4f} %'.format(100 * accuracy))
return accuracy
def save_model(model, save_path):
# save model
torch.save(model.state_dict(), save_path)
import matplotlib.pyplot as plt
def show_curve(ys, title):
"""
plot curlve for Loss and Accuacy
Args:
ys: loss or acc list
title: loss or accuracy
"""
x = np.array(range(len(ys)))
y = np.array(ys)
plt.plot(x, y, c='b')
plt.axis()
plt.title('{} curve'.format(title))
plt.xlabel('epoch')
plt.ylabel('{}'.format(title))
plt.show()
def fit(model, num_epochs, optimizer, device):
"""
train and evaluate an classifier num_epochs times.
n and evaluate an classifier num_epochs times.
We use optimizer and cross entropy loss to train the model.
Args:
model: CNN network
num_epochs: the number of training epochs
optimizer: optimize the loss function loss_func.to(device)
loss_func.to(device)
"""
# loss and optimizer
loss_func = nn.CrossEntropyLoss()
model.to(device)
loss_func.to(device)
# log train loss and test accuracy
losses = []
accs = []
for epoch in range(num_epochs):
print('Epoch {}/{}:'.format(epoch + 1, num_epochs))
# train step
loss = train(model, train_loader, loss_func, optimizer, device)
losses.append(loss)
# evaluate step
accuracy = evaluate(model, test_loader, device)
accs.append(accuracy)
# show curve
show_curve(losses, "train loss")
show_curve(accs, "test accuracy")
from torch.utils.data import Dataset
import torch.utils.data as Data
import torchvision.transforms as transforms
import torchvision
BATCH_SIZE = 100
transform = transforms.Compose(
[transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
train_data=torchvision.datasets.CIFAR10(
root='E:/cifar10',
train=True,
transform=transform
)
train_loader = Data.DataLoader(
dataset=train_data,
batch_size=BATCH_SIZE,
shuffle=True
)
test_data=torchvision.datasets.CIFAR10(
root='E:/cifar10',
train=False,
transform=transform
)
test_loader = torch.utils.data.DataLoader(test_data, batch_size=BATCH_SIZE,
shuffle=False)
resnet.load_state_dict(torch.load('resnet18.cpk'))
# Hyper-parameters
num_epochs = 5
lr = 0.001
# Device configuration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# optimizer
optimizer = torch.optim.Adam(resnet.parameters(), lr=lr)
fit(resnet, num_epochs, optimizer, device)
网络比较深层, 需要比之前更多的迭代epoch次数, 最好用GPU. 这里跑了15个epoch, 可以观察到准确率能稳步提升到85以上, 并且还有增长的空间. 实验中最好的结果是30epoch, 92per正确率, 基本达到复现论文的水平.
反观我们之前实现的4卷积层简单网络, 只能把正确率拉到70以上. 这也证明了深度对性能的影响.
Squeeze-and-Excitation的技巧是比较新的(2017), 用于提升ResNet的残差模块性能的一种技巧. 它通过把残差模块的输出x做额外的全局池化, 再做编码-译码以得到"每个通道的重要度打分". 这个重要度应该体现在输出上, 所以我们把这个分数直接乘在输出上.
这种思想类似我们后面RNN中常用的attention, 是一种end2end的设计思想. 虽然可能很不可思议, 但是实践上这个方法是切实能提升模型性能的.
class SELayer(nn.Module):
def __init__(self, channel, reduction=16):
super(SELayer, self).__init__()
self.avg_pool = nn.AdaptiveAvgPool2d(1)
self.fc = nn.Sequential(
nn.Linear(channel, channel // reduction, bias=False),
nn.ReLU(inplace=True),
nn.Linear(channel // reduction, channel, bias=False),
nn.Sigmoid()
)
def forward(self, x):
b, c, _, _ = x.size()
y = self.avg_pool(x).view(b, c)
y = self.fc(y).view(b, c, 1, 1)
return x * y.expand_as(x)
class SEResidualBlock(nn.Module):
def __init__(self, in_channels, out_channels, stride=1, downsample=None, reduction=16):
super(SEResidualBlock, self).__init__()
self.res = ResidualBlock(in_channels, out_channels, stride, downsample)
self.se = SELayer(out_channels, reduction)
def forward(self, x):
residual = x
out = self.res(x)
out = self.se(out)
out += residual
return out