1、图像分类与语义分割

通常CNN网络在卷积层之后会接上若干个全连接层, 将卷积层产生的特征图(feature map)映射成一个固定长度的特征向量。以AlexNet为代表的经典CNN结构适合于图像级的分类和回归任务，因为它们最后都期望得到整个输入图像的一个数值描述（概率），比如AlexNet的ImageNet模型输出一个1000维的向量表示输入图像属于每一类的概率(softmax归一化)。
FCN对图像进行像素级的分类，从而解决了语义级别的图像分割（semantic segmentation）问题。与经典的CNN在卷积层之后使用全连接层得到固定长度的特征向量进行分类（全联接层＋softmax输出）不同，FCN可以接受任意尺寸的输入图像，采用反卷积层对最后一个卷积层的feature map进行上采样, 使它恢复到输入图像相同的尺寸，从而可以对每个像素都产生了一个预测, 同时保留了原始输入图像中的空间信息, 最后在上采样的特征图上进行逐像素分类（如上图）

2、全卷积网络（FCN）

可以看到：

FCN以传统的卷积网络（VGG，ResNet等）作为基础网络，去掉最后的全连接层。
经过卷积层（conv1，conv2...）输入尺寸不变，而经过池化层后feature map尺寸减半。
经过pool3后feature map尺寸变为原始图像的1/8，经过pool4后变为1/16，经过pool5后变为1/32。
将pool5得到的1/32尺寸feature map（此时应叫heat map）进行上采样（转置卷积或双线性插值）得到1/16与pool4的1/16对应相加，得到新的1/16的heat map再进行上采样变为1/8，再与pool3的1/8进行对应相加操作。
最后对合并后的原图的1/8 的heat map进行上采样恢复到原图大小，此时的通道数应为分类的类别数。

3、FCN的pytorch实现

# 定义双线性插值，作为转置卷积的初始化权重参数
def bilinear_kernel(in_channels, out_channels, kernel_size):
   factor = (kernel_size + 1) // 2
   if kernel_size % 2 == 1:
       center = factor - 1
   else:
       center = factor - 0.5
   og = np.ogrid[:kernel_size, :kernel_size]
   filt = (1 - abs(og[0] - center) / factor) * (1 - abs(og[1] - center) / factor)
   weight = np.zeros((in_channels, out_channels, kernel_size, kernel_size), dtype='float32')
   weight[range(in_channels), range(out_channels), :, :] = filt
   return torch.from_numpy(weight)


class fcn(nn.Module):
   def __init__(self, num_classes):
       super(fcn, self).__init__()
       pretrained_net = resnet34(pretrained=True)
       self.stage1 = nn.Sequential(*list(pretrained_net.children())[:-4]) # 第一段
       self.stage2 = list(pretrained_net.children())[-4] # 第二段
       self.stage3 = list(pretrained_net.children())[-3] # 第三段
       
       # 通道统一
       self.scores1 = nn.Conv2d(512, num_classes, 1)
       self.scores2 = nn.Conv2d(256, num_classes, 1)
       self.scores3 = nn.Conv2d(128, num_classes, 1)
       
       # 8倍上采样
       self.upsample_8x = nn.ConvTranspose2d(num_classes, num_classes, 16, 8, 4, bias=False)
       self.upsample_8x.weight.data = bilinear_kernel(num_classes, num_classes, 16) # 使用双线性 kernel
       
       # 2倍上采样
       self.upsample_4x = nn.ConvTranspose2d(num_classes, num_classes, 4, 2, 1, bias=False)
       self.upsample_4x.weight.data = bilinear_kernel(num_classes, num_classes, 4) # 使用双线性 kernel
       self.upsample_2x = nn.ConvTranspose2d(num_classes, num_classes, 4, 2, 1, bias=False)   
       self.upsample_2x.weight.data = bilinear_kernel(num_classes, num_classes, 4) # 使用双线性 kernel

       
   def forward(self, x):
       x = self.stage1(x)
       s1 = x # 1/8
       
       x = self.stage2(x)
       s2 = x # 1/16
       
       x = self.stage3(x)
       s3 = x # 1/32
       
       s3 = self.scores1(s3)
       s3 = self.upsample_2x(s3) # 1/16
       s2 = self.scores2(s2)
       s2 = s2 + s3
       
       s1 = self.scores3(s1)
       s2 = self.upsample_4x(s2) # 1/8
       s = s1 + s2

       s = self.upsample_8x(s) # 1/1
       return s

我们伪造一个batch的图像，输入网络，看看网络输出是怎样的

x = torch.randn(1,3,64,64) # 伪造图像
num_calsses = 21
net = fcn(num_classes)
y = net(x)
y.shape

输出：
torch.Size([1, 21, 64, 64])
可见，输出和原图一样为64 × 64大小，通道数为标签类别数21（VOC数据集有21个类别，含背景）的张量。
最后，21个通道上的每一个对应位置的像素分别预测一个类别的概率，保留概率最大的那个像素的索引，即可得到一个64×64的索引矩阵，根据索引矩阵找出对应像素属于哪个类别。

FCN原理及pytorch实现

1、图像分类与语义分割

2、全卷积网络（FCN）

3、FCN的pytorch实现

你可能感兴趣的:(FCN原理及pytorch实现)