SSD：Single Shot Multibox Detector：第二部分-代码与细节实现

作者： 心有宝宝人自圆

声明：欢迎转载本文中的图片或文字，请说明出处

写在前面

受到前辈们的启发，决定应该写些文章记录一下学习的内容了

之前也读过一些文章、写过一些代码，以后再慢慢填坑吧

现在把最近读的学习与大家分享一下

在此分享一下自己的理解和心得，如有错误或理解不当敬请指出

这篇文章是SSD：Single Shot Multibox Detector：第一部分-论文阅读的后续内容，努力填坑......

论文地址：SSD: Single Shot MultiBox Detector

我们的目标是：用Pytorch实现SSD

我使用的是python-3.6+ pytorch-1.3.0+torchvision-0.4.1

训练集：VOC2007 trainval ，VOC2012 trainval

测试集：VOC2007 test

其中目标类别如下，共20个类别+1（背景类）

('aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 
'chair', 'cow', 'diningtable','dog', 'horse', 'motorbike', 'person',
 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor')

以下图片为detect的结果，训练了45个epochs，比着作者的200+epochs差的挺多，但效果还行把（关键有点耗时间），随机展示了测试集中的一些图片检测效果，看看怎么样

0.论文重要概念的回顾

single-shot vs two-stage：典型的two-stage模型（R-CNN系列）一般有SSD论文提及的那个pipeline，大量的多尺度的提议区域，卷积神经网络提取特征，高质量分类器进行分类，用回归方法预测边界框的位置，blablabla......总之它存在准确率-速度权衡，大量的计算资源消耗使它不适合真实世界的即时目标检测任务；SSD将最耗时的提议区域的选择与重采样去除，转而使用封装在了模型内部的固定锚框，是我们能又快又准的进行目标检测
固定的锚框（fixed边界框，priors）：在我之前写的论文阅读部分中，大量的准备工作都是对锚框进行的，锚框的设计对模型的训练至关重要，因为它将被设计成ground truth标记（offset+label）。锚框是预先在SSD模型中固定下来的（priors），以(aspect ratio, scale)来标识。由于锚框与不同层次的feature map对应，所以高层的 scale大，低层的 scale小（预测是基于每一个priors）
多尺度特征图与预测器：SSD在不同层次的特征图上进行预测，并将预测结果加到截断的base net之后。低层主要用来检测较小的目标，高层主要用来检测较大的目标，不同尺度的预测器学习去预测该尺度下的目标。由于不同的尺度特征上，一个像素的感受野在高层更大，这一特性使得卷积核被设定成固定的大小的小卷积核。
Hard Negative Mining：SSD在训练时往往会存在大量的负类，这将导致训练数据的正负类严重不平衡，所以我们需要显式选择一定比例负类信度高的预测结果去计算损失，而不使用全部的负类
非极大值抑制：只留下信度最高的预测框，删除交叠、冗余的数据框

整体的工作量还是很大的，我尽量把注释写的清楚

记得定义全局变量

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

1. 从锚框（论文中固定边界框、default boxes，之后的Prior）开始

import matplotlib.pyplot as plt
def show_box(box, color):
   """
   使用matplotlib展示边界框
   :param box: 边界框，(xmin, ymin, xmax, ymax)
   :return: matplotlib.patches.Rectangle
   """
    return plt.Rectangle(xy=(box[0], box[1]), width=box[2] - box[0], height=box[3] - box[1], fill=False,edgecolor=color, linewidth=2)

通常来说，目标（不论是哪个种类）在图像中的位置分布十分散乱，大小尺寸各不一致。从概率上来说，目标可能出现在任何地方，所以我们只能将这种概率空间离散化，这样我们至少能得出一个概率值了......我们就让锚框尽可能的普遍整个特征图（离散化的概率空间？）。

锚框是先验的、固定的方框，它们共同代表了这个类别可能性和近似的方框的概率空间，之后为了突出先验性，给它起个英文名：Prior。

1.1 好吧Prior

这些锚框需要人工选定且大小、尺度符合训练数据的特点，想要Prior代表概率空间就需要它们以每个像素块生成
和之前论文阅读中讲的一样，低层采样较小的scale（检测较小的目标），高层采用较大的scale（检测较大的目标）。因为scale采用比例表示，从特征图还原到原始空间上尺度具有一致性

10x10特征图上某一location对应的6个priors（其他的没画太多了）

（具体的操作过程看论文或我之前写的文章把，这里只标识了重点步骤）

def create_prior_boxes(widths: list, heights: list, scales: list, aspect_ratios: list) -> torch.Tensor:
    """
    Create prior boxes on each pixel following authors methods in paper
    :param widths: widths list of all feature maps using for create priors
    :param heights: heights list of all feature maps using for create priors
    :param scales: scales list of all feature maps use for create priors.
                Note that each feature map has a specific scale
    :param aspect_ratios: widths list of all feature maps use for create priors.
                Note that each feature maps has different nums of ratios
    :return: priors' location in center coordinates , a tensor in shape of(8732, 4)
    """
    prior_boxes = []
    for i, (width, height, scale, ratios) in enumerate(zip(widths, heights, scales, aspect_ratios)):
        for y in range(height):
            for x in range(width):
                # change cxcy to the center of pixel
                # change cxcy in range 0 to 1
                cx = (x + 0.5) / width
                cy = (y + 0.5) / height
                for ratio in ratios:
                    # all those params are proportional form(percent coordinates)
                    prior_width = scale * math.sqrt(ratio)
                    prior_height = scale / math.sqrt(ratio)
                    prior_boxes.append([cx, cy, prior_width, prior_height])

                    # For the aspect ratio of 1, we also add a default box whose scale is sqrt(s(k)*(sk+1))
                    if ratio == 1:
                        try:
                            additional_scale = math.sqrt(scales[i] * scales[i + 1])
                        # except this is the last feature map, only one pixel is left
                        except IndexError:
                            additional_scale = 1

                        # ratio of 1 means scale is width and height
                        prior_boxes.append([cx, cy, additional_scale, additional_scale])

    return torch.FloatTensor(prior_boxes).clamp_(0, 1).to(device) # (8732, 4) Note that they are percent coordinates

1.2 Prior的表示形式

Prior在论文中表示为(cx, cy, w, h)：中心表示形式，而有时候为了编程的方便还会采用(xmin, ymin , xmax, ymax)的边缘表示形式，这就需要两种表示形式的相互转化

def xy_to_cxcy(xy: torch.Tensor) -> torch.Tensor:
    """
    把(xmin, ymin, xmax, ymax)的中心表示形式转换为(cx, cy, w, h)的边缘表示形式
    :param xy: 边界框的(xmin, ymin, xmax, ymax)表示，a tensor of size (num_boxes, 4)
    :return:边界框的(cx, cy, w, h)表示， a tensor of size (num_boxes, 4)
    """
    return torch.cat([(xy[:, 2:] + xy[:, :2] )/ 2, xy[:, 2:] - xy[:, :2]], dim=1)

def cxcy_to_xy(cxcy: torch.Tensor) -> torch.Tensor:
    """
    把(cx, cy, w, h)表示形式转换为(xmin, ymin, xmax, ymax)
    :param cxcy: 边界框的(cx, cy, w, h)表示，a tensor of size (n_boxes, 4)
    :return: 边界框的(xmin, ymin, xmax, ymax)表示
    """
    return torch.cat([cxcy[:, :2] - (cxcy[:, 2:] / 2), cxcy[:, :2] + (cxcy[:, 2:] / 2)], 1)

注：在之前的论文阅读部分也指明了通过多方面考虑应该使用相对长度(或相对坐标，即已进行归一化)来表示Prior

1.3 Prior to ground truth

很显然priors并不是真正的groud truth信息（与真实边界存在偏差、未指定类别、且每个prior的ground truth据有不确定性，我们需要量化这些信息），我们需要将priors的信息调整为ground truth信息来计算损失（同时我么也必须理解我们预测的是什么，预测结果怎么转换为真实预测边界框的信息）

1.3.1 offset

偏移量表示为，论文阅读部分指出进行了如下编码：

=,=,, （1），

其中(cx,cy,w,h)是ground truth的真实位置信息，是prior的真实位置信息

而在实际使用的时候常常使用基于经验参数的标准化对编码结果再次处理，即：

=,=,, （2），

其中经验参数

def cxcy_to_gcxgcy(cxcy: torch.Tensor, priors_cxcy: torch.Tensor) -> torch.Tensor:
    """
    使用中心格式的输入计算与目标区域与priors的偏移量，该偏移量按式(2)编码
    中心格式的目标区域与priors是一一对应的
    :param cxcy: 边缘格式的边界框, a tensor of size (n_priors, 4)
    :param priors_cxcy: prior的边界框, a tensor of size (n_priors, 4)
    :return: encoded bounding boxes, a tensor of size (n_priors, 4)
    """
    return torch.cat([(cxcy[:, :2] - priors_cxcy[:, :2]) / (priors_cxcy[:, 2:]) * 10,  
                      torch.log(cxcy[:, 2:] / priors_cxcy[:, 2:]) * 5], 1)

我们要获得实际预测边界框，则需要对上述过程进行解码（注：预测器实际预测的结果是上面最终编码的的offsets）

def gcxgcy_to_cxcy(gcxgcy: torch.Tensor, priors_cxcy: torch.Tensor) -> torch.Tensor:
    """
    输入模型预测的offsets和priors(一一对应)，解码出的预测边界框中心格式边界框
    :param gcxgcy:编码后的边界框(即offset),如模型的输出, a tensor of size (n_priors, 4)
    :param priors_cxcy:prior的边界框, a tensor of size (n_priors, 4)
    :return: decoded bounding boxes in center-size form, a tensor of size (n_priors, 4)
    """
    return torch.cat([gcxgcy[:, :2] / 10 * priors_cxcy[:, 2:] + priors_cxcy[:, 2],
                      torch.exp(gcxgcy[:, 2:] / 5) * priors_cxcy[:, 2:]], dim=1)

这一部分中ground truth offset只需cxcy为ground truth labels即可，但cxcy需与priors一一对应，这种对应关系，就是我们接下来讨论的内容

1.3.2 object class

0代表背景类，1-n_classes代表目标类别。每个图像中目标个数、目标类别均不一定相同，因此我要先给priors分配一个目标，由该目标的类别确定prior的类别

1.3.3 criterion

为了为priors分配类别，必须采用一种指标，来判断priors与真实边界框的匹配程度

原文中采用了jaccard overlap（交并比，IoU）

IoU

下面定义了计算交并比的函数，注意输入是边界框的边缘形式

def find_intersection(set_1, set_2):
    """
    Find the intersection of every box combination between two sets of boxes that are in boundary coordinates.
    :param set_1: set 1, a tensor of dimensions (n1, 4)
    :param set_2: set 2, a tensor of dimensions (n2, 4)
    :return: intersection of each of the boxes in set 1 with respect to each of the boxes in set 2, a tensor of dimensions (n1, n2)
    """

    # PyTorch auto-broadcasts singleton dimensions
    lower_bound = torch.max(set_1[:, :2].unsqueeze(1), set_2[:, :2].unsqueeze(0))  # (n1,n2,2)
    upper_bound = torch.min(set_1[:, 2:].unsqueeze(1), set_2[:, 2:].unsqueeze(0))  # (n1,n2,2)
    intersection_dims = torch.clamp(upper_bound - lower_bound, 0)  # (n1, n2, 2)
    return intersection_dims[:, :, 0] * intersection_dims[:, :, 1]  # (n1, n2)


def find_jaccard_overlap(set_1, set_2):
    """
    Find the Jaccard Overlap (IoU) of every box combination between two sets of boxes that are in boundary coordinates.
    :param set_1: set 1, a tensor of dimensions (n1, 4)
    :param set_2: set 2, a tensor of dimensions (n2, 4)
    :return: Jaccard Overlap of each of the boxes in set 1 with respect to each of the boxes in set 2, a tensor of dimensions (n1, n2)
    """
    # Find intersections
    intersection = find_intersection(set_1, set_2)

    # Find areas of each box in both sets
    areas_set_1 = (set_1[:, 2] - set_1[:, 0]) * (set_1[:, 3] - set_1[:, 1])  # (n1)
    areas_set_2 = (set_2[:, 2] - set_2[:, 0]) * (set_2[:, 3] - set_2[:, 1])  # (n2)

    # Find the union
    # PyTorch auto-broadcasts singleton dimensions
    union = areas_set_1.unsqueeze(1) + areas_set_2.unsqueeze(0) - intersection  # (n1, n2)
    return intersection / union  # (n1, n2)

假设set_1是priors(8732, 4)，set_2是真实边界框(n_object_per_image, 4)，我们最终的到(8732, n_object_per_image)的tensor，即在该图像内每个prior与每个object box的交并比

1.3.4 priors to ground truth

def label_prior(priors_cxcy, boxes, classes):
    """
    Assign ground truth label for prior. Note that we do this for each image in a batch
    priors are fixed pretrain, boxes and classes are from dataloader.
    :param priors_cxcy: priors which we create in shape of (8732, 4),note that they are center center coordinates and percent coordinates
    :param boxes: boxes is a tensor of true objects' bounding boxes in the image. Note that they are percent coordinates
    :param classes: classes is a tensor of true objects' class labels in the image
    :return:
    """

    n_objects = boxes.size(0)
    # cxcy to xy
    priors_cxcy = priors_cxcy
    priors_cxcy = cxcy_to_xy(priors_cxcy)
    overlaps = find_jaccard_overlap(boxes, priors_cxcy)

    # 为每个prior找出最大的overlap并以此为标准分配目标(注意不是类别)
    overlap_per_prior, object_per_prior = overlaps.max(dim=0)  # (8732)

    # 直接为按交并比大小分配类别会产生如下的问题
    # 1. 如果一个检测目标对与所有priors的交并比都不是最大的，该目标的类别则不能分配给任意一个prior
    # 2. 给定阈值(0.5)将交并比较小的prior分配给背景类(class 0)

    # 解决第一个问题：
    _, prior_per_object = overlaps.max(dim=1)  # (nums of object)每个值为该目标对应的index in (0, 8731)

    object_per_prior[prior_per_object] = torch.LongTensor(range(n_objects)).to(device)  # 为与每个目标overlap最大prior的分配为该目标
    overlap_per_prior[prior_per_object] = 1

    # 解决第二个问题：
    class_per_prior = classes[object_per_prior]  # 根据object的索引获得对应其真实的类别标签
    class_per_prior[overlap_per_prior < 0.5] = 0  # (8732)

    # 为每个prior计算与之前所分配objcet边界框的offset
    offset_per_prior = cxcy_to_gcxgcy(boxes[object_per_prior], priors_cxcy)  # (8732, 4)

    return class_per_prior, offset_per_prior

不难注意到，每个prior对应了一个ground truth，它们用来检测不同尺度、不同位置的目标

label_prior()是针对batch里的一个图像与之对应的目标边界框和目标类别（xml文件标注的，from dataloard）,只需在batches里写个for循环即可，就得到了针对该图片的priors to ground truth，用于Loss计算（见5.1）.

2. 网络结构

SSD模型的网络结构将VGG-16从FC之前截断作为base net，将base net细节结构进行更改并加上Conv6和Conv7，在base net之后加上了额外的卷积层结构

（注：为代码的可读性网络，SSD的网络被拆分BaseNet和AuxiliaryConvolutions）

vgg-16

作者提供的细节更改+附加结构

完整的VGG-16模型由于全连接层的存在，需要输入的大小为( 3, 224, 224)，作者将网络魔改一下用来接收300x300的输入(SSD300 model)

2.0 Conv4_3：

按vgg-16向前传播的时候，Conv_4中300 x 300的原始图像会被下采样到37 x 37，而这里指出的大小为38 x 38。vgg-16网络中，能够下采样的只有池化层，所以这里变化是由maxpool3的修改而导致的，将其中计算输出尺寸的函数由向下取整(floor)改为向上取整(ceiling)

self.pool3=nn.MaxPool2d(kernel_size=2, 2, ceil_mode=True)

2.1 Maxpool5

不在使用原来vgg-16中同一结构，而改用size=(3,3)，stride=1，padding=1的maxpool

self.pool5 = nn.MaxPool2d(kernel_size=3, stride=1, padding=1)

2.2 Conv6与Conv7：希望我能表述的足够清楚

fc6-fc7：图像为(512, 7, 7).flatten()(fc6)4096(fc7)1000，作者希望直接利用fc6和fc7的weights生成Conv6和Conv7的卷积核

2.2.1我们先来理清一下卷积层与全连接层的相互转化问题

卷积层->全连接层：

Conv to FC

由上图很容易的出转换fc层的权重是取自卷积核权重的稀疏矩阵。又特征图每个输出通道上的像素由输入空间所有in_channel在相同位置的卷积值相加得到(i.e.红框阴影由多层蓝阴影框(假设有多层-_-)分别与多个卷积核卷积得到的多层结果相加得到)，所以out_channel控制特征图的个数，in_channel和out_channels控制fc权重的长和宽

全连接层->卷积层：考虑input像素（512，7，7）.flatten() -> 4096个，此时fc权重为（512*7*7，4096）

假设卷积核大小与图像大小一致，为（4096，512，7，7），按照卷积的运算过程，得到的结果是（某一输出通道内）每个通道的每个像素与对应的卷积核权重相乘之后相加，与全连接的计算结果完全一致，此时通道维是原来的特征维
所以conv6的卷积核应为（4096，512，7，7），conv7的卷积核应为（4096，4096，1，1）

However，这样还不行，这些过滤器数量众多、体积庞大，而且计算成本很高，所以作者对卷积核进行了下采样

2.2.2 卷积核下采样

其实这个过程非常的简单，就是把卷积核的参数（out_channels, height, width这三个dim）给下采样了.......

from collections import Iterable
def decimate(tensor: torch.Tensor, m: Iterable) -> torch.Tensor:
    """
    对tensor的一些维度进行下采样，每一维度的下采样间隔列表为m
    :param tensor: 要被下采样的tensor
    :param m: 每一维度的下采样间隔参数列表，如果某一维度不进行下采样，参数为None
    :return: 下采样后的tensor
    """
    assert tensor.dim() == len(m)
    for d in range(tensor.dim()):
        if m[d] is not None:
            tensor = tensor.index_select(dim=d, index=torch.arange(start=0, end=tensor.size(d), step=m[d]))
    return tensor

作者将 height和width dim的采样率都设为3（每三取一），out_channels采样率为4采样出了的原始卷积核

终于我们得到了Conv6核Conv7的卷积核分别为（1024，512，3，3），（1024, 1024，1, 1）

2.2.2Atrous卷积

Atrous卷积（空洞卷积, also known as Dilated Convolution or Convolution with holes......)实际针对的是相邻的像素（因为相邻像素一般在信息上有较大冗余）。为了在不进行pooling下采样的情况下能够获得更大的感受野，我们便可以在卷积的输入空间内加入空洞（因为pooling意味着图片信息的损失。Atrous卷积实际并没有图片信息的损失，只不过特征图同一像素不提取输入空间相邻像素的信息，而在其他特征图像素中，之前被“跳过”的相邻像也确实和卷积核进行了运算......不多说了，看图更清楚）

该图片来自：vdumoulin/conv_arithmetic （可能大家对这一系列的图都很熟悉，阴影部分是卷积运算的区域）

DILATED CONVOLUTIONS with kernel size 3x3, dilation=2

不难发现，确实每个输入空间的像素都被用到（没有像pooling那样丢弃）并且还扩大了感受野

2.2.3Atrous算法与卷积核的下采样

原文中，conv6的输出大小仍是19x19，且使用了atrous卷积。

按之前讲述的内容卷积核被下采样后，特征图原本应该与7x7卷积核运算，但下采样使部分核有所缺失（holes are in the kernel），所以合适的方法应该让卷积时跳过3个像素。然而作者的仓库中实际上使用了dilation=6，这样的操作可能是考虑了修改之后maxpool5没有使输出大小缩小一半，所以dilation需要增加一倍

self.conv6 = nn.Conv2d(512, 1024, kernel_size=3, padding=6, dilation=6)  # atrous convolution
self.conv7 = nn.Conv2d(1024, 1024, kernel_size=1)

接下来使用原全连接层的weight和bias更新base_net：

# this part can be defined in class BaseNet as a function for init.
# get state_dict which only contains params
state_dict = base_net.state_dict()  # base net is instance of BaseNet
pretrained_state_dict = torchvision.models.vgg16(pretrained=True).state_dict()

# fc6
conv_fc_weight = pretrained_state_dict['classifier.0.weight'].view(4096, 512, 7, 7)  # (4096, 512, 7, 7)
conv_fc_bias = pretrained_state_dict['classifier.0.bias']  # (4096)
state_dict['conv6.weight'] = decimate(conv_fc_weight, m=[4, None, 3, 3])# (1024, 512, 3, 3)
# fc7：在预训练模型中，fc7的名字就是classifier.3
conv_fc7_weight = pretrained_state_dict['classifier.3.weight'].view(4096, 4096, 1, 1)  # (4096, 4096, 1, 1)
conv_fc7_bias = pretrained_state_dict['classifier.3.bias']  # (4096)
state_dict['conv7.weight'] = decimate(conv_fc7_weight, m=[4, 4, None, None])  # (1024, 1024, 1, 1)
state_dict['conv7.bias'] = decimate(conv_fc7_bias, m=[4])  # (1024)

base_net.load_state_dict(state_dict)

......这个令人头疼的部分终于结束了

2.3 其余的附加卷积层：

都是作者附加的用来提取大尺度特征的，挺好理解，1x1卷积层有妙用（类似于提取特征图进一步提取特征？）

class AuxiliaryConvolutions(nn.Module):
    """
    Additional convolutions to produce higher-level feature maps.
    """

    def __init__(self):
        super(AuxiliaryConvolutions, self).__init__()

        # Auxiliary convolutions on top of the VGG base
        self.conv8_1 = nn.Conv2d(1024, 256, kernel_size=1, padding=0)  
        self.conv8_2 = nn.Conv2d(256, 512, kernel_size=3, stride=2, padding=1) 

        self.conv9_1 = nn.Conv2d(512, 128, kernel_size=1, padding=0)
        self.conv9_2 = nn.Conv2d(128, 256, kernel_size=3, stride=2, padding=1)  
        
        self.conv10_1 = nn.Conv2d(256, 128, kernel_size=1, padding=0)
        self.conv10_2 = nn.Conv2d(128, 256, kernel_size=3, padding=0)  

        self.conv11_1 = nn.Conv2d(256, 128, kernel_size=1, padding=0)
        self.conv11_2 = nn.Conv2d(128, 256, kernel_size=3, padding=0)  
        
        # Initialize convolutions' parameters
        for c in self.children():
            if isinstance(c, nn.Conv2d):
                nn.init.xavier_normal_(c.weight)
                nn.init.constant_(c.bias, 0.)

2.4 multi-level feature maps：

从图中可以看出，用来提取多尺度特征的特征图选择为conv4_3, conv7, conv8_2, conv9_2, conv10_2, and conv11_2（有低层特征图，也有高层特征图），在forward内把这些特征图返回出来即可

BaseNet：forward return conv4_3_features, conv7_features

AuxiliaryConvolutions: foward return conv8_2_features, conv9_2_features, conv10_2_features, conv11_2_features

2.5 predictor

多层特征图传入各自的预测其，分别预测offset和class，各层的预测器具有较类似的结构：kernel_size=3, padding=1

注意offset的预测结果是基于该层特征图上priors的编码结果（见1.3），class需要为各个类别评分

def loc_predictor(in_channels, num_priors):
    """
    边界框预测层,为每个输入空间每个像素上的priors预测4个偏移量
    :param in_channels: 输入空间通道数
    :param num_priors:每个单元为中心生成 num_priors 个prior
    :return:预测offset的卷积层
    """
    return nn.Conv2d(in_channels, num_priors * 4, kernel_size=3, padding=1)


def cls_predictor(in_channels, num_priors, num_classes):
    """
    类别预测层,为每个输入空间像素上的priors预测各个类别的评分
    类别预测层使用一个保持输入高和宽的卷积层。此时，输出和输入在特征图宽和高上的空间坐标一一对应
    :param in_channels: 输入空间通道数
    :param num_priors: 每个单元为中心生成 num_priors 个prior
    :param num_classes: 目标的类别个数为 num_classes
    :return:类别预测的卷积层
    """
    return nn.Conv2d(in_channels, num_priors * num_classes, kernel_size=3, padding=1)

priors是在特征图每个像素上生成的，预测器的预测结果的w,h与输入空间一致，所以每个预测空间像素与输入空间像素对应，很自然offset是针对对应prior的编码后offset，此时out_channels转换为了特征维，为了应对不同输入空间大小不同导致w,h和num_priors的不同，我们需要在把所有输出结果concatenate前，需要把其空间维flatten一下。class预测与offset预测的思路基本一致只是最后的特征维（输出通道）不同

为了训练还需要把选取提取特征的特征图元素个数凑得和priors的个数一致（一一对应关系）

最后把所有特征图的预测结果连接起来

class PredictionConvolution(nn.Module):
    """
    Convolutions to predict class scores and bounding boxes
    """

    def __init__(self, n_classes):
        """
        :param n_class: number of different types of objects
        """
        self.n_classes = n_classes
        super(PredictionConvolution, self).__init__()
        # Number of priors, as we showing before ,at per position in each feature map
        n_boxes = {'conv4_3': 4,
                   'conv7': 6,
                   'conv8_2': 6,
                   'conv9_2': 6,
                   'conv10_2': 4,
                   'conv11_2': 4}
        self.convs = ['conv4_3', 'conv7', 'conv8_2', 'conv9_2', 'conv10_2', 'conv11_2']
        for name, ic in zip(self.convs, [512, 1024, 512, 256, 256, 256]):
            setattr(self, 'cls_%s' % name, cls_predictor(ic, n_boxes[name], n_classes))
            setattr(self, 'loc_%s' % name, loc_predictor(ic, n_boxes[name]))      

        # Initialize convolutions' parameters
        for c in self.children():
            if isinstance(c, nn.Conv2d):
                nn.init.xavier_normal_(c.weight)
                nn.init.constant_(c.bias, 0.)

    def _apply(self, x: torch.Tensor, conv: nn.Conv2d, num_features: int):
        """
        Apply forward calculation for each conv2d with respect to specific feature map
        :param x: input tensor
        :param conv: conv
        :param num_features: output feature, for loc_pred is 4, for label_pred is num_classes+1
        :return: locations and class scores
        """
        x = conv(x).permute(0, 2, 3, 1).contiguous()
        return x.view(x.size(0), -1, num_features)

    def forward(self, *args):
        # args are feature maps needed for prediction
        assert len(args) == len(self.convs)
        locs = []
        classes_scores = []

        for name, x in zip(self.convs, args):
            classes_scores.append(self._apply(x, getattr(self, 'cls_%s' %name), self.n_classes))
            locs.append(self._apply(x, getattr(self, 'loc_%s' % name), 4))

        locs = torch.cat(locs, dim=1)  # (N, 8732, 4)
        classes_scores = torch.cat(classes_scores, dim=1)  # (N, 8732, n_classes)

        return locs, classes_scores

2.6 SSD300

把BaseNet，AuxiliaryConvolutions和PredictionConvolution整合在一起得到SSD300模型

3. 训练数据处理

数据增广时除了图像本身的处理外还涉及对真实边界框的处理，所以我们不能直接使用torchvision.transform里封装好的类，我们只能手动写了

作者使用的data augmentation

针对文中所说的0.5的概率进行图像增广，只需通过判断random.random()是否小于0.5来进行图像增广即可

3.1 随机裁剪

原文中的数据增广主要就是这个随机裁剪了

def random_crop(image: torch.Tensor, boxes: torch.Tensor, labels: torch.Tensor):
    """
    随机裁剪，能够帮助网络学习更大尺度的目标，但某些目标可能被完全剪切掉
    :param image: 图像, a tensor of dimensions (3, original_h, original_w)
    :param boxes: 边缘形式的真实边界框, a tensor of dimensions (n_objects, 4)
    :param labels: 真实目标类别, a tensor of dimensions (n_objects)
    :return: 随机裁剪后图像，边界框，目标类别
    """
    original_width = image.size(2)
    original_height = image.size(1)

    while True:
        # 'None' 意味着不剪裁,0意味着随即裁剪，[.1, .3, .5, .7, .9]是作者文中描述的最小交并比
        min_overlap = random.choice([0., .1, .3, .5, .7, .9, None])
        if min_overlap is None:
            return image, boxes, labels

        # 对选取的最小交并比尝试50次（原文中未提及，但作者仓库中使用），若均不满足条件，则进行下一循环选择新的最小交并比
        for _ in range(50):
            min_scale = 0.3
            # 论文中提及采样比例是[.1, 1]，但作者仓库使用[.3, 1]
            # random.uniform(a,b)->[a,b]闭区间
            new_width = int(original_width * random.uniform(min_scale, 1))
            new_height = int(original_height * random.uniform(min_scale, 1))

            # 论文重提及采样后aspect ratio应该在[0.5,2]
            if not .5 <= new_height / new_width <= 2:
                continue

            # 获取裁剪的位置
            # random.randint(a,b)->[a,b]闭区间
            left = random.randint(0, original_width - new_width)
            top = random.randint(0, original_height - new_height)
            right = left + new_width
            bottom = top + new_height

            crop_bounding = torch.FloatTensor([left, top, right, bottom])

            # 计算剪裁后的图片与真实边界框交并比
            over_lap = find_jaccard_overlap(crop_bounding.unsqueeze(0), boxes).squeeze(0)  # (n_objects)

            # 论文中提及，与所有目标的交并比应该> min_overlap
            if over_lap.max().item() < min_overlap:
                continue

            cropped_image = image[:, top:bottom, left:right]

            # 判断object是否在图像中的判据：true bounding box的中心是否在裁剪后的图像中
            box_centers = (boxes[:, :2] + boxes[:, 2:]) / 2.  # (n_objects, 2)
            center_in_cropped_iamge = (box_centers[:, 0] > left) * (box_centers[:, 0] < right) * ( box_centers[:, 1] > top) * (box_centers[:, 0] < bottom)  # (n_objects)

            # 如果没有一个目标的中心在裁剪后的图像中
            if center_in_cropped_iamge.any():
                continue

            # 丢弃没有通过判据的目标
            new_boxes = boxes[center_in_cropped_iamge]
            new_labels = labels[center_in_cropped_iamge]

            # 计算剪切后图像中边界框的位置
            # 筛选出真实左边界、上边界和裁剪左边界、上边界之中小的那个
            new_boxes[:, :2] = torch.max(new_boxes[:, :2], crop_bounding[:2])
            new_boxes[:, :2] -= crop_bounding[:2]
            # 筛选出真实右边界、下边界和裁剪右边界、下边界之中大的那个
            new_boxes[:, 2:] = torch.min(new_boxes[:, 2:], crop_bounding[2:])
            new_boxes[:, 2:] -= crop_bounding[:2]

            return cropped_image, new_boxes, new_labels

3.2 水平翻转

这个很简单，就是真实边界框不是图像还需要额外处理

def flip(image, boxes):
    """
    Flip image horizontally.
    :param image: 一个PIL图像，因为调用了torchvision的函数，必须使用PIL Image
    :param boxes: 边缘形式的真实边界框, a tensor of dimensions (n_objects, 4)
    :return: 水平翻转图像, 更新后的边界框
    """

    # Flip image
    new_image = torchvision.transforms.functional.hflip(image)

    # Flip boxes
    new_boxes = boxes
    new_boxes[:, 0] = image.width - (boxes[:, 0] + 1)
    new_boxes[:, 2] = image.width - (boxes[:, 2] + 1)
    new_boxes = new_boxes[:, [2, 1, 0, 3]]

    return new_image, new_boxes

3.3 Resize

SSD300模型需要将训练集resize到300 x 300，此外在这里把真实边界框处理成比例的形式

def resize(image, boxes, size=(300, 300), return_percent_coords=True):
    """
    Resize image. For the SSD300, resize to (300, 300).

    Since percent/fractional coordinates are calculated for the bounding boxes (w.r.t image dimensions) in this process,
    you may choose to retain them.
    :param image: image, a PIL Image
    :param boxes: bounding boxes in boundary coordinates, a tensor of dimensions (n_objects, 4)
    :param size: resize to specific size
    :param return_percent_coords: whether to return new bounding box coordinates in form of percent coordinates
    :return: resized image, updated bounding box coordinates (or fractional coordinates, in which case they remain the same)
    """
    # Resize image
    new_image = transforms.functional.resize(image, size)

    # Resize bounding boxes
    old_size = torch.FloatTensor([image.width, image.height, image.width, image.height]).unsqueeze(0)
    # resize means percent coordinates will not change for only augment or shrink
    new_boxes = boxes / old_size  # percent coordinates means same even if different size 

    if not return_percent_coords:
        new_size = torch.FloatTensor([size[0], size[1], size[0], size[1]]).unsqueeze(0)
        new_boxes = new_boxes * new_size

    return new_image, new_boxes

3.5 Expand

由于模型对于较小尺度目标的检测性能不好，在此我们将训练数据放大，以增强对小尺度目标的检测能力

整体的步骤与resize十分类似，只不过需要将新图片放大，将原图片放在新图片内部，再将其他空白部分填充一下

这个填充的值推荐使用三个channels各自的平均值（可以在3.6中看到）

由于新图片范围比原图片大，真实边界框只需加上[ 向左的移动，向下的移动，向左的移动，向下的移动 ]

3.6 标准化

输入数据先被归一化到[0, 1]，预训练的模型会还需对归一化输入进行标准化，这个页面展示了torchvision.model预训练模型的具体处理

mean = [0.485, 0.456, 0.406] # RGB channels
std = [0.229, 0.224, 0.225]  # RGB channels

4. Dataset and DataLoader

Dataset需要手动创建torch.utils.data.Dataset的子类，在里面对图片、真实边界框、目标标记进行第3节的处理即可

Dataset返回图片、真实边界框、目标标记

然而在使用DataLoader读取batches的时候会出现问题：

注意每个图片内objects的个数不同，这会导致每个图片内boxes和labels的长度不同，这样没办法组成batches

所以我们要为DataLoader的collate_fn=参数指定一个函数（注意只需传入函数名），按此函数整理输出

def collate_fn(batch):
    """
      
    This describes how to combine these tensors of different sizes. We use lists.

    :param batch: an iterable of N sets from __getitem__()
    :return: a tensor of images, lists of varying-size tensors of bounding boxes, labels, and difficulties
    """

    images = list()
    boxes = list()
    labels = list()

    for b in batch:
        images.append(b[0])
        boxes.append(b[1])
        labels.append(b[2])

        images = torch.stack(images, dim=0)

        return images, boxes, labels, difficulties  # tensor (N, 3, 300, 300), 3 lists of N tensors each

5.训练

5.1 Loss Function

location_loss=torch.nn.L1Loss()
confidence_loss=nn.CrossEntropyLoss(reduction='none')

5.2 Hard negative mining

由于训练数据中的负类（背景类）远远多于正类，导致训练数据正负类严重的不平衡，所以这里要使用Hard negative mining，选择Loss最大的负类，使正负类之比为1：3

def calculate_loss(priors_cxcy, pred_locs, pred_scores, boxes, labels, loc_loss, conf_loss, alpha=1):
    """
    使用Hard Negative mining 计算损失
    :param priors_cxcy: 中心形式的priors
    :param pred_locs: 预测的offsets, 一个batch的预测结果
    :param pred_scores: 类别预测分数, 一个batch的预测结果
    :param boxes: 真实边界框，from a batch of dataloader
    :param labels: 真实类别标记，from a batch of dataloader
    :param loc_loss: nn.L1Loss()
    :param conf_loss: nn.CrossEntropyLoss(reduction='none')
    :param alpha: 论文中位置损失的权重，默认为1
    :return: 
    """
    n_priors = priors_cxcy.size(0)
    batch_size = pred_locs.size(0)
    n_classes = pred_scores.size(2)

    assert n_priors == pred_scores.size(1) == pred_scores.size(1)
    true_locs = torch.zeros((batch_size, n_priors, 4), dtype=torch.float).to(device)  # (N, 8732, 4)
    true_classes = torch.zeros((batch_size, n_priors), dtype=torch.long).to(device)  # (N, 8732)

    # 在不同图片里，为每个prior分配真实标签
    for i in range(batch_size):
        cls, loc = label_prior(priors_cxcy, boxes[i], labels[i])
        true_locs[i] = loc
        true_classes[i] = cls

    positive_priors = (true_classes != 0)  # (N, 8732)

    # 计算位置损失：位置损失只计算正类（非背景类）
    loss_of_loc = loc_loss(pred_locs[positive_priors], true_locs[positive_priors])

    # 计算信度损失

    # 按论文中负类：正类 = 3：1选取负类
    n_hard_negative = 3 * positive_priors.sum(dim=1)  # (N)

    # 首先计算所由正类和负类的信度损失，这样可以免得计算不同图片导致的位置关系
    # CrossEntropyLoss(reduction='none')使得损失在第0维度上罗列开来而不是相加或取平均

    loss_of_conf_all = conf_loss(pred_scores.view(-1, n_classes), labels.view(-1))  # (N * 8732)
    loss_of_conf_all = loss_of_conf_all.view(batch_size, n_priors)  # (N, 8732)

    # 我们已经知道了所有正类的损失
    loss_of_conf_pos = loss_of_conf_all[positive_priors]  # (sum(n_positives))

    loss_of_conf_neg = loss_of_conf_all.clone()  # (N, 8732)
    loss_of_conf_neg[positive_priors] = 0  # (N, 8732), 使正类的loss永远不能在前n_hard_negatives
    loss_of_conf_neg, _ = loss_of_conf_neg.sort(dim=1, descending=True)  # 负类将损失按降序排序
    neg_ranks = torch.LongTensor(range(n_priors)).unsqueeze(0).expand_as(loss_of_conf_neg)  # (N, 8732), 为每行元素标序号
    hard_negatives = (neg_ranks < n_hard_negative.unsqueeze(1))  # (N, 8732)
    loss_of_conf_hard_neg = loss_of_conf_neg[hard_negatives]  # (sum(n_hard_negatives)

    # As in the paper, averaged over positive priors only, although computed over both positive and hard-negative priors
    loss_of_conf = (loss_of_conf_pos.sum() + loss_of_conf_hard_neg.sum()) / positive_priors.sum().float()  # (), scalar

    # TOTAL LOSS

    return loss_of_conf + alpha * loss_of_loc

6. 目标检测

6.1 非极大值抑制

在最后进行目标检测的时候，我们不希望输出过多的预测边界框（此时的边界框存在大量的重叠），这时候我们需要进行非极大值抑制，把认为是重叠的边界框（不同预测边界框之间的交并比大于给定阈值认为是重叠）去除，只保留信度最大的边界框

def none_max_suppress(priors_cxcy, pred_locs, pred_scores, min_score, max_overlap, top_k):
    """
    执行非极大值预测
    :param priors_cxcy: 中心格式的priors
    :param pred_locs: 预测的offsets，预测器的输出
    :param pred_scores: 预测的得分，预测器的输出
    :param min_score: 设置接收的最小得分
    :param max_overlap: 设置抑制的最大交并比
    :param top_k: 保留至多top_k个预测目标
    :return: 压缩后边缘形式的边界框、类别、得分
    """
    batch_size = priors.size(0)
    n_priors = priors.size(0)
    n_classes = pred_scores.size(2)

    pred_scores = torch.softmax(pred_scores, dim=2)  # (batch_size, n_priors, n_classes)

    assert n_priors == pred_scores.size(1) == pred_locs.size(1)

    boxes_all_image = []
    scores_all_image = []
    labels_all_image = []

    for i in range(batch_size):
        # 将预测的offset解码为边缘形式的边界框
        boxes = cxcy_to_xy(gcxgcy_to_cxcy(pred_locs[i], priors_cxcy))  # (n_priors, 4)

        boxes_per_image = []
        scores_per_image = []
        labels_per_image = []

        for c in range(1, n_classes):
            class_scores = pred_scores[i, :, c]  # (8732)
            score_above_min = class_scores > min_score
            n_score_above_min = score_above_min.sum().item()

            if n_score_above_min == 0:
                continue

            # 仅保留score>min_score的预测
            class_scores = class_scores[score_above_min]
            class_boxes = boxes[score_above_min]

            # 按检测信度排序
            class_scores, sorted_ind = class_scores.sort(dim=0, descending=True)  # (n_score_above_min)
            class_boxes = class_boxes[sorted_ind]  # (n_score_above_min, 4)

            # 按交并比进行非极大值压缩
            overlap = find_jaccard_overlap(class_boxes, class_boxes)  # (n_score_above_min, n_score_above_min)

            # 创建记录是否被压缩的掩码，1代表压缩
            suppress = torch.zeros((n_score_above_min), dtype=torch.uint8).to(device)

            for b_id in range(n_score_above_min):
                # 若已被掩码记录为压缩，则跳过
                if suppress[b_id] == 1:
                    continue
                # 按预测边框间的交并比是否>max_overlap更新mask,并保持原来被压缩的边界框不变
                suppress = torch.max(suppress, (overlap[box] > max_overlap).byte())
                # 不压缩当前边界框
                suppress[b_id] = 0

            # 仅为每个类存储未被压缩的预测
            boxes_per_image.append(class_boxes[(1 - suppress).bool()])
            scores_per_image.append(class_scores[(1 - suppress).bool()])
            labels_per_image.append(torch.LongTensor([c] * (1 - suppress).sum().item()))

        # 如果该图片中没有包含任何类别, 则把整个图片标注为背景类
        if len(labels_per_image) == 0:
            boxes_per_image.append(torch.FloatTensor([0, 0, 1, 1]).to(device))
            labels_per_image.append(torch.LongTensor([0]).to(device))
            scores_per_image.append(torch.FloatTensor([0]).to(device))

        boxes_per_image = torch.cat(boxes_per_image, dim=0)  # (n_objects, 4)
        scores_per_image = torch.cat(scores_per_image, dim=0)  # (n_objects)
        labels_per_image = torch.cat(labels_per_image, dim=0)  # (n_objects)
        n_object = boxes_per_image.size(0)

        # 只保留按信度排序前K个目标
        if n_object > top_k:
            scores_per_image, sorted_ind = scores_per_image.sort(dim=0, descending=True)
            scores_per_image = scores_per_image[:top_k]
            boxes_per_image = boxes_per_image[sorted_ind][:top_k]
            labels_per_image = labels_per_image[sorted_ind][:top_k]

        boxes_all_image.append(boxes_per_image)
        scores_all_image.append(scores_per_image)
        labels_all_image.append(labels_per_image)

    return boxes_all_image, labels_all_image, scores_all_image  #  长度为batch_size的列表

额外部分：一些注意点

我们将各层特征图的输出连接成一个tensor，此时conv4_3 feature maps处于较低层，其features数值比之高层的大很多（下采样会使特征响应的数值减小），因此我们可以选择对feature maps进行归一化（如L2 normalization）后，再放大其特征响应（该factor由网络自己学习）。我认为Batch Normalization同样也适用。
使用dtype=torch.bool或torch.uint8(至少1.3.0之后就废除了uint8的索引操作了)为多维tensor进行索引操作，得到的索引结果是flatten的（注：此 bool tensor的位置与原 tensor一一时，若不是则会保留dim（即使还维剩余1个数组），切片则会把仅剩一个数组的维度给压缩了），如
```
x = torch.rand((2, 3, 4))  # 假设有一半的数据>0.5
y = x > 0.5  # y in shape of (2, 3, 4)，一半是True，一半是False
print(x[y].shape) # tenor in shape of（12）
```
提高训练速度的一些操作

torch.backends.cudnn.benchmark = True

dataloader的pin_memory=True，使用GPU中的锁页内存（不与虚拟内存交换数据以加快速度），需要GPU内存足够，更具体内容参考：https://blog.csdn.net/tfcy694/article/details/83270701
这里没用使用eval函数去评价模型实际的效果，可以选择使用mAP。在保存最好的网络模型时，可以考虑eval指标的增加来保留下好的参数，同时可以用此eval指标控制epochs提前终止

新人上路，请多多关注，纯手动不易，欢迎讨论

转载请说明出处。

References

a-PyTorch-Tutorial-to-Object-Detection
《动手学深度学习》