SSD-目标检测代码解读

最近看了SSD的源代码,理了一下其中的逻辑,写一篇学习笔记。

代码地址:https://github.com/balancap/SSD-Tensorflow

一、网络结构

首先贴出来网络结构图,便于后续的分析,这里的图是SSD 300的结构图,而我看的代码是SSD 512,但是思想差别不大,可以看出来SSD比YOLO的差别就是,不仅在最后一层提取预选框,而是在中间某几层就已经开始通过3X3的卷积提取候选框,且引入了anchors,可以看到不同的特征层的anchors数量也不一样,从开始的38X38X4到19X19X6到3X3X4到后面的1X1X4都是候选框个数,加一起据说总共3800多个,大大扩充了候选窗数量,而且还具有检测大小物体的侧重分工。

接下开始分析代码,首先网络的结构是在ssd_512_net.py中搭起来的,首先看一下与网络结构有关的参数:

下面的参数是用于构建网络用的参数。feat_layers指定第几个层做为特征层用来提取候选框,feat_shapes则是标明对应的特征层尺寸,相当于以前的cell_size,不过由于好多个特征层一起提取,所以有好多的cell_size,normalizations则指定对应特征层的归一化系数,因为第一个特征层较靠前,其数值较其他的特征层偏大,故只对其进行归一化。

feat_layers = ['block4', 'block7', 'block8', 'block9', 'block10', 'block11', 'block12']
feat_shapes = [(64, 64), (32, 32), (16, 16), (8, 8), (4, 4), (2, 2), (1, 1)]
normalizations = [20, -1, -1, -1, -1, -1, -1]

下面的参数则是用于anchors的构建,主要的是anchor_sizes与anchor_ratios,对于anchors的构建,主要是这样的规则:

第一个:anchor_sizes[0],即原尺寸

第二个:sqrt(anchor_sizes[0] * anchor_sizes[1]),两项乘积开方

后续:anchor_ratios* anchor_sizes[0]

所以一共是1+1+len(anchor_ratios) = len(anchor_sizes) + len(anchor_ratios)

anchor_size_bounds = [0.10, 0.90]
anchor_sizes = [(20.48, 51.2),
                (51.2, 133.12),
                (133.12, 215.04),
                (215.04, 296.96),
                (296.96, 378.88),
                (378.88, 460.8),
                (460.8, 542.72)]
anchor_ratios = [[2, .5],
                       [2, .5, 3, 1./3],
                       [2, .5, 3, 1./3],
                       [2, .5, 3, 1./3],
                       [2, .5, 3, 1./3],
                       [2, .5],
                       [2, .5]]
anchor_steps = [8, 16, 32, 64, 128, 256, 512]
anchor_offset = 0.5

解释完上述的参数,就可以先看代码了 ,首先是网络的搭建,这里直接去看ssd_net()函数,这是详细的构造过程:

    def ssd_net(inputs,
            num_classes,
            feat_layers,
            anchor_sizes,
            anchor_ratios,
            normalizations,
            is_training=True,
            dropout_keep_prob=0.5,
            prediction_fn=slim.softmax,
            reuse=None,
            scope='ssd_300_vgg'):
    """SSD net definition.
    """
    # if data_format == 'NCHW':
    #     inputs = tf.transpose(inputs, perm=(0, 3, 1, 2))

    # End_points collect relevant activations for external use.
    # 分块进行卷积池化处理,并将不同块的处理结果储存在end_points中
    end_points = {}
    with tf.variable_scope(scope, 'ssd_512_vgg', [inputs], reuse=reuse):
        # Original VGG-16 blocks.
        print(inputs)
        net = slim.repeat(inputs, 2, slim.conv2d, 64, [3, 3], scope='conv1')
        end_points['block1'] = net
        print('block1', net)
        net = slim.max_pool2d(net, [2, 2], scope='pool1')
        # Block 2.
        net = slim.repeat(net, 2, slim.conv2d, 128, [3, 3], scope='conv2')
        end_points['block2'] = net
        net = slim.max_pool2d(net, [2, 2], scope='pool2')
        # Block 3.
        net = slim.repeat(net, 3, slim.conv2d, 256, [3, 3], scope='conv3')
        end_points['block3'] = net
        net = slim.max_pool2d(net, [2, 2], scope='pool3')
        # Block 4.
        net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], scope='conv4')
        end_points['block4'] = net
        net = slim.max_pool2d(net, [2, 2], scope='pool4')
        # Block 5.
        net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], scope='conv5')
        end_points['block5'] = net
        net = slim.max_pool2d(net, [3, 3], 1, scope='pool5')

        # Additional SSD blocks.
        # Block 6: let's dilate the hell out of it!
        net = slim.conv2d(net, 1024, [3, 3], rate=6, scope='conv6')
        end_points['block6'] = net
        # Block 7: 1x1 conv. Because the fuck.
        net = slim.conv2d(net, 1024, [1, 1], scope='conv7')
        end_points['block7'] = net

        # Block 8/9/10/11: 1x1 and 3x3 convolutions stride 2 (except lasts).
        end_point = 'block8'
        with tf.variable_scope(end_point):
            net = slim.conv2d(net, 256, [1, 1], scope='conv1x1')
            net = custom_layers.pad2d(net, pad=(1, 1))
            net = slim.conv2d(net, 512, [3, 3], stride=2, scope='conv3x3', padding='VALID')
        end_points[end_point] = net
        print('block8', net)
        end_point = 'block9'
        with tf.variable_scope(end_point):
            net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')
            net = custom_layers.pad2d(net, pad=(1, 1))
            net = slim.conv2d(net, 256, [3, 3], stride=2, scope='conv3x3', padding='VALID')
        end_points[end_point] = net
        print('block9', net)
        end_point = 'block10'
        with tf.variable_scope(end_point):
            net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')
            net = custom_layers.pad2d(net, pad=(1, 1))
            net = slim.conv2d(net, 256, [3, 3], stride=2, scope='conv3x3', padding='VALID')
        end_points[end_point] = net
        print('block10', net)
        end_point = 'block11'
        with tf.variable_scope(end_point):
            net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')
            net = custom_layers.pad2d(net, pad=(1, 1))
            net = slim.conv2d(net, 256, [3, 3], stride=2, scope='conv3x3', padding='VALID')
        end_points[end_point] = net
        print('block11', net)
        end_point = 'block12'
        with tf.variable_scope(end_point):
            net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')
            net = custom_layers.pad2d(net, pad=(1, 1))
            net = slim.conv2d(net, 256, [4, 4], scope='conv4x4', padding='VALID')
            # Fix padding to match Caffe version (pad=1).
            # pad_shape = [(i-j) for i, j in zip(layer_shape(net), [0, 1, 1, 0])]
            # net = tf.slice(net, [0, 0, 0, 0], pad_shape, name='caffe_pad')
            print(net)
        end_points[end_point] = net

        # Prediction and localisations layers.
        predictions = []
        logits = []
        localisations = []
        # 根据feat_layers中标出的特征层,分别回归坐标值以及预测分类类别
        for i, layer in enumerate(feat_layers):
            with tf.variable_scope(layer + '_box'):
                p, l = ssd_multibox_layer(end_points[layer],
                                                      num_classes,
                                                      anchor_sizes[i],
                                                      anchor_ratios[i],
                                                      normalizations[i])
            # 这里prediction_fn就是softmax
            predictions.append(prediction_fn(p))
            logits.append(p)
            localisations.append(l)
        print(logits)
        #
        # predictions: [[batch_num, 64, 64, 4, class_num], .....[batch_num, 1, 1, 4, class_num]]
        # logits : [[batch_num, 64, 64, 4, class_num], .....[batch_num, 1, 1, 4, class_num]]
        # localisations : [[batch_num, 64, 64, 4, 4], .....[batch_num, 1, 1, 4, 4]]
        return predictions, localisations, logits, end_points

可以看到前面的卷积池化没有什么特点,其中值得一提的是pad2d这个函数,对张量进行适当的填充,从而保证之后的卷积正常进行,针对前面分block储存的网络输出,在ssd_multibox_layer()中,结合anchors提取候选框以及候选框分类。

下面是实现代码:

def ssd_multibox_layer(inputs,
                       num_classes,
                       sizes,
                       ratios=[1],
                       normalization=-1,
                       bn_normalization=False):
    """Construct a multibox layer, return a class and localization predictions.
    """
    net = inputs
    # 如果需要L2正则则进行L2正则化
    if normalization > 0:
        net = custom_layers.l2_normalization(net, scaling=True)
    # Number of anchors.
    # 该特征层总anchor数量,
    num_anchors = len(sizes) + len(ratios)

    # Location.
    # Location预测四个描述回归框位置的参数,故为anchors*4的数量
    num_loc_pred = num_anchors * 4
    loc_pred = slim.conv2d(net, num_loc_pred, [3, 3], activation_fn=None,
                           scope='conv_loc')
    # 这里有关于NCHW和NHWC两种张量形式,该函数是统一成NHWC形式
    loc_pred = custom_layers.channel_to_last(loc_pred)
    # reshape成[batch_num, cell_size, cell_size, num_anchors, 4]的形式
    loc_pred = tf.reshape(loc_pred,
                          tensor_shape(loc_pred, 4)[:-1]+[num_anchors, 4])
    # Class prediction.
    # 与上面同理,不过预测的是classes,所以输出通道数变成了num_anchors * num_classes
    num_cls_pred = num_anchors * num_classes
    cls_pred = slim.conv2d(net, num_cls_pred, [3, 3], activation_fn=None,
                           scope='conv_cls')
    cls_pred = custom_layers.channel_to_last(cls_pred)
    #[BATCH_SIZE, CELL_SIZE, CELL_SIZE, NUM_ANCHORS, NUM_CLASSES]
    cls_pred = tf.reshape(cls_pred,
                          tensor_shape(cls_pred, 4)[:-1]+[num_anchors, num_classes])
    return cls_pred, loc_pred

到这里就得到了预测结果,predictions和localisations

二、样本编码

样本读进来之后只有一副图片内的目标的类别和位置信息,要编码成可以进行loss计算的格式,还需要根据全部的anchors将ground truth按IOU分配给各个anchors,所以样本编码分为两部分:求全部anchors尺寸,编码。

1、anchors集合构建

代码部分直接看ssd_300_vgg.py的ssd_anchors_all_layers():

# 根据每个特征层,构建anchors
def ssd_anchors_all_layers(img_shape,
                           layers_shape,
                           anchor_sizes,
                           anchor_ratios,
                           anchor_steps,
                           offset=0.5,
                           dtype=np.float32):
    """Compute anchor boxes for all feature layers.
    """
    layers_anchors = []
    # 针对每一个特征层尺寸
    for i, s in enumerate(layers_shape):
        # 输入:
        # img_shape:图片尺寸,这里关于回归框预测值的转换规则
        # s:当前特征层的尺寸,以SSD512的第一层为例,即(64,64)
        # anchor_sizes:anchor原始尺寸
        # anchor_ratios:不同比例的anchor
        # anchor_steps:特征图较原图的缩放倍率
        # 输出:
        # anchor_bboxes:输出每层特征层的anchor坐标详情,构成为[x,y,w,h]
        #               以第一层为例:[64,64,4,4],64为x,y坐标,4为全部anchor在固定中心点情况下的4种尺寸
        #               其中,某些特征层anchor尺寸变化为4种,有些为6种
        anchor_bboxes = ssd_anchor_one_layer(img_shape, s,
                                             anchor_sizes[i],
                                             anchor_ratios[i],
                                             anchor_steps[i],
                                             offset=offset, dtype=dtype)
        print(anchor_bboxes)
        # layers_anchors:[[64, 64, 4, 4]........[1, 1, 4, 4]]
        layers_anchors.append(anchor_bboxes)
    return layers_anchors

还是同样的路子,根据不同的特征层,按照其相应的anchor规格,构建anchors,然后堆在一个list中,继续跟着看ssd_anchor_one_layer()函数,看一下具体对每一个特征层然后构建其中的anchors:

def ssd_anchor_one_layer(img_shape,
                         feat_shape,
                         sizes,
                         ratios,
                         step,
                         offset=0.5,
                         dtype=np.float32):
    """Computer SSD default anchor boxes for one feature layer.

    Determine the relative position grid of the centers, and the relative
    width and height.

    Arguments:
      feat_shape: Feature shape, used for computing relative position grids;
      size: Absolute reference sizes;
      ratios: Ratios to use on these features;
      img_shape: Image shape, used for computing height, width relatively to the
        former;
      offset: Grid offset.

    Return:
      y, x, h, w: Relative x and y grids, and height and width.
    """
    # Compute the position grid: simple way.
    # y, x = np.mgrid[0:feat_shape[0], 0:feat_shape[1]]
    # y = (y.astype(dtype) + offset) / feat_shape[0]
    # x = (x.astype(dtype) + offset) / feat_shape[1]
    # Weird SSD-Caffe computation using steps values...
    # 分格矩阵
    y, x = np.mgrid[0:feat_shape[0], 0:feat_shape[1]]
    y = (y.astype(dtype) + offset) * step / img_shape[0]
    x = (x.astype(dtype) + offset) * step / img_shape[1]

    # Expand dims to support easy broadcasting.
    # 维度阔充
    y = np.expand_dims(y, axis=-1)
    x = np.expand_dims(x, axis=-1)

    # Compute relative height and width.
    # Tries to follow the original implementation of SSD for the order.
    # 不同特征层的anchors数量有异
    num_anchors = len(sizes) + len(ratios)
    h = np.zeros((num_anchors, ), dtype=dtype)
    w = np.zeros((num_anchors, ), dtype=dtype)
    # Add first anchor boxes with ratio=1.
    # 这里可以看到每一层的anchor尺寸具体构造方式:
    # 针对sizes,sizes第一个尺寸值是原尺寸的anchor,第二个尺寸值需要与第一个尺寸值做乘积开方来作为一个anchor的尺寸
    # 针对ratios,每个ratios都是在原尺寸size[0]的基础上进行比例运算
    # 所以,每个特征层的anchor数量为len(size)+len(ratios)
    h[0] = sizes[0] / img_shape[0]
    w[0] = sizes[0] / img_shape[1]
    di = 1
    if len(sizes) > 1:
        h[1] = math.sqrt(sizes[0] * sizes[1]).real / img_shape[0]
        w[1] = math.sqrt(sizes[0] * sizes[1]).real / img_shape[1]
        di += 1
    for i, r in enumerate(ratios):
        h[i+di] = sizes[0] / img_shape[0] / math.sqrt(r).real
        w[i+di] = sizes[0] / img_shape[1] * math.sqrt(r).real
    # 以第一层为例,由于是64X64特征图,2+2anchors
    # 所以返回量为:y:[[64]], x: [[64]], h: [4], w: [4]
    return y, x, h, w

到这里,我们就构建了一个特征层的anchors,然后逐层进行构造,并堆叠,最后就形成了[[64, 64, 4, 4]........[1, 1, 4, 4]]这种格式的所有anchors的二点式坐标集合。

2、样本编码
在得到了所有anchors的具体位置之后,我们就可以像faster rcnn那样来针对每个anchor,将gt编码成loss计算需要的样子,即对每个样本图像,找到其中的anchors来负责每一个待检测目标。

这里需要提一下,我们编码后的gt坐标,以及预测出来的位置坐标并不是真实的坐标,而是根据与cell尺寸,图片尺寸算出来的一个系数,具体算的过程如下:

这里b是gt的x,y,w,h;d是负责该目标的anchors的x,y,w,h。

而l才是我们编码后,以及预测出来的东西,这样数学关系就比较明确了。

 

编码程序也是和其他一样,封装在类中,但其实调用的外界函数,这里调用的ssd_common.py的tf_ssd_bboxes_encode()函数:

def tf_ssd_bboxes_encode(labels,
                         bboxes,
                         anchors,
                         num_classes,
                         no_annotation_label,
                         ignore_threshold=0.5,
                         prior_scaling=[0.1, 0.1, 0.2, 0.2],
                         dtype=tf.float32,
                         scope='ssd_bboxes_encode'):
    """Encode groundtruth labels and bounding boxes using SSD net anchors.
    Encoding boxes for all feature layers.

    Arguments:
      labels: 1D Tensor(int64) containing groundtruth labels;
      bboxes: Nx4 Tensor(float) with bboxes relative coordinates;
      anchors: List of Numpy array with layer anchors;
      matching_threshold: Threshold for positive match with groundtruth bboxes;
      prior_scaling: Scaling of encoded coordinates.

    Return:
      (target_labels, target_localizations, target_scores):
        Each element is a list of target Tensors.
    """
    # 在此之前先明确一下输入量维度,由上方说明也可得知
    # labels:1维的向量,里面按序存放图片种的有的类别
    # bboxes:N*4维的向量,N应该就是len(labels),即针对每个有类别属性的物体,其位置信息
    # anchors:即之前得到的所有特征层上的所有anchors列表
    with tf.name_scope(scope):
        # 预先做出类别,gt,置信度存储空间
        target_labels = []
        target_localizations = []
        target_scores = []
        # 针对每一层特征层
        for i, anchors_layer in enumerate(anchors):
            with tf.name_scope('bboxes_encode_block_%i' % i):
                t_labels, t_loc, t_scores = \
                    tf_ssd_bboxes_encode_layer(labels, bboxes, anchors_layer,
                                               num_classes, no_annotation_label,
                                               ignore_threshold,
                                               prior_scaling, dtype)
                target_labels.append(t_labels)
                target_localizations.append(t_loc)
                target_scores.append(t_scores)
        # target_labels:[[64, 64, 4].......[1, 1, 4]]
        # target_localization:[[64, 64, 4,4].......[1, 1, 4,4]]
        # target_scores:[[64, 64, 4].......[1, 1, 4]]
        return target_labels, target_localizations, target_scores

一样的路子,按不同层的anchors分开处理,直接进入tf_ssd_bboxes_encode_layer(),看具体某一层的编码方式:

def tf_ssd_bboxes_encode_layer(labels,
                               bboxes,
                               anchors_layer,
                               num_classes,
                               no_annotation_label,
                               ignore_threshold=0.5,
                               prior_scaling=[0.1, 0.1, 0.2, 0.2],
                               dtype=tf.float32):
    """Encode groundtruth labels and bounding boxes using SSD anchors from
    one layer.

    Arguments:
      labels: 1D Tensor(int64) containing groundtruth labels;
      bboxes: Nx4 Tensor(float) with bboxes relative coordinates;
      anchors_layer: Numpy array with layer anchors;
      matching_threshold: Threshold for positive match with groundtruth bboxes;
      prior_scaling: Scaling of encoded coordinates.

    Return:
      (target_labels, target_localizations, target_scores): Target Tensors.
    """
    init_op = tf.global_variables_initializer()
    sess.run(init_op)
    # Anchors coordinates and volume.
    # 由x,y和h,w得到全部anchors的左上右下坐标
    yref, xref, href, wref = anchors_layer
    ymin = yref - href / 2.
    xmin = xref - wref / 2.
    ymax = yref + href / 2.
    xmax = xref + wref / 2.
    # 全部anchor的面积,用于计算之后的iou
    vol_anchors = (xmax - xmin) * (ymax - ymin)

    # Initialize tensors...
    # shape: [CELL_SIZE, CELL_SIZE, NUM_ANCHORS]
    shape = (yref.shape[0], yref.shape[1], href.size)
    # 各种真值标签
    feat_labels = tf.zeros(shape, dtype=tf.int64)
    feat_scores = tf.zeros(shape, dtype=dtype)

    feat_ymin = tf.zeros(shape, dtype=dtype)
    feat_xmin = tf.zeros(shape, dtype=dtype)
    feat_ymax = tf.ones(shape, dtype=dtype)
    feat_xmax = tf.ones(shape, dtype=dtype)

    # 类似iou系数计算
    def jaccard_with_anchors(bbox):
        """Compute jaccard score between a box and the anchors.
        """
        int_ymin = tf.maximum(ymin, bbox[0])
        int_xmin = tf.maximum(xmin, bbox[1])
        int_ymax = tf.minimum(ymax, bbox[2])
        int_xmax = tf.minimum(xmax, bbox[3])
        h = tf.maximum(int_ymax - int_ymin, 0.)
        w = tf.maximum(int_xmax - int_xmin, 0.)
        # Volumes.
        inter_vol = h * w
        union_vol = vol_anchors - inter_vol \
            + (bbox[2] - bbox[0]) * (bbox[3] - bbox[1])
        jaccard = tf.div(inter_vol, union_vol)
        return jaccard

    def intersection_with_anchors(bbox):
        """Compute intersection between score a box and the anchors.
        """
        int_ymin = tf.maximum(ymin, bbox[0])
        int_xmin = tf.maximum(xmin, bbox[1])
        int_ymax = tf.minimum(ymax, bbox[2])
        int_xmax = tf.minimum(xmax, bbox[3])
        h = tf.maximum(int_ymax - int_ymin, 0.)
        w = tf.maximum(int_xmax - int_xmin, 0.)
        inter_vol = h * w
        scores = tf.div(inter_vol, vol_anchors)
        return scores

    # while_loop判定,labels数量来定总循环次数,将一副图片中的所有目标构建进真值标签
    def condition(i, feat_labels, feat_scores,
                  feat_ymin, feat_xmin, feat_ymax, feat_xmax):
        """Condition: check label index.
        """
        # 这里代码内容有更改,是因为我用的样本每个图片里就一个目标
        return i < 1

    # 制作真值标签
    def body(i, feat_labels, feat_scores,
             feat_ymin, feat_xmin, feat_ymax, feat_xmax):
        """Body: update feature labels, scores and bboxes.
        Follow the original SSD paper for that purpose:
          - assign values when jaccard > 0.5;
          - only update if beat the score of other bboxes.
        """
        # Jaccard score.
        # 首先得到当前的labels及bbox,这里也是有代码个人更改,原代码应该是labels[i]与bboxes[i]
        label = labels[0]
        bbox = bboxes[0]
        # 计算bbox与每个anchor的iou
        jaccard = jaccard_with_anchors(bbox)
        # Mask: check threshold + scores + no annotations + num_classes.
        # 如果新的iou大于旧的得分记录,则mask的对应位置为true,即需要更新这个anchor的负责的目标信息
        mask = tf.greater(jaccard, feat_scores)
        # mask = tf.logical_and(mask, tf.greater(jaccard, matching_threshold))
        # 这四步只是进行一些转换方便后续处理
        mask = tf.logical_and(mask, feat_scores > -0.5)
        mask = tf.logical_and(mask, label < num_classes)
        imask = tf.cast(mask, tf.int64)
        fmask = tf.cast(mask, dtype)
        # Update values using mask.
        # 这里,针对mask为true的位置的anchors,更新他们负责的目标信息
        # anchors负责目标的标准为:
        # 每一个真值框可以被多个anchor负责
        # 但一个anchor只能负责与他iou最大的真值框
        feat_labels = imask * label + (1 - imask) * feat_labels
        # where函数,简述其作用就是,mask对应位置为true的anchors的feat_scores更新为iou,其他保持不变
        feat_scores = tf.where(mask, jaccard, feat_scores)
        # 更新两点式真值框坐标
        feat_ymin = fmask * bbox[0] + (1 - fmask) * feat_ymin
        feat_xmin = fmask * bbox[1] + (1 - fmask) * feat_xmin
        feat_ymax = fmask * bbox[2] + (1 - fmask) * feat_ymax
        feat_xmax = fmask * bbox[3] + (1 - fmask) * feat_xmax

        # Check no annotation label: ignore these anchors...
        # interscts = intersection_with_anchors(bbox)
        # mask = tf.logical_and(interscts > ignore_threshold,
        #                       label == no_annotation_label)
        # # Replace scores by -1.
        # feat_scores = tf.where(mask, -tf.cast(mask, dtype), feat_scores)

        return [i+1, feat_labels, feat_scores,
                feat_ymin, feat_xmin, feat_ymax, feat_xmax]
    # Main loop definition.
    # i = 0
    # [i, feat_labels, feat_scores,
    #  feat_ymin, feat_xmin,
    #  feat_ymax, feat_xmax] = tf.while_loop(condition, body,
    #                                        [i, feat_labels, feat_scores,
    #                                         feat_ymin, feat_xmin,
    #                                         feat_ymax, feat_xmax])
    # 这里还是我自身用到所以改了点东西
    # 原代码的大致思路就是遍历一副图片中的目标(通过condition函数判断),构造真值标签(通过body函数构造)
    [i, feat_labels, feat_scores,
       feat_ymin, feat_xmin,
       feat_ymax, feat_xmax] = body(1, feat_labels, feat_scores,
                                             feat_ymin, feat_xmin,
                                             feat_ymax, feat_xmax)
    # Transform to center / size.
    # 这里进行坐标的编码行为
    feat_cy = (feat_ymax + feat_ymin) / 2.
    feat_cx = (feat_xmax + feat_xmin) / 2.
    feat_h = feat_ymax - feat_ymin
    feat_w = feat_xmax - feat_xmin
    # Encode features.
    feat_cy = (feat_cy - yref) / href / prior_scaling[0]
    feat_cx = (feat_cx - xref) / wref / prior_scaling[1]
    feat_h = tf.log(feat_h / href) / prior_scaling[2]
    feat_w = tf.log(feat_w / wref) / prior_scaling[3]
    # Use SSD ordering: x / y / w / h instead of ours.
    # 将4个坐标信息进行堆叠
    feat_localizations = tf.stack([feat_cx, feat_cy, feat_w, feat_h], axis=-1)
    # 此时返回的是针对一张图片的label与gt,关于某一层的全部anchors的标签
    # 输出维度:
    # feat_labels:[CELL_SIZE, CELL_SIZE, NUM_ANCHORS]
    # feat_localization:[CELL_SIZE, CELL_SIZE, NUM_ANCHORS,4]
    # feat_scores:[CELL_SIZE, CELL_SIZE, NUM_ANCHORS]
    return feat_labels, feat_localizations, feat_scores

代码很长,分块来看的话就很简单了,大致分为下面三块:

1)初始化合适shape的存储空间,并赋初值。

2)定义了几个函数,主要用于计算IOU、判断是否遍历图片中的gt与labels、根据gt分配至合适anchors。

3)通过while_loop将上述参数联合起来使用,完成gt编码。

这里的重点就看一下几个函数的实现,IOU与遍历labels这两个函数很简单,不细说,就是jaccard_with_anchors和condition。这里condition因为我的样本里一副图片就一个目标,所以改写了一下,直接就是目标数到1就截止,原来的也很简单,一看就懂。

这里主要看的是body函数,为了方便这里单独放一下body:

def body(i, feat_labels, feat_scores,
             feat_ymin, feat_xmin, feat_ymax, feat_xmax):
        """Body: update feature labels, scores and bboxes.
        Follow the original SSD paper for that purpose:
          - assign values when jaccard > 0.5;
          - only update if beat the score of other bboxes.
        """
        # Jaccard score.
        # 首先得到当前的labels及bbox,这里也是有代码个人更改,原代码应该是labels[i]与bboxes[i]
        label = labels[0]
        bbox = bboxes[0]
        # 计算bbox与每个anchor的iou
        jaccard = jaccard_with_anchors(bbox)
        # Mask: check threshold + scores + no annotations + num_classes.
        # 如果新的iou大于旧的得分记录,则mask的对应位置为true,即需要更新这个anchor的负责的目标信息
        mask = tf.greater(jaccard, feat_scores)
        # mask = tf.logical_and(mask, tf.greater(jaccard, matching_threshold))
        # 这四步只是进行一些转换方便后续处理
        mask = tf.logical_and(mask, feat_scores > -0.5)
        mask = tf.logical_and(mask, label < num_classes)
        imask = tf.cast(mask, tf.int64)
        fmask = tf.cast(mask, dtype)
        # Update values using mask.
        # 这里,针对mask为true的位置的anchors,更新他们负责的目标信息
        # anchors负责目标的标准为:
        # 每一个真值框可以被多个anchor负责
        # 但一个anchor只能负责与他iou最大的真值框
        feat_labels = imask * label + (1 - imask) * feat_labels
        # where函数,简述其作用就是,mask对应位置为true的anchors的feat_scores更新为iou,其他保持不变
        feat_scores = tf.where(mask, jaccard, feat_scores)
        # 更新两点式真值框坐标
        feat_ymin = fmask * bbox[0] + (1 - fmask) * feat_ymin
        feat_xmin = fmask * bbox[1] + (1 - fmask) * feat_xmin
        feat_ymax = fmask * bbox[2] + (1 - fmask) * feat_ymax
        feat_xmax = fmask * bbox[3] + (1 - fmask) * feat_xmax

        # Check no annotation label: ignore these anchors...
        # interscts = intersection_with_anchors(bbox)
        # mask = tf.logical_and(interscts > ignore_threshold,
        #                       label == no_annotation_label)
        # # Replace scores by -1.
        # feat_scores = tf.where(mask, -tf.cast(mask, dtype), feat_scores)

        return [i+1, feat_labels, feat_scores,
                feat_ymin, feat_xmin, feat_ymax, feat_xmax]

这样就成功的将样本编码进了合适的anchors,完成了编码。

三、loss构造

有了上边的基础,其实loss构造代码就很简单了,只是还有一点要注意的就是,在loss构造中,只有某些存在目标可能性较大的anchor才有资格参与loss计算图构建:

# loss定义
def ssd_losses(logits, localisations,
               gclasses, glocalisations, gscores,
               match_threshold=0.5,
               negative_ratio=3.,
               alpha=1.,
               label_smoothing=0.,
               device='/cpu:0',
               scope=None):
    with tf.name_scope(scope, 'ssd_losses'):
        lshape = get_shape(logits[0], 5)
        num_classes = lshape[-1]
        batch_size = lshape[0]

        # Flatten out all vectors!
        #下面一大堆操作就是把各个向量拉平合并
        # 真值标签:
        #           gclasses:[batch_num*(64*64*4 +.....+ 1*1*4)]
        #           gscores:[batch_num*(64*64*4 +.....+ 1*1*4)]
        #           glocalisations:[batch_num*(64*64*4 +.....+ 1*1*4),4]
        # 预测值:
        #           logits:[batch_num*(64*64*4 +.....+ 1*1*4),num_classes]
        #           localisations:[batch_num*(64*64*4 +.....+ 1*1*4),4]
        flogits = []
        fgclasses = []
        fgscores = []
        flocalisations = []
        fglocalisations = []
        for i in range(len(logits)):
            flogits.append(tf.reshape(logits[i], [-1, num_classes]))
            fgclasses.append(tf.reshape(gclasses[i], [-1]))
            fgscores.append(tf.reshape(gscores[i], [-1]))
            flocalisations.append(tf.reshape(localisations[i], [-1, 4]))
            fglocalisations.append(tf.reshape(glocalisations[i], [-1, 4]))
        # And concat the crap!
        logits = tf.concat(flogits, axis=0)
        gclasses = tf.concat(fgclasses, axis=0)
        gscores = tf.concat(fgscores, axis=0)
        localisations = tf.concat(flocalisations, axis=0)
        glocalisations = tf.concat(fglocalisations, axis=0)
        dtype = logits.dtype

        # Compute positive matching mask...
        # 这里可以认为是只有iou大于match_threshold的样本才是positive样本
        pmask = gscores > match_threshold
        fpmask = tf.cast(pmask, dtype)
        n_positives = tf.reduce_sum(fpmask)

        # Hard negative mining...
        # 其余的都按背景处理
        no_classes = tf.cast(pmask, tf.int32)
        # 预测类别为可能型最大的类别
        predictions = slim.softmax(logits)
        # 除了positive样本,其余都是negative样本
        nmask = tf.logical_and(tf.logical_not(pmask),
                               gscores > -0.5)
        fnmask = tf.cast(nmask, dtype)
        # 将预测类别中的对应位置的类别改为背景
        nvalues = tf.where(nmask,
                           predictions[:, 0],
                           1. - fnmask)
        # 将类别预测结果reshape成[batch_num*(64*64*4 +.....+ 1*1*4)]
        nvalues_flat = tf.reshape(nvalues, [-1])
        # Number of negative entries to select.
        # 严格按照positive与negative样本比例3:1来重新选择negative样本
        max_neg_entries = tf.cast(tf.reduce_sum(fnmask), tf.int32)
        n_neg = tf.cast(negative_ratio * n_positives, tf.int32) + batch_size
        n_neg = tf.minimum(n_neg, max_neg_entries)

        val, idxes = tf.nn.top_k(-nvalues_flat, k=n_neg)
        max_hard_pred = -val[-1]
        # Final negative mask.
        nmask = tf.logical_and(nmask, nvalues < max_hard_pred)
        fnmask = tf.cast(nmask, dtype)

        # Add cross-entropy loss.
        with tf.name_scope('cross_entropy_pos'):
            # positive样本交叉熵
            loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,
                                                                  labels=gclasses)
            #loss = tf.div(tf.reduce_sum(loss * fpmask), batch_size, name='value')
            loss = tf.reduce_sum(loss * fpmask)
            tf.losses.add_loss(loss)

        with tf.name_scope('cross_entropy_neg'):
            # negative样本交叉熵
            loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,
                                                                  labels=no_classes)
            #loss = tf.div(tf.reduce_sum(loss * fnmask), batch_size, name='value')
            loss = tf.reduce_sum(loss * fnmask)
            tf.losses.add_loss(loss)

        # Add localization loss: smooth L1, L2, ...
        with tf.name_scope('localization'):
            # Weights Tensor: positive mask + random negative.
            # L1平滑的位置回归
            weights = tf.expand_dims(alpha * fpmask, axis=-1)
            loss = custom_layers.abs_smooth(localisations - glocalisations)
            #loss = tf.div(tf.reduce_sum(loss * weights), batch_size, name='value')
            loss = tf.reduce_sum(loss * weights)
            tf.losses.add_loss(loss)

把所有loss都添加进了losses之后,这个loss构建也就完成了。

你可能感兴趣的:(SSD-目标检测代码解读)