SSD源码分析

SSD基本原理

SSD code:https://github.com/balancap/SSD-Tensorflow/issues
SSD paper:https://arxiv.org/abs/1512.02325
建议先看论文再阅读源码
SSD,全称Single Shot MultiBox Dector,是ECCV 2016刘伟提出的一种目标检测算法,类似于YOLO将目标检测转化为回归的思想。基于Faster RCNN中的anchor,提出相似的prior box。SSD同样加入了基于特征金字塔(Pyramidal Feature Hierachy)的检测方式,即在多个feature map上同时进行softmax分类和位置回归。

SSD网络结构图

SSD源码分析_第1张图片
SSD于VGG的基础上加了8个额外的特征图。这里着重讲下SSD论文中feature map cell和default box的差别。
SSD源码分析_第2张图片

  • feature map cell就是将feature map切分成之后的一个个格子。
  • default box是每一个格子上一系列固定大小的默认框,SSD中也叫prior box。

SSD-Tensorflow源码分析

SSD训练

SSD训练时与RCNN系列目标检测算法的主要区别是:SSD训练图像的groundtruth需要赋予固定输出的boxes上。看论文可以知道,SSD的输出是一系列固定大小的bouding boxes。如上图狗的groundtruth是红色的bouding boxes,但在label标注的时候需将红色的groundtruthbox赋予上图(c)中一系列固定输出boxes中的一个,即图(c)中的红色虚线框。当将图像中的GT与固定输出的default boxes对应之后就可以根据损失函数进行端对端进行损失函数计算及反向传播更新参数。

SSD anchor生成代码解析

源码中在每一feature layer中都执行如下函数,为每一feature layer生成anchor,该函数返回该feature layer的所有anchor中心点坐标及anchor的h和w,代码如下:

def ssd_anchor_one_layer(img_shape,
                         feat_shape,
                         sizes,
                         ratios,
                         step,
                         offset=0.5,
                         dtype=np.float32):
    #配置文件中的feature layer anchor_size和anchor_ratio
    '''
      anchor_sizes=[(21., 45.),#这个size是 21 45 99 153 207 261 315写成turple为了好计算论文中sqrt(SkSk+1)
                      (45., 99.),
                      (99., 153.),
                      (153., 207.),
                      (207., 261.),
                      (261., 315.)],
        anchor_ratios=[[2, .5],
                       [2, .5, 3, 1./3],
                       [2, .5, 3, 1./3],
                       [2, .5, 3, 1./3],
                       [2, .5],
                       [2, .5]],
        anchor_steps=[8, 16, 32, 64, 100, 300],
    '''

    """Computer SSD default anchor boxes for one feature layer.
    #为每一feature layer计算anchor boxes

    Determine the relative position grid of the centers, and the relative
    width and height.
    #参数:
    img_shape:    输入的图片大小
    feat_shape:   特征图大小
    size:         框的size
    ratio:        框的长宽比
    offset:       网格偏移

    Arguments:
      feat_shape: Feature shape, used for computing relative position grids;
      size: Absolute reference sizes;
      ratios: Ratios to use on these features;
      img_shape: Image shape, used for computing height, width relatively to the
        former;
      offset: Grid offset.

    Return:
      y, x, h, w: Relative x and y grids, and height and width.
    """
    # Compute the position grid: simple way.
    # y, x = np.mgrid[0:feat_shape[0], 0:feat_shape[1]]
    # y = (y.astype(dtype) + offset) / feat_shape[0]
    # x = (x.astype(dtype) + offset) / feat_shape[1]
    # Weird SSD-Caffe computation using steps values...
    y, x = np.mgrid[0:feat_shape[0], 0:feat_shape[1]]        #y,x为特征图上每个点的坐标,y范围[0,feat_shape[0]),x范围[0,feat_shape[1])
    y = (y.astype(dtype) + offset) * step / img_shape[0]     #*step/img_shape[0]是特征图坐标于原图中归一化
    x = (x.astype(dtype) + offset) * step / img_shape[1]     #+0.5是因为每个框相当于间隔是1,所以中点加0.5

    #经过上面计算得到y,x是默认框的中心,于特征图每两个点中间,并规一化

    # Expand dims to support easy broadcasting.
    y = np.expand_dims(y, axis=-1) #扩展维度 axis = -1是最后一个维度
    x = np.expand_dims(x, axis=-1)

    # Compute relative height and width.
    # Tries to follow the original implementation of SSD for the order.
    num_anchors = len(sizes) + len(ratios)                   #计算anchors数量
    h = np.zeros((num_anchors, ), dtype=dtype)               #最后计算的anchor的h和w存于数组中
    w = np.zeros((num_anchors, ), dtype=dtype)
    # Add first anchor boxes with ratio=1.
    h[0] = sizes[0] / img_shape[0]                           #计算第一个anchor的h,w 却h=w比例为1
    w[0] = sizes[0] / img_shape[1]
    di = 1
    #size 和ratio都是只表示这一层的
    #size是经过计算后的具体长宽大小了,每个框的具体size已经于配置文件中给出
    if len(sizes) > 1:
        h[1] = math.sqrt(sizes[0] * sizes[1]) / img_shape[0]
        w[1] = math.sqrt(sizes[0] * sizes[1]) / img_shape[1]
        di += 1
    for i, r in enumerate(ratios):
        #size为一turple
        #size[0]才是这一层的size,size[1]是另一层feat_map的anchor size,写成tupple是为了计算SkSk+1(k为feature layer索引)
        h[i+di] = sizes[0] / img_shape[0] / math.sqrt(r)#体现每个default box长宽不同的地方在于h用/,和w用*计算
        w[i+di] = sizes[0] / img_shape[1] * math.sqrt(r)
    return y, x, h, w #返回框的y x h w
    #最后是返回一个该feature layer每个anchor中心点坐标和该feature layer anchor的h、w

groundtruth与default boxes的匹配策略

  • 在开始的时候,用 MultiBox 中的 best jaccard overlap 来匹配每一个 ground truth box 与 default box,这样就能保证每一个 groundtruth box 与唯一的一个 default box 对应起来。
  • 但是又不同于 MultiBox ,本文之后又将 default box 与任何的 groundtruth box 配对,只要两者之间的jaccard overlap 大于一个阈值,这里本文的阈值为 0.5.代码中其实实现的就是找和groundtruth 最大交并比的anchor。
def tf_ssd_bboxes_encode_layer(labels,
                               bboxes,
                               anchors_layer,
                               num_classes,
                               no_annotation_label,
                               ignore_threshold=0.5,
                               prior_scaling=[0.1, 0.1, 0.2, 0.2],
                               dtype=tf.float32):
    """Encode groundtruth labels and bounding boxes using SSD anchors from
    one layer.

    Arguments:
      labels: 1D Tensor(int64) containing groundtruth labels;    #groundtruth labels
      bboxes: Nx4 Tensor(float) with bboxes relative coordinates;#Nx4 N个box4个相对坐标
      anchors_layer: Numpy array with layer anchors;#每层的anchor
      matching_threshold: Threshold for positive match with groundtruth bboxes;#匹配的阈值
      prior_scaling: Scaling of encoded coordinates.#encode 的scal

    Return:
      (target_labels, target_localizations, target_scores): Target Tensors.
    """
    # Anchors coordinates and volume.
    yref, xref, href, wref = anchors_layer #这只是一层的anchor y x是该层所有anchor的中心点, h w是其中一点anchor长和宽
    ymin = yref - href / 2.                #分别计算anchor左上角,右上角坐标方便求交并
    xmin = xref - wref / 2.
    ymax = yref + href / 2.
    xmax = xref + wref / 2.
    vol_anchors = (xmax - xmin) * (ymax - ymin)#anchor面积

    # Initialize tensors...#初始化tensor
    shape = (yref.shape[0], yref.shape[1], href.size)#(layer中特征图的所有点,anchor的数量)
    feat_labels = tf.zeros(shape, dtype=tf.int64)
    feat_scores = tf.zeros(shape, dtype=dtype)
    #按如上定义 feat_labels 和feat_scores都可以访问到该层每个的点的任一anchor

    feat_ymin = tf.zeros(shape, dtype=dtype)
    feat_xmin = tf.zeros(shape, dtype=dtype)
    feat_ymax = tf.ones(shape, dtype=dtype)
    feat_xmax = tf.ones(shape, dtype=dtype)

    def jaccard_with_anchors(bbox):
        """Compute jaccard score between a box and the anchors.
        """
        int_ymin = tf.maximum(ymin, bbox[0])
        int_xmin = tf.maximum(xmin, bbox[1])
        int_ymax = tf.minimum(ymax, bbox[2])
        int_xmax = tf.minimum(xmax, bbox[3])
        h = tf.maximum(int_ymax - int_ymin, 0.)
        w = tf.maximum(int_xmax - int_xmin, 0.)
        # Volumes.
        inter_vol = h * w
        union_vol = vol_anchors - inter_vol \
            + (bbox[2] - bbox[0]) * (bbox[3] - bbox[1])
        jaccard = tf.div(inter_vol, union_vol)
        return jaccard

    def intersection_with_anchors(bbox):
        """Compute intersection between score a box and the anchors.
        """
        int_ymin = tf.maximum(ymin, bbox[0])
        int_xmin = tf.maximum(xmin, bbox[1])
        int_ymax = tf.minimum(ymax, bbox[2])
        int_xmax = tf.minimum(xmax, bbox[3])
        h = tf.maximum(int_ymax - int_ymin, 0.)
        w = tf.maximum(int_xmax - int_xmin, 0.)
        inter_vol = h * w
        scores = tf.div(inter_vol, vol_anchors)
        return scores

    def condition(i, feat_labels, feat_scores,
                  feat_ymin, feat_xmin, feat_ymax, feat_xmax):
        """Condition: check label index.
        """
        r = tf.less(i, tf.shape(labels))#遍历所有的真实框
        return r[0]

    def body(i, feat_labels, feat_scores,
             feat_ymin, feat_xmin, feat_ymax, feat_xmax):
        """Body: update feature labels, scores and bboxes.
        Follow the original SSD paper for that purpose:
          - assign values when jaccard > 0.5;
          - only update if beat the score of other bboxes.
        """
        #i是第i个真实框
        # Jaccard score.
        label = labels[i]#i个实框的label
        bbox = bboxes[i]#第i个真实框的坐标
        #计算所有anchor和图像第一个框的交并比
        jaccard = jaccard_with_anchors(bbox)#都是利用了归一化后的anchor 和gt计算交并比
        #因为是数组,所以上式会计算该i个真实框和该层所有anchor的交比比
        #返回交并比
        # Mask: check threshold + scores + no annotations + num_classes.
        #选择交并比大于0的,大于0的1,小于0的为0,第一次。后面进行更新
        mask = tf.greater(jaccard, feat_scores)
        # mask = tf.logical_and(mask, tf.greater(jaccard, matching_threshold))
        mask = tf.logical_and(mask, feat_scores > -0.5)
        mask = tf.logical_and(mask, label < num_classes)
        imask = tf.cast(mask, tf.int64)
        fmask = tf.cast(mask, dtype)
        # Update values using mask.
        '''
        当imask为1,那么就是label,否则label就是0,也就是背景,那imask什么时候为1,imask = tf.cast(mask, tf.int64),而mask又是大于feat_score的,所以这个地方因为是循环,遍历所有的目标,那么选择框的方式就是,选择交比比最大的,也就是某一个目标他对应的框里面,交并比最大的,这是一种策略,但是论文中还提到,高于0.5的我们也有对应的目标,但是代码没有这中策略,它只是选择了交并比最大的。feat_scores = tf.where(mask, jaccard, feat_scores),这个地方就是更新feat_scores,也就是体现是选择交并比最大的
        '''
        #每个anchor的label在这进行更新
        feat_labels = imask * label + (1 - imask) * feat_labels
        #tf.where为真则返回jaccard,否则返回feat_scores
        feat_scores = tf.where(mask, jaccard, feat_scores)

        #其实一层的所有anchor都去对应一个GT的label和xyhw,但如果这个anchor和其它GT的交并比小于0,则label为0是背景对应的xywh也是0
        #对于其它anchor对应的是这个anchor与其交并比最大的GT,这个anchor的label和xyhw就是这个GT
        #这个anchor的score就是其对应的GT的交并比
        feat_ymin = fmask * bbox[0] + (1 - fmask) * feat_ymin
        feat_xmin = fmask * bbox[1] + (1 - fmask) * feat_xmin
        feat_ymax = fmask * bbox[2] + (1 - fmask) * feat_ymax
        feat_xmax = fmask * bbox[3] + (1 - fmask) * feat_xmax

        # Check no annotation label: ignore these anchors...
        # interscts = intersection_with_anchors(bbox)
        # mask = tf.logical_and(interscts > ignore_threshold,
        #                       label == no_annotation_label)
        # # Replace scores by -1.
        # feat_scores = tf.where(mask, -tf.cast(mask, dtype), feat_scores)

        return [i+1, feat_labels, feat_scores,
                feat_ymin, feat_xmin, feat_ymax, feat_xmax]
    # Main loop definition.
    i = 0
    [i, feat_labels, feat_scores,
     feat_ymin, feat_xmin,
     feat_ymax, feat_xmax] = tf.while_loop(condition, body,
                                           [i, feat_labels, feat_scores,
                                            feat_ymin, feat_xmin,
                                            feat_ymax, feat_xmax])
    # Transform to center / size.
    feat_cy = (feat_ymax + feat_ymin) / 2.
    feat_cx = (feat_xmax + feat_xmin) / 2.
    feat_h = feat_ymax - feat_ymin
    feat_w = feat_xmax - feat_xmin
    # Encode features.
    #框的左上角和右下角坐标变为中心和宽和高
    feat_cy = (feat_cy - yref) / href / prior_scaling[0]
    feat_cx = (feat_cx - xref) / wref / prior_scaling[1]
    feat_h = tf.log(feat_h / href) / prior_scaling[2]
    feat_w = tf.log(feat_w / wref) / prior_scaling[3]
    # Use SSD ordering: x / y / w / h instead of ours.
    feat_localizations = tf.stack([feat_cx, feat_cy, feat_w, feat_h], axis=-1)
    #所以这个地方损失函数其实是我们预测的是变换,我们实际的框和anchor之间的变换和我们预测的变换之间的loss。我们回归的是一种变换,如果预测的变换是对,就可以将anchor经过预测的变换转为真实的框
    return feat_labels, feat_localizations, feat_scores
    #返回的是该层所有anchor label(即该anchor对应GT label
    #返回的是该层所有anchor localization(即该anchor对应GT xyhw)
    #返回的是该层所有anchor与其对应的GT 的交并比
    #对应的原则计算这个anchor和所有GT的交比并,选择交并比最大的GT对应
    #这样该层的一个anchor最多对应一个GT,但对于多层而言,一个GT会对应多个anchor

损失函数

  • 损失函数其实是我们预测的是变换,我们实际的框和anchor之间的变换和我们预测的变换之间的loss。我们回归的是一种变换,如果预测的变换是对,就可以将anchor经过预测的变换转为真实的框。
    这里写图片描述
    SSD源码分析_第3张图片
    这里写图片描述
  • 损失函数是定位损失和类别预测置信损失的和
  • N是已经匹配的默认框的个数,xij^k表示i个默认框匹配类别k的第j个ground truth
  • 定位损失localization loss损失是一个Smooth L1损失
  • 在匹配阶段大部分的默认框是负例,这会导致正例和负例训练样本的不平衡。作者按每个默认框计算最高的置信度损失排序,然后选择前面的几个,保证正例与负例的比例至多3:1.
def ssd_losses(logits, localisations,
               gclasses, glocalisations, gscores,
               match_threshold=0.5,
               negative_ratio=3.,
               alpha=1.,
               label_smoothing=0.,
               scope=None):
    """
    Loss functions for training the SSD 300 VGG network.

    This function defines the different loss components of the SSD, and
    adds them to the TF loss collection.

    Arguments:
      logits: (list of) predictions logits Tensors;#每一层logits输出未经softmax
      localisations: (list of) localisations Tensors;#预测的框
      #前缀为g的为ground truth
      gclasses: (list of) groundtruth labels Tensors;
      glocalisations: (list of) groundtruth localisations Tensors;
      gscores: (list of) groundtruth score Tensors;
    """
    with tf.name_scope(scope, 'ssd_losses'):
        l_cross_pos = []
        l_cross_neg = []
        l_loc = []
        for i in range(len(logits)):
            dtype = logits[i].dtype
            with tf.name_scope('block_%i' % i):
                # Determine weights Tensor.
                pmask = gscores[i] > match_threshold
                fpmask = tf.cast(pmask, dtype)
               # 这个代码,这个地方又做一次筛选,如果交并比大于0.5,那么我们认为是正例,fpmask 记录正例和负例,n_positives这个是正例的个数,
                n_positives = tf.reduce_sum(fpmask)

                # Select some random negative entries.
                # n_entries = np.prod(gclasses[i].get_shape().as_list())
                # r_positive = n_positives / n_entries
                # r_negative = negative_ratio * n_positives / (n_entries - n_positives)

                # Negative mask.
                no_classes = tf.cast(pmask, tf.int32)
                #no_classes把布尔型变量变为整形,那么就是要么是0,要么是1,前景就是1,背景就是0,predictions是记录预测每个类的概率s
                predictions = slim.softmax(logits[i])
                nmask = tf.logical_and(tf.logical_not(pmask),
                                       gscores[i] > -0.5)
                fnmask = tf.cast(nmask, dtype)
                nvalues = tf.where(nmask,
                                   predictions[:, :, :, :, 0],
                                   1. - fnmask)
                nvalues_flat = tf.reshape(nvalues, [-1])
                '''
                n_neg就是负样本的数量,negative_ratio正负样本比列,默认就是3,后面的第一个取最大,我觉得是保证至少有负样本,
max_neg_entries这个就是负样本的数量,n_neg = tf.minimum(n_neg, max_neg_entries),这个比较很好理解,万一
你总样本比你三倍正样本少,所以需要选择小的,所以这个地方保证足够的负样本,nmask表示我们所选取的负样本,
tf.nn.top_k,这个是选取前k=neg个负例,因为取了负号,表示选择的交并比最小的k个,minval就是选择负例里面交并比
最大的,nmask就是把我们选择的负样例设为整数,就是提取出我们选择的,tf.logical_and就是同时为真,首先。需要是
负例,其次值需要大于minval,因为取了负数,所以nmask就是我们所选择的负例,fnmask就是就是我们选取的负样本只是
数据类型变了,由bool变为了浮点型,(dtype默认是浮点型)
                '''
                # Number of negative entries to select.
                n_neg = tf.cast(negative_ratio * n_positives, tf.int32)
                n_neg = tf.maximum(n_neg, tf.size(nvalues_flat) // 8)
                n_neg = tf.maximum(n_neg, tf.shape(nvalues)[0] * 4)
                max_neg_entries = 1 + tf.cast(tf.reduce_sum(fnmask), tf.int32)
                n_neg = tf.minimum(n_neg, max_neg_entries)

                val, idxes = tf.nn.top_k(-nvalues_flat, k=n_neg)
                minval = val[-1]
                # Final negative mask.
                nmask = tf.logical_and(nmask, -nvalues > minval)
                fnmask = tf.cast(nmask, dtype)

                # Add cross-entropy loss.
                #正例的损失,其实就是交叉熵损失,fmask过滤负例
                with tf.name_scope('cross_entropy_pos'):
                    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits[i],
                                                                          labels=gclasses[i])
                    loss = tf.losses.compute_weighted_loss(loss, fpmask)
                    l_cross_pos.append(loss)
                #负例的损失函数,fmask过滤正例
                with tf.name_scope('cross_entropy_neg'):
                    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits[i],
                                                                          labels=no_classes)
                    loss = tf.losses.compute_weighted_loss(loss, fnmask)
                    l_cross_neg.append(loss)

                # Add localization loss: smooth L1, L2, ...
                with tf.name_scope('localization'):
                    # Weights Tensor: positive mask + random negative.
                    weights = tf.expand_dims(alpha * fpmask, axis=-1)
                    loss = custom_layers.abs_smooth(localisations[i] - glocalisations[i])
                    loss = tf.losses.compute_weighted_loss(loss, weights)
                    l_loc.append(loss)

        # Additional total losses...
        with tf.name_scope('total'):
            total_cross_pos = tf.add_n(l_cross_pos, 'cross_entropy_pos')
            total_cross_neg = tf.add_n(l_cross_neg, 'cross_entropy_neg')
            total_cross = tf.add(total_cross_pos, total_cross_neg, 'cross_entropy')
            total_loc = tf.add_n(l_loc, 'localization')

            # Add to EXTRA LOSSES TF.collection
            tf.add_to_collection('EXTRA_LOSSES', total_cross_pos)
            tf.add_to_collection('EXTRA_LOSSES', total_cross_neg)
            tf.add_to_collection('EXTRA_LOSSES', total_cross)
            tf.add_to_collection('EXTRA_LOSSES', total_loc)

SSD检测结果

SSD源码分析_第4张图片
SSD源码分析_第5张图片

参考博客

[1] https://blog.csdn.net/qq1483661204/article/details/79776065
[2] https://blog.csdn.net/lk123400/article/details/53814488
[3] https://blog.csdn.net/xunan003/article/details/79086642


你可能感兴趣的:(计算机视觉)