最近看了SSD的源代码,理了一下其中的逻辑,写一篇学习笔记。
代码地址:https://github.com/balancap/SSD-Tensorflow
一、网络结构
首先贴出来网络结构图,便于后续的分析,这里的图是SSD 300的结构图,而我看的代码是SSD 512,但是思想差别不大,可以看出来SSD比YOLO的差别就是,不仅在最后一层提取预选框,而是在中间某几层就已经开始通过3X3的卷积提取候选框,且引入了anchors,可以看到不同的特征层的anchors数量也不一样,从开始的38X38X4到19X19X6到3X3X4到后面的1X1X4都是候选框个数,加一起据说总共3800多个,大大扩充了候选窗数量,而且还具有检测大小物体的侧重分工。
接下开始分析代码,首先网络的结构是在ssd_512_net.py中搭起来的,首先看一下与网络结构有关的参数:
下面的参数是用于构建网络用的参数。feat_layers指定第几个层做为特征层用来提取候选框,feat_shapes则是标明对应的特征层尺寸,相当于以前的cell_size,不过由于好多个特征层一起提取,所以有好多的cell_size,normalizations则指定对应特征层的归一化系数,因为第一个特征层较靠前,其数值较其他的特征层偏大,故只对其进行归一化。
feat_layers = ['block4', 'block7', 'block8', 'block9', 'block10', 'block11', 'block12']
feat_shapes = [(64, 64), (32, 32), (16, 16), (8, 8), (4, 4), (2, 2), (1, 1)]
normalizations = [20, -1, -1, -1, -1, -1, -1]
下面的参数则是用于anchors的构建,主要的是anchor_sizes与anchor_ratios,对于anchors的构建,主要是这样的规则:
第一个:anchor_sizes[0],即原尺寸
第二个:sqrt(anchor_sizes[0] * anchor_sizes[1]),两项乘积开方
后续:anchor_ratios* anchor_sizes[0]
所以一共是1+1+len(anchor_ratios) = len(anchor_sizes) + len(anchor_ratios)
anchor_size_bounds = [0.10, 0.90]
anchor_sizes = [(20.48, 51.2),
(51.2, 133.12),
(133.12, 215.04),
(215.04, 296.96),
(296.96, 378.88),
(378.88, 460.8),
(460.8, 542.72)]
anchor_ratios = [[2, .5],
[2, .5, 3, 1./3],
[2, .5, 3, 1./3],
[2, .5, 3, 1./3],
[2, .5, 3, 1./3],
[2, .5],
[2, .5]]
anchor_steps = [8, 16, 32, 64, 128, 256, 512]
anchor_offset = 0.5
解释完上述的参数,就可以先看代码了 ,首先是网络的搭建,这里直接去看ssd_net()函数,这是详细的构造过程:
def ssd_net(inputs,
num_classes,
feat_layers,
anchor_sizes,
anchor_ratios,
normalizations,
is_training=True,
dropout_keep_prob=0.5,
prediction_fn=slim.softmax,
reuse=None,
scope='ssd_300_vgg'):
"""SSD net definition.
"""
# if data_format == 'NCHW':
# inputs = tf.transpose(inputs, perm=(0, 3, 1, 2))
# End_points collect relevant activations for external use.
# 分块进行卷积池化处理,并将不同块的处理结果储存在end_points中
end_points = {}
with tf.variable_scope(scope, 'ssd_512_vgg', [inputs], reuse=reuse):
# Original VGG-16 blocks.
print(inputs)
net = slim.repeat(inputs, 2, slim.conv2d, 64, [3, 3], scope='conv1')
end_points['block1'] = net
print('block1', net)
net = slim.max_pool2d(net, [2, 2], scope='pool1')
# Block 2.
net = slim.repeat(net, 2, slim.conv2d, 128, [3, 3], scope='conv2')
end_points['block2'] = net
net = slim.max_pool2d(net, [2, 2], scope='pool2')
# Block 3.
net = slim.repeat(net, 3, slim.conv2d, 256, [3, 3], scope='conv3')
end_points['block3'] = net
net = slim.max_pool2d(net, [2, 2], scope='pool3')
# Block 4.
net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], scope='conv4')
end_points['block4'] = net
net = slim.max_pool2d(net, [2, 2], scope='pool4')
# Block 5.
net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], scope='conv5')
end_points['block5'] = net
net = slim.max_pool2d(net, [3, 3], 1, scope='pool5')
# Additional SSD blocks.
# Block 6: let's dilate the hell out of it!
net = slim.conv2d(net, 1024, [3, 3], rate=6, scope='conv6')
end_points['block6'] = net
# Block 7: 1x1 conv. Because the fuck.
net = slim.conv2d(net, 1024, [1, 1], scope='conv7')
end_points['block7'] = net
# Block 8/9/10/11: 1x1 and 3x3 convolutions stride 2 (except lasts).
end_point = 'block8'
with tf.variable_scope(end_point):
net = slim.conv2d(net, 256, [1, 1], scope='conv1x1')
net = custom_layers.pad2d(net, pad=(1, 1))
net = slim.conv2d(net, 512, [3, 3], stride=2, scope='conv3x3', padding='VALID')
end_points[end_point] = net
print('block8', net)
end_point = 'block9'
with tf.variable_scope(end_point):
net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')
net = custom_layers.pad2d(net, pad=(1, 1))
net = slim.conv2d(net, 256, [3, 3], stride=2, scope='conv3x3', padding='VALID')
end_points[end_point] = net
print('block9', net)
end_point = 'block10'
with tf.variable_scope(end_point):
net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')
net = custom_layers.pad2d(net, pad=(1, 1))
net = slim.conv2d(net, 256, [3, 3], stride=2, scope='conv3x3', padding='VALID')
end_points[end_point] = net
print('block10', net)
end_point = 'block11'
with tf.variable_scope(end_point):
net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')
net = custom_layers.pad2d(net, pad=(1, 1))
net = slim.conv2d(net, 256, [3, 3], stride=2, scope='conv3x3', padding='VALID')
end_points[end_point] = net
print('block11', net)
end_point = 'block12'
with tf.variable_scope(end_point):
net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')
net = custom_layers.pad2d(net, pad=(1, 1))
net = slim.conv2d(net, 256, [4, 4], scope='conv4x4', padding='VALID')
# Fix padding to match Caffe version (pad=1).
# pad_shape = [(i-j) for i, j in zip(layer_shape(net), [0, 1, 1, 0])]
# net = tf.slice(net, [0, 0, 0, 0], pad_shape, name='caffe_pad')
print(net)
end_points[end_point] = net
# Prediction and localisations layers.
predictions = []
logits = []
localisations = []
# 根据feat_layers中标出的特征层,分别回归坐标值以及预测分类类别
for i, layer in enumerate(feat_layers):
with tf.variable_scope(layer + '_box'):
p, l = ssd_multibox_layer(end_points[layer],
num_classes,
anchor_sizes[i],
anchor_ratios[i],
normalizations[i])
# 这里prediction_fn就是softmax
predictions.append(prediction_fn(p))
logits.append(p)
localisations.append(l)
print(logits)
#
# predictions: [[batch_num, 64, 64, 4, class_num], .....[batch_num, 1, 1, 4, class_num]]
# logits : [[batch_num, 64, 64, 4, class_num], .....[batch_num, 1, 1, 4, class_num]]
# localisations : [[batch_num, 64, 64, 4, 4], .....[batch_num, 1, 1, 4, 4]]
return predictions, localisations, logits, end_points
可以看到前面的卷积池化没有什么特点,其中值得一提的是pad2d这个函数,对张量进行适当的填充,从而保证之后的卷积正常进行,针对前面分block储存的网络输出,在ssd_multibox_layer()中,结合anchors提取候选框以及候选框分类。
下面是实现代码:
def ssd_multibox_layer(inputs,
num_classes,
sizes,
ratios=[1],
normalization=-1,
bn_normalization=False):
"""Construct a multibox layer, return a class and localization predictions.
"""
net = inputs
# 如果需要L2正则则进行L2正则化
if normalization > 0:
net = custom_layers.l2_normalization(net, scaling=True)
# Number of anchors.
# 该特征层总anchor数量,
num_anchors = len(sizes) + len(ratios)
# Location.
# Location预测四个描述回归框位置的参数,故为anchors*4的数量
num_loc_pred = num_anchors * 4
loc_pred = slim.conv2d(net, num_loc_pred, [3, 3], activation_fn=None,
scope='conv_loc')
# 这里有关于NCHW和NHWC两种张量形式,该函数是统一成NHWC形式
loc_pred = custom_layers.channel_to_last(loc_pred)
# reshape成[batch_num, cell_size, cell_size, num_anchors, 4]的形式
loc_pred = tf.reshape(loc_pred,
tensor_shape(loc_pred, 4)[:-1]+[num_anchors, 4])
# Class prediction.
# 与上面同理,不过预测的是classes,所以输出通道数变成了num_anchors * num_classes
num_cls_pred = num_anchors * num_classes
cls_pred = slim.conv2d(net, num_cls_pred, [3, 3], activation_fn=None,
scope='conv_cls')
cls_pred = custom_layers.channel_to_last(cls_pred)
#[BATCH_SIZE, CELL_SIZE, CELL_SIZE, NUM_ANCHORS, NUM_CLASSES]
cls_pred = tf.reshape(cls_pred,
tensor_shape(cls_pred, 4)[:-1]+[num_anchors, num_classes])
return cls_pred, loc_pred
到这里就得到了预测结果,predictions和localisations
二、样本编码
样本读进来之后只有一副图片内的目标的类别和位置信息,要编码成可以进行loss计算的格式,还需要根据全部的anchors将ground truth按IOU分配给各个anchors,所以样本编码分为两部分:求全部anchors尺寸,编码。
1、anchors集合构建
代码部分直接看ssd_300_vgg.py的ssd_anchors_all_layers():
# 根据每个特征层,构建anchors
def ssd_anchors_all_layers(img_shape,
layers_shape,
anchor_sizes,
anchor_ratios,
anchor_steps,
offset=0.5,
dtype=np.float32):
"""Compute anchor boxes for all feature layers.
"""
layers_anchors = []
# 针对每一个特征层尺寸
for i, s in enumerate(layers_shape):
# 输入:
# img_shape:图片尺寸,这里关于回归框预测值的转换规则
# s:当前特征层的尺寸,以SSD512的第一层为例,即(64,64)
# anchor_sizes:anchor原始尺寸
# anchor_ratios:不同比例的anchor
# anchor_steps:特征图较原图的缩放倍率
# 输出:
# anchor_bboxes:输出每层特征层的anchor坐标详情,构成为[x,y,w,h]
# 以第一层为例:[64,64,4,4],64为x,y坐标,4为全部anchor在固定中心点情况下的4种尺寸
# 其中,某些特征层anchor尺寸变化为4种,有些为6种
anchor_bboxes = ssd_anchor_one_layer(img_shape, s,
anchor_sizes[i],
anchor_ratios[i],
anchor_steps[i],
offset=offset, dtype=dtype)
print(anchor_bboxes)
# layers_anchors:[[64, 64, 4, 4]........[1, 1, 4, 4]]
layers_anchors.append(anchor_bboxes)
return layers_anchors
还是同样的路子,根据不同的特征层,按照其相应的anchor规格,构建anchors,然后堆在一个list中,继续跟着看ssd_anchor_one_layer()函数,看一下具体对每一个特征层然后构建其中的anchors:
def ssd_anchor_one_layer(img_shape,
feat_shape,
sizes,
ratios,
step,
offset=0.5,
dtype=np.float32):
"""Computer SSD default anchor boxes for one feature layer.
Determine the relative position grid of the centers, and the relative
width and height.
Arguments:
feat_shape: Feature shape, used for computing relative position grids;
size: Absolute reference sizes;
ratios: Ratios to use on these features;
img_shape: Image shape, used for computing height, width relatively to the
former;
offset: Grid offset.
Return:
y, x, h, w: Relative x and y grids, and height and width.
"""
# Compute the position grid: simple way.
# y, x = np.mgrid[0:feat_shape[0], 0:feat_shape[1]]
# y = (y.astype(dtype) + offset) / feat_shape[0]
# x = (x.astype(dtype) + offset) / feat_shape[1]
# Weird SSD-Caffe computation using steps values...
# 分格矩阵
y, x = np.mgrid[0:feat_shape[0], 0:feat_shape[1]]
y = (y.astype(dtype) + offset) * step / img_shape[0]
x = (x.astype(dtype) + offset) * step / img_shape[1]
# Expand dims to support easy broadcasting.
# 维度阔充
y = np.expand_dims(y, axis=-1)
x = np.expand_dims(x, axis=-1)
# Compute relative height and width.
# Tries to follow the original implementation of SSD for the order.
# 不同特征层的anchors数量有异
num_anchors = len(sizes) + len(ratios)
h = np.zeros((num_anchors, ), dtype=dtype)
w = np.zeros((num_anchors, ), dtype=dtype)
# Add first anchor boxes with ratio=1.
# 这里可以看到每一层的anchor尺寸具体构造方式:
# 针对sizes,sizes第一个尺寸值是原尺寸的anchor,第二个尺寸值需要与第一个尺寸值做乘积开方来作为一个anchor的尺寸
# 针对ratios,每个ratios都是在原尺寸size[0]的基础上进行比例运算
# 所以,每个特征层的anchor数量为len(size)+len(ratios)
h[0] = sizes[0] / img_shape[0]
w[0] = sizes[0] / img_shape[1]
di = 1
if len(sizes) > 1:
h[1] = math.sqrt(sizes[0] * sizes[1]).real / img_shape[0]
w[1] = math.sqrt(sizes[0] * sizes[1]).real / img_shape[1]
di += 1
for i, r in enumerate(ratios):
h[i+di] = sizes[0] / img_shape[0] / math.sqrt(r).real
w[i+di] = sizes[0] / img_shape[1] * math.sqrt(r).real
# 以第一层为例,由于是64X64特征图,2+2anchors
# 所以返回量为:y:[[64]], x: [[64]], h: [4], w: [4]
return y, x, h, w
到这里,我们就构建了一个特征层的anchors,然后逐层进行构造,并堆叠,最后就形成了[[64, 64, 4, 4]........[1, 1, 4, 4]]这种格式的所有anchors的二点式坐标集合。
2、样本编码
在得到了所有anchors的具体位置之后,我们就可以像faster rcnn那样来针对每个anchor,将gt编码成loss计算需要的样子,即对每个样本图像,找到其中的anchors来负责每一个待检测目标。
这里需要提一下,我们编码后的gt坐标,以及预测出来的位置坐标并不是真实的坐标,而是根据与cell尺寸,图片尺寸算出来的一个系数,具体算的过程如下:
这里b是gt的x,y,w,h;d是负责该目标的anchors的x,y,w,h。
而l才是我们编码后,以及预测出来的东西,这样数学关系就比较明确了。
编码程序也是和其他一样,封装在类中,但其实调用的外界函数,这里调用的ssd_common.py的tf_ssd_bboxes_encode()函数:
def tf_ssd_bboxes_encode(labels,
bboxes,
anchors,
num_classes,
no_annotation_label,
ignore_threshold=0.5,
prior_scaling=[0.1, 0.1, 0.2, 0.2],
dtype=tf.float32,
scope='ssd_bboxes_encode'):
"""Encode groundtruth labels and bounding boxes using SSD net anchors.
Encoding boxes for all feature layers.
Arguments:
labels: 1D Tensor(int64) containing groundtruth labels;
bboxes: Nx4 Tensor(float) with bboxes relative coordinates;
anchors: List of Numpy array with layer anchors;
matching_threshold: Threshold for positive match with groundtruth bboxes;
prior_scaling: Scaling of encoded coordinates.
Return:
(target_labels, target_localizations, target_scores):
Each element is a list of target Tensors.
"""
# 在此之前先明确一下输入量维度,由上方说明也可得知
# labels:1维的向量,里面按序存放图片种的有的类别
# bboxes:N*4维的向量,N应该就是len(labels),即针对每个有类别属性的物体,其位置信息
# anchors:即之前得到的所有特征层上的所有anchors列表
with tf.name_scope(scope):
# 预先做出类别,gt,置信度存储空间
target_labels = []
target_localizations = []
target_scores = []
# 针对每一层特征层
for i, anchors_layer in enumerate(anchors):
with tf.name_scope('bboxes_encode_block_%i' % i):
t_labels, t_loc, t_scores = \
tf_ssd_bboxes_encode_layer(labels, bboxes, anchors_layer,
num_classes, no_annotation_label,
ignore_threshold,
prior_scaling, dtype)
target_labels.append(t_labels)
target_localizations.append(t_loc)
target_scores.append(t_scores)
# target_labels:[[64, 64, 4].......[1, 1, 4]]
# target_localization:[[64, 64, 4,4].......[1, 1, 4,4]]
# target_scores:[[64, 64, 4].......[1, 1, 4]]
return target_labels, target_localizations, target_scores
一样的路子,按不同层的anchors分开处理,直接进入tf_ssd_bboxes_encode_layer(),看具体某一层的编码方式:
def tf_ssd_bboxes_encode_layer(labels,
bboxes,
anchors_layer,
num_classes,
no_annotation_label,
ignore_threshold=0.5,
prior_scaling=[0.1, 0.1, 0.2, 0.2],
dtype=tf.float32):
"""Encode groundtruth labels and bounding boxes using SSD anchors from
one layer.
Arguments:
labels: 1D Tensor(int64) containing groundtruth labels;
bboxes: Nx4 Tensor(float) with bboxes relative coordinates;
anchors_layer: Numpy array with layer anchors;
matching_threshold: Threshold for positive match with groundtruth bboxes;
prior_scaling: Scaling of encoded coordinates.
Return:
(target_labels, target_localizations, target_scores): Target Tensors.
"""
init_op = tf.global_variables_initializer()
sess.run(init_op)
# Anchors coordinates and volume.
# 由x,y和h,w得到全部anchors的左上右下坐标
yref, xref, href, wref = anchors_layer
ymin = yref - href / 2.
xmin = xref - wref / 2.
ymax = yref + href / 2.
xmax = xref + wref / 2.
# 全部anchor的面积,用于计算之后的iou
vol_anchors = (xmax - xmin) * (ymax - ymin)
# Initialize tensors...
# shape: [CELL_SIZE, CELL_SIZE, NUM_ANCHORS]
shape = (yref.shape[0], yref.shape[1], href.size)
# 各种真值标签
feat_labels = tf.zeros(shape, dtype=tf.int64)
feat_scores = tf.zeros(shape, dtype=dtype)
feat_ymin = tf.zeros(shape, dtype=dtype)
feat_xmin = tf.zeros(shape, dtype=dtype)
feat_ymax = tf.ones(shape, dtype=dtype)
feat_xmax = tf.ones(shape, dtype=dtype)
# 类似iou系数计算
def jaccard_with_anchors(bbox):
"""Compute jaccard score between a box and the anchors.
"""
int_ymin = tf.maximum(ymin, bbox[0])
int_xmin = tf.maximum(xmin, bbox[1])
int_ymax = tf.minimum(ymax, bbox[2])
int_xmax = tf.minimum(xmax, bbox[3])
h = tf.maximum(int_ymax - int_ymin, 0.)
w = tf.maximum(int_xmax - int_xmin, 0.)
# Volumes.
inter_vol = h * w
union_vol = vol_anchors - inter_vol \
+ (bbox[2] - bbox[0]) * (bbox[3] - bbox[1])
jaccard = tf.div(inter_vol, union_vol)
return jaccard
def intersection_with_anchors(bbox):
"""Compute intersection between score a box and the anchors.
"""
int_ymin = tf.maximum(ymin, bbox[0])
int_xmin = tf.maximum(xmin, bbox[1])
int_ymax = tf.minimum(ymax, bbox[2])
int_xmax = tf.minimum(xmax, bbox[3])
h = tf.maximum(int_ymax - int_ymin, 0.)
w = tf.maximum(int_xmax - int_xmin, 0.)
inter_vol = h * w
scores = tf.div(inter_vol, vol_anchors)
return scores
# while_loop判定,labels数量来定总循环次数,将一副图片中的所有目标构建进真值标签
def condition(i, feat_labels, feat_scores,
feat_ymin, feat_xmin, feat_ymax, feat_xmax):
"""Condition: check label index.
"""
# 这里代码内容有更改,是因为我用的样本每个图片里就一个目标
return i < 1
# 制作真值标签
def body(i, feat_labels, feat_scores,
feat_ymin, feat_xmin, feat_ymax, feat_xmax):
"""Body: update feature labels, scores and bboxes.
Follow the original SSD paper for that purpose:
- assign values when jaccard > 0.5;
- only update if beat the score of other bboxes.
"""
# Jaccard score.
# 首先得到当前的labels及bbox,这里也是有代码个人更改,原代码应该是labels[i]与bboxes[i]
label = labels[0]
bbox = bboxes[0]
# 计算bbox与每个anchor的iou
jaccard = jaccard_with_anchors(bbox)
# Mask: check threshold + scores + no annotations + num_classes.
# 如果新的iou大于旧的得分记录,则mask的对应位置为true,即需要更新这个anchor的负责的目标信息
mask = tf.greater(jaccard, feat_scores)
# mask = tf.logical_and(mask, tf.greater(jaccard, matching_threshold))
# 这四步只是进行一些转换方便后续处理
mask = tf.logical_and(mask, feat_scores > -0.5)
mask = tf.logical_and(mask, label < num_classes)
imask = tf.cast(mask, tf.int64)
fmask = tf.cast(mask, dtype)
# Update values using mask.
# 这里,针对mask为true的位置的anchors,更新他们负责的目标信息
# anchors负责目标的标准为:
# 每一个真值框可以被多个anchor负责
# 但一个anchor只能负责与他iou最大的真值框
feat_labels = imask * label + (1 - imask) * feat_labels
# where函数,简述其作用就是,mask对应位置为true的anchors的feat_scores更新为iou,其他保持不变
feat_scores = tf.where(mask, jaccard, feat_scores)
# 更新两点式真值框坐标
feat_ymin = fmask * bbox[0] + (1 - fmask) * feat_ymin
feat_xmin = fmask * bbox[1] + (1 - fmask) * feat_xmin
feat_ymax = fmask * bbox[2] + (1 - fmask) * feat_ymax
feat_xmax = fmask * bbox[3] + (1 - fmask) * feat_xmax
# Check no annotation label: ignore these anchors...
# interscts = intersection_with_anchors(bbox)
# mask = tf.logical_and(interscts > ignore_threshold,
# label == no_annotation_label)
# # Replace scores by -1.
# feat_scores = tf.where(mask, -tf.cast(mask, dtype), feat_scores)
return [i+1, feat_labels, feat_scores,
feat_ymin, feat_xmin, feat_ymax, feat_xmax]
# Main loop definition.
# i = 0
# [i, feat_labels, feat_scores,
# feat_ymin, feat_xmin,
# feat_ymax, feat_xmax] = tf.while_loop(condition, body,
# [i, feat_labels, feat_scores,
# feat_ymin, feat_xmin,
# feat_ymax, feat_xmax])
# 这里还是我自身用到所以改了点东西
# 原代码的大致思路就是遍历一副图片中的目标(通过condition函数判断),构造真值标签(通过body函数构造)
[i, feat_labels, feat_scores,
feat_ymin, feat_xmin,
feat_ymax, feat_xmax] = body(1, feat_labels, feat_scores,
feat_ymin, feat_xmin,
feat_ymax, feat_xmax)
# Transform to center / size.
# 这里进行坐标的编码行为
feat_cy = (feat_ymax + feat_ymin) / 2.
feat_cx = (feat_xmax + feat_xmin) / 2.
feat_h = feat_ymax - feat_ymin
feat_w = feat_xmax - feat_xmin
# Encode features.
feat_cy = (feat_cy - yref) / href / prior_scaling[0]
feat_cx = (feat_cx - xref) / wref / prior_scaling[1]
feat_h = tf.log(feat_h / href) / prior_scaling[2]
feat_w = tf.log(feat_w / wref) / prior_scaling[3]
# Use SSD ordering: x / y / w / h instead of ours.
# 将4个坐标信息进行堆叠
feat_localizations = tf.stack([feat_cx, feat_cy, feat_w, feat_h], axis=-1)
# 此时返回的是针对一张图片的label与gt,关于某一层的全部anchors的标签
# 输出维度:
# feat_labels:[CELL_SIZE, CELL_SIZE, NUM_ANCHORS]
# feat_localization:[CELL_SIZE, CELL_SIZE, NUM_ANCHORS,4]
# feat_scores:[CELL_SIZE, CELL_SIZE, NUM_ANCHORS]
return feat_labels, feat_localizations, feat_scores
代码很长,分块来看的话就很简单了,大致分为下面三块:
1)初始化合适shape的存储空间,并赋初值。
2)定义了几个函数,主要用于计算IOU、判断是否遍历图片中的gt与labels、根据gt分配至合适anchors。
3)通过while_loop将上述参数联合起来使用,完成gt编码。
这里的重点就看一下几个函数的实现,IOU与遍历labels这两个函数很简单,不细说,就是jaccard_with_anchors和condition。这里condition因为我的样本里一副图片就一个目标,所以改写了一下,直接就是目标数到1就截止,原来的也很简单,一看就懂。
这里主要看的是body函数,为了方便这里单独放一下body:
def body(i, feat_labels, feat_scores,
feat_ymin, feat_xmin, feat_ymax, feat_xmax):
"""Body: update feature labels, scores and bboxes.
Follow the original SSD paper for that purpose:
- assign values when jaccard > 0.5;
- only update if beat the score of other bboxes.
"""
# Jaccard score.
# 首先得到当前的labels及bbox,这里也是有代码个人更改,原代码应该是labels[i]与bboxes[i]
label = labels[0]
bbox = bboxes[0]
# 计算bbox与每个anchor的iou
jaccard = jaccard_with_anchors(bbox)
# Mask: check threshold + scores + no annotations + num_classes.
# 如果新的iou大于旧的得分记录,则mask的对应位置为true,即需要更新这个anchor的负责的目标信息
mask = tf.greater(jaccard, feat_scores)
# mask = tf.logical_and(mask, tf.greater(jaccard, matching_threshold))
# 这四步只是进行一些转换方便后续处理
mask = tf.logical_and(mask, feat_scores > -0.5)
mask = tf.logical_and(mask, label < num_classes)
imask = tf.cast(mask, tf.int64)
fmask = tf.cast(mask, dtype)
# Update values using mask.
# 这里,针对mask为true的位置的anchors,更新他们负责的目标信息
# anchors负责目标的标准为:
# 每一个真值框可以被多个anchor负责
# 但一个anchor只能负责与他iou最大的真值框
feat_labels = imask * label + (1 - imask) * feat_labels
# where函数,简述其作用就是,mask对应位置为true的anchors的feat_scores更新为iou,其他保持不变
feat_scores = tf.where(mask, jaccard, feat_scores)
# 更新两点式真值框坐标
feat_ymin = fmask * bbox[0] + (1 - fmask) * feat_ymin
feat_xmin = fmask * bbox[1] + (1 - fmask) * feat_xmin
feat_ymax = fmask * bbox[2] + (1 - fmask) * feat_ymax
feat_xmax = fmask * bbox[3] + (1 - fmask) * feat_xmax
# Check no annotation label: ignore these anchors...
# interscts = intersection_with_anchors(bbox)
# mask = tf.logical_and(interscts > ignore_threshold,
# label == no_annotation_label)
# # Replace scores by -1.
# feat_scores = tf.where(mask, -tf.cast(mask, dtype), feat_scores)
return [i+1, feat_labels, feat_scores,
feat_ymin, feat_xmin, feat_ymax, feat_xmax]
这样就成功的将样本编码进了合适的anchors,完成了编码。
三、loss构造
有了上边的基础,其实loss构造代码就很简单了,只是还有一点要注意的就是,在loss构造中,只有某些存在目标可能性较大的anchor才有资格参与loss计算图构建:
# loss定义
def ssd_losses(logits, localisations,
gclasses, glocalisations, gscores,
match_threshold=0.5,
negative_ratio=3.,
alpha=1.,
label_smoothing=0.,
device='/cpu:0',
scope=None):
with tf.name_scope(scope, 'ssd_losses'):
lshape = get_shape(logits[0], 5)
num_classes = lshape[-1]
batch_size = lshape[0]
# Flatten out all vectors!
#下面一大堆操作就是把各个向量拉平合并
# 真值标签:
# gclasses:[batch_num*(64*64*4 +.....+ 1*1*4)]
# gscores:[batch_num*(64*64*4 +.....+ 1*1*4)]
# glocalisations:[batch_num*(64*64*4 +.....+ 1*1*4),4]
# 预测值:
# logits:[batch_num*(64*64*4 +.....+ 1*1*4),num_classes]
# localisations:[batch_num*(64*64*4 +.....+ 1*1*4),4]
flogits = []
fgclasses = []
fgscores = []
flocalisations = []
fglocalisations = []
for i in range(len(logits)):
flogits.append(tf.reshape(logits[i], [-1, num_classes]))
fgclasses.append(tf.reshape(gclasses[i], [-1]))
fgscores.append(tf.reshape(gscores[i], [-1]))
flocalisations.append(tf.reshape(localisations[i], [-1, 4]))
fglocalisations.append(tf.reshape(glocalisations[i], [-1, 4]))
# And concat the crap!
logits = tf.concat(flogits, axis=0)
gclasses = tf.concat(fgclasses, axis=0)
gscores = tf.concat(fgscores, axis=0)
localisations = tf.concat(flocalisations, axis=0)
glocalisations = tf.concat(fglocalisations, axis=0)
dtype = logits.dtype
# Compute positive matching mask...
# 这里可以认为是只有iou大于match_threshold的样本才是positive样本
pmask = gscores > match_threshold
fpmask = tf.cast(pmask, dtype)
n_positives = tf.reduce_sum(fpmask)
# Hard negative mining...
# 其余的都按背景处理
no_classes = tf.cast(pmask, tf.int32)
# 预测类别为可能型最大的类别
predictions = slim.softmax(logits)
# 除了positive样本,其余都是negative样本
nmask = tf.logical_and(tf.logical_not(pmask),
gscores > -0.5)
fnmask = tf.cast(nmask, dtype)
# 将预测类别中的对应位置的类别改为背景
nvalues = tf.where(nmask,
predictions[:, 0],
1. - fnmask)
# 将类别预测结果reshape成[batch_num*(64*64*4 +.....+ 1*1*4)]
nvalues_flat = tf.reshape(nvalues, [-1])
# Number of negative entries to select.
# 严格按照positive与negative样本比例3:1来重新选择negative样本
max_neg_entries = tf.cast(tf.reduce_sum(fnmask), tf.int32)
n_neg = tf.cast(negative_ratio * n_positives, tf.int32) + batch_size
n_neg = tf.minimum(n_neg, max_neg_entries)
val, idxes = tf.nn.top_k(-nvalues_flat, k=n_neg)
max_hard_pred = -val[-1]
# Final negative mask.
nmask = tf.logical_and(nmask, nvalues < max_hard_pred)
fnmask = tf.cast(nmask, dtype)
# Add cross-entropy loss.
with tf.name_scope('cross_entropy_pos'):
# positive样本交叉熵
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,
labels=gclasses)
#loss = tf.div(tf.reduce_sum(loss * fpmask), batch_size, name='value')
loss = tf.reduce_sum(loss * fpmask)
tf.losses.add_loss(loss)
with tf.name_scope('cross_entropy_neg'):
# negative样本交叉熵
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,
labels=no_classes)
#loss = tf.div(tf.reduce_sum(loss * fnmask), batch_size, name='value')
loss = tf.reduce_sum(loss * fnmask)
tf.losses.add_loss(loss)
# Add localization loss: smooth L1, L2, ...
with tf.name_scope('localization'):
# Weights Tensor: positive mask + random negative.
# L1平滑的位置回归
weights = tf.expand_dims(alpha * fpmask, axis=-1)
loss = custom_layers.abs_smooth(localisations - glocalisations)
#loss = tf.div(tf.reduce_sum(loss * weights), batch_size, name='value')
loss = tf.reduce_sum(loss * weights)
tf.losses.add_loss(loss)
把所有loss都添加进了losses之后,这个loss构建也就完成了。