在detectron训练网络的过程中,给网络送的blob在下面的函数中生成:(位于minibatch.py)
def get_minibatch(roidb):
"""Given a roidb, construct a minibatch sampled from it."""
# We collect blobs from each image onto a list and then concat them into a
# single tensor, hence we initialize each blob to an empty list
blobs = {k: [] for k in get_minibatch_blob_names()}
# Get the input image blob, formatted for caffe2
im_blob, im_scales = _get_image_blob(roidb) #对输入的图像处理程网络需要的形式(batch,channel,height,width),im_scales是变换的尺度
blobs['data'] = im_blob
if cfg.RPN.RPN_ON:
# RPN-only or end-to-end Faster/Mask R-CNN
valid = rpn_roi_data.add_rpn_blobs(blobs, im_scales, roidb)
elif cfg.RETINANET.RETINANET_ON:
im_width, im_height = im_blob.shape[3], im_blob.shape[2]
# im_width, im_height corresponds to the network input: padded image
# (if needed) width and height. We pass it as input and slice the data
# accordingly so that we don't need to use SampleAsOp
valid = retinanet_roi_data.add_retinanet_blobs(
blobs, im_scales, roidb, im_width, im_height
)
else:
# Fast R-CNN like models trained on precomputed proposals
valid = fast_rcnn_roi_data.add_fast_rcnn_blobs(blobs, im_scales, roidb)
return blobs, valid
其中给retinanet输送blob的函数为retinanet_roi_data.add_retinanet_blobs,具体来分析这个函数。
1.1 我们首先来回顾一下普通的RPN+FPN网络是如何产生anchor
这里以P2-P6的FPN网络为例(训练图片的大小为1024×1024,其余参数皆为默认):
层数 | stride | anchor_sizes | anchor_aspect_ratios | 生成的anchor个数 (乘以3是因为3种比例) |
P2 | 4(2^2) | 32 | 0.5,1,2 | (1024/4)^2×3=196608 |
P3 | 8(2^3) | 64 | 0.5,1,2 | (1024/8)^2×3= 49152 |
P4 | 16(2^4) | 128 | 0.5,1,2 | (1024/16)^2×3=12288 |
P5 | 32(2^5) | 256 | 0.5,1,2 | (1024/32)^2×3=3072 |
P6 | 64(2^6) | 512 | 0.5,1,2 | (1024/64)^2×3=768 |
1.2 retinanet产生anchor
k_max, k_min = cfg.FPN.RPN_MAX_LEVEL, cfg.FPN.RPN_MIN_LEVEL
scales_per_octave = cfg.RETINANET.SCALES_PER_OCTAVE # scales_per_octave = 3
aspect_ratios = cfg.RETINANET.ASPECT_RATIOS #aspect_ratios = (1.0,2.0,0.5) 也就是anchor的长宽比
num_aspect_ratios = len(cfg.RETINANET.ASPECT_RATIOS) #比例的数量
anchor_scale = cfg.RETINANET.ANCHOR_SCALE
# get anchors from all levels for all scales/aspect ratios
foas = []
for lvl in range(k_min, k_max + 1):
stride = 2. ** lvl #每一层FPN的步长都有所不同。P3-P7:8,16-128。每隔一个stride生成一个anchor
for octave in range(scales_per_octave): #range(scales_per_octave) = (0,1,2)
octave_scale = 2 ** (octave / float(scales_per_octave)) #octave_scale = 2^0,2^(1/3),2^(2/3)
for idx in range(num_aspect_ratios): #range(num_aspect_ratios) = (0,1,2)
anchor_sizes = (stride * octave_scale * anchor_scale, ) #anchor的边长
anchor_aspect_ratios = (aspect_ratios[idx], ) #anchor的长宽比 1.0,2.0,0.5
foa = data_utils.get_field_of_anchors( #生成anchor
stride, anchor_sizes, anchor_aspect_ratios, octave, idx)
foas.append(foa)
all_anchors = np.concatenate([f.field_of_anchors for f in foas])
retianet网络新增了两个参数。
我们先来分析主要的循环不,然后就知道上面两个参数的作用了:
(1)第一个for循环确定每层FPN的步长,与上面的RPN+FPN网络相同,P3-P7:8-128。这里从P3开始,stride = 8
(2)第二个for循环,确定octave_scale,octave_scale = 2^(0/3),2^(1/3),2^(2/3),这里从2^(0/3)开始。
其实到这一步该层的anchor_sizes已经确定了,公式为如下,代码中将其放在了第三个循环中,并没有影响。
anchor_sizes = (stride * octave_scale * anchor_scale, )
可以看出新增的那两个参数主要用于构成每一层FPN的anchor_sizes ,也就是在原本的stride的基础上,乘以anchor_scale,之后再乘以octave_scale,由于octave_scale有三个值,也就是相比于原来RPN+RPN网络一个FPN层只产生一种大小的anchor_sizes,retinanet网络产生三种anchor_sizes,直观上来说增加了anchor的数量,大大增加了anchor与gt的覆盖的可能。
(3)第三个循环主要是为了生成anchor_aspect_ratios,也就是针对已经确定的anchor_sizes,依次生成三种比例的anchor。,生成的函数如下,输入的重点就是标黑的三个参数。
foa = data_utils.get_field_of_anchors( stride, anchor_sizes, anchor_aspect_ratios, octave, idx)
下面这个表格以P3层为例,(训练图片的大小为1024×1024,其余参数皆为默认),可以看出对于P3层生成anchor总数为16384×9
表格1
名称 | stride 第一个for循环 |
octave_scale 第二个for循环 |
anchor_scale | anchor_sizes | anchor_aspect_ratios 第三个for循环 |
anchor数量 每一行相当于一个foa |
P3 | 8(2^3) | 2^0 | 4 | 8×2^0×4 = 32 | 0.5 | (1024/8)^2=16384 |
1.0 | (1024/8)^2=16384 | |||||
2.0 | (1024/8)^2=16384 | |||||
2^(1/3) | 4 | 8×2^(1/3)×4 = 40.317 | 0.5 | (1024/8)^2=16384 | ||
1.0 | (1024/8)^2=16384 | |||||
2.0 | (1024/8)^2=16384 | |||||
2^(2/3) | 4 | 8×2^(2/3)×4 = 50.796 | 0.5 | (1024/8)^2=16384 | ||
1.0 | (1024/8)^2=16384 | |||||
2.0 | (1024/8)^2=16384 |
之后通过代码foas.append(foa),将每一次生成的foa都添加到foas中。一层FPN要添加9个foa,走完上面三个for循环,foas就会有5×9=45个foa。最后通过all_anchors = np.concatenate([f.field_of_anchors for f in foas]),将所有层的anchor放在一起,如下:
anchor数量 | |
P3 | (1024/8)^2×9 = 147456 |
P4 | (1024/16)^2×9 = 36864 |
P5 | (1024/32)^2×9 = 9216 |
P6 | (1024/64)^2×9 = 2304 |
P7 | (1024/128)^2×9 = 576 |
求和 | 196416(与上图相同) |
1.3 小结
(1)两种类型的网络默认anchor_aspect_ratios相同,为1.0 2.0 0.5
(2)RPN+FPN网络,每一层FPN只有一种anchor_sizes,由于3种长宽比,生成1×3=3种类型的anchor。
(3)retianet网络,每一层FPN产生三种anchor_sizes,再加上3种长宽比,生成3×3=9种类型的anchor。
(4)无论是FPN生成anchor还是Retianet生成anchor都调用了get_field_of_anchors这个函数,详细讲解见detectron代码理解(六):对输入样本如何产生anchor
(5)all_anchor是将所有FPN层的anchor放在一起
2.1 针对每一张roidb生成blob,主要调用_get_retinanet_blob函数
blobs['retnet_fg_num'], blobs['retnet_bg_num'] = 0.0, 0.0
for im_i, entry in enumerate(roidb):
scale = im_scales[im_i] #图片的缩放尺度
im_height = np.round(entry['height'] * scale) #entry['height']是你样本的大小,×scale变为经过放缩后的大小
im_width = np.round(entry['width'] * scale) #同上
gt_inds = np.where(
(entry['gt_classes'] > 0) & (entry['is_crowd'] == 0))[0] #gt的序号
assert len(gt_inds) > 0, \
'Empty ground truth empty for image is not allowed. Please check.'
gt_rois = entry['boxes'][gt_inds, :] * scale #gt的box
gt_classes = entry['gt_classes'][gt_inds] #gt的label
im_info = np.array([[im_height, im_width, scale]], dtype=np.float32)
blobs['im_info'].append(im_info)
retinanet_blobs, fg_num, bg_num = _get_retinanet_blobs(
foas, all_anchors, gt_rois, gt_classes, image_width, image_height)
这段代码的第一个for大循环是针对每一张送入的样本图片,获取其gt信息,主要包括:
获取到了上述的信息后,再调用 _get_retinanet_blob函数(见下面),获取针对该样本图片的blob
def _get_retinanet_blobs(
foas, all_anchors, gt_boxes, gt_classes, im_width, im_height):
#1.筛选anchor
#all_anchor是生成的所有anchor,根据变量的命名首先是对anchor进行筛选,保存在图片内的anchor
#(但是下面实际的代码,并没有筛选,inds_inside实际上是所有anchor的序号)
total_anchors = all_anchors.shape[0]
logger.debug('Getting mad blobs: im_height {} im_width: {}'.format(
im_height, im_width))
inds_inside = np.arange(all_anchors.shape[0])
anchors = all_anchors
num_inside = len(inds_inside)
logger.debug('total_anchors: {}'.format(total_anchors))
logger.debug('inds_inside: {}'.format(num_inside))
logger.debug('anchors.shape: {}'.format(anchors.shape))
#2.计算重叠率,并根据重叠率计算labels:
# label=1 正样本, 0 负样本, -1 忽略
labels = np.empty((num_inside, ), dtype=np.float32)
labels.fill(-1) #默认先全部置为-1
if len(gt_boxes) > 0: #如果该图片存在gt
# Compute overlaps between the anchors and the gt boxes overlaps 计算每一个anchor与gt重叠率,anchor_by_gt_overlap.shape = (anchors_num, gt_num)
anchor_by_gt_overlap = box_utils.bbox_overlaps(anchors, gt_boxes)
# Map from anchor to gt box that has highest overlap 返回每一个anchor与哪一个gt重叠率最大,anchor_to_gt_argmax.shape = (anchors_num, )
anchor_to_gt_argmax = anchor_by_gt_overlap.argmax(axis=1)
# For each anchor, amount of overlap with most overlapping gt box #上述的重叠率是多少 anchor_to_gt_max.shape = (anchors_num, )
anchor_to_gt_max = anchor_by_gt_overlap[
np.arange(num_inside), anchor_to_gt_argmax]
# Map from gt box to an anchor that has highest overlap 返回与每一个gt重叠最大的anchor的index。gt_to_anchor_argmax.shape = (3,).axis=0表示就是对于每一列找出最大值,刚好每一列代表的就所有anchor与该gt的重叠率
gt_to_anchor_argmax = anchor_by_gt_overlap.argmax(axis=0)
# For each gt box, amount of overlap with most overlapping anchor 返回与每个gt重叠最大的重叠率
gt_to_anchor_max = anchor_by_gt_overlap[
gt_to_anchor_argmax, np.arange(anchor_by_gt_overlap.shape[1])]
# Find all anchors that share the max overlap amount
# (this includes many ties)
anchors_with_max_overlap = np.where( #找到所有共享这个最大重叠率的anchors
anchor_by_gt_overlap == gt_to_anchor_max)[0]
# Fg label: for each gt use anchors with highest overlap
# (including ties) #将与gt重叠率最大的anchor的label设置为gt的label
gt_inds = anchor_to_gt_argmax[anchors_with_max_overlap]
labels[anchors_with_max_overlap] = gt_classes[gt_inds]
# Fg label: above threshold IOU 将与gt重叠率大于cfg.RETINANET.POSITIVE_OVERLAP的anchor的label设置为gt的label
inds = anchor_to_gt_max >= cfg.RETINANET.POSITIVE_OVERLAP
gt_inds = anchor_to_gt_argmax[inds]
labels[inds] = gt_classes[gt_inds]
fg_inds = np.where(labels >= 1)[0] #fg的数量
bg_inds = np.where(anchor_to_gt_max < cfg.RETINANET.NEGATIVE_OVERLAP)[0] #bg的数量
labels[bg_inds] = 0 #bg的label设置为0
num_fg, num_bg = len(fg_inds), len(bg_inds) #fg和bg的数量
#3.计算bbox_targets,即平移量(tx,ty)与尺度因子(tw,th),其输入是:
#(1)anchors[fg_inds, :]:被标为正样本的anchor
#(2)gt_boxes[anchor_to_gt_argmax[fg_inds], :]:与这些正样本重叠最高的gt
bbox_targets = np.zeros((num_inside, 4), dtype=np.float32)
bbox_targets[fg_inds, :] = data_utils.compute_targets(
anchors[fg_inds, :], gt_boxes[anchor_to_gt_argmax[fg_inds], :])
4.恢复置原来的顺序(如果第一步anchor经过筛选了,这一步就非常重要)
labels = data_utils.unmap(labels, total_anchors, inds_inside, fill=-1)
bbox_targets = data_utils.unmap(bbox_targets, total_anchors, inds_inside, fill=0)
上述代码主要完成的事情已经写在注释中,即1-4,至此已经获取到的是该图片所有FPN层的:
综上可以看出,计算一个图片labels和bbox_targets是基于所有FPN层的生成的all_anchor和该图片的gt做比较得来的,而all_anchor是将foas每一个foa的anchor按照顺序进行拼接生成的,同理labels,bbox_targets的尺度(行数)也和all_anchors一样,顺序也是一样的。
所以下面代码中的for循环中
首先得到H和W,这个H和W相当于再生成anchor过程中,单个边长上生成anchor的数量(在1.3小结的(4)中有提过,需要跳转到另一个博客)所以H×W就是该个foa包括的anchor数量,因而可以确定起始的index,和截至的index。从而获取对应的_labels,_bbox_targets,然后进行变换。
之后找出_labels大于0的,由于_labels(1, 1, H, W)是一个四维的数组,所以返回的inds_4d是一个tupel,包含四个元素,每一个元素表示大于1的label在_labels中的索引位置。因为前两个维度为1,所以实际上H所在的维度和W所在的维度才能体现出具体的位置y, x = inds_4d[2], inds_4d[3]。
多数情况下inds_4d的四个元素都是空的,因为毕竟fg的数量比较少,所以lbls为空,此时for i, lbl in enumerate(lbls)循环是进不去的,直接跳转到blobs_out.append,返回该foa的retnet_cls_labels,retnet_roi_bbox_targets和retnet_roi_fg_bbox_locs,可以看到此时后两个都是空的,同时retnet_cls_labels也是0(bg)或者-1(ignore)
blobs_out = []
start_idx = 0
for foa in foas:
H = foa.field_size
W = foa.field_size
end_idx = start_idx + H * W ##取出相应结束序号
_labels = labels[start_idx:end_idx] #取出对应的label
_bbox_targets = bbox_targets[start_idx:end_idx, :] #取出对应的bbox_targets
start_idx = end_idx
# labels output with shape (1, height, width)
_labels = _labels.reshape((1, 1, H, W))
# bbox_targets output with shape (1, 4 * A, height, width)
_bbox_targets = _bbox_targets.reshape((1, H, W, 4)).transpose(0, 3, 1, 2)
stride = foa.stride
w = int(im_width / stride)
h = int(im_height / stride)
# data for select_smooth_l1 loss
num_classes = cfg.MODEL.NUM_CLASSES - 1
inds_4d = np.where(_labels > 0)
M = len(inds_4d) #fg的数量
_roi_bbox_targets = np.zeros((0, 4))
_roi_fg_bbox_locs = np.zeros((0, 4))
if M > 0:
im_inds, y, x = inds_4d[0], inds_4d[2], inds_4d[3] #y是对应的行序号,x是对应的列序号
_roi_bbox_targets = np.zeros((len(im_inds), 4))
_roi_fg_bbox_locs = np.zeros((len(im_inds), 4))
lbls = _labels[im_inds, :, y, x]
for i, lbl in enumerate(lbls):
l = lbl[0] - 1
if not cfg.RETINANET.CLASS_SPECIFIC_BBOX:
l = 0
assert l >= 0 and l < num_classes, 'label out of the range'
_roi_bbox_targets[i, :] = _bbox_targets[:, :, y[i], x[i]]
_roi_fg_bbox_locs[i, :] = np.array([[0, l, y[i], x[i]]])
blobs_out.append(
dict(
retnet_cls_labels=_labels[:, :, 0:h, 0:w].astype(np.int32),
retnet_roi_bbox_targets=_roi_bbox_targets.astype(np.float32),
retnet_roi_fg_bbox_locs=_roi_fg_bbox_locs.astype(np.float32),
))
out_num_fg = np.array([num_fg + 1.0], dtype=np.float32)
out_num_bg = ( #背景的数量×(分类数-1)+前景数量×(分类数-2)
np.array([num_bg + 1.0]) * (cfg.MODEL.NUM_CLASSES - 1) +
out_num_fg * (cfg.MODEL.NUM_CLASSES - 2))
return blobs_out, out_num_fg, out_num_bg
那如果inds_4d是非空呢,如下左图,由于inds_4d是四维所以不能直接可视化,我去掉第一维度可视化,即(1,1,W,H)->(1,W,H),此时就可以看到下面有些值为2代表fg。对比下面的label,以及inds_4d中第三维度和第四维度的值就可以看出来,inds_4d中第三维度表示的是label的行序号,第四维度表示的是label的列序号。
之后就是通过_roi_bbox_targets[i, :] = _bbox_targets[:, :, y[i], x[i]],_roi_fg_bbox_locs[i, :] = np.array([[0, l, y[i], x[i]]]) 取出fg对应的值。此时_roi_fg_bbox_locs并不象bg那样全为0了,如下,在第三列和第第四列存储了fg对应的行列号(按照个行顺序来,一行一行的存储)
2.2 将blob按照FPN层合并
for i, foa in enumerate(foas):
for k, v in retinanet_blobs[i].items():
# the way it stacks is:
# [[anchors for image1] + [anchors for images 2]]
level = int(np.log2(foa.stride))
key = '{}_fpn{}'.format(k, level)
if k == 'retnet_roi_fg_bbox_locs':
v[:, 0] = im_i
# loc_stride: 80 * 4 if cls_specific else 4
loc_stride = 4 # 4 coordinate corresponding to bbox prediction
if cfg.RETINANET.CLASS_SPECIFIC_BBOX:
loc_stride *= (cfg.MODEL.NUM_CLASSES - 1)
anchor_ind = foa.octave * num_aspect_ratios + foa.aspect
# v[:, 1] is the class label [range 0-80] if we do
# class-specfic bbox otherwise it is 0. In case of class
# specific, based on the label, the location of current
# anchor is class_label * 4 and then we take into account
# the anchor_ind if the anchors
v[:, 1] *= 4
v[:, 1] += loc_stride * anchor_ind
blobs[key].append(v)
blobs['retnet_fg_num'] += fg_num
blobs['retnet_bg_num'] += bg_num
blobs['retnet_fg_num'] = blobs['retnet_fg_num'].astype(np.float32)
blobs['retnet_bg_num'] = blobs['retnet_bg_num'].astype(np.float32)
对每一张roid生成的blob如下图左,每一个元素包含如下图右。retinanet_blob的元素个数为45,其顺序与foas也是一一对应,比如foas中第一个foa是fpn3产生的anchor,但是fpn3对应也产生很多类型的anchor,它是表1第一行产生的anchor类型。
代码中第一个for循环是按照foas中的顺序开始,第二个for循环retinanet_blobs[i].items()保持了与foa的一一对应。所以无论第二个for循环怎么改,只要第一个for循环不进行下去,level = int(np.log2(foa.stride))是不变的。因此这一段代码就是将retinanet_blob中同一FPN层的不同类型的anchor对应的retnet_cls_labels,retnet_roi_bbox_targets和retnet_roi_fg_bbox_locs合并在一起。所以最后blob如下,例如相应的retnet_cls_labels_fpn3变成了含有9个list的字典。
2.3 将list合并为矩阵的形式
N = len(roidb)
for k, v in blobs.items():
if isinstance(v, list) and len(v) > 0:
# compute number of anchors
A = int(len(v) / N)
# for the cls branch labels [per fpn level],
# we have blobs['retnet_cls_labels_fpn{}'] as a list until this step
# and length of this list is N x A where
# N = num_images, A = num_anchors for example, N = 2, A = 9
# Each element of the list has the shape 1 x 1 x H x W where H, W are
# spatial dimension of curret fpn lvl. Let a{i} denote the element
# corresponding to anchor i [9 anchors total] in the list.
# The elements in the list are in order [[a0, ..., a9], [a0, ..., a9]]
# however the network will make predictions like 2 x (9 * 80) x H x W
# so we first concatenate the elements of each image to a numpy array
# and then concatenate the two images to get the 2 x 9 x H x W
if k.find('retnet_cls_labels') >= 0:
tmp = []
# concat anchors within an image
for i in range(0, len(v), A):
tmp.append(np.concatenate(v[i: i + A], axis=1))
# concat images
blobs[k] = np.concatenate(tmp, axis=0)
else:
# for the bbox branch elements [per FPN level],
# we have the targets and the fg boxes locations
# in the shape: M x 4 where M is the number of fg locations in a
# given image at the current FPN level. For the given level,
# the bbox predictions will be. The elements in the list are in
# order [[a0, ..., a9], [a0, ..., a9]]
# Concatenate them to form M x 4
blobs[k] = np.concatenate(v, axis=0)
合并后
至此针对输入的一张样本,输入进入网络的blob名称为:
[u'im_info', u'retnet_fg_num', u'retnet_bg_num',
u'retnet_cls_labels_fpn3', u'retnet_roi_bbox_targets_fpn3', u'retnet_roi_fg_bbox_locs_fpn3',
u'retnet_cls_labels_fpn4', u'retnet_roi_bbox_targets_fpn4', u'retnet_roi_fg_bbox_locs_fpn4',
u'retnet_cls_labels_fpn5', u'retnet_roi_bbox_targets_fpn5', u'retnet_roi_fg_bbox_locs_fpn5',
u'retnet_cls_labels_fpn6', u'retnet_roi_bbox_targets_fpn6', u'retnet_roi_fg_bbox_locs_fpn6',
u'retnet_cls_labels_fpn7', u'retnet_roi_bbox_targets_fpn7', u'retnet_roi_fg_bbox_locs_fpn7']