目录
Hourglass
Corner Pooling
制作标签
Loss
测试流程
参考
input:(batch_size, 3, 511, 511)
backbone: hourglass
输入首先接一个stem模块(由一个conv(7x7-c128-s2-p3)的卷积模块和一个residual模块,其中卷积层的核为3x3-c256-s2-p1组成),输出变成(batch_size, 256, 128, 128)
self.stem = nn.Sequential(
ConvModule(3, 128, 7, padding=3, stride=2, norm_cfg=norm_cfg), # torch.Size([2, 128, 256, 256])
ResLayer(BasicBlock, 128, 256, 1, stride=2, norm_cfg=norm_cfg))
然后跟着两个hourglass module, 之前穿插着一些额外的卷积,代码如下
for ind in range(self.num_stacks): # 2
single_hourglass = self.hourglass_modules[ind]
out_conv = self.out_convs[ind]
hourglass_feat = single_hourglass(inter_feat)
out_feat = out_conv(hourglass_feat)
out_feats.append(out_feat)
if ind < self.num_stacks - 1: # 2-1
inter_feat = self.conv1x1s[ind](
inter_feat) + self.remap_convs[ind](
out_feat)
inter_feat = self.inters[ind](self.relu(inter_feat))
return out_feats
hourglass结构如下所示
从图中可以看出整个结构是一个对称的嵌套结构,每一层是由up1、low1、low2、low3、up2五个操作组成,代码如下所示,最后返回up1+up2的结果,代码中通过在low2这一步递归调用自己来实现图中的嵌套结构
def forward(self, x):
"""Forward function."""
up1 = self.up1(x)
low1 = self.low1(x)
low2 = self.low2(low1)
low3 = self.low3(low2)
up2 = self.up2(low3)
return up1 + up2
每一个hourglass_module的输入和输出维度是一样的
整个流程如下
注意训练和测试的时候这里的输出不一样,训练的时候两个hourglass module的输出都用到了,而测试的时候只用了第二个hourglass module的输出。训练阶段这里的输出为[(batch_size, 256, 128, 128), (batch_size, 256, 128, 128)]。
接着hourglass module的输出传入两个prediction module,分别检测top-left corner和bottom-right corner。
模型总的流程图如下所示
上图中只画出了第二个hourglass module的输出,在训练阶段第一个hourglass module的输出也传入了prediction module,两个输出的后续操作是一样的,因此下面只以一个module的输出为例继续讲解后续操作。
top-left corner和bottom-right corner的prediction module不共享参数,除了corner pooling的操作不一样,其它都是一样的,以top-left corner的prediction module为例,它又是由top-left corner pooling module和后面输出Heatmaps、Embeddings、Offsets两个部分组成,首先看top-left corner pooling module,如下图所示
代码如下
direction1_conv = self.direction1_conv(x) # torch.Size([2, 128, 128, 128])
direction2_conv = self.direction2_conv(x) # torch.Size([2, 128, 128, 128])
direction1_feat = self.direction1_pool(direction1_conv) # torch.Size([2, 128, 128, 128])
direction2_feat = self.direction2_pool(direction2_conv) # torch.Size([2, 128, 128, 128])
aftpool_conv = self.aftpool_conv(direction1_feat + direction2_feat) # torch.Size([2, 256, 128, 128])
conv1 = self.conv1(x) # torch.Size([2, 256, 128, 128])
relu = self.relu(aftpool_conv + conv1) # torch.Size([2, 256, 128, 128])
conv2 = self.conv2(relu) # torch.Size([2, 256, 128, 128])
return conv2
backbone也就是hourglass module的输出维度是(batch_size, 256, 128, 128),分别经过两个3×3 Conv-Bn-ReLu和一个1×1 Conv-Bn,对应上面代码中的self.direction1_conv、self.direction2_conv、self.conv1,输出分别为(b,128,128,128)、(b,128,128,128)、(b,256,128,128),就是上图中第二列的三个feature map。接着前两个输出分别传入top corner pooling和left corner pooling,对应上面代码中的self.direction1_pool、self.direction2_pool。
corner pooling是作者专门提出根据先验知识来定位物体的左上和右下角点的,如下图所示,当我们在左上角从左往右看可以确定物体的上边界,从上往下看可以确定物体的左边界
因此当求解某一个点的 top-left corner pooling时 ,就是以该点为起点,水平向右看遇到的最大值以及竖直向下看最大的值之和。
以left corner pooling为例,具体计算的时候作者想到一个巧妙的方法,即每一行从右向左遍历每个像素,用遇到的最大值替换当前像素。top corner pooling则是从下向上遍历。
top corner pooling和left corner pooling的输出相加,接一个3×3 Conv-BN,然后与一开始1×1 Conv-BN的结果相加,接一个ReLU,再接一个3×3 Conv-BN-ReLU,即得到图中prediction module中间的那个灰色feature map,也就是代码中返回的conv2,维度为(b,256,128,128)。
corner pooling的结果分别接三个3×3 Conv-ReLU和一个1×1 Conv,就得到了Heatmaps、Embeddings、Offsets,维度分别为(b, num_class, 128, 128)、(b, 1, 128, 128)、(b, 2, 128, 128)。
到这一步得到了模型的最终输出,bottom-right prediction module也会得到同样维度的三个输出,一共六个输出。
result_list = [tl_heat, br_heat, tl_emb, br_emb, tl_off, br_off]
训练阶段,第一个hourglass module会直接传入后面的predition module,也会得到和上面一样的六个输出。接下来这些输出就要和ground truth进行loss的计算了。
在计算loss之前,首先要计算模型输出对应的ground truth。首先是heatmap的ground truth,heatmap的通道数就是类别数,若原图存在某个类别的一个物体,则heatmap对应通道上、根据stride对应角点位置处应为正,其余为负。但这样太过严格,角点附近的点确定的一个预测有可能和ground truth的iou非常大,比如说0.9,这样的点label为负显然不利于模型训练。因此作者并没有将除角点外的所有位置都视为负样本给予同等的惩罚,而是减少了对以角点为圆心的某个半径区域内位置的惩罚,半径大小根据与ground truth的iou确定,而圆区域内的标签由作者提出的改进的高斯分布确定,越靠近圆心标签越大,越远标签越接近负样本的标签。
半径的计算可以参考这个https://zhuanlan.zhihu.com/p/96856635?from_voters_page=true。其实就是分为内切、外切和一个内切一个外切三种情况,注意虽然图中画的是个圆形,但实际计算的时候是按正方形算的。根据确定的iou,求解三种情况下的二元一次方程,得到最终的半径r。代码如下
def gaussian_radius(det_size, min_overlap):
r"""Generate 2D gaussian radius.
This function is modified from the `official github repo
`_.
Given ``min_overlap``, radius could computed by a quadratic equation
according to Vieta's formulas.
There are 3 cases for computing gaussian radius, details are following:
- Explanation of figure: ``lt`` and ``br`` indicates the left-top and
bottom-right corner of ground truth box. ``x`` indicates the
generated corner at the limited position when ``radius=r``.
- Case1: one corner is inside the gt box and the other is outside.
.. code:: text
|< width >|
lt-+----------+ -
| | | ^
+--x----------+--+
| | | |
| | | | height
| | overlap | |
| | | |
| | | | v
+--+---------br--+ -
| | |
+----------+--x
To ensure IoU of generated box and gt box is larger than ``min_overlap``:
.. math::
\cfrac{(w-r)*(h-r)}{w*h+(w+h)r-r^2} \ge {iou} \quad\Rightarrow\quad
{r^2-(w+h)r+\cfrac{1-iou}{1+iou}*w*h} \ge 0 \\
{a} = 1,\quad{b} = {-(w+h)},\quad{c} = {\cfrac{1-iou}{1+iou}*w*h}
{r} \le \cfrac{-b-\sqrt{b^2-4*a*c}}{2*a}
- Case2: both two corners are inside the gt box.
.. code:: text
|< width >|
lt-+----------+ -
| | | ^
+--x-------+ |
| | | |
| |overlap| | height
| | | |
| +-------x--+
| | | v
+----------+-br -
To ensure IoU of generated box and gt box is larger than ``min_overlap``:
.. math::
\cfrac{(w-2*r)*(h-2*r)}{w*h} \ge {iou} \quad\Rightarrow\quad
{4r^2-2(w+h)r+(1-iou)*w*h} \ge 0 \\
{a} = 4,\quad {b} = {-2(w+h)},\quad {c} = {(1-iou)*w*h}
{r} \le \cfrac{-b-\sqrt{b^2-4*a*c}}{2*a}
- Case3: both two corners are outside the gt box.
.. code:: text
|< width >|
x--+----------------+
| | |
+-lt-------------+ | -
| | | | ^
| | | |
| | overlap | | height
| | | |
| | | | v
| +------------br--+ -
| | |
+----------------+--x
To ensure IoU of generated box and gt box is larger than ``min_overlap``:
.. math::
\cfrac{w*h}{(w+2*r)*(h+2*r)} \ge {iou} \quad\Rightarrow\quad
{4*iou*r^2+2*iou*(w+h)r+(iou-1)*w*h} \le 0 \\
{a} = {4*iou},\quad {b} = {2*iou*(w+h)},\quad {c} = {(iou-1)*w*h} \\
{r} \le \cfrac{-b+\sqrt{b^2-4*a*c}}{2*a}
Args:
det_size (list[int]): Shape of object.
min_overlap (float): Min IoU with ground truth for boxes generated by
keypoints inside the gaussian kernel.
Returns:
radius (int): Radius of gaussian kernel.
"""
height, width = det_size
a1 = 1
b1 = (height + width)
c1 = width * height * (1 - min_overlap) / (1 + min_overlap)
sq1 = sqrt(b1**2 - 4 * a1 * c1)
r1 = (b1 - sq1) / (2 * a1)
a2 = 4
b2 = 2 * (height + width)
c2 = (1 - min_overlap) * width * height
sq2 = sqrt(b2**2 - 4 * a2 * c2)
r2 = (b2 - sq2) / (2 * a2)
a3 = 4 * min_overlap
b3 = -2 * min_overlap * (height + width)
c3 = (min_overlap - 1) * width * height
sq3 = sqrt(b3**2 - 4 * a3 * c3)
r3 = (b3 + sq3) / (2 * a3)
return min(r1, r2, r3)
确定了半径r后,根据一个二维高斯分布计算这个区域内的label值,其中就是半径r。
接着计算offsets的标签。offsets的标签很好理解,就是将原图上的交点坐标根据步长映射到输出特征图上时会取整而导致与原始值有一个误差,offsets的标签就是这个插值,公式如下
embeddings的标签更好理解,就是将属于同一物体的一组角点对应起来,将映射到输出特征图上的属于同一个物体的一组坐标放入一个列表。
corner_match.append([[top_idx, left_idx], [bottom_idx, right_idx]])
其中top_id是某个物体的左上角的y坐标映射到输出特征图上取整的坐标值。
作者设计了一个focal loss的变体损失作为角点heatmaps的loss,公式如下
其中N是图像中目标的数量,和是超参,文中分别设为2和4,是类位置处的预测值。代码如下
eps = 1e-12
alpha = 2.0
gamma = 4.0
pos_weights = gaussian_target.eq(1)
neg_weights = (1 - gaussian_target).pow(gamma)
pos_loss = -(pred + eps).log() * (1 - pred).pow(alpha) * pos_weights
neg_loss = -(1 - pred + eps).log() * pred.pow(alpha) * neg_weights
heatmap_loss = pos_loss + neg_loss
embeddings loss的公式如下
其中是预测图中左上角点位置的值,是右下角点位置的值,是左上和右下角点值的均值。作者设计了pull和push两个loss,思想是把属于统一物体的一组角点pull到一起,把不同物体的角点push开来。代码如下
def ae_loss_per_image(tl_preds, br_preds, match):
"""Associative Embedding Loss in one image.
Associative Embedding Loss including two parts: pull loss and push loss.
Pull loss makes embedding vectors from same object closer to each other.
Push loss distinguish embedding vector from different objects, and makes
the gap between them is large enough.
During computing, usually there are 3 cases:
- no object in image: both pull loss and push loss will be 0.
- one object in image: push loss will be 0 and pull loss is computed
by the two corner of the only object.
- more than one objects in image: pull loss is computed by corner pairs
from each object, push loss is computed by each object with all
other objects. We use confusion matrix with 0 in diagonal to
compute the push loss.
Args:
tl_preds (tensor): Embedding feature map of left-top corner.
br_preds (tensor): Embedding feature map of bottom-right corner.
match (list): Downsampled coordinates pair of each ground truth box.
"""
tl_list, br_list, me_list = [], [], []
if len(match) == 0: # no object in image
pull_loss = tl_preds.sum() * 0.
push_loss = tl_preds.sum() * 0.
else:
for m in match:
[tl_y, tl_x], [br_y, br_x] = m
tl_e = tl_preds[:, tl_y, tl_x].view(-1, 1) # torch.Size([1]) -> torch.Size([1, 1])
# tensor([[0.0916]], device='cuda:0', grad_fn=)
br_e = br_preds[:, br_y, br_x].view(-1, 1)
tl_list.append(tl_e)
br_list.append(br_e)
me_list.append((tl_e + br_e) / 2.0)
tl_list = torch.cat(tl_list) # torch.Size([3, 1])
br_list = torch.cat(br_list)
me_list = torch.cat(me_list)
assert tl_list.size() == br_list.size()
# N is object number in image, M is dimension of embedding vector
N, M = tl_list.size() # 3,1
pull_loss = (tl_list - me_list).pow(2) + (br_list - me_list).pow(2)
pull_loss = pull_loss.sum() / N
margin = 1 # exp setting of CornerNet, details in section 3.3 of paper
# confusion matrix of push loss
conf_mat = me_list.expand((N, N, M)).permute(1, 0, 2) - me_list
conf_weight = 1 - torch.eye(N).type_as(me_list)
conf_mat = conf_weight * (margin - conf_mat.sum(-1).abs())
if N > 1: # more than one object in current image
push_loss = F.relu(conf_mat).sum() / (N * (N - 1))
else:
push_loss = tl_preds.sum() * 0.
return pull_loss, push_loss
offsets loss就是普通的SmoothL1Loss,这里不再赘述。
def _local_maximum(self, heat, kernel=3):
pad = (kernel - 1) // 2
hmax = F.max_pool2d(heat, kernel, stride=1, padding=pad)
keep = (hmax == heat).float()
return heat * keep
根据置信度选出top_k(100)个左上角和右下角位置(不区分类别)。
对这200个位置根据对应的offset进行调整,然后映射回原图(乘以stride并减去padding,127/2)。
对100个左上位置和100个右下位置进行一一配对,共有10000个可能的组合,取左上和右下得分的均值作为组合的得分。
scores = (tl_scores + br_scores) / 2
根据配对角点对应的embedding计算距离,排除这10000个组合中距离大于阈值(0.5)的、不属于同一类的、坐标关系不满足的(右下的xy坐标小于左上的xy坐标)。
scores[cls_inds] = -1
scores[width_inds] = -1
scores[height_inds] = -1
scores[dist_inds] = -1
从这10000个组合中根据得分选出top1000个,并根据对应的index选出1000个bbox和class。
按类别进行soft-nms,然后取前max_per_img=100个。
取得分大于thresh=0.3的作为最终结果。
https://zhuanlan.zhihu.com/p/188587434
https://zhuanlan.zhihu.com/p/103705172