paper:High Performance Visual Tracking with Siamese Region Proposal Network
code:https://github.com/STVIR/pysot【另一个包括训练测试的复现代码】
我自己根据pysot修改的一个易读的代码:https://github.com/laisimiao/siamrpn.pytorch
SiamRPN是siamese系列的跟踪经典之作,又是最早的anchor-based方法,所以当你看懂SiamRPN一些操作之后,对后续的DaSiamRPN, SiamMask,SiamRPN++都是有帮助的。所以在此记录一下SiamRPN代码中的几个要点:
虽然看完上面那幅图以后整体模型流程已经非常清晰了:
1、template frame和detection frame经过相同的Siamese Network得到一个feature,然后经过RPN的classification branch和regression branch,其中template作为kernel在detection上做correlation操作。
2、分类分支的作用就是预测原图上的哪些anchor会与目标的IoU大于一定的阈值,他们对应最后的feature map上的点就是 1;回归分支就是预测每个anchor与target box的xywh的偏移
但是在具体编程的时候,模型构建还是有一点点不一样(如上图划线后红色字所示),但这样都不影响预测出来的tensor的shape,又再一次说明了CNN的黑箱性【只要能train work就行,然后加以一点解释】,下面是代码体现,主要在get_rpn_head
中的pysot/models/head/rpn.py
:
class DepthwiseXCorr(nn.Module):
def __init__(self, in_channels, hidden, out_channels, kernel_size=3, hidden_kernel_size=5):
# in_channels:256, hidden:256, out_channels:2*K(K is number of anchors)
super(DepthwiseXCorr, self).__init__()
self.conv_kernel = nn.Sequential(
nn.Conv2d(in_channels, hidden, kernel_size=kernel_size, bias=False),
nn.BatchNorm2d(hidden),
nn.ReLU(inplace=True),
)
self.conv_search = nn.Sequential(
nn.Conv2d(in_channels, hidden, kernel_size=kernel_size, bias=False),
nn.BatchNorm2d(hidden),
nn.ReLU(inplace=True),
)
self.head = nn.Sequential(
nn.Conv2d(hidden, hidden, kernel_size=1, bias=False),
nn.BatchNorm2d(hidden),
nn.ReLU(inplace=True),
nn.Conv2d(hidden, out_channels, kernel_size=1)
)
def forward(self, kernel, search):
kernel = self.conv_kernel(kernel)
search = self.conv_search(search)
feature = xcorr_depthwise(search, kernel)
out = self.head(feature)
return out
之前一直知道在原图(也就是上图中的detection frame)撒anchors,但是看过以后才知道,这也涉及到feature map上的点映射到original image上,然后以此映射回来的点为中心撒不同scale和不同aspect ratio的anchor【看来不管是anchor-based还是anchor-free这一步都是不可缺的】,看看代码是怎样实现的,主要是在pysot/utils/anchor.py
中的generate_all_anchors
:
def generate_all_anchors(self, im_c, size):
"""
im_c: image center -> cfg.TRAIN.SEARCH_SIZE//2
size: image size -> cfg.TRAIN.OUTPUT_SIZE
"""
if self.image_center == im_c and self.size == size:
return False
self.image_center = im_c
self.size = size
a0x = im_c - size // 2 * self.stride
ori = np.array([a0x] * 4, dtype=np.float32)
# 这里的self.anchors就是一个位置上的K个anchors
# ori就是detection frame上映射回来的最左上角的位置
zero_anchors = self.anchors + ori
x1 = zero_anchors[:, 0]
y1 = zero_anchors[:, 1]
x2 = zero_anchors[:, 2]
y2 = zero_anchors[:, 3]
x1, y1, x2, y2 = map(lambda x: x.reshape(self.anchor_num, 1, 1),
[x1, y1, x2, y2])
cx, cy, w, h = corner2center([x1, y1, x2, y2])
disp_x = np.arange(0, size).reshape(1, 1, -1) * self.stride
disp_y = np.arange(0, size).reshape(1, -1, 1) * self.stride
cx = cx + disp_x
cy = cy + disp_y
# broadcast
zero = np.zeros((self.anchor_num, size, size), dtype=np.float32)
cx, cy, w, h = map(lambda x: x + zero, [cx, cy, w, h])
x1, y1, x2, y2 = center2corner([cx, cy, w, h])
self.all_anchors = (np.stack([x1, y1, x2, y2]).astype(np.float32),
np.stack([cx, cy, w, h]).astype(np.float32))
return True
下面这幅图就是我画出来的一幅示意图(左边是detection frame,右边是覆盖了所有的anchors的图):
下面就是论文中对于标签的描述:
下面就看一下代码中怎么实现的,主要是pysot/datasets/anchor_target.py
中的__call__
方法:
def __call__(self, target, size, neg=False):
anchor_num = len(cfg.ANCHOR.RATIOS) * len(cfg.ANCHOR.SCALES)
# -1 ignore 0 negative 1 positive
cls = -1 * np.ones((anchor_num, size, size), dtype=np.int64)
delta = np.zeros((4, anchor_num, size, size), dtype=np.float32)
delta_weight = np.zeros((anchor_num, size, size), dtype=np.float32)
def select(position, keep_num=16):
num = position[0].shape[0]
if num <= keep_num:
return position, num
slt = np.arange(num)
np.random.shuffle(slt)
slt = slt[:keep_num]
return tuple(p[slt] for p in position), keep_num
tcx, tcy, tw, th = corner2center(target)
if neg:
# l = size // 2 - 3
# r = size // 2 + 3 + 1
# cls[:, l:r, l:r] = 0
cx = size // 2
cy = size // 2
cx += int(np.ceil((tcx - cfg.TRAIN.SEARCH_SIZE // 2) /
cfg.ANCHOR.STRIDE + 0.5))
cy += int(np.ceil((tcy - cfg.TRAIN.SEARCH_SIZE // 2) /
cfg.ANCHOR.STRIDE + 0.5))
l = max(0, cx - 3)
r = min(size, cx + 4)
u = max(0, cy - 3)
d = min(size, cy + 4)
cls[:, u:d, l:r] = 0
neg, neg_num = select(np.where(cls == 0), cfg.TRAIN.NEG_NUM)
cls[:] = -1
cls[neg] = 0
overlap = np.zeros((anchor_num, size, size), dtype=np.float32)
return cls, delta, delta_weight, overlap
anchor_box = self.anchors.all_anchors[0]
anchor_center = self.anchors.all_anchors[1]
x1, y1, x2, y2 = anchor_box[0], anchor_box[1], \
anchor_box[2], anchor_box[3]
cx, cy, w, h = anchor_center[0], anchor_center[1], \
anchor_center[2], anchor_center[3]
delta[0] = (tcx - cx) / w
delta[1] = (tcy - cy) / h
delta[2] = np.log(tw / w)
delta[3] = np.log(th / h)
overlap = IoU([x1, y1, x2, y2], target)
pos = np.where(overlap > cfg.TRAIN.THR_HIGH)
neg = np.where(overlap < cfg.TRAIN.THR_LOW)
pos, pos_num = select(pos, cfg.TRAIN.POS_NUM)
neg, neg_num = select(neg, cfg.TRAIN.TOTAL_NUM - cfg.TRAIN.POS_NUM)
cls[pos] = 1
delta_weight[pos] = 1. / (pos_num + 1e-6)
cls[neg] = 0
return cls, delta, delta_weight, overlap
下面是对cls标签的可视化,K(K=5)个channel分别画出来,而delta回归分支因为维度太高不容易可视化【黄色为1,紫色为0,蓝绿色为-1】:
下面是对某个anchor(特定的一个K)的四个channel的标签进行的可视化,因为w offset和h offset是负值,所以一片紫色:
下图就是论文中提到损失函数的部分:可以看到:
现在来看看代码怎么实现的,入口就是pysot/models/model_builder.py
中的forward
方法:
cls_loss = select_cross_entropy_loss(cls, label_cls)
loc_loss = weight_l1_loss(loc, label_loc, label_loc_weight)
一看代码就是经典的二分类交叉熵损失函数,只不过需要注意三点:
def get_cls_loss(pred, label, select):
if len(select.size()) == 0 or \
select.size() == torch.Size([0]):
return 0
pred = torch.index_select(pred, 0, select)
label = torch.index_select(label, 0, select)
return F.nll_loss(pred, label)
def select_cross_entropy_loss(pred, label):
"""
:param pred: (N,K,17,17,2)
:param label: (N,K,17,17)
:return:
"""
pred = pred.view(-1, 2)
label = label.view(-1)
pos = label.data.eq(1).nonzero().squeeze().cuda() # (#pos,)
neg = label.data.eq(0).nonzero().squeeze().cuda() # (#neg,)
loss_pos = get_cls_loss(pred, label, pos)
loss_neg = get_cls_loss(pred, label, neg)
return loss_pos * 0.5 + loss_neg * 0.5
代码里面并没有使用smooth L1 loss,而是直接使用了L1 loss,即 L 1 l o s s = ∑ n = 1 n ∣ f ( x i ) − y i ∣ n L1 loss=\frac{\sum_{n=1}^{n}\left|f\left(x_{i}\right)-y_{i}\right|}{n} L1loss=n∑n=1n∣f(xi)−yi∣,这里也有注意的点:
def weight_l1_loss(pred_loc, label_loc, loss_weight):
"""
:param pred_loc: (N,4K,17,17)
:param label_loc: (N,4,k,17,17)
:param loss_weight: (N,K,17,17)
:return:
"""
b, _, sh, sw = pred_loc.size()
pred_loc = pred_loc.view(b, 4, -1, sh, sw)
diff = (pred_loc - label_loc).abs()
diff = diff.sum(dim=1).view(b, -1, sh, sw)
loss = diff * loss_weight
return loss.sum().div(b)
这一部分在pysot\tracker\siamrpn_tracker.py
里,主要实现两个方法:init和track:
这部分就是利用第一帧的先验信息,包括第一帧图片和ground truth bbox,相当于一个one-shot detection,这个template frame就固定了,相当于一个kernel
def init(self, img, bbox):
"""
args:
img(np.ndarray): BGR image
bbox: (x, y, w, h) bbox
"""
# 之后要更新self.center_pos和self.size
self.center_pos = np.array([bbox[0]+(bbox[2]-1)/2,
bbox[1]+(bbox[3]-1)/2])
self.size = np.array([bbox[2], bbox[3]])
# calculate z crop size
w_z = self.size[0] + cfg.TRACK.CONTEXT_AMOUNT * np.sum(self.size)
h_z = self.size[1] + cfg.TRACK.CONTEXT_AMOUNT * np.sum(self.size)
s_z = round(np.sqrt(w_z * h_z))
# calculate channle average
self.channel_average = np.mean(img, axis=(0, 1))
# get crop
z_crop = self.get_subwindow(img, self.center_pos,
cfg.TRACK.EXEMPLAR_SIZE,
s_z, self.channel_average)
self.model.template(z_crop)
这里就是输入一张subsequent frame,然后根据预测值,加以scale和ratio的penalty,然后用cosine window来suppress large displacement,然后根据分类分数的最高值对应的anchor来回归预测目标位置:
def track(self, img):
"""
args:
img(np.ndarray): BGR image
return:
bbox(list):[x, y, width, height]
"""
w_z = self.size[0] + cfg.TRACK.CONTEXT_AMOUNT * np.sum(self.size)
h_z = self.size[1] + cfg.TRACK.CONTEXT_AMOUNT * np.sum(self.size)
s_z = np.sqrt(w_z * h_z)
scale_z = cfg.TRACK.EXEMPLAR_SIZE / s_z
s_x = s_z * (cfg.TRACK.INSTANCE_SIZE / cfg.TRACK.EXEMPLAR_SIZE)
x_crop = self.get_subwindow(img, self.center_pos,
cfg.TRACK.INSTANCE_SIZE,
round(s_x), self.channel_average)
outputs = self.model.track(x_crop)
score = self._convert_score(outputs['cls'])
pred_bbox = self._convert_bbox(outputs['loc'], self.anchors)
def change(r):
return np.maximum(r, 1. / r)
def sz(w, h):
pad = (w + h) * 0.5
return np.sqrt((w + pad) * (h + pad))
# scale penalty
s_c = change(sz(pred_bbox[2, :], pred_bbox[3, :]) /
(sz(self.size[0]*scale_z, self.size[1]*scale_z)))
# aspect ratio penalty
r_c = change((self.size[0]/self.size[1]) /
(pred_bbox[2, :]/pred_bbox[3, :]))
penalty = np.exp(-(r_c * s_c - 1) * cfg.TRACK.PENALTY_K)
pscore = penalty * score
# window penalty
pscore = pscore * (1 - cfg.TRACK.WINDOW_INFLUENCE) + \
self.window * cfg.TRACK.WINDOW_INFLUENCE
best_idx = np.argmax(pscore)
bbox = pred_bbox[:, best_idx] / scale_z
lr = penalty[best_idx] * score[best_idx] * cfg.TRACK.LR
cx = bbox[0] + self.center_pos[0]
cy = bbox[1] + self.center_pos[1]
# smooth bbox
width = self.size[0] * (1 - lr) + bbox[2] * lr
height = self.size[1] * (1 - lr) + bbox[3] * lr
# clip boundary
cx, cy, width, height = self._bbox_clip(cx, cy, width,
height, img.shape[:2])
# udpate state
self.center_pos = np.array([cx, cy])
self.size = np.array([width, height])
bbox = [cx - width / 2,
cy - height / 2,
width,
height]
best_score = score[best_idx]
return {
'bbox': bbox,
'best_score': best_score
}
因为之前撒anchor的时候是以原图中心为原点(而不是左上角),所以在上面的【47-48行】的时候,是直接加到上一帧的中心pos结果上。
视频讲解:https://www.bilibili.com/video/BV1tz4y1f7Cj
视频中的PPT下载链接【有积分的小伙伴希望给我鼓励呀】:https://download.csdn.net/download/laizi_laizi/12776130
最后写到这里,总是感觉siamese系列好像都差不多了,基于siamese的都会利用pysot的开源工作来写代码。看来我感觉已经进入tracking field一定深度了,遇到了一定的瓶颈,下一阶段就是实现自我idea 的思考和实现!!加油把