[paper reading] CenterNet (Object as Points)

GitHub:Notes of Classic Detection Papers

2020.11.09更新:更新了Use Yourself,即对于本文的理解和想法,详情参见GitHub:Notes-of-Classic-Detection-Papers


topic motivation technique key element math use yourself relativity
(Object as Points)
Problem to Solve
CenterNet Architecture Center Point & Anchor
Getting Ground-Truth
Model Output
Data Augmentation
Compared with SOTA
Additional Experiments
Loss Function
KeyPoint Loss L k \text{L}_k Lk
Offset Loss L o f f \text{L}_{off} Loff
Size Loss L s i z e \text{L}_{size} Lsize
…… Anchor-Based


Problem to Solve

anchor-based method有以下的缺点:

  • wasteful & inefficient


  • need post-processing(e.g. NMS


  • 本质上讲:

    将Object Detection转化为Standard Keypoint Estimation

  • 思路上讲:

    使用bounding box的center point表示一个object

  • 具体流程上讲:

    使用keypoint estimation寻找center point,并根据center point回归其他的属性(因为其他的属性都和center point存在确定的数学关系


CenterNet Architecture


  • Backbone

    • Stacked Hourglass Network

      详见 [CornerNet](./[paper reading] RetinaNet.md)

    • Upconvolutional Residual Netwotk

    • Deep Layer Aggregation(DLA)

  • Task-Specific Modality

    • 1 个 3×3 Convolution
    • ReLU
    • 1 个 1×1 Convolution


  • simpler & faster & accurate

  • end-to-end differential

    所有的输出都是直接keypoint estimation network输出,不需要NMS(以及其他post-processing)

    Peak Keypoint Extraction由 3 × 3  Max Pooling 3×3 \ \text{Max Pooling} 3×3 Max Pooling 实现,足够用来替换NMS

  • estimate additional object properties in one single forward pass

    单次前向传播中,可以估计出多种object properties

Key Element

Center Point & Anchor


center point可以看作是shape-agnostic anchor形状不可知的anchor)


  • center point仅仅与location有关(与box overlap无关


  • 每个object仅对应1个center point

    直接keypoint heatmap上提取local peak不存在重复检测的问题

  • CenterNet更大的输出分辨率


Getting Ground-Truth

详见 [Symbol Definition](#Symbol Definition)

Keypoint Ground-Truth

Ground-Truth:Input Image ==> Output Feature Map
  • p ∈ R 2 p \in \mathcal{R}^2 pR2ground-truth keypoint
  • p ~ = ⌊ p R ⌋ \widetilde{p} = \lfloor\frac pR \rfloor p =Rplow-resolution equivalent

imageground-truth keypoint p p p 映射为output feature mapground-truth keypoint p ~ \widetilde p p
p ~ = ⌊ p R ⌋ \widetilde{p} = \lfloor\frac pR \rfloor p =Rp

Gaussian Penalty Reduction

Y x y c = exp ⁡ ( − ( x − p ~ x ) 2 + ( y − p ~ y ) 2 2 σ p 2 ) Y_{x y c}=\exp \left(-\frac{\left(x-\tilde{p}_{x}\right)^{2}+\left(y-\tilde{p}_{y}\right)^{2}}{2 \sigma_{p}^{2}}\right) Yxyc=exp(2σp2(xp~x)2+(yp~y)2)

  • σ p \sigma_{p} σpobject size-adaptive标准差

如果同一个类别2个Gaussian发生重叠,则取element-wise maximum

keypoint heatmap
Y ^ ∈ [ 0 , 1 ] W R × H R × C \hat{Y}\in[0,1]^{\frac{W}{R}×\frac HR×C} Y^[0,1]RW×RH×C

  • Y ^ x , y , c = 1 \hat Y _{x,y,c} =1 Y^x,y,c=1 ==> keypoint
  • Y ^ x , y , c = 0 \hat Y _{x,y,c} =0 Y^x,y,c=0 ==> background

注意:这里的centerbounding box几何中心,即center左右边和上下边距离是相等的

Size Ground-Truth

bounding box 用4个点表示(第 k k k 个object,类别为 c k c_k ck):
( x 1 ( k ) , y 1 ( k ) , x 2 ( k ) , y 2 ( k ) ) (x_1^{(k)}, y_1^{(k)}, x_2^{(k)}, y_2^{(k)}) (x1(k),y1(k),x2(k),y2(k))
Center 表示为:
p k = ( x 1 ( k ) + x 2 ( k ) 2 , y 1 ( k ) + y 2 ( k ) 2 ) p_k = \big( \frac{x_1^{(k)} + x_2^{(k)} }{2} , \frac{y_1^{(k)} + y_2^{(k)} }{2} \big) pk=(2x1(k)+x2(k),2y1(k)+y2(k))
Size Ground-Truth 表示为:
s k = ( x 2 ( k ) − x 1 ( k ) , y 2 ( k ) − y 1 ( k ) ) s_k = \big(x_2^{(k)} - x_1^{(k)}, y_2^{(k)}-y_1^{(k)} \big) sk=(x2(k)x1(k),y2(k)y1(k))

注意:不对scale进行归一化,而是直接使用raw pixel coordinate

Model Output

Input & Output Resolution

  • 512×512
  • 128×128


  • keypoint Y ^ \hat Y Y^ ==> C C C
  • offset O ^ \hat O O^ ==> 2
  • size S ^ \hat S S^ ==> 2


对于each modality,在将feature经过:

  • 1 个 3×3 Convolution
  • ReLU
  • 1 个 1×1 Convolution

Data Augmentation

  • random flip

  • random scaling(0.6~1.3)

  • cropping

  • color jittering


CenterNetInferencesingle network forward pass

  1. image输入backbone(e.g. FCN),得到3个输出

    • keypoint Y ^ \hat Y Y^ ==> C C C


      peak的判定:值 ≥ \ge 其8个邻居

    • offset O ^ \hat O O^ ==> 2

    • size S ^ \hat S S^ ==> 2

  2. 根据keypoint Y ^ \hat Y Y^offset O ^ \hat O O^size S ^ \hat S S^ 计算bounding box


    • ( δ x ^ i , δ x ^ i ) = O ^ x ^ i , y ^ i (\delta \hat x_i, \delta \hat x_i) = \hat O_{\hat x_i, \hat y_i} (δx^i,δx^i)=O^x^i,y^ioffset prediction
    • ( w ^ i , h ^ i ) = S ^ x ^ i , y ^ i ( \hat w_i, \hat h_i) = \hat S _{\hat x_i, \hat y_i} (w^i,h^i)=S^x^i,y^isize prediction
  3. 计算keypointconfidence:keypoint对应位置的value
    Y ^ x i , y i c \hat Y_{x_i,y_ic} Y^xi,yic



  1. no augmentation

  2. flip augmentation

    flip:在decoding之前,进行output average

  3. flip & multi-scale(0.5,0.75,1,1.25,1.5)


Compared with SOTA

[paper reading] CenterNet (Object as Points)_第1张图片

Additional Experiments

Center Point Collision

多个object经过下采样,其center keypoint有可能重叠

CenterNet可以减少Center Keypoint的冲突



Training & Testing Resolution

  1. 低分辨率速度最快但是精度最差
  2. 高分辨率精度提高,但速度降低
  3. 原尺寸速度略高于高分辨率,但速度略慢

Regression Loss

smooth L1 Loss的效果略差于L1 Loss

[paper reading] CenterNet (Object as Points)_第2张图片

Bounding Box Size Weight

λ s i z e \lambda_{size} λsize 为0.1时最佳,增大时AP快速衰减,减小时鲁棒

[paper reading] CenterNet (Object as Points)_第3张图片

Training Schedule


[paper reading] CenterNet (Object as Points)_第4张图片


Symbol Definition

  • I ∈ R W × H × 3 I \in R^{W×H×3} IRW×H×3image
  • R R Routput stride,实验中为4
  • C C Ckeypoint类别数

Loss Function

L d e t = L k + λ s i z e L s i z e + λ o f f L o f f \text{L}_{det} = \text{L}_k + \lambda_{size} \text{L}_{size} + \lambda_{off} \text{L}_{off} Ldet=Lk+λsizeLsize+λoffLoff

  • λ s i z e = 0.1 \lambda_{size} = 0.1 λsize=0.1
  • λ o f f = 1 \lambda_{off} = 1 λoff=1

KeyPoint Loss L k \text{L}_k Lk

penalty-reduced pixel-wise logistic regression with focal loss

image-<a href=[paper reading] CenterNet (Object as Points)_第5张图片

  • Y ^ x y c \hat{Y}_{xyc} Y^xycpredicted keypoint confidence
  • α = 2 , β = 4 \alpha =2,\beta=4 α=2,β=4

Offset Loss L o f f \text{L}_{off} Loff

目的:恢复由下采样带来的离散化错误(discretization error)


  • O ^ ∈ R W R × H R × 2 \hat O \in \mathcal R^{\frac{W}{R}×\frac HR×2} O^RRW×RH×2predicted local offset


  • 仅仅对keypoint locationpositive)计算
  • 所有的类别共享相同的offset prediction

Size Loss L s i z e \text{L}_{size} Lsize


  • S ^ p k ∈ R W R × H R × 2 \hat{S}_{p_{k}} \in \mathcal R^{\frac{W}{R}×\frac HR×2} S^pkRRW×RH×2
  • s k = ( x 2 ( k ) − x 1 ( k ) , y 2 ( k ) − y 1 ( k ) ) s_k = \big(x_2^{(k)} - x_1^{(k)}, y_2^{(k)}-y_1^{(k)} \big) sk=(x2(k)x1(k),y2(k)y1(k))

Use Yourself


Related work

Anchor-Based Method



Two-Stage Method

  1. image上放置anchor(同 [One-Stage Method](#One-Stage Method))

    即:在low-resolutiondense & grid采样anchor,分类为foreground/background ==> proposal


    • foreground

      任意ground-truth box> 0.7 的IoU

    • background

      任意ground-truth box< 0.3 的IoU

    • ignored

      任意ground-truth boxIoU ∈ [ 0.3 , 0.7 ] \in [0.3, 0.7] [0.3,0.7]

  2. anchor进行feature resample


  • RCNN:在image上取crop
  • Fast-RCNN:在feature map上取crop

One-Stage Method

  1. image上放置anchor
  2. 直接anchor位置进行分类

one-stage method的一些改进

  • anchor shape prior
  • different feature resolution(e.g. Feature Pyramid Network
  • loss re-weighting(e.g. Focal Loss


  • Purpose:根据IoU,抑制相同instance的detections

  • Drawback:难以differentiatetrain,导致绝大部分的detector无法做到end-to-end trainable

KeyPoint-Based Method


detection转化为keypoint estimation

Backbone均为KeyPoint Estimation Network


检测2个corner作为keypoints,表示1个bounding box


检测 top-most, left-most, bottom-most, right-most ,center 作为keypints


1个object检测多个keypoint,其需要额外的grouping stage(导致算法速度的降低)
