行人检测是计算机视觉社区的一个流行的研究主题之一,general object detector 未必对行人检测是最优,作者改进 faster RCNN 以更适配行人检测任务,与此同时基于 Cityscapses 分割数据集提出 CityPersons
1)Training, testing ( M R O MR^O MRO, M R N MR^N MRN)
log miss-rate (MR) is averaged over the FPPI (false positives per image) range of [ 1 0 − 2 , 1 0 0 ] [10^{−2}, 10^{0}] [10−2,100] FPPI.
M R O MR^O MRO 表示 original annotation
M R N MR^N MRN 表示 new annotation
2)FasterRCNN
克服分割标签转检测框时产生的问题(作者说的第二点,分割标签最小外切矩阵应该比较准吧?哈哈,也没有说矩形框的中心就是目标中心啊,水平方向何来 segment centre rather the object centre),看齐 Existing datasets (INRIA, Caltech, KITTI) 的标注格式
CityPersons 采用了 amodal bounding box 打标形式
也即,被遮挡的区域也给你打出来
1)Fine-grained categories
4 种类型
2)Annotation protocol
如图 2 所示,pedestrian 和 reider 两类 amodal 模式,上顶中下两个点,宽长比 0.41,形成框,sitting person 和 other person 两类 only provide the segment bounding box
fake human 区域(people on posters, statue, mannequin, people’s reflection in mirror or window, etc.)mark them as ignore regions.
遮挡率计算如下
B B − v i s B B − f u l l \frac{BB-vis}{BB-full} BB−fullBB−vis
3)Annotation tool
pops out one person segment at a time
首先,标注 the fine-grained category
然后,do the full body annotation for pedestrians and riders
But the ignore region annotations have to be done by searching over the whole images
2)Diversity
作者的数据集中 identical persons 也即 ID 也很多
provides fine-grained labels
3)Occlusion
Reasonable 表示遮挡小于 35% 的样本
CityPersons has more occlusion cases
最常见的 9 种遮挡类型如下:
MR stands for log-average miss rate on the “reasonable” setup (scale [50, ∞], occlusion ratio [0, 0.35]) unless otherwise specified.
cyclists/sitting persons/other persons/ignore regions are not considered
其他类都是来作秀的对吧,哈哈
only use the reasonable subset of pedestrians for training
CityPersons generalizes better than Caltech and KITTI.
attribute to
the size and diversity of the Cityscapes data
the quality of the bounding boxes annotations
看看 Caltech 上的效果
improves more for harder cases
better-aligned detections,IoU 0.5->0.75 反而 Δ M R \Delta MR ΔMR 更多
FCN-8s 训练 Cityscapes coarse annotations,用来 predict semantic map
concatenate semantic channels with RGB channels and feed them altogether into convnets