CVPR2021 Objectron: A Large Scale Dataset of Object-Centric Videos in the Wild with Pose Annotations

0、关键词

annotated videos, 3D object detection, object-centric videos, pose annotations, Objectron dataset, 3D object tracking, 3D shape representation, object-centric short videos, annotated images, robotics, image retrieval, augmented reality

1、链接

该论文来自谷歌研究院（Google Research It's Google :-( 内地需要VPN才能访问）。秉承其形成技术壁垒的一贯作风，要么“力大砖飞”，使用大规模集群或高性能GPU，让人轻易不能训练和验证算法效果；要么依靠其广泛的原始数据来源和充足的工程师队伍，获得并标注泛化性强的高质量数据集。该数据集型论文Objectron显然为后者。

论文链接：https://ieeexplore.ieee.org/abstract/document/9578264

论文代码及主页：https://github.com/google-research-datasets/Objectron/

论文官方网站介绍：https://google.github.io/mediapipe/solutions/objectron

对于3D目标检测和6D姿态估计领域，Objectron贡献了一个高质量的in-the-wild的公共数据集，但碍于3D物体的高复杂性，即使到这篇文章提出为止，依然没有能对标2D目标检测领域内MS-COCO数据集那样的工作诞生。Objectron的静态图像数据规模为400万，但仅包含14819个视频和17095个物体实例（COCO的实例多样性更胜一筹），物体类别上也仅为9类（COCO为80个类别），远未覆盖日常所能见到的所有物体。抛开这些，Objectron依然是目前领域内最具挑战的数据集之一。

Samples: Objectron is a dataset of short object centric video clips with pose annotations.

尽管，Objectron非常详尽地阐述了其构建思路与过程，并慷慨地公开了数据集，但是有两项重要内容，其未公布：1）baseline模型的训练代码Training Codes；2）3D物体框标注工具Annotation Tool。有经验的领域内专家也许可以找到替代品，或者自行复现，但显然费时费力，这就是Google出产的技术壁垒所在，一般公司似乎都会这么做吧。

2、主要内容概述

※ Abstract

3D目标检测应用十分广泛（ robotics, augmented reality, autonomy, and image retrieval），本文提出的Objectron数据集致力于推进3D目标检测，以及多个相关领域的发展（包括3D object tracking, view synthesis, and improved 3D shape representation）

数据集具体信息：The dataset contains object-centric short videos with pose annotations for nine categories and includes 4 million annotated images in 14, 819 annotated videos。

另外，针对3D目标检测任务，本文还提出了新的度量指标，即3D Intersection over Union。

最后，基于自建的benckmark，作者提供了两个baselines：3D object detection任务和novel view synthesis任务。

※ Introduction

在机器学习算法和大量训练图像的加持下，计算机视觉任务取得了精度上的巨大提升，同理，3D目标理解任务也有了很大的进步。然而，因为缺乏大量真实世界中的数据集（the lack of large real-world datasets compared to 2D tasks (e.g., ImageNet [8], COCO [22], and Open Images [20])），理解3D物体较2D物体有很大难度。本文想要制作一个object-centric video datasets，也就是以物体为中心，环绕其四周录制不同的连续视角下的观察短视频。

具体来讲，每个短视频使用AR一体化设备获取，元数据包括相机姿态、稀疏点云和外表面（camera poses, sparse point-clouds, and surface planes），另外针对每个物体，还有人工标注的3D包围框（3D bounding boxes），这些标注描述了物体的9维信息，包括位置、朝向和维度（ position, orientation, and dimensions）【也就是X, Y, Z, pitch, yaw, roll, length, width, height】。为了保持数据的多样性，所有14, 819个短视频样例尽量采集自地理上分散的不同国家（from a geo-diverse sample covering ten countries across five continents）【谷歌的优势之一：研究院遍布全球】

Figure 2: Our dataset consists of object-centric videos, which capture different views of the same objects from different angles.

Objectron数据集有以下几点优势：

● Videos contain multiple views of the same object, enabling many applications well beyond 3D object detection. This includes multi-view geometric understanding, view synthesis, 3D shape reconstruction, etc.

● The 3D bounding box is present in the entire video and is temporally consistent, thus enabling 3D tracking applications.

● Our dataset is collected in the wild to provide better generalization for real-world scenarios in contrast to datasets that are collected in a controlled environment [13] [4].【主要是数据集LineMOD (ACCV2012)和YCB (IJRR2017)】

● Each instance’s translation and size are stored in metric scale, thanks to accurate on-device AR tracking and provides sparse point clouds in 3D, enabling sparse depth estimation techniques. The images are calibrated and the camera parameters are provided, enabling the recovery of the object’s true scale.

● Our annotations are dense and continuous, unlike some of the previous work [30] 【Fei-Fei Li提出的数据集3DObject (ICCV2007)，年代久远~】where viewpoints have been discretized to fit into bins.

● Each object category contains hundreds of instances, collected from different locations across different countries in different lighting conditions. 【强调每个类别下的数据规模大，且数据分布的多样性更优】

※ Previous Work

作者主要比较了以往常用的多个具有代表性的3D目标检测数据集：

● 从具有更大的数据集规模，更高清的视频和更真实的常见物体上比较：BOP challenge, T-LESS, Rutgers APC, LineMOD, IC-BIN, YCB；

● 从具备更丰富的标注维度上（9-DoF vs. 6-DoF）比较：ObjectNet3D, Pascal3D+, Pix3D, 3DObject；

● 从更复杂的场景数据集（scene datasets）采集方式上比较（RGBD或LIDAR）：ScanNet, Scan2CAD, Rio；

● 从数据真假（synthetic data or photo-realistic scenes vs. real world）上比较：ShapeNet, HyperSim；【Synthetic datasets offer valuable data for training and benchmarking, but the ability to generalize to the real-world is unknown.】

更多关于比较的详细描述，见原论文，或参考我的相关博客6D Object Pose Estimation Datasets

※ Data Collection and Annotation

● Object Categories

首先阐明挑选物体类别的几个大致标准：

1）In Objectron dataset, the aim was to select meaningful categories of common objects that form a representative set of all categories that are practically relevant and technically challenging. 【引出了cups, chairs and bikes】

2）The object categories in the dataset contain both rigid, and non-rigid objects. 【引出了非刚性物体bikes 和 laptops，当然，在制作数据集的过程中，录制视频时，非刚性物体是保持不动的 remain stationary】

3）Many 3D object detection models are known to exhibit difficulties in estimating rotations of symmetric objects [21]. Symmetric objects have ambiguity in their one, two, or even three degrees of rotation.【引出了强对称性物体cups 和 bottles】

4）It has been shown that vision models pay special attention to texts in the images. Re-producing texts and labels correctly are important in generative models too.【强调有些时候可能需要恢复一些细节，比如文本信息，引出了books 和 cereal boxes】

5）Since we strive for real-time perception we included a few categories (shoes and chairs) that enable exciting applications, such as augmented reality and image retrieval. 【引出了shoes 和 chairs】

共计9类物体：bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops, and shoes

● Data Collection

通过谷歌自家的手持式AR设备（ARKit和ARCore）采集数据，公布的数据集中将包括记录的视频，和AR设备采集到的元数据。【We assume the standard pinhole camera model, and provide calibration, extrinsics and intrinsics matrix for every frame in the dataset】所有视频的分辨率为1920×1080，FPS为30，均使用手机的前置摄像头拍摄，且

1）设备类别不超过5种，以保证成像的统一性；

2）视频长度在10秒左右，以减少AR设备本身的偏差带来的干扰；

3）数据集采集者被要求不能快速移动，以防止产生模糊图像。

当然，全程要求物体保持相对静止。正是由于采用了手机录制的方案，研究者才能快速地在世界上不同区域内发起标注计划，下图是数据来源的地区分布情况（来自5个大洲的10个国家）：

Figure 3: Countries where we collected data from.

● Data Annotation

Efficient and accurate data annotation is the key to building large-scale datasets. 【说易做难】

Annotating 3D bounding boxes for each image is time-consuming and expensive. 【做优质数据集必定很难】

Instead, we annotate 3D objects in a video clip and populate them to all frames in the clip, scaling up the annotation process, and reducing the per image annotation cost. 【也就是说每个短视频只需要标注首个关键帧，后续每一帧的标注信息可以借助相机参数自动生成，以大大减少标注负担】标注工具的用户界面如下：

Figure 4: Data annotation. The annotated 3D box is verified at multiple views, and then populated to all images in the sequence.

以下是原文中的标注过程描述：

Next, we show the 3D world map to the annotator side-by-side with the images from the video sequence (Figure 4a). The annotator draws a 3D bounding box in the 3D world map, and our tool projects the 3D bounding box over all the frames given pre-computed camera poses from the AR sessions (such as ARKit or ARCore). The annotator looks at the projected bounding box and makes necessary adjustments (position, orientation, and the scale of a 3D bounding box) so the projected bounding box looks consistent across different frames. At the end of the process, the user saves the 3D bounding box and the annotation. The benefits of our approach are 1) by annotating a video once, we get annotated images for all frames in the video sequence; 2) by using AR, we can get accurate metric sizes for bounding boxes.

【从以上描述中可以看出，尽管标注的是2D帧，但其标注对象为3D点云，2D图像仅用于辅助判断待标注的各个维度的物体信息，标注者也只能操作右侧3D世界图中的3D框，而左侧均为投影和计算所得。这类使用点云的3D标注工具很常见，因此尽管文章未将标注工具开源，但仍可以很快找到替代品，比如openvinotoolkit开源的标注工具CVAT】

● Annotation Variance

严谨的高质量的标注过程，必须有定量分析！作者提到了两个影响标注精度的因素【The accuracy of our annotation hinges on two factors】：

1) the amount of drift in the estimated camera pose throughout the captured video, and

2) the accuracy of the raters annotating the 3D bounding box.

【因素1为相机姿态的估计误差，因素2为不同标注者标注标签的不一致性】

We compared the relative positional drift in our camera pose against an offline refined camera pose (obtained by an offline bundle adjustment) 【针对因素1中的误差，作者想到了使用线下的集束调整算法，比较直接获得的相机姿态参数与计算微调后的结果之间的差异。这里使用的集束优化算法为SfM，详细描述作者放在了第4章节的最后一段】【同时，为了减少人工误差，作者将视频长度控制在10秒左右，如下图】

Figure 5: Distribution of the video length in our dataset. Majority of the videos are 10 seconds long (300 frames) and the longest video is 2022 frames long.

To evaluate the accuracy of the rater, we asked eight annotators to re-annotate same sequences. 【针对因素2，作者使用了冗余标注的方式，通过多人重复标注同意样本，最后取平均值来减小人工标注的误差，这里用到了8人/per sequence，谷歌就是有钱啊】

Overall for the chairs, the standard deviation for the chair orientation, translation, and scale was 4.6°, 1cm, and 4cm, respectively which demonstrates insignificant variance of the annotation results between different raters.【下图是对椅子的标注示例，文章统计结果声称，最终标注误差的方差是很小的，这个数值是否也代表了算法预测的上限呢？】

Figure 6: The overlay of the 3D bounding boxes annotated by different annotators shows the annotations from different raters are very close.

※ Objectron Dataset

In this section, we describe the details of our Objectron dataset and provide some statistics. 【本章节是对Objectron数据集具备的特征的统计分析】共计9类物体：bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops, and shoes。其中bikes 和 laptops是非刚性物体。 In total there are 17, 095 object instances that appear in 4M annotated images from 14819 annotated videos (not counting the unreleased evaluation set for future competitions). 训练集和测试集划分如下：

Table 1: Per-category statistics of Objectron dataset.

此外，作者还可视化分析了每类物体的方位角（azimuth）和仰角（elevation）的分布情况，如下图

Figure 7: View-point distribution of samples per object category. The top row shows the azimuth distribution in polar graph, and the bottom row denotes the elevation distribution.

※ Baseline Experiments and Evaluations

作者发布了训练集和测试集，以及测试评估代码，但并不包括训练代码，以及某些物体类别的预训练模型。测试标准包括：3D IoU, 2D projection error, view-point error, polar and azimuth error, and rotation error. 其中，除了3D IoU这项指标，其它指标都是标准化的定义，无需赘述。

● 3D Intersection Over Union

作者声称，文中的3D IoU定义是全新的。作者指出，之前文献中涉及到的3D boxes重叠率计算方式，都是被过度简化了的，比如坐标对齐（axis-aligned）后再计算重叠率，或者将3D boxes投影到某个平面，然后计算2D投影多边形（2D projected polygons）的重叠率。然而，在通用的场景中，这些假设是不稳固的。Although this approach works for vehicles on the road, it has two limitations:

1) The object should sit on the same ground plane, which limits the degrees of freedom of the box from 9 to 7. The box only has freedom in yaw, and the roll and pitch are set to 0.

2) it assumes the boxes have the same height. For the Objectron datasets, these assumptions do not hold.

因此，作者提出了通用的计算方法（computing accurate 3D IoU values for general 3D-oriented boxes）。过程原文讲述也不够详细，且作者建议我们仔细阅读其中涉及到的关键经典算法the Sutherland-Hodgman Polygon clipping algorithm，这个多边形裁剪算法（polygon-clipping algorithm）在计算机图形学领域最为著名，下图中展示了其原理示例。具体计算过程，请参考公布的测试代码。

Figure 8:Accurate computation of 3D IoU using polygon-clipping algorithm.

另外，对于自对称性物体，比如杯子和瓶子，由于其外形上有旋转不变性，3D IoU需要重新定义。因此，作者会在度量时，将预测的3D框绕对称轴均匀地旋转，并匹配最合适的3D框（3D IoU最大）作为预测结果，下图为示例。

Figure 9: 3D IoU computation for symmetric objects: Rotating the bounding box along the Y axis of symmetry to maximize 3D IoU.

● Baselines for 3D object detection

作者首先给出了3D目标检测任务的baseline。为了形成对比，作者使用了两种检测算法：1）MobilePose 单阶段轻量级检测网络，由谷歌先前提出，挂在arxiv上但未被会议收录；2）SSD + EfficientNet-Lite 两阶段检测架构，文章中首次提出。下图是两种网络架构的示例图。

The architecture of the MobilePose model

Architecture of the two-stage model. The red blocks are 1 × 1 convolutional layers, green blocks are depth-wise convolutional layers, andblue blocks are addition layers for skip connection. The black block at the end is a fully connected layer.

需要说明的是，两种网络都是预测回归3D包围框顶点的2D投影，因此还需要将2D预测点恢复为3D，作者使用了EPnP算法。【We use a similar EPnP algorithm as in [16] to lift the 2D predicted keypoints to 3D.】【这里需要说明的是，OpenCV中虽然也集成了EPnP算法，但其使用接口并不适用本文的情况。经典的PnP算法都是将2D点和对应3D点作为输入，输出为估计相机参数。而本文是在知道3D包围框的8个2D顶点的相互几何关系后，估计并恢复3D信息，相机参数默认是固定的，详细计算方式，参见公布的代码】

一些定量化检测结果见下图，对于数据的解读与分析，参见原文

Table 2: Evaluation of different baseline models for the Objectron dataset.

Figure 10: Evaluation of MobilePose network [16] on the Objectron dataset

Figure 11: Evaluation of two-stage network on the Objectron dataset

● Baselines for Neural Radiance Field

NeRF的定义：It can learn the scene and object representation with fine details. The NeRF model learns the color and density value of each voxel in the scene and can generate novel views. 作者在文中主要执行以下两个任务：We used NeRF for two baselines: 1) Computing segmentation mask and 2) Novel view synthesis. 下面是效果图，详细过程需要参阅参考文献[24]。

Figure 12: Results of the NeRF [24] model on our dataset. 1) Input image with annotation, 2) NeRF rendering from the same view, 3) Estimated depth map from NeRF, 4) Extracted segmentation mask, and 5) Synthesized novel view.

※ Conclusion

This paper introduces the Objectron dataset: a large scale object-centric dataset of 14, 819 short videos in the wild with object pose annotation. We developed an efficient and scalable data collection and annotation framework based on on-device AR libraries. By releasing this dataset, we hope to enable the research community to push the limits of 3D object geometry understanding and foster new research and applications in 3D understanding, video models, object retrieval, view synthetics, and 3D reconstruction. 【直言不讳，本文最大的贡献便是数据集构建过程中，利用AR设备搜集和3D标注框架的创新，思路很值得借鉴】

3、特别关注细节

在理解数据集和复现测试代码的过程中，有以下几个极易被忽略的细节：

● 使用EPnP算法将2D投影点恢复成3D点：原官方代码 https://github.com/google-research-datasets/Objectron/ 未涉及这一部分，推荐的原C++代码也不便使用，几经查阅，在其另一官方博客 https://google.github.io/mediapipe/solutions/objectron 中找到了Python接口代码，阅读代码发现其指向mediapipe开源包https://github.com/google/mediapipe，其中与EPnP（Lift2DTo3D）相关的代码及文件为：

https://github.com/google/mediapipe/blob/master/mediapipe/python/solutions/objectron.py

https://github.com/google/mediapipe/blob/master/mediapipe/python/solution_base.py

https://github.com/google/mediapipe/blob/master/mediapipe/modules/objectron/calculators/decoder.cc#L201

https://github.com/google/mediapipe/blob/master/mediapipe/modules/objectron/calculators/epnp.cc

由于上述Lift2DTo3D函数被封装过于复杂，不便理解和迁移使用，在进一步搜索后发现，算法CenterPose（arxiv2021）使用了Objectron数据集作为benckmark，直接阅读它的代码更方便。Lift2DTo3D函数的实现，见下

https://github.com/NVlabs/CenterPose/blob/4355198a492b72e785a02ee911a9db8d8b63c0ab/src/tools/objectron_eval/eval_image_official.py#L805

● 使用本文定义的全新的3D IoU度量方式：原官方代码已给出使用步骤

https://github.com/google-research-datasets/Objectron/blob/master/notebooks/3D_IOU.ipynb

但其中没有涉及到对称物体的3D IoU计算方式，可以参见issue中有些提问者给出的方案，或者直接按照上一个注意事项，参考算法CenterPose中的代码。

● 使用或者复现文中的baselines：本文提到的两个3D目标检测baselines都是一笔带过，且没有公布其训练代码，推理代码可能通过阅读https://github.com/google/mediapipe 能找到，但过程仍旧过于复杂。比较好的方式是，直接将backbone换成更常用的YOLOv5等，或者转向算法CenterPose，该开源代码刚好使用了Objectron数据集训练和测试。

4、新颖点

参考在Introduction章节中，最后一段枚举的关于Objectron数据集的多项优势

5、总结

本文作为数据集类文章，虽然平时在CVPR会议中不多见，但初次细读此类文章，还是被其写作框架和介绍逻辑所折服。相较于其它各类“长篇大论”的数据集文章，比如ECCV会议上的MS-COCO，或IJCV期刊上的Pascal VOC，Objectron显然是短小精悍的。当然，考虑其数据集规模和当前所处的发展阶段，似乎这么比较并不公平，但我仍能得到一些启示：

● 是否能直接对2D图像进行3D信息的标注？在没有3D点云的情况下，直接标注大量独立的2D图像，势必会少于9个自由度，那能否满足6个自由度，或5个自由度呢？

● 如何确定标注标签的合理性和合法性？合理性指的是有原理或公式依据，合法性指的是人工标注标签需要具有一致性，本文在Annotation Variance章节，提供了很好的证明思路。

● 如何掌握好此类文章投顶会可能面临的创新性问题？尽管文章在Introduction章节的最后面枚举了Objectron数据集多达6个优势项，但在我看来，实验部分的baselines设计也很重要。如果只是测试MobilePose，势必会让整体内容显得单薄（重数据而轻方法）。两阶段检测框架的加入，以及NeRF章节的补充，让实验部分更加平衡了一些，思路值得借鉴。