转载请注明作者和出处: http://blog.csdn.net/john_bh/
论文链接:BlazeFace: Sub-millisecond Neural Face Detection on Mobile GPUs
作者及团队:谷歌
会议及时间:Arxiv 201907
code:BlazeFace_PyTorch版
code:BlazeFace_Tensorflow版
本文的主要贡献有三个:
前两个主要解决inference时的速度问题,第三个提升预测质量
We present BlazeFace, a lightweight and well-performing face detector tailored for mobile GPU inference. It runs at a speed of 200–1000+ FPS on flagship devices. This super-realtime performance enables it to be applied to any augmented reality pipeline that requires an accurate facial region of interest as an input for task-specific models, such as 2D/3D facial keypoint or geometry estimation, facial features or expression classification, and face region segmentation. Our contributions include a lightweight feature extraction network inspired by, but distinct from MobileNetV1/V2,a GPU-friendly anchor scheme modified from Single Shot MultiBox Detector (SSD), and an improved tie resolution strategy alternative to non-maximum suppression.
我们展示了BlazeFace,这是一种轻量级且性能良好的面部检测器,专为移动GPU推理量身定制。 在旗舰设备上,它以200–1000 + FPS的速度运行。 这种超实时性能使其可以应用于需要精确的关注面部区域作为特定任务模型(例如2D / 3D面部关键点或几何估计,面部特征或表情分类,以及 脸部区域分割。 我们的贡献包括受MobileNetV1 / V2启发但与MobileNetV1 / V2有所不同的轻量级特征提取网络,从Single Shot MultiBox Detector(SSD)修改而来的GPU友好锚定方案,以及替代非最大抑制的改进的领带分辨率策略。
In recent years, a variety of architectural improvements in deep networks ([4, 6, 8]) have enabled real-time object detection. In mobile applications, this is usually the first step in a video processing pipeline, and is followed by taskspecific components such as segmentation, tracking, or geometry inference. Therefore, it is imperative that the object detection model inference runs as fast as possible, preferably with the performance much higher than just the standard real-time benchmark.
We propose a new face detection framework called BlazeFace that is optimized for inference on mobile GPUs,adapted from the Single Shot Multibox Detector (SSD) framework [4]. Our main contributions are:
近年来,通过对深度神经网络中各种架构的改进,我们已经可以实现实时目标检测。在移动应用程序中,实时目标检测通常是视频处理流程中的第一步,接着是各种特定任务组件,例如分割,跟踪或几何推理。因此,目标检测模型推理必须尽可能快地运行,其性能最好能够达到远高于标准的实时基准。
我们提出了一种名为 BlazeFace 的新面部检测框架,该框架是在单镜头多盒检测器(SSD)框架上针对移动 GPU 推理进行的优化。我们的主要创新包括:
1. Related to the inference speed:
1.1. A very compact feature extractor convolutional neural network related in structure to MobileNetV1/V2 [3, 9], designed specifically for lightweight object detection.一个专为轻量级目标检测而设计的在结构上与 MobileNetV1/V2 相关的非常紧凑的特征提取器卷积神经网络。.
1.2. A novel GPU-friendly anchor scheme modified from SSD [4], aimed at effective GPU utilization.Anchors [8], or priors in SSD terminology, are predefined static bounding boxes that serve as the basis for the adjustment by network predictions and determine the prediction granularity.一种基于 SSD 的新型 GPU-friendly anchor 机制,旨在提高 GPU 利用率。Anchors(SSD 术语中的先验)是预定义的静态边界框,作为网络预测调整和确定预测粒度的基础。
2. Related to the prediction quality: A tie resolution strategy alternative to non-maximum suppression [4, 6, 8] that achieves stabler, smoother tie resolution between overlapping predictions.一种替代非最大抑制的联合分辨率策略,可在多预测之间实现更稳定、更平滑的联系分辨率。
While the proposed framework is applicable to a variety of object detection tasks, in this paper we focus on detecting faces in a mobile phone camera viewfinder. We build separate models for the front-facing and rear-facing cameras owing to the different focal lengths and typical captured object sizes.
In addition to predicting axis-aligned face rectangles, our BlazeFace model produces 6 facial keypoint coordinates (for eye centers, ear tragions, mouth center, and nose tip) that allow us to estimate face rotation (roll angle). This enables passing a rotated face rectangle to later task-specific stages of the video processing pipeline, alleviating the requirement of significant translation and rotation invariance in subsequent processing steps (see Section 5).
虽然所提出的框架适用于各种对象检测任务,但在本文中,我们着重于在手机相机取景器中检测面部。 由于不同的焦距和典型的被摄物体尺寸,我们为前置摄像头和后置摄像头分别构建了模型。
除了预测与轴对齐的脸部矩形外,我们的BlazeFace模型还生成6个脸部关键点坐标(用于眼睛中心,耳,嘴中心和鼻尖),使我们可以估计脸部旋转(滚动角度)。 这样可以将旋转的面部矩形传递到视频处理管道的后续任务特定阶段,从而减轻了后续处理步骤中显着平移和旋转不变性的要求(请参阅第5节)。
BlazeFace model architecture is built around four important design considerations discussed below.BlazeFace 模型架构围绕下面讨论的四个重要设计考虑因素而构建。
Enlarging the receptive field sizes. While most of the modern convolutional neural network architectures (including both MobileNet [3, 9] versions) tend to favor 3×3 convolution kernels everywhere long the model graph, we note that the depthwise separable convolution computations are dominated by their pointwise parts. On an s × s × c s × s × c s×s×c input tensor, a k × k k × k k×k depthwise convolution involves s 2 c k 2 s^{2} ck^2 s2ck2multiply-add operations, while the subsequent 1 × 1 1 × 1 1×1 convolution into d output channels is comprised of s 2 c d s^2 cd s2cd such operations, within a factor of d / k 2 d/k^2 d/k2 of the depthwise part.
扩大感受野的大小。尽管大多数现代卷积神经网络体系结构(包括MobileNet [3,9]版本)在模型图的各处都倾向于使用3×3卷积核,但我们注意到,深度可分离卷积计算主要由它们的逐点部分决定。在 s × s × c s×s×c s×s×c 输入张量上,应用可分离卷积操作,其中, k × k k×k k×k 的深度卷积涉及 s 2 c k 2 s^2ck^2 s2ck2 次乘加运算,而后续的 1 × 1 1×1 1×1 卷积到 d 个输出通道由 s 2 c d s^2cd s2cd次乘加运算组成,是深度阶段的 d / k 2 d /k^2 d/k2倍。
In practice, for instance, on an Apple iPhone X with the Metal Performance Shaders implementation [1], a 3×3 depthwise convolution in 16-bit floating point arithmetic takes 0.07 ms for a 56 × 56 × 128 56×56×128 56×56×128 tensor, while the subsequent 1×1 convolution from 128 to 128 channels is 4.3× slower at 0.3 ms (this is not as significant as the pure arithmetic operation count difference due to fixed costs and memory access factors).
例如,实际上,在具有Metal Performance Shaders实现的Apple iPhone X上,对于16×56×128张量,在16位浮点算法中进行3×3深度卷积需要0.07毫秒,相比之下 128 到 128 通道的 1×1 卷积运算会慢 4.3 倍,即后续的点卷积操作需要 0.3 毫秒(由于固定成本和存储器访问因素导致的纯算术运算计数差)
This observation implies that increasing the kernel size of the depthwise part is relatively cheap. We employ 5×5 kernels in our model architecture bottlenecks, trading the kernel size increase for the decrease in the total amount of such bottlenecks required to reach a particular receptive field size (Figure 1).
该观察表明增加深度部分的核尺寸性价比更高。我们在模型架构中使用 5×5 内核,这样使得感受野达到指定大小所需的 bottleneck 数量大大减少,得到的 BlazeBlock 有下图所示的两种结构:
A MobileNetV2 [9] bottleneck contains subsequent depth-increasing expansion and depth-decreasing projection pointwise convolutions separated by a non-linearity. To accommodate for the fewer number of channels in the intermediate tensors, we swap these stages so that the residual connections in our bottlenecks operate in the “expanded”(increased) channel resolution.
MobileNetV2 [9]瓶颈包含随后的深度增加的扩展和深度减少的投影逐点卷积,这些卷积被非线性分隔。为了适应中间张量中较少数量的通道,我们交换了这些级,以使瓶颈中的剩余连接以“扩展的”(提高的)通道分辨率运行。
Finally, the low overhead of a depthwise convolution allows us to introduce another such layer between these two pointwise convolutions, accelerating the receptive field size progression even further. This forms the essence of a double BlazeBlock that is used as the bottleneck of choice for the higher abstraction level layers of BlazeFace (see Figure 1,right).
最后,深度卷积的低开销使我们能够在这两个点式卷积之间引入另一个这样的层,从而进一步加快了接收场大小的进程。这形成了双BlazeBlock的本质,该BlazeBlock用作BlazeFace的较高抽象级别层的选择瓶颈(请参见图1,右)。
Feature extractor. For a specific example, we focus on the feature extractor for the front-facing camera model. It has to account for a smaller range of object scales and therefore has lower computational demands. The extractor takes an RGB input of 128×128 pixels and consists of a 2D convolution followed by 5 single BlazeBlocks and 6 double BlazeBlocks (see Table 4 in Appendix A for the full layout).The highest tensor depth (channel resolution) is 96, while the lowest spatial resolution is 8×8 (in contrast to SSD, which reduces the resolution all the way down to 1×1).
对于具体的例子,我们专注于前置摄像头模型的特征提取器。该特征提取器必须考虑较小范围的目标尺度,因此它具有较低的计算需求。提取器采用 128×128 像素的 RGB 输入,包括一个 2D 卷积和 5 个单 BlazeBlock 和 6 个双 BlazeBlock 组成,完整布局见下表。最大张量深度(通道分辨率)为 96,而最低空间分辨率为 8×8(与 SSD 相比,它将分辨率一直降低到 1×1).
Anchor scheme. SSD-like object detection models rely on pre-defined fixed-size base bounding boxes called priors, or anchors in Faster-R-CNN [8] terminology. A set of regression (and possibly classification) parameters such as center offset and dimension adjustments is predicted for each anchor. They are used to adjust the pre-defined anchor position into a tight bounding rectangle.
类似 SSD 的目标检测模型依赖于预定义的固定大小的基础边界框,称为先验机制,或 Faster-R-CNN 术语中的锚点。为每个锚预测一组回归(可能还包括分类)参数,例如中心偏移量和尺寸调整。它们用于将预定义的锚位置调整为紧密的边界矩形。
It is a common practice to define anchors at multiple resolution levels in accordance with the object scale ranges.Aggressive downsampling is also a means for computational resource optimization. A typical SSD model uses predictions from 1×1, 2×2, 4×4, 8×8, and 16×16 feature map sizes. However, the success of the Pooling Pyramid Network (PPN) architecture [7] implies that additional computations could be redundant after reaching a certain feature map resolution.
通常的做法是根据目标比例范围在多个分辨率级别定义锚点,同时下采样也是计算资源优化的手段。典型的 SSD 模型使用 1×1,2×2,4×4,8×8 和 16×16 特征映射大小的预测。然而,金字塔池化网络 PPN 架构的成功意味着在特征图达到某个特征映射分辨率后,将产生大量额外的计算。
A key feature specific to GPU as opposed to CPU computation is a noticeable fixed cost of dispatching a particular layer computation, which becomes relatively significant for deep low-resolution layers inherent to popular CPU-tailored architectures. As an example, in one experiment we observed that out of 4.9 ms of MobileNetV1 inference time only 3.9 ms were spent in actual GPU shader computation.
相比于 CPU 计算,GPU 独有的关键特性是调度特定层计算会有一个显著的固定成本,这对于流行的 CPU 定制架构固有的深度低分辨率层而言非常重要。例如,在一个实验中我们观察到 MobileNetV1 推理时间需要 4.9 毫秒,而在实际 GPU 计算中花费 3.9 毫秒。
Taking this into consideration, we have adopted an alternative anchor scheme that stops at the 8×8 feature map dimensions without further downsampling (Figure 2). We have replaced 2 anchors per pixel in each of the 8×8, 4×4 and 2×2 resolutions by 6 anchors at 8×8. Due to the limited variance in human face aspect ratios, limiting the anchors to the 1:1 aspect ratio was found sufficient for accurate face detection.
考虑到这一点,我们采用了另一种锚定方案,该方案停留在 8×8 特征图尺寸处而无需进一步下采样(图 2)。我们已经将 8×8,4×4 和 2×2 分辨率中的每个像素的 2 个锚点替换为 8×8 的 6 个锚点。由于人脸长宽比的变化有限,因此发现将锚固定为 1:1 纵横比足以进行精确的面部检测。
Post-processing. As our feature extractor is not reducing the resolution below 8×8, the number of anchors overlapping a given object significantly increases with the object size. In a typical non-maximum suppression scenario, only one of the anchors “wins” and is used as the final algorithm outcome. When such a model is applied to subsequent video frames, the predictions tend to fluctuate between different anchors and exhibit temporal jitter (humanperceptible noise).
由于我们的特征提取器未将分辨率降低到 8×8 以下,因此给定目标重叠的锚点数量会随目标尺寸的增加而显著增加。在典型的非最大抑制方案中,只有一个锚点被选中作为算法的输出。这样的模型应用于后续视频人脸预测时,预测结果将在不同锚之间波动并且在时间序列上检测框上持续抖动(人类易感噪声)。
To minimize this phenomenon, we replace the suppression algorithm with a blending strategy that estimates the regression parameters of a bounding box as a weighted mean between the overlapping predictions. It incurs virtually no additional cost to the original NMS algorithm. For our face detection task, this adjustment resulted in a 10% increase in accuracy.
为了最小化这种现象,我们用一种混合策略代替抑制算法,该策略以重叠预测之间的加权平均值估计边界框的回归参数,它几乎不会产生给原来的 NMS 算法带来额外成本。对于人脸检测任务,此调整使准确度提高 10%。
We quantify the amount of jitter by passing several slightly offset versions of the same input image into the network and observing how the model outcomes (adjusted to account for the translation) are affected. After the described tie resolution strategy modification, the jitter metric, defined as the root mean squared difference between the predictions for the original and displaced inputs, dropped down by 40% on our frontal camera dataset and by 30% on a rear-facing camera dataset containing smaller faces.
我们通过连续输入目标轻微偏移的图像来量化抖动量,并观察模型结果(受偏移量影响)如何受到影响。在联合分辨率策略修改之后,抖动量(定义为原始输入和移位输入的预测之间的均方根差)在我们的前置摄像头数据集上下降了 40%,在包含较小人脸的后置摄像头数据集上下降了 30%。
We trained our model on a dataset of 66K images. For evaluation, we used a private geographically diverse dataset consisting of 2K images. For the frontal camera model, only faces that occupy more than 20% of the image area were considered due to the intended use case (the threshold for the rear-facing camera model was 5%).
我们在 66K 图像的数据集上训练我们的模型。为了评估实验结果,我们使用了由 2K 图像组成的地理位置多样数据集。对于前置摄像头模型,它只考虑占据图像区域的 20%以上的面部,这是由预期的用例决定的(后置摄像头型号的阈值为 5%)。
The regression parameter errors were normalized by the inter-ocular distance (IOD) for scale invariance, and the median absolute error was measured to be 7.4% of IOD. The jitter metric evaluated by the procedure mentioned above was 3% of IOD.
回归参数误差采用眼间距离(IOD)进行尺度不变性归一化,中值绝对误差为 IOD 的 7.4%。通过上述程序评估的抖动度量是 IOD 的 3%。
Table 1 shows the average precision (AP) metric [5](with a standard 0.5 intersection-over-union bounding box match threshold) and the mobile GPU inference time for the proposed frontal face detection network and compares it to a MobileNetV2-based object detector with the same anchor coding scheme (MobileNetV2-SSD). We have used TensorFlow Lite GPU [2] in 16-bit floating point mode as the framework for inference time evaluation.
表1 显示了所提出的正面人脸检测网络的平均精度(AP)度量(标准 0.5 交叉联合边界框匹配阈值)和移动 GPU 推理时间,并将其与基于 MobileNetV2 的目标检测器(MobileNetV2-SSD)进行了比较。我们在 16 位浮点模式下使用 TensorFlow Lite GPU 作为推理时间评估的框架。
Table 3 shows the amount of degradation in the regression parameter prediction quality that is caused by the smaller model size. As explored in the following section,this does not necessarily incur a proportional degradation of the whole AR pipeline quality.
表3展示了由于模型尺寸较小引起的回归参数预测质量的退化程度。如下一节所述,这不一定会导致整个 AR 管道质量的成比例降低。
The proposed model, operating on the full image or a video frame, can serve as the first step of virtually any facerelated computer vision application, such as 2D/3D facial keypoints, contour, or surface geometry estimation, facial features or expression classification, and face region segmentation. The subsequent task in the computer vision pipeline can thus be defined in terms of a proper facial crop. Combined with the few facial keypoint estimates provided by BlazeFace, this crop can be also rotated so that the face inside is centered, scale-normalized and has a roll angle close to zero. This removes the requirement of significant translation and rotation invariance from the taskspecific model, allowing for better computational resource allocation.
上述模型可以在完整图像或视频帧上运行,并且可以作为几乎任何与人脸相关的计算机视觉应用的第一步,例如 2D / 3D 人脸关键点、轮廓或表面几何估计、面部特征或表情分类以及人脸区域分割。因此,计算机视觉流程中的后续任务可以根据适当的面部剪裁来定义。结合 BlazeFace 提供的少量面部关键点估计,此结果也可以旋转,这样图像中的面部是居中的、标准化的并且滚动角接近于零。这消除了 SIG-nifi 不能平移和旋转不变性的要求,从而允许模型实现更好的计算资源分配。
We illustrate this pipelined approach with a specific example of face contour estimation.In Figure 3, we show how the output of BlazeFace, i.e. the predicted bounding box and the 6 keypoints of the face (red), are further refined by a more complex face contour estimation model applied to a slightly expanded crop.The detailed keypoints yield a finer bounding box estimate (green) that can be reused for tracking in the subsequent frame without running the face detector.To detect failures of this computation saving strategy, the contours model can also detect whether the face is indeed present and reasonably aligned in the provided rectangular crop. Whenever that condition is violated, the BlazeFace face detector is run on the whole video frame again.
我们通过一个具体的人脸轮廓估计示例来说明这种方法。在图3 中,我们展示了 BlazeFace 的输出,即预测的边界框和面部的 6 个关键点(红色)如何通过一个更复杂的人脸轮廓估计模型来进一步细化,并将其应用于扩展的结果。详细的关键点可以产生更精细的边界框估计(绿色),并在不运行人脸检测器的情况下重新用于后续帧中的跟踪。为了检测该计算节省策略的故障,该模型还可以检测面部是否存在所提供的矩形裁剪中合理地对齐。每当违反该条件时,BlazeFace 人脸检测器将再次在整个视频帧上运行。
The technology described in this paper is driving major AR self-expression applications and AR developer APIs on mobile phones.