Focal Loss for Dense Object Detection（RetinaNet）读书笔记

照例是先翻译，再介绍。

Focal Loss for Dense Object Detection

Abstract

The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, we design and train a simple dense detector we call RetinaNet. Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors. Code is at: https://github.com/facebookresearch/Detectron

迄今为止，最精确的目标检测器是基于R-CNN推广的两阶段方法，其中分类器应用于稀疏的候选目标位置集。相比之下，在对可能的目标位置进行规则密集采样的基础上应用的一阶段探测器可能更快、更简单，但迄今为止落后于两阶段检测器的精度。在本文中，我们将研究为什么会出现这种情况。我们发现，在检测器的训练过程中所遇到的极端前景、背景类别不平衡，是造成这种现象的主要原因。我们通过修改标准的交叉熵损失，来解决这一类样本不平衡问题，从而降低正确分类样本的损失的权重。我们的原创Focal Loss在一组稀疏的难样本上训练，并有效防止大量负样本影响检测器性能。为了评估损失的有效性，我们设计并训练了一个简单的密集检测器，称为RetinaNet。研究结果表明，当使用Focal Loss训练时，RetinaNet与以往的单阶段检测器速度相当，同时超过了现有最先进的两级检测器的精度。代码位于：https://github.com/facebook research/Detectron

1. Introduction

Current state-of-the-art object detectors are based on a two-stage, proposal-driven mechanism. As popularized in the R-CNN framework [11], the first stage generates a sparse set of candidate object locations and the second stage classifies each candidate location as one of the foreground classes or as background using a convolutional neural network. Through a sequence of advances [10, 28, 20, 14], this two-stage framework consistently achieves top accuracy on the challenging COCO benchmark [21].

Despite the success of two-stage detectors, a natural question to ask is: could a simple one-stage detector achieve similar accuracy? One stage detectors are applied over a regular, dense sampling of object locations, scales, and aspect ratios. Recent work on one-stage detectors, such as YOLO [26, 27] and SSD [22, 9], demonstrates promising results, yielding faster detectors with accuracy within 10- 40% relative to state-of-the-art two-stage methods.

This paper pushes the envelope further: we present a one-stage object detector that, for the first time, matches the state-of-the-art COCO AP of more complex two-stage detectors, such as the Feature Pyramid Network (FPN) [20] or Mask R-CNN [14] variants of Faster R-CNN [28]. To achieve this result, we identify class imbalance during training as the main obstacle impeding one-stage detector from achieving state-of-the-art accuracy and propose a new loss function that eliminates this barrier.

当前最先进的对象检测器基于两阶段的提案驱动机制。正如在R-CNN框架中流行的那样[11]，第一阶段生成稀疏的候选对象位置集，第二阶段使用卷积神经网络将每个候选位置分类为前景之一或背景。通过一系列进展[10，28，20，14]，此两阶段框架始终在具有挑战性的COCO基准测试中实现最高准确性[21]。

尽管两级检测器取得了成功，一个自然的问题是：一个简单的一级检测器能否达到类似的精度？一级检测器在常规密集采样得到的目标位置方面，得到了应用。YOLO [26，27]和SSD [22，9]等一阶段检测器的最新工作令人鼓舞，相对于最新的两级检测器，其可产生精度在10％至40％之内的更快检测器方法。

本文将进一步探讨：我们提出了一种单阶段的目标检测器，它首次与更复杂的两级检测器（例如功能金字塔网络（FPN）或Faster R-CNN [28]、Mask R-CNN [14]）在COCO数据集上准确度相当。为了获得此结果，我们将训练过程中的类别不平衡，确定为阻碍一阶段检测器提高精度的主要障碍，并提出了消除这种障碍的新损失函数。

Class imbalance is addressed in R-CNN-like detectors by a two-stage cascade and sampling heuristics. The proposal stage (e.g., Selective Search [35], EdgeBoxes [39], DeepMask [24, 25], RPN [28]) rapidly narrows down the number of candidate object locations to a small number (e.g., 1-2k), filtering out most background samples. In the second classification stage, sampling heuristics, such as a fixed foreground-to-background ratio (1:3), or online hard example mining (OHEM) [31], are performed to maintain a manageable balance between foreground and background.

类不平衡在类R-CNN检测器中通过两级级联和启发性的采样来解决。建议阶段（例如，选择性搜索[35]、EdgeBoxes[39]、DeepMask[24，25]、RPN[28]）快速地将候选对象位置的数目缩小到一个小数目（例如，1-2k），过滤掉大多数背景样本。在第二个分类阶段，为了在前景和背景之间保持可管理的平衡，执行采样启发，例如固定的前景与背景比（1:3）或在线硬示例挖掘（OHEM）[31]。

In contrast, a one-stage detector must process a much larger set of candidate object locations regularly sampled across an image. In practice this often amounts to enumerating ∼100k locations that densely cover spatial positions, scales, and aspect ratios. While similar sampling heuristics may also be applied, they are inefficient as the training procedure is still dominated by easily classified background examples. This inefficiency is a classic problem in object detection that is typically addressed via techniques such as bootstrapping [33, 29] or hard example mining [37, 8, 31].

相比之下，一级检测器必须处理一组更大的候选对象位置，这些候选对象位置在图像上定期采样。在实践中，这通常相当于枚举密集覆盖空间位置、比例和纵横比的100k个位置。虽然也可以应用类似的抽样启发法，但它们效率低下，因为训练过程仍然由容易分类的背景示例支配。这种效率低下是对象检测中的一个典型问题，通常通过引导（bootstrapping）或硬示例挖掘（hard example mining）等技术来解决。

In this paper, we propose a new loss function that acts as a more effective alternative to previous approaches for dealing with class imbalance. The loss function is a dynamically scaled cross entropy loss, where the scaling factor decays to zero as confidence in the correct class increases, see Figure 1. Intuitively, this scaling factor can automatically down-weight the contribution of easy examples during training and rapidly focus the model on hard examples. Experiments show that our proposed Focal Loss enables us to train a high-accuracy, one-stage detector that significantly outperforms the alternatives of training with the sampling heuristics or hard example mining, the previous state-of-the-art techniques for training one-stage detectors. Finally, we note that the exact form of the focal loss is not crucial, and we show other instantiations can achieve similar results.

在本文中，我们提出一个新的损失函数，作为一个更有效的替代方法来处理类不平衡问题。损失函数是一个动态标度的交叉熵损失，当正确分类的样本置信度增加时，标度因子衰减为零，见图1。直观地说，这个比例因子可以自动降低训练过程中简单示例的权重，并快速地将模型集中在硬示例上。实验表明，我们提出的焦点损失使我们能够训练出一种高精度的单级检测器，其性能明显优于以往训练单级检测器的最新技术，即采样启发式或硬示例挖掘。最后，我们注意到Focal loss的确切形式并不重要，并且其他实例可以获得类似的结果。

To demonstrate the effectiveness of the proposed focal loss, we design a simple one-stage object detector called RetinaNet, named for its dense sampling of object locations in an input image. Its design features an efficient in-network feature pyramid and use of anchor boxes. It draws on a variety of recent ideas from [22, 6, 28, 20]. RetinaNet is efficient and accurate; our best model, based on a ResNet-101- FPN backbone, achieves a COCO test-dev AP of 39.1 while running at 5 fps, surpassing the previously best published single-model results from both one and two-stage detectors, see Figure 2.

为了证明所提出的损失的有效性，我们设计了一种简单的单级目标检测器RetinaNet，该检测器以对输入图像中的目标位置进行密集采样命名。其设计特点是高效的网络特征金字塔和锚的使用。它借鉴了[22，6，28，20]中的最新观点。RetinaNet是高效和准确的；我们的最佳模型，基于ResNet-101-FPN骨干网，在5 fps的速度下运行时，COCO测试开发AP达到39.1，超过了先前发布的单级和两级探测器的最佳单模型结果，见图2。

2. Related Work

Classic Object Detectors: The sliding-window paradigm, in which a classifier is applied on a dense image grid, has a long and rich history. One of the earliest successes is the classic work of LeCun et al. who applied convolutional neural networks to handwritten digit recognition [19, 36]. Viola and Jones [37] used boosted object detectors for face detection, leading to widespread adoption of such models. The introduction of HOG [4] and integral channel features [5] gave rise to effective methods for pedestrian detection. DPMs [8] helped extend dense detectors to more general object categories and had top results on PASCAL [7] for many years. While the sliding-window approach was the leading detection paradigm in classic computer vision, with the resurgence of deep learning [18], two-stage detectors, described next, quickly came to dominate object detection.

经典的目标检测器：滑动窗口模式，在密集的图像网格上应用分类器，具有悠久而丰富的历史。最早的成功之一是LeCun等人的经典著作。他将卷积神经网络应用于手写数字识别[19，36]。Viola和Jones[37]使用增强型目标检测器进行人脸检测，导致了此类模型的广泛采用。HOG[4]和积分通道特征[5]的引入为行人检测提供了有效的方法。DPMs[8]有助于将密集探测器扩展到更一般的物体类别，多年来在PASCAL[7]上取得了最好的结果。虽然滑动窗口方法是经典计算机视觉中的主要检测范式，但随着深度学习的兴起[18]，接下来描述的两级检测器很快就占据了目标检测的主导地位。

Two-stage Detectors: The dominant paradigm in modern object detection is based on a two-stage approach. As pioneered in the Selective Search work [35], the first stage generates a sparse set of candidate proposals that should contain all objects while filtering out the majority of negative locations, and the second stage classifies the proposals into foreground classes / background. R-CNN [11] upgraded the second-stage classifier to a convolutional network yielding large gains in accuracy and ushering in the modern era of object detection. R-CNN was improved over the years, both in terms of speed [15, 10] and by using learned object proposals [6, 24, 28]. Region Proposal Networks (RPN) integrated proposal generation with the second-stage classifier into a single convolution network, forming the Faster RCNN framework [28]. Numerous extensions to this framework have been proposed, e.g. [20, 31, 32, 16, 14].

两级检测器：现代目标检测的主要范式是基于两级方法。正如选择性搜索工作[35]所开创的那样，第一阶段生成一个稀疏的候选方案集，该候选方案集应包含所有对象，同时过滤掉大多数负面样本，第二阶段将方案分类为前景类/背景。R-CNN[11]将第二级分类器升级为卷积网络，在精度上有了很大的提高，并开创了现代的目标检测时代。R-CNN经过多年的发展，在速度[15，10]和对象建议[6，24，28]方面取得了较大的进步。区域建议网络（RPN）将生成建议和第二阶段的分类集成到单个卷积网络中，形成更快的RCNN框架[28]。还有其他工作对该框架提出了许多扩展，例如[20、31、32、16、14]。

One-stage Detectors: OverFeat [30] was one of the first modern one-stage object detector based on deep networks. More recently SSD [22, 9] and YOLO [26, 27] have renewed interest in one-stage methods. These detectors have been tuned for speed but their accuracy trails that of twostage methods. SSD has a 10-20% lower AP, while YOLO focuses on an even more extreme speed/accuracy trade-off. See Figure 2. Recent work showed that two-stage detectors can be made fast simply by reducing input image resolution and the number of proposals, but one-stage methods trailed in accuracy even with a larger compute budget [17]. In contrast, the aim of this work is to understand if one-stage detectors can match or surpass the accuracy of two-stage detectors while running at similar or faster speeds.

单级探测器：OverFeat[30]是第一个基于深度网络的现代单级目标探测器。最近，SSD[22,9]和YOLO[26,27]重新激发了研究者对单阶段方法的兴趣。这些探测器具有较快的速度，但其精度落后于两级方法。SSD的AP降低了10-20%，而YOLO则专注于更极端的速度/精度权衡。见图2。最近的研究表明，只需降低输入图像的分辨率和建议框的数量，就可以快速实现两级检测器，但一级方法在精度方面仍然落后，即使计算预算更大[17]。相比之下，这项工作的目的是了解一级探测器在以相似或更快的速度运行时，是否能够匹配或超过两级探测器的精度。

The design of our RetinaNet detector shares many similarities with previous dense detectors, in particular the concept of ‘anchors’ introduced by RPN [28] and use of features pyramids as in SSD [22] and FPN [20]. We emphasize that our simple detector achieves top results not based on innovations in network design but due to our novel loss.

我们的RetinaNet探测器的设计与以前的密集探测器有许多相似之处，特别是RPN[28]引入的“锚”概念，以及在SSD[22]和FPN[20]中使用金字塔特征。我们强调，我们的简单探测器取得了最好的结果不是基于网络设计的创新，而是由于我们的新损失。

Class Imbalance: Both classic one-stage object detection methods, like boosted detectors [37, 5] and DPMs [8], and more recent methods, like SSD [22], face a large class imbalance during training. These detectors evaluate 104- 105 candidate locations per image but only a few locations contain objects. This imbalance causes two problems: (1) training is inefficient as most locations are easy negatives that contribute no useful learning signal; (2) en masse, the easy negatives can overwhelm training and lead to degenerate models. A common solution is to perform some form of hard negative mining [33, 37, 8, 31, 22] that samples hard examples during training or more complex sampling/reweighing schemes [2]. In contrast, we show that our proposed focal loss naturally handles the class imbalance faced by a one-stage detector and allows us to efficiently train on all examples without sampling and without easy negatives overwhelming the loss and computed gradients.

类不平衡：经典的一级目标检测方法，如增强型检测器[37,5]和DPMs[8]，以及较新的方法，如SSD[22]，在训练过程中都面临着较大的类不平衡。这些探测器评估每个图像104-105个候选位置，但只有少数位置包含对象。这种不平衡导致了两个问题：（1）训练效率低下，因为大多数位置都是容易产生负面影响的，而这些负面影响不会产生有用的学习信号；（2）总体而言，这些负面影响会压倒训练，导致模型退化。一种常见的解决方案是执行某种形式的硬负挖掘[33、37、8、31、22]，在训练或更复杂的采样/重新称重方案[2]期间对硬示例进行采样。相比之下，我们的研究表明，我们提出的Focal损失自然地处理了单级检测器所面临的类不平衡，并允许我们在没有采样的情况下有效地训练所有的例子，而且不容易出现压倒损失和计算梯度的负性。

Robust Estimation: There has been much interest in designing robust loss functions (e.g., Huber loss [13]) that reduce the contribution of outliers by down-weighting the loss of examples with large errors (hard examples). In contrast, rather than addressing outliers, our focal loss is designed to address class imbalance by down-weighting inliers (easy examples) such that their contribution to the total loss is small even if their number is large. In other words, the focal loss performs the opposite role of a robust loss: it focuses training on a sparse set of hard examples.

鲁棒性估计：人们对设计鲁棒性损失函数（如Huber损失[13]）非常感兴趣，该函数通过降低具有较大误差的实例（硬实例）的损失权重来减少异常值的贡献。相反，我们的焦点损失不是处理异常值，而是通过降低内联（简单示例）的权重来处理类不平衡，这样即使它们的数量很大，它们对总损失的贡献也很小。换言之，Focal损失扮演着与鲁棒损失相反的角色：它将训练集中在一组稀疏的硬例子上。

3. Focal Loss

The Focal Loss is designed to address the one-stage object detection scenario in which there is an extreme imbalance between foreground and background classes during training (e.g., 1:1000). We introduce the focal loss starting from the cross entropy (CE) loss for binary classification1: Focal Loss是为了解决训练期间前景类和背景类之间存在极端不平衡（例如1:1000）的一阶段目标检测场景。我们从二元分类的交叉熵（CE）损失开始引入焦点损失：

In the above y in {±1} specifies the ground-truth class and p in [0; 1] is the model’s estimated probability for the class with label y = 1. For notational convenience, we define pt: y是真实的类别，P是模型的预测类别概率。

and rewrite CE(p; y) = CE(pt) = - log(pt). 由于pt比p做了变换，可以直接变为这个式子。

The CE loss can be seen as the blue (top) curve in Figure 1. One notable property of this loss, which can be easily seen in its plot, is that even examples that are easily classified (pt :5) incur a loss with non-trivial magnitude. When summed over a large number of easy examples, these small loss values can overwhelm the rare class.

在图1中，CE损耗可以看作是蓝色（顶部）曲线。这种损失的一个显著特点是，即使是易于分类的例子（pt:5）也会产生较大的损失，这一点在图中很容易看出。当通过大量简单的例子进行总结时，这些小的损失值可以压倒少数的类。

3.1 Balanced Cross Entropy

A common method for addressing class imbalance is to introduce a weighting factor α 2 [0; 1] for class 1 and 1-α for class -1. In practice α may be set by inverse class frequency or treated as a hyperparameter to set by cross validation. For notational convenience, we define αt analogously to how we defined pt. We write the α-balanced CE loss as: 解决类不平衡的一种常用方法是为类1引入加权因子α2属于[0；1]，为类-1引入加权因子α1。实际上，α可以通过类频率的倒数来设置，也可以作为超参数通过交叉验证来设置。为了便于注释，我们将αt的定义类似于我们对pt的定义。我们将α-平衡交叉熵损失写成：

This loss is a simple extension to CE that we consider as an experimental baseline for our proposed focal loss.

这个损失是对交叉熵损失的一个简单扩展，我们认为它是我们提出的focal loss的实验基线。

3.2 Focal Loss Definition

As our experiments will show, the large class imbalance encountered during training of dense detectors overwhelms the cross entropy loss. Easily classified negatives comprise the majority of the loss and dominate the gradient. While α balances the importance of positive/negative examples, it does not differentiate between easy/hard examples. Instead, we propose to reshape the loss function to down-weight easy examples and thus focus training on hard negatives. More formally, we propose to add a modulating factor (1 - pt)γ to the cross entropy loss, with tunable focusing parameter γ ≥ 0. We define the focal loss as:

正如我们的实验所显示的，在训练稠密检测器时遇到的大类不平衡压倒了交叉熵损失。容易分类的负类占了损失的大部分，并且控制了梯度。虽然α平衡了正/负例子的重要性，但它不区分简单/困难的例子。相反，我们建议重塑减肥功能，以减轻体重，简单的例子，因此重点训练硬消极。在形式上，我们建议在交叉熵损失中加入调制因子（1-pt）γ，可调聚焦参数γ≥0。我们将焦点损失定义为：

The focal loss is visualized for several values of γ in [0; 5] in Figure 1. We note two properties of the focal loss. (1) When an example is misclassified and pt is small, the modulating factor is near 1 and the loss is unaffected. As pt -> 1, the factor goes to 0 and the loss for well-classified examples is down-weighted. (2) The focusing parameter γ smoothly adjusts the rate at which easy examples are downweighted. When γ = 0, FL is equivalent to CE, and as γ is increased the effect of the modulating factor is likewise increased (we found γ = 2 to work best in our experiments).

focal loss在图1中显示了结果，其中参数γ 在0到5 之间变化。可以看出focal loss具有两个特性：（1）当样本被错分并且pt很小时，修正项接近1，损失几乎不变。随着pt趋向于1，修正参数取向于0，也就是正确分类的样本的权重被降低了。（2）参数γ 对正确分类的简单样本的适应是平滑变化的。当γ 为0时，FL=CE,当γ 增大时，条件的效果增强了。（在我们的实验中设置了γ 为2）

Intuitively, the modulating factor reduces the loss contribution from easy examples and extends the range in which an example receives low loss. For instance, with γ = 2, an example classified with pt = 0:9 would have 100× lower loss compared with CE and with pt ≈ 0:968 it would have 1000× lower loss. This in turn increases the importance of correcting misclassified examples (whose loss is scaled down by at most 4× for pt ≤ :5 and γ = 2).

直观地，调制因子减少了简单示例的损失贡献，并且扩展了示例接收低损耗的范围。例如，当γ=2时，当pt=0:9时，与CE相比，损耗降低100倍；当pt≈0:968时，损耗降低1000倍。这又增加了纠正错误分类示例的重要性（对于pt≤：5和γ=2，其损失最多缩小4倍）。

In practice we use an α-balanced variant of the focal loss: 实际上，我们使用focal loss的α-平衡变量：

We adopt this form in our experiments as it yields slightly improved accuracy over the non-α-balanced form. Finally, we note that the implementation of the loss layer combines the sigmoid operation for computing p with the loss computation, resulting in greater numerical stability.

我们在实验中采用了这种形式，因为它比非α-平衡形式的精度稍有提高。最后，我们注意到损失层的实现将计算p的sigmoid运算与损失计算结合起来，从而产生更大的数值稳定性。

While in our main experimental results we use the focal loss definition above, its precise form is not crucial. In the appendix we consider other instantiations of the focal loss and demonstrate that these can be equally effective.

虽然在我们的主要实验结果中，我们使用了上述的focal loss定义，但它的精确形式并不重要。在附录中，我们考虑了focal loss的其他实例，并证明了这些实例同样有效。

3.3 Class Imbalance and Model Initialization

Binary classification models are by default initialized to have equal probability of outputting either y = -1 or 1. Under such an initialization, in the presence of class imbalance, the loss due to the frequent class can dominate total loss and cause instability in early training. To counter this, we introduce the concept of a ‘prior’ for the value of p estimated by the model for the rare class (foreground) at the start of training. We denote the prior by π and set it so that the model’s estimated p for examples of the rare class is low, e.g. 0.01. We note that this is a change in model initialization (see x4.1) and not of the loss function. We found this to improve training stability for both the cross entropy and focal loss in the case of heavy class imbalance.

默认情况下，二进制分类模型初始化为输出y=-1或1的概率相等。在这样的初始化条件下，在存在类别不平衡的情况下，出现多的类别损失会主导全员损失，造成早期训练的不稳定。为了解决这一问题，我们引入了“先验”的概念，用于训练开始时稀有类（前景）模型估计的p值。我们用π来表示先验，并设置它，使得对于稀有类的例子，模型的估计p是低的，例如0.01。我们注意到这是模型初始化（见x4.1）的变化，而不是损失函数的变化。我们发现在重类不平衡的情况下，这可以提高交叉熵和focal loss的训练稳定性。

3.4 Class Imbalance and Two-stage Detectors

Two-stage detectors are often trained with the cross entropy loss without use of α-balancing or our proposed loss. Instead, they address class imbalance through two mechanisms: (1) a two-stage cascade and (2) biased minibatch sampling. The first cascade stage is an object proposal mechanism [35, 24, 28] that reduces the nearly infinite set of possible object locations down to one or two thousand. Importantly, the selected proposals are not random, but are likely to correspond to true object locations, which removes the vast majority of easy negatives. When training the second stage, biased sampling is typically used to construct minibatches that contain, for instance, a 1:3 ratio of positive to negative examples. This ratio is like an implicit α- balancing factor that is implemented via sampling. Our proposed focal loss is designed to address these mechanisms in a one-stage detection system directly via the loss function.

在不使用α-平衡或我们提出的损失的情况下，两级检测器通常使用交叉熵损失进行训练。相反，它们通过两种机制解决类不平衡问题：（1）两级级联和（2）有偏小批量采样。第一个级联阶段是一个对象建议机制[35，24，28]，它将几乎无限的一组可能的对象位置减少到1千或2千个。重要的是，所选方案不是随机的，而是可能与真实的目标位置相对应，从而消除了绝大多数容易产生的负面影响。在训练第二阶段时，通常使用有偏采样来构造小批量，其中包含1:3的正负示例比率。这个比率就像是一个隐含的α-平衡因子，通过采样实现。我们提出的focal loss是为了直接通过损失函数解决单级检测系统中的这些机制。

4. RetinaNet Detector

RetinaNet is a single, unified network composed of a backbone network and two task-specific subnetworks. The backbone is responsible for computing a convolutional feature map over an entire input image and is an off-the-self convolutional network. The first subnet performs convolutional object classification on the backbone’s output; the second subnet performs convolutional bounding box regression. The two subnetworks feature a simple design that we propose specifically for one-stage, dense detection, see Figure 3. While there are many possible choices for the details of these components, most design parameters are not particularly sensitive to exact values as shown in the experiments. We describe each component of RetinaNet next.

RetinaNet是一个单一的、统一的网络，由一个骨干网和两个特定于任务的子网组成。主干网负责计算整个输入图像上的卷积特征映射，并且是一个非自卷积网络。第一个子网对骨干网的输出进行卷积对象分类；第二个子网进行卷积包围盒回归。这两个子网是我们专门为一级密集检测提出的，见图3。虽然对于这些部件的细节有许多可能的选择，但大多数设计参数对实验中所示的精确值并不特别敏感。接下来我们描述RetinaNet的每个组成部分。

Feature Pyramid Network Backbone: We adopt the Feature Pyramid Network (FPN) from [20] as the backbone network for RetinaNet. In brief, FPN augments a standard convolutional network with a top-down pathway and lateral connections so the network efficiently constructs a rich, multi-scale feature pyramid from a single resolution input image, see Figure 3(a)-(b). Each level of the pyramid can be used for detecting objects at a different scale. FPN improves multi-scale predictions from fully convolutional networks (FCN) [23], as shown by its gains for RPN [28] and DeepMask-style proposals [24], as well at two-stage detectors such as Fast R-CNN [10] or Mask R-CNN [14]. Following [20], we build FPN on top of the ResNet architecture [16]. We construct a pyramid with levels P3 through P7, where l indicates pyramid level (Pl has resolution 2l lower than the input). As in [20] all pyramid levels have C = 256 channels. Details of the pyramid generally follow [20] with a few modest differences.2 While many design choices are not crucial, we emphasize the use of the FPN backbone is; preliminary experiments using features from only the final ResNet layer yielded low AP.

采用文献[20]中的特征金字塔网络（FPN）作为RetinaNet的骨干网。简言之，FPN通过自上而下的路径和横向连接来增强标准卷积网络，因此该网络有效地从单分辨率输入图像构建丰富的多尺度特征金字塔，见图3（a）-（b）。金字塔的每一层都可以用来检测不同尺度的物体。FPN改进了来自完全卷积网络（FCN）的多尺度预测[23]，这一点在RPN[28]和DeepMask风格的建议[24]以及两级检测器（如Fast R-CNN[10]或Mask R-CNN[14]）中得到了证明。根据[20]，我们在ResNet架构的基础上构建FPN[16]。我们构造了一个P3到P7层的金字塔，其中l表示金字塔层（Pl的分辨率比输入低2l）。如[20]所示，所有金字塔级别都有C=256个通道。金字塔的细节一般遵循[20]并有一些适度的差异。2虽然许多设计选择并不重要，但我们强调使用FPN主干网是；初步实验仅使用来自最终ResNet层的特性产生低AP。

Anchors: We use translation-invariant anchor boxes similar to those in the RPN variant in [20]. The anchors have areas of 322 to 5122 on pyramid levels P3 to P7, respectively. As in [20], at each pyramid level we use anchors at three aspect ratios [1:2; 1:1, 2:1]. For denser scale coverage than in [20], at each level we add anchors of sizes [2^0，2^1/3，2^2/3] of the original set of 3 aspect ratio anchors. This improve AP in our setting. In total there are A = 9 anchors per level and across levels they cover the scale range 32 - 813 pixels with respect to the network’s input image.

我们使用类似于[20]中RPN变体的平移不变锚定框。锚的面积为322至5122，分别位于金字塔的P3至P7层。如[20]，在每个金字塔级别上，我们使用三个长宽比为[1:2；1:1，2:1]的锚。对于比[20]更密集的比例覆盖，在每个级别上，我们添加尺寸为[2^0，2^1/3，2^2/3]的原3个长宽比锚。这改善了我们的环境。总的来说，每层有9个锚定，并且跨层锚定覆盖网络输入图像的32-813像素范围。

Each anchor is assigned a length K one-hot vector of classification targets, where K is the number of object classes, and a 4-vector of box regression targets. We use the assignment rule from RPN [28] but modified for multiclass detection and with adjusted thresholds. Specifically, anchors are assigned to ground-truth object boxes using an intersection-over-union (IoU) threshold of 0.5; and to background if their IoU is in [0, 0.4). As each anchor is assigned to at most one object box, we set the corresponding entry in its length K label vector to 1 and all other entries to 0. If an anchor is unassigned, which may happen with overlap in [0.4, 0.5), it is ignored during training. Box regression targets are computed as the offset between each anchor and its assigned object box, or omitted if there is no assignment.

每个锚都被赋予一个长度K的one-hot分类目标，其中K是对象类的数量，以及一个4维矢量作为回归目标。我们使用RPN[28]中的赋值规则，但对多类检测和调整阈值进行了修改。具体地说，使用超过并集（IoU）阈值0.5的交集将锚分配给地面真值对象框；如果它们的IoU在[0，0.4]中，则将锚分配给背景。当每个锚被分配给最多一个对象框时，我们将其长度K标签向量中的相应条目设置为1，将所有其他条目设置为0。如果锚是未分配的，则可能在[0.4，0.5]中发生重叠。

Classification Subnet: The classification subnet predicts the probability of object presence at each spatial position for each of the A anchors and K object classes. This subnet is a small FCN attached to each FPN level; parameters of this subnet are shared across all pyramid levels. Its design is simple. Taking an input feature map with C channels from a given pyramid level, the subnet applies four 3×3 conv layers, each with C filters and each followed by ReLU activations, followed by a 3×3 conv layer with KA filters. Finally sigmoid activations are attached to output the KA binary predictions per spatial location, see Figure 3 (c). We use C = 256 and A = 9 in most experiments.

In contrast to RPN [28], our object classification subnet is deeper, uses only 3×3 convs, and does not share parameters with the box regression subnet (described next). We found these higher-level design decisions to be more important than specific values of hyperparameters.

分类子网：分类子网预测A个锚和K个目标类在每个空间位置的存在概率。此子网是附加到每个FPN级别的小型FCN；此子网的参数在所有金字塔级别上共享。它的设计很简单。子网从给定的金字塔级别获取一个带有C通道的输入特征映射，然后应用四个3×3 conv层，每个层都有C个过滤器，每个层都有ReLU激活，然后是一个带KA过滤器的3×3 conv层。最后，sigmoid激活被附加到每个空间位置输出KA二进制预测，参见图3（c）。我们在大多数实验中使用C=256和A=9。

与RPN[28]相比，我们的对象分类子网更深，仅使用3×3 conv，并且不与box回归子网共享参数（下面将介绍）。我们发现这些更高层次的设计决策比超参数的特定值更重要。

Box Regression Subnet: In parallel with the object classification subnet, we attach another small FCN to each pyramid level for the purpose of regressing the offset from each anchor box to a nearby ground-truth object, if one exists. The design of the box regression subnet is identical to the classification subnet except that it terminates in 4A linear outputs per spatial location, see Figure 3 (d). For each of the A anchors per spatial location, these 4 outputs predict the relative offset between the anchor and the groundtruth box (we use the standard box parameterization from RCNN [11]). We note that unlike most recent work, we use a class-agnostic bounding box regressor which uses fewer parameters and we found to be equally effective. The object classification subnet and the box regression subnet, though sharing a common structure, use separate parameters.

回归子网：与对象分类子网并行，我们在每个金字塔层附加另一个小FCN，以便将每个锚定盒的偏移量回归到附近的真值（如果存在）。回归子网的设计与分类子网相同，只是它在每个空间位置输出4A个线性结果，见图3（d）。对于每个空间位置的A个锚，这4个预测锚和真实框之间的相对偏移（我们使用RCNN[11]中的标准框参数化）。我们注意到，与最近的工作不同，我们使用一个类不可知的边界盒回归器，它使用较少的参数，但同样有效。对象分类子网和框回归子网虽然共享一个公共结构，但使用单独的参数。

4.1 Inference and Training

Inference: RetinaNet forms a single FCN comprised of a ResNet-FPN backbone, a classification subnet, and a box regression subnet, see Figure 3. As such, inference involves simply forwarding an image through the network. To improve speed, we only decode box predictions from at most 1k top-scoring predictions per FPN level, after thresholding detector confidence at 0.05. The top predictions from all levels are merged and non-maximum suppression with a threshold of 0.5 is applied to yield the final detections.

RetinaNet形成一个由ResNet FPN主干网、分类子网和box回归子网组成的FCN，见图3。为了提高速度，在阈值检测器置信度为0.05后，我们仅从每个FPN级别最多解码预测1k个最高得分。合并所有级别的顶级预测，并应用阈值为0.5的非最大抑制来产生最终检测结果。

Focal Loss: We use the focal loss introduced in this work as the loss on the output of the classification subnet. As we will show in x5, we find that γ = 2 works well in practice and the RetinaNet is relatively robust to γ 2 [0:5; 5]. We emphasize that when training RetinaNet, the focal loss is applied to all ∼100k anchors in each sampled image. This stands in contrast to common practice of using heuristic sampling (RPN) or hard example mining (OHEM, SSD) to select a small set of anchors (e.g., 256) for each minibatch. The total focal loss of an image is computed as the sum of the focal loss over all ∼100k anchors, normalized by the number of anchors assigned to a ground-truth box. We perform the normalization by the number of assigned anchors, not total anchors, since the vast majority of anchors are easy negatives and receive negligible loss values under the focal loss. Finally, we note that α, the weight assigned to the rare class, also has a stable range, but it interacts with γ making it necessary to select the two together (see Tables 1a and 1b). In general, α should be decreased slightly as γ is increased (for γ = 2, α = 0:25 works best).

Focal Loss: 我们使用在这项工作中引入的焦点损失作为分类子网输出上的损失。正如我们在x5中所展示的，我们发现γ=2在实际中工作良好，并且RetinaNet对γ2相对稳定[0:5；5]。我们强调在训练RetinaNet时，每个采样图像中的所有100k个锚都会受到焦距损失的影响。这与使用启发式采样（RPN）或硬示例挖掘（OHEM，SSD）为每个小批量选择一组锚（例如256）的常见做法形成了对比。图像的总焦距损失被计算为所有∼100k个锚的焦距损失之和，通过指定给地面真值框的锚的数量进行归一化。由于绝大多数锚都是易负的，并且在焦点损失下得到的损失值可以忽略不计，所以我们通过指定锚的数量而不是总锚的数量来执行标准化。最后，我们注意到，分配给稀有类的权重α也有一个稳定的范围，但它与γ相互作用，因此有必要同时选择两者（见表1a和1b）。一般来说，随着γ的增加，α应略微降低（对于γ=2，α=0:25效果最好）。

Initialization: We experiment with ResNet-50-FPN and ResNet-101-FPN backbones [20]. The base ResNet-50 and ResNet-101 models are pre-trained on ImageNet1k; we use the models released by [16]. New layers added for FPN are initialized as in [20]. All new conv layers except the final one in the RetinaNet subnets are initialized with bias b = 0 and a Gaussian weight fill with σ = 0:01. For the final conv layer of the classification subnet, we set the bias initialization to b = - log((1 - π)=π), where π specifies that at the start of training every anchor should be labeled as foreground with confidence of ∼π. We use π = :01 in all experiments, although results are robust to the exact value. As explained in x3.3, this initialization prevents the large number of background anchors from generating a large, destabilizing loss value in the first iteration of training.

初始化：我们使用ResNet-50-FPN和ResNet-101-FPN主干进行实验[20]。基本的ResNet-50和ResNet-101模型是在ImageNet1k上预先训练的；我们使用[16]发布的模型。为FPN添加的新层初始化为[20]。除RetinaNet子网中的最后一层外，所有新的conv层都用偏置b=0和高斯权重填充σ=0.01进行初始化。对于分类子网的最后一个conv层，我们将偏差初始化设置为b=-log（（1-π）=π），其中π指定在训练开始时，每个锚都应标记为前景，置信度为∼π。我们在所有实验中都使用π=：01，尽管结果对精确值是稳健的。如x3.3所述，该初始化防止大量背景锚在训练的第一次迭代中产生大的、不稳定的损失值。

Optimization: RetinaNet is trained with stochastic gradient descent (SGD). We use synchronized SGD over 8 GPUs with a total of 16 images per minibatch (2 images per GPU). Unless otherwise specified, all models are trained for 90k iterations with an initial learning rate of 0.01, which is then divided by 10 at 60k and again at 80k iterations. We use horizontal image flipping as the only form of data augmentation unless otherwise noted. Weight decay of 0.0001 and momentum of 0.9 are used. The training loss is the sum the focal loss and the standard smooth L1 loss used for box regression [10]. Training time ranges between 10 and 35 hours for the models in Table 1e.

优化：用随机梯度下降（SGD）训练RetinaNet 。我们使用同步SGD超过8个GPU，每个小批量总共16个图像（每个GPU 2个图像）。除非另有说明，否则所有模型都经过90k次迭代的训练，初始学习率为0.01，然后在60k次迭代时除以10，在80k次迭代时再次除以10。我们使用水平图像翻转作为数据增强的唯一形式，除非另有说明。重量衰减为0.0001，动量为0.9。训练损失是focal loss和smooth L1 loss的总和。

5. Experiments

We present experimental results on the bounding box detection track of the challenging COCO benchmark [21]. For training, we follow common practice [1, 20] and use the COCO trainval35k split (union of 80k images from train and a random 35k subset of images from the 40k image val split). We report lesion and sensitivity studies by evaluating on the minival split (the remaining 5k images from val). For our main results, we report COCO AP on the test-dev split, which has no public labels and requires use of the evaluation server.

我们给出了具有挑战性的COCO基准的预测框检测的实验结果。对于训练，我们遵循常见实践[1，20]并使用COCO train val 35k分割（将来自训练的80k图像和来自40k图像val分割的35k图像的随机子集合并）。我们报告了通过评估最小val分裂（剩余的5k val图像）的病态和敏感性研究。对于我们的主要结果，我们在test-dev split上报告COCO AP，它没有公共标签，需要使用评估服务器。

5.1 Training Dense Detection

We run numerous experiments to analyze the behavior of the loss function for dense detection along with various optimization strategies. For all experiments we use depth 50 or 101 ResNets [16] with a Feature Pyramid Network (FPN) [20] constructed on top. For all ablation studies we use an image scale of 600 pixels for training and testing.

我们进行了大量的实验来分析密度检测的损失函数的行为以及各种优化策略。对于所有的实验，我们使用深度为50或101的ResNets[16]，在上面构造一个特征金字塔网络（FPN）[20]。对于所有的研究，我们使用600像素的图像尺度进行训练和测试。

Network Initialization: Our first attempt to train RetinaNet uses standard cross entropy (CE) loss without any modifications to the initialization or learning strategy. This fails quickly, with the network diverging during training. However, simply initializing the last layer of our model such that the prior probability of detecting an object is π = :01 (see x4.1) enables effective learning. Training RetinaNet with ResNet-50 and this initialization already yields a respectable AP of 30.2 on COCO. Results are insensitive to the exact value of π so we use π = :01 for all experiments.

网络初始化：我们第一次尝试训练RetinaNet使用标准交叉熵（CE）损失，没有任何修改初始化或学习策略。这很快就失败了，网络在训练过程中出现了分化。然而，只要初始化我们模型的最后一层，使得检测对象的先验概率为π=：01（见x4.1），就可以实现有效的学习。用ResNet-50训练RetinaNet，这个初始化已经在COCO上产生了30.2的AP。结果对π的精确值不敏感，因此我们对所有实验都使用π=0.01。

Balanced Cross Entropy: Our next attempt to improve learning involved using the α-balanced CE loss described in x3.1. Results for various α are shown in Table 1a. Setting α = :75 gives a gain of 0.9 points AP.

平衡交叉熵：我们下一次尝试使用x3.1中描述的α平衡CE损失来改进学习。各种α的结果如表1a所示。设置α=：75可获得0.9点AP的增益。

Focal Loss: Results using our proposed focal loss are shown in Table 1b. The focal loss introduces one new hyperparameter, the focusing parameter γ, that controls the strength of the modulating term. When γ = 0, our loss is equivalent to the CE loss. As γ increases, the shape of the loss changes so that “easy” examples with low loss get further discounted, see Figure 1. FL shows large gains over CE as γ is increased. With γ = 2, FL yields a 2.9 AP improvement over the α-balanced CE loss.

For the experiments in Table 1b, for a fair comparison we find the best α for each γ. We observe that lower α’s are selected for higher γ’s (as easy negatives are down weighted, less emphasis needs to be placed on the positives). Overall, however, the benefit of changing γ is much larger, and indeed the best α’s ranged in just [.25,.75] (we tested α 2 [:01; :999]). We use γ = 2:0 with α = :25 for all experiments but α = :5 works nearly as well (.4 AP lower).

Analysis of the Focal Loss: To understand the focal loss better, we analyze the empirical distribution of the loss of a converged model. For this, we take our default ResNet- 101 600-pixel model trained with γ = 2 (which has 36.0 AP). We apply this model to a large number of random images and sample the predicted probability for ∼107 negative windows and ∼105 positive windows. Next, separately for positives and negatives, we compute FL for these samples, and normalize the loss such that it sums to one. Given the normalized loss, we can sort the loss from lowest to highest and plot its cumulative distribution function (CDF) for both positive and negative samples and for different settings for γ (even though model was trained with γ = 2).

Cumulative distribution functions for positive and negative samples are shown in Figure 4. If we observe the positive samples, we see that the CDF looks fairly similar for different values of γ. For example, approximately 20% of the hardest positive samples account for roughly half of the positive loss, as γ increases more of the loss gets concentrated in the top 20% of examples, but the effect is minor.

The effect of γ on negative samples is dramatically different. For γ = 0, the positive and negative CDFs are quite similar. However, as γ increases, substantially more weight becomes concentrated on the hard negative examples. In fact, with γ = 2 (our default setting), the vast majority of the loss comes from a small fraction of samples. As can be seen, FL can effectively discount the effect of easy negatives, focusing all attention on the hard negative examples.

Online Hard Example Mining (OHEM): [31] proposed to improve training of two-stage detectors by constructing minibatches using high-loss examples. Specifically, in OHEM each example is scored by its loss, non-maximum suppression (nms) is then applied, and a minibatch is constructed with the highest-loss examples. The nms threshold and batch size are tunable parameters. Like the focal loss, OHEM puts more emphasis on misclassified examples, but unlike FL, OHEM completely discards easy examples. We also implement a variant of OHEM used in SSD [22]: after applying nms to all examples, the minibatch is constructed to enforce a 1:3 ratio between positives and negatives to help ensure each minibatch has enough positives.

在线难例样本挖掘（OHEM）：[31]提出通过使用高损失示例构造小批量来改进两阶段检测器的训练。具体来说，在OHEM中，每个例子都是按其损失来评分的，然后应用非最大抑制（nms），并用损失最大的例子构造一个小批量。nms阈值和批处理大小是可调参数。像focal loss一样，OHEM更强调错误分类的例子，但不像FL，OHEM完全抛弃了简单的例子。我们还实现了SSD[22]中使用的OHEM变体：在将nms应用于所有示例之后，构建小批量以强制正样本和负样本之间的1:3比率，以帮助确保每个小批量都有足够的正样本。

We test both OHEM variants in our setting of one-stage detection which has large class imbalance. Results for the original OHEM strategy and the ‘OHEM 1:3’ strategy for selected batch sizes and nms thresholds are shown in Table 1d. These results use ResNet-101, our baseline trained with FL achieves 36.0 AP for this setting. In contrast, the best setting for OHEM (no 1:3 ratio, batch size 128, nms of .5) achieves 32.8 AP. This is a gap of 3.2 AP, showing FL is more effective than OHEM for training dense detectors. We note that we tried other parameter setting and variants for OHEM but did not achieve better results.

在我们的一级检测设置中，我们测试了两种OHEM变体，这两种检测具有较大的类不平衡性。原始OHEM策略和所选批量大小和nms阈值的“OHEM 1:3”策略的结果如表1d所示。这些结果使用ResNet-101，我们使用FL训练的基线在此设置下达到36.0 AP。相比之下，OHEM的最佳设置（1:3比率，批量大小128，nms为.5）达到32.8AP。这是一个3.2ap的差距，表明FL比OHEM训练密集型探测器更有效。我们注意到，我们尝试了OHEM的其他参数设置和变体，但没有取得更好的效果。

Hinge Loss: Finally, in early experiments, we attempted to train with the hinge loss [13] on pt, which sets loss to 0 above a certain value of pt. However, this was unstable and we did not manage to obtain meaningful results. Results exploring alternate loss functions are in the appendix.

5.2 Model Architecture Design

Anchor Density: One of the most important design factors in a one-stage detection system is how densely it covers the space of possible image boxes. Two-stage detectors can classify boxes at any position, scale, and aspect ratio using a region pooling operation [10]. In contrast, as one-stage detectors use a fixed sampling grid, a popular approach for achieving high coverage of boxes in these approaches is to use multiple ‘anchors’ [28] at each spatial position to cover boxes of various scales and aspect ratios.

锚密度：在一级检测系统中，最重要的设计因素之一是覆盖可能的图像框的空间有多密。两级检测器可以使用区域池操作对任何位置、比例和纵横比的框进行分类[10]。相比之下，由于一级检测器使用固定采样网格，在这些方法中，实现框的高覆盖率的流行方法是在每个空间位置使用多个“锚”[28]来覆盖不同比例和纵横比的框。

We sweep over the number of scale and aspect ratio anchors used at each spatial position and each pyramid level in FPN. We consider cases from a single square anchor at each location to 12 anchors per location spanning 4 sub-octave scales (2k=4, for k ≤ 3) and 3 aspect ratios [0.5, 1, 2]. Results using ResNet-50 are shown in Table 1c. A surprisingly good AP (30.3) is achieved using just one square anchor. However, the AP can be improved by nearly 4 points (to 34.0) when using 3 scales and 3 aspect ratios per location. We used this setting for all other experiments in this work.

我们扫描了在FPN中每个空间位置和每个金字塔级别使用的比例和纵横比锚的数量。我们考虑从每个位置的单个方形锚定到每个位置的12个锚定，跨越4个倍频标度（2k=4，对于k≤3）和3个纵横比[0.5，1，2]。使用ResNet-50的结果如表1c所示。仅使用一个方形锚定即可获得令人惊讶的良好AP（30.3）。然而，当每个位置使用3个尺度和3个纵横比时，AP可以提高近4个点（达到34.0）。在这项工作中，我们将此设置用于所有其他实验。

Finally, we note that increasing beyond 6-9 anchors did not shown further gains. Thus while two-stage systems can classify arbitrary boxes in an image, the saturation of performance w.r.t. density implies the higher potential density of two-stage systems may not offer an advantage.

最后，我们注意到，超过6-9锚没有显示出进一步的收益。因此，虽然两级系统可以对图像中的任意盒子进行分类，但性能w.r.t.密度的饱和意味着两级系统的较高密度可能不会提供优势。

Speed versus Accuracy: Larger backbone networks yield higher accuracy, but also slower inference speeds. Likewise for input image scale (defined by the shorter image side). We show the impact of these two factors in Table 1e. In Figure 2 we plot the speed/accuracy trade-off curve for RetinaNet and compare it to recent methods using public numbers on COCO test-dev. The plot reveals that RetinaNet, enabled by our focal loss, forms an upper envelope over all existing methods, discounting the low-accuracy regime. RetinaNet with ResNet-101-FPN and a 600 pixel image scale (which we denote by RetinaNet-101-600 for simplicity) matches the accuracy of the recently published ResNet- 101-FPN Faster R-CNN [20], while running in 122 ms per image compared to 172 ms (both measured on an Nvidia M40 GPU). Using larger scales allows RetinaNet to surpass the accuracy of all two-stage approaches, while still being faster. For faster runtimes, there is only one operating point (500 pixel input) at which using ResNet-50-FPN improves over ResNet-101-FPN. Addressing the high frame rate regime will likely require special network design, as in [27], and is beyond the scope of this work. We note that after publication, faster and more accurate results can now be obtained by a variant of Faster R-CNN from [12].

速度vs准确度：较大的主干网产生更高的准确度，但速度慢。同样输入图像比例也是如此（由图像的较短侧定义）。我们在表1e中显示了这两个因素的影响。在图2中，我们绘制了RetinaNet的速度/精度权衡曲线，并将其与最近在COCO test-dev中使用公共数字的方法进行了比较。该图显示，RetinaNet由于我们的focal loss，在所有现有方法上形成了一个上包络线，从而降低了低精度范围。ResNet-101-FPN的RetinaNet和600像素的图像比例（为了简单起见，我们用RetinaNet-101-600表示）与最近发布的ResNet-101-FPN的精度相匹配，R-CNN的速度更快[20]，同时每张图像的运行时间为122 ms，而不是172 ms（都是在Nvidia M40 GPU上测量的）。使用更大的比例尺可以使RetinaNet超过所有两阶段方法的准确性，同时仍然更快。对于更快的运行时间，只有一个操作点（500像素输入）使用ResNet-50-FPN比ResNet-101-FPN提高。解决高帧速率体制可能需要特殊的网络设计，如[27]所示，并且超出了本工作的范围。我们注意到，在发表后，更快更准确的结果现在可以通过快速R-CNN的变体从[12]中获得。

5.3 Comparison to State of the Art

We evaluate RetinaNet on the challenging COCO dataset and compare test-dev results to recent state-of-the-art methods including both one-stage and two-stage models. Results are presented in Table 2 for our RetinaNet-101-800 model trained using scale jitter and for 1.5× longer than the models in Table 1e (giving a 1.3 AP gain). Compared to existing one-stage methods, our approach achieves a healthy 5.9 point AP gap (39.1 vs. 33.2) with the closest competitor, DSSD [9], while also being faster, see Figure 2. Compared to recent two-stage methods, RetinaNet achieves a 2.3 point gap above the top-performing Faster R-CNN model based on Inception-ResNet-v2-TDM [32]. Plugging in ResNeXt- 32x8d-101-FPN [38] as the RetinaNet backbone further improves results another 1.7 AP, surpassing 40 AP on COCO.

我们在具有挑战性的COCO数据集上评估RetinaNet，并将测试开发结果与包括一阶段和两阶段模型的最新方法进行比较。表2显示了我们的RetinaNet-101-800模型使用比例抖动训练的结果，比表1e中的模型长1.5倍（给出1.3ap增益）。与现有的单阶段方法相比，我们的方法与最接近的竞争对手DSSD[9]的AP差距达到了5.9个点（39.1对33.2），同时速度也更快，见图2。与最近的两阶段方法相比，RetinaNet比基于Inception-ResNet-v2-TDM的性能最好的R-CNN模型高出2.3个点[32]。插入ResNeXt-32x8d-101-FPN[38]作为视网膜骨干进一步改善结果，另一个1.7ap，超过了COCO上的40ap。

6. Conclusion

In this work, we identify class imbalance as the primary obstacle preventing one-stage object detectors from surpassing top-performing, two-stage methods. To address this, we propose the focal loss which applies a modulating term to the cross-entropy loss in order to focus learning on hard negative examples. Our approach is simple and highly effective. We demonstrate its efficacy by designing a fully convolutional one-stage detector and report extensive experimental analysis showing that it achieves state-of-the-art accuracy and speed. Source code is available at https://github.com/facebookresearch/Detectron [12].

在这项工作中，我们确定类不平衡是阻止一级目标检测器超越最佳性能的两级方法的主要障碍。为了解决这一问题，我们提出了将调制项应用于交叉熵损失的焦点损失，以便将学习集中在硬的反面例子上。我们的方法简单高效。我们通过设计一个完全卷积的单级检测器来证明它的有效性，并报告了大量的实验分析，表明它达到了最先进的精度和速度。源代码可在https://github.com/facebook research/Detectron[12]上找到。

Appendix A: Focal Loss*

The exact form of the focal loss is not crucial. We now show an alternate instantiation of the focal loss that has similar properties and yields comparable results. The following also gives more insights into properties of the focal loss. We begin by considering both cross entropy (CE) and the focal loss (FL) in a slightly different form than in the main text. Specifically, we define a quantity xt as follows:

where y in {±1} specifies the ground-truth class as before. We can then write pt = σ(xt) (this is compatible with the definition of pt in Equation 2). An example is correctly classified when xt > 0, in which case pt > :5. We can now define an alternate form of the focal loss in terms of xt. We define p∗ t and FL∗ as follows:

FL∗ has two parameters, γ and β, that control the steepness and shift of the loss curve. We plot FL∗ for two selected settings of γ and β in Figure 5 alongside CE and FL. As can be seen, like FL, FL∗ with the selected parameters diminishes the loss assigned to well-classified examples. We trained RetinaNet-50-600 using identical settings as before but we swap out FL for FL∗ with the selected parameters. These models achieve nearly the same AP as those trained with FL, see Table 3. In other words, FL∗ is a reasonable alternative for the FL that works well in practice.

We found that various γ and β settings gave good results. In Figure 7 we show results for RetinaNet-50-600 with FL∗ for a wide set of parameters. The loss plots are color coded such that effective settings (models converged and with AP over 33.5) are shown in blue. We used α = :25 in all experiments for simplicity. As can be seen, losses that reduce weights of well-classified examples (xt > 0) are effective. More generally, we expect any loss function with similar properties as FL or FL∗ to be equally effective.

Appendix B: Derivatives

For reference, derivates for CE, FL, and FL* w.r.t. x are:

Plots for selected settings are shown in Figure 6. For all loss functions, the derivative tends to -1 or 0 for high-confidence predictions. However, unlike CE, for effective settings of both FL and FL∗, the derivative is small as soon as xt > 0.

总结

1.主要思想

本文指出，目前目标检测有两种类型框架：

一种是两阶段的，例如RCNN、fastRCNN、fasterRCNN、maskRCNN等这一系列的。其分为两个阶段，第一个阶段使用CNN提取到特征图之后，使用region proposal network得到类别的感兴趣区域，第二个阶段使用classifier进行类别的回归，最终实现检测，这种方式分两个步骤，检测速度比较慢，但是效果准确率很高。

另一种就是一阶段的，例如YOLO、SSN等。其直接使用CNN提到多尺度特征图之后，根据人为选定的anchor，使用不同scale和不同aspect ratios在不同尺度的feature map上进行感兴趣区域的提取，以来覆盖整个图片，然后直接对这些感兴趣anchor使用CNN进行位置和类别的回归，最后使用Non-Maximum Suppression得到最终的检测结果，一步达到检测的结果，这种方式一步到位，检测速度较快，但是准确率相较于two-stage的方法有所下降。

这篇文章分析one-stage对于two-stage的方法准确率下降的原因在于类别分布不均不平衡，two-stage使用proposal的方法（Selective Search , EdgeBoxes ,DeepMask , RPN ）在每张图上能得到1-2k左右的感兴趣区域，很大程度上过滤掉了背景，只留下需要分类的前景，但是one-stage使用多尺度anchor的方法在每张图上能到100k左右的感兴趣区域，相当于是对全图的均匀采样，并没有区分掉背景和需要分类的前景，背景和需要分类的前景之间的数量分布不存，使用传统的交叉熵loss（-logp）会导致训练效果降低，因此检测率有所下降，因此本文提出了一种能够解决这种类别不平衡的loss，称为focal loss，为了验证这个focal loss有效，提出了one-stage网络RetinaNet来验证。

2.focal loss的提出

先从交叉熵的一般形式来提出（这里是两类分类问题，作者指出多类问题的性质相同）：

将p作改写：

带入上式得到交叉熵的形式变为：

CE(p; y) = CE(pt) = - log(pt).

另外在这个形式的基础上，提出不同类别的不平衡问题，并分别设置不同的调节参数，对于正类参数是α（在0,1之间），对于负类采用1-α，并仿照上述形式表达为：

但实验显示，在proposal框很密集的情况下，虽然α平衡了正/负例子的权重，但它不区分简单/困难的例子。由此在交叉熵损失中加入调制因子（1-pt)^γ。我们将focal loss定义为：

从而可以自动调节难例样本的训练权重，也就是让难样本的权重增加，迫使训练器适应难例样本。在作者的实验中，他认为γ取2时效果最好。

3.focal loss与OHEM的区别

按理来说，focal loss 和OHEM不都是对难例样本的处理吗？为什么OHEM的效果要差一点呢。

按作者的话说，OHEM采用的是对所有样本的loss排序，采用非极大值抑制的方法，只提取其中loss最大的若干样本来训练，也就是完全抛弃了简单样本的作用。这可能是导致OHEM效果不好的原因。

4. RetinaNet的网络结构

RetinaNet说到底就是resNet+FPN+两个FCN子网络的合体，FPN使得网络能够结合多尺度的特征信息。在FPN上的每层feature map上使用CNN进行class和box的子网络（虽然结构相同，但是并不共享权重）。

FPN的内容可以查看下面这张图像：

这是四种不同的生成多维度特征组合的方法。

图(a)中的方法即为常规的生成一张图片的多维度特征组合的经典方法。即对某一输入图片我们通过压缩或放大从而形成不同维度的图片作为模型输入，使用同一模型对这些不同维度的图片分别处理后，最终再将这些分别得到的特征（feature maps）组合起来就得到了我们想要的可反映多维度信息的特征集。此种方法缺点在于需要对同一图片在更改维度后输入处理多次，因此对计算机的算力及内存大小都有较高要求。

图(b)中的方法则只拿单一维度的图片做为输入，然后经CNN模型处理后，拿最终一层的feature maps作为最终的特征集。显然此种方法只能得到单一维度的信息。优点是计算简单，对计算机算力及内存大小都无过高需求。此方法为大多数R-CNN系列目标检测方法所用像R-CNN/Fast-RCNN/Faster-RCNN等。因此最终这些模型对小维度的目标检测性能不是很好。

图(c)中的方法同样是拿单一维度的图片做为输入，不过最终选取用于接下来分类或检测任务时的特征组合时，此方法不只选用了最后一层的high level feature maps，同样也会选用稍靠下的反映图片low level 信息的feature maps。然后将这些不同层次（反映不同level的图片信息）的特征简单合并起来（一般为concat处理），用于最终的特征组合输出。此方法可见于SSD当中。不过SSD在选取层特征时都选用了较高层次的网络。比如在它以VGG16作为主干网络的检测模型里面所选用的最低的Convolution的层为Conv4，这样一些具有更低级别信息的层特征像Conv2/Conv3就被它给漏掉了，于是它对更小维度的目标检测效果就不大好。

图(d)中的方法同图(c)中的方法有些类似，也是拿单一维度的图片作为输入，然后它会选取所有层的特征来处理然后再联合起来做为最终的特征输出组合。（作者在论文中拿Resnet为实例时并没选用Conv1层，那是为了算力及内存上的考虑，毕竟Conv1层的size还是比较大的，所包含的特征跟直接的图片像素信息也过于接近）。另外还对这些反映不同级别图片信息的各层自上向下进行了再处理以能更好地组合从而形成较好的特征表达（详细过程会在下面章节中进一步介绍）。而此方法正是我们本文中要讲的FPN CNN特征提取方法。

你可能感兴趣的:(目标检测)

目标检测-YOLOv3 wydxry 深度学习目标检测 YOLO 深度学习
YOLOv3介绍YOLOv3(YouOnlyLookOnce,Version3)是YOLO系列目标检测模型的第三个版本，相较于YOLOv2有了显著的改进和增强，尤其在检测速度和精度上表现优异。YOLOv3的设计目标是在保持高速的前提下提升检测的准确性和稳定性。下面是对YOLOv3改进和优势的介绍，以及YOLOv3核心部分的代码展示。相比YOLOv2的改进与优势多尺度特征金字塔YOLOv3引入了FP
SSD目标检测系统月见樽
首发于个人博客系统结构system.pngSSD识别系统也是一种单步物体识别系统，即将提取物体位置和判断物体类别融合在一起进行，其最主要的特点是识别器用于判断物体的特征不仅仅来自于神经网络的输出，还来自于神经网络的中间结果。该系统分为以下几个部分：神经网络部分：用作特征提取器，提取图像特征识别器：根据神经网络提取的特征，生成包含物品位置和类别信息的候选框（使用卷积实现）后处理：对识别器提取出的候选
深度学习目标检测入门COCO数据集日暮途远z 深度学习目标检测人工智能
常见数据集类型：COCO数据集：Pytorch加载COCO数据集：COCO数据集的读取COCO_dataset=torchvision.datasets.CocoDetection(root="./dataset/val2017",annFile="./instances_val2017/instances_val2017.json")root(strorpathlib.Path)–Rootdir
[数据集][目标检测]街道乱堆垃圾检测数据集VOC+YOLO格式94张1类别 FL1623863129 数据集目标检测 YOLO 人工智能
数据集格式：PascalVOC格式+YOLO格式(不包含分割路径的txt文件，仅仅包含jpg图片以及对应的VOC格式xml文件和yolo格式txt文件)图片数量(jpg文件个数)：94标注数量(xml文件个数)：94标注数量(txt文件个数)：94标注类别数：1标注类别名称:["baolu"]每个类别标注的框数：baolu框数=107总框数：107使用标注工具：labelImg标注规则：对类别进行
YOLOv8改进 | 检测头篇 | YOLOv8引入DynamicHead检测头小李学AI YOLOv8有效涨点专栏 YOLO 深度学习目标检测计算机视觉机器学习人工智能
1.DynamicHead描述1.1摘要：在目标检测中，定位和分类相结合的复杂性导致了各种方法的蓬勃发展。以往的工作试图提高各种目标检测头的性能，但未能呈现出统一的观点。本文根据目标检测的特点，推导了一种新的动态头部框架，将目标检测头部与注意力统一起来。该方法通过在特征层次间、空间位置间和输出通道内协调组合多种自注意机制，在不增加计算开销的情况下显著提高了目标检测头的表示能力。进一步的实验表明，本
目标检测-YOLOv1 wydxry 深度学习目标检测 YOLO 人工智能
YOLOv1介绍YOLOv1（YouOnlyLookOnceversion1）是一种用于目标检测的深度学习算法，由JosephRedmon等人于2016年提出。它基于单个卷积神经网络，将目标检测任务转化为一个回归问题，通过在图像上划分网格并预测每个网格中是否包含目标以及目标的位置和类别来实现目标检测。YOLOv1的主要特点包括：快速的检测速度：相比于传统的目标检测算法，YOLOv1具有更快的检测速
[数据集][目标检测]人脸口罩佩戴目标检测数据集VOC+YOLO格式8068张3类别 FL1623863129 数据集目标检测 YOLO 目标跟踪
数据集格式：PascalVOC格式+YOLO格式(不包含分割路径的txt文件，仅仅包含jpg图片以及对应的VOC格式xml文件和yolo格式txt文件)图片数量(jpg文件个数)：8068标注数量(xml文件个数)：8068标注数量(txt文件个数)：8068标注类别数：3标注类别名称:["face_with_mask","face_without_mask","mask"]每个类别标注的框数：f
葡萄检测-目标检测数据集（包括VOC格式、YOLO格式）数据集_深度学习目标检测 YOLO 人工智能计算机视觉葡萄
葡萄检测-目标检测数据集（包括VOC格式、YOLO格式）数据集：链接：https://pan.baidu.com/s/1YMwAaSJc8H5SI0f8RVSidw?pwd=iygs提取码：iygs数据集信息介绍：共有1646张图像和一一对应的标注文件标注文件格式提供了两种，包括VOC格式的xml文件和YOLO格式的txt文件。标注的对象共有以下几种：[‘grape’]标注框的数量信息如下：（标注
OpenCV项目实战-深度学习去阴影-图像去阴影阿利同学 opencv 深度学习人工智能阴影去除图像去阴影
往期热门博客项目回顾：计算机视觉项目大集合改进的yolo目标检测-测距测速路径规划算法图像去雨去雾+目标检测+测距项目交通标志识别项目yolo系列-重磅yolov9界面-最新的yolo姿态识别-3d姿态识别深度学习小白学习路线//正文开始！图像去阴影算法旨在改善图像质量并恢复阴影下物体的真实颜色与亮度这对于许多计算机视觉任务如物体识别、跟踪以及增强现实等至关重要。以下是一些图像去阴影算法的基本概述
目标检测-YOLOv4 wydxry 深度学习目标检测 YOLO 目标跟踪
YOLOv4介绍YOLOv4是YOLO系列的第四个版本，继承了YOLOv3的高效性，并通过大量优化和改进，在目标检测任务中实现了更高的精度和速度。相比YOLOv3，YOLOv4在框架设计、特征提取、训练策略等方面进行了全面升级。它在保持实时检测的同时，显著提升了检测性能，尤其在复杂场景中的表现尤为出色。相比YOLOv3的改进与优势改进的Backbone(CSPDarknet-53)YOLOv4使用
[数据集][目标检测]井盖丢失未盖破损检测数据集VOC+YOLO格式2890张5类别 FL1623863129 数据集目标检测 YOLO 人工智能
数据集格式：PascalVOC格式+YOLO格式(不包含分割路径的txt文件，仅仅包含jpg图片以及对应的VOC格式xml文件和yolo格式txt文件)图片数量(jpg文件个数)：2890标注数量(xml文件个数)：2890标注数量(txt文件个数)：2890标注类别数：5标注类别名称:["broke","circle","good","lose","uncovered"]每个类别标注的框数：br
YOLOv8改进更换轻量级网络结构学yolo的小白 Upgrade YOLOv8进阶 YOLO 目标检测深度学习
一、GhostNet论文论文地址：1911.11907.pdf(arxiv.org)二、GhostNet结构GhostNet是一种高效的目标检测网络，具有较低的计算复杂度和较高的准确性。该网络采用了轻量级的架构，可以在计算资源有限的设备上运行，并能够快速地实时检测图像中的目标物体。GhostNet基于MobileNetV3的设计思路，采用了Ghost模块来减少网络参数数量，从而减少计算量并提高模型
【Python】成功解决TypeError: list indices must be integers or slices, not str 高斯小哥 BUG解决方案合集 python list 新手入门学习 debug
【Python】成功解决TypeError:listindicesmustbeintegersorslices,notstr欢迎进入我的个人主页，我是高斯小哥！博主档案：广东某985本硕，SCI顶刊一作，深耕深度学习多年，熟练掌握PyTorch框架。技术专长：擅长处理各类深度学习任务，包括但不限于图像分类、图像重构(去雾\去模糊\修复)、目标检测、图像分割、人脸识别、多标签分类、重识别(行人\车辆
LeYOLO 用于目标检测的新型可扩展和高效CNN架构 | 最新轻量化SOTA! 5GFLOP下无对手！迪菲赫尔曼 YOLOv8改进实战目标检测 cnn 架构 pytorch 深度学习轻量化
本改进已集成到YOLOv8-Magic框架。论文地址：https://arxiv.org/pdf/2406.14239代码地址：https://github.com/LilianHollard/LeYOLO/tree/main在深度神经网络中，计算效率对于目标检测至关重要，尤其是在新型模型更倾向于速度而非计算效率（浮点运算次数，FLOP）的情况下。这种演变在一定程度上忽视了嵌入式和面向移动的AI目
Python 使用 Detectron2 进行目标检测 (Detectron2, CenterNet2, Detic) Eric Woo X Python AI Ubuntu python 目标检测开发语言
代码说明代码主要是一个用来演示如何使用Detectron2进行目标检测的脚本。它可以从摄像头或视频文件中读取图像，并应用指定的配置文件进行目标检测。其中，Detectron2结合了CenterNet2和Detic进行目标检测。主要库介绍Detectron2Detectron2是由FacebookAIResearch开发的一个用于目标检测和实例分割的开源库。它提供了一系列预训练模型和灵活的配置系统，
Transformer+目标检测，这一篇入门就够了 BIT可达鸭 ▶深度学习-计算机视觉 transformer 深度学习目标检测计算机视觉自然语言处理
VisionTransformerforObjectDetection本文作者：Encoder-Decoder简介：Encoder-Decoder的缺陷：Attention机制：Self-Attention机制：Multi-HeadAttention：Transformer结构：图像分类之ViT：图像分类之PyramidViT：目标检测之DETR：目标检测之DeformableDETR：本文作者：
目标检测-YOLOv2 wydxry 深度学习目标检测 YOLO 人工智能
YOLOv2介绍YOLOv2（YouOnlyLookOnceversion2）是一种用于目标检测的深度学习模型，由JosephRedmon等人于2016年提出，并详细论述在其论文《YOLO9000:Better,Faster,Stronger》中。YOLOv2在保持高速检测的同时，显著提升了检测的精度和泛化能力，成为实时目标检测领域的重要算法之一。核心原理YOLOv2的核心原理是将目标检测问题转化
【计算机视觉前沿研究热点顶会】ECCV 2024中目标检测有关的论文平安顺遂事事如意顶刊顶会论文合集计算机视觉目标检测人工智能 3d 目标跟踪
整值训练和尖峰驱动推理脉冲神经网络用于高性能和节能的目标检测与人工神经网络(ANN)相比，脑激励的脉冲神经网络(SNN)具有生物合理性和低功耗的优势。由于SNN的性能较差，目前的应用仅限于简单的分类任务。在这项工作中，我们专注于弥合人工神经网络和神经网络在目标检测方面的性能差距。我们的设计围绕着网络架构和尖峰神经元。当行人检测遇到多模态学习时：通才模型和基准数据集近年来，利用不同传感器模态(如RG
目标检测——YOLOv8模型预测结果张飞飞飞飞飞目标检测 YOLO 人工智能
fromultralyticsimportYOLOmodel_path=r'/home/zhangh/project1/workproject/YOLOv8/ultralytics/runs/train/2024723_yolov8n5/weights/best.pt'img_path=r'worker_data/images/val/%E9%93%B2%E6%96%97%E5%9D%90%E4%
基于yolov8的口罩佩戴检测系统python源码+onnx模型+评估指标曲线+精美GUI界面 FL1623863129 深度学习 python
【算法介绍】基于YOLOv8的口罩佩戴检测系统是一款利用深度学习技术，特别是YOLOv8算法，实现高效、准确检测人脸是否佩戴口罩的系统。YOLOv8作为YOLO系列算法的最新版本，在检测速度和准确性上进行了显著优化，能够实时处理图像和视频数据。该系统通过训练大量标注了人脸和口罩状态（包括戴口罩、未戴口罩）的图片数据，构建了一个强大的目标检测模型。在实际应用中，该系统可以部署在公共场所如机场、车站、
[数据集][目标检测]卫星遥感舰船检测数据集VOC+YOLO格式2238张17类别 FL1623863129 数据集目标检测 YOLO 人工智能
数据集格式：PascalVOC格式+YOLO格式(不包含分割路径的txt文件，仅仅包含jpg图片以及对应的VOC格式xml文件和yolo格式txt文件)图片数量(jpg文件个数)：2238标注数量(xml文件个数)：2238标注数量(txt文件个数)：2238标注类别数：17标注类别名称:[“AircraftCarrier”,“AuxiliaryShips”,“Cargo”,“Commander”
YOLOv8目标检测结果分析 MechMaster 计算机视觉 YOLO 目标检测人工智能
使用YOLOv8检测图片中的动物importcv2fromultralyticsimportYOLOmodel=YOLO(model="yolov8n.pt")results=model(source="animal2.jpg")result=results[0]img=result.plot()frommatplotlibimportpyplotaspltplt.imshow(X=img[:,:
TROS DataFlow - USB Camera & mipi Sensor - rtsp WuChao_JMUer 地平线RDK X3系列板卡 RDK X3 ROS2
TROSDataFlow-USBCamera&mipiSensor-rtsp使用TROS的功能，通过USB或者mipi摄像头得到MJPEG数据，推理YOLOv5节点，得到目标检测结果，通过ros_rtsp将nv12数据变成标准的H264/H265码流推出。DataFlow示意图mipiUSBH264/H265推流展示编译ros_rtsp_server系统版本2.1.0,sudoaptupdate有
旋转目标检测：mmrotate仓库中 “主要模型” 及其 “配置文件” 的列表沉浸式AI AI与SLAM论文解析旋转目标检测深度学习 mmrotate
mmrotate目录:mmrotate仓库中的主要模型和配置BackgroundandMotivation背景与动机MethodsOverview方法概述1.CFACFA:Convex-hullFeatureAdaptationforOrientedandDenselyPackedObjectDetectionCFA：用于定向和密集对象检测的凸包特征适应2.ConvNeXtConvNeXt:ACo
【目标检测】MMRotate旋转目标检测训练自定义DOTA格式数据集 ericdiii Open MMLab 目标检测人工智能计算机视觉
1.安装环境1.1.Conda虚拟环境condacreate--namemmrotatepython=3.8-ycondaactivatemmrotate1.2.安装PyTorchpipinstalltorch==1.12.1+cu113torchvision==0.13.1+cu113torchaudio==0.12.1--extra-index-urlhttps://download.pyto
二、MMRotate旋转框目标检测训练DOTA数据集（模型推理与部署，保存推理结果为xml文件并构建镜像）清如许. YOLOv8改进实战人工智能与边缘计算目标检测 xml mmrotate
在上一篇文章中介绍了MMRotate的概述、安装和训练Dota数据集全流程，由于文章篇幅限制还剩下一部分模型的推理和部署环节没有写，为避免后续对这部分工作的遗忘，决定还是补充上这部分的笔记，仅作记录，如有不足之处还请指出！一、模型推理测试1.1单张图片推理测试可以首先使用官网上的推理方法，从源代码安装mmrotate，只需运行以下命令。#首先把所需要的配置文件下载下来，或直接用你训练好的模型和配置
MTCNN人脸检测算法 samuelwang_ccnu 深度学习
人脸检测是指识别数字图像中的人脸。人脸检测可以视为目标检测的一种特殊情况。在目标检测中，任务是查找图像中特定类的所有对象的位置和大小。例如行人和汽车。在人脸检测中应用较广的算法就是MTCNN（Multi-taskCascadedConvolutionalNetworks的缩写）。MTCNN算法是一种基于深度学习的人脸检测和人脸对齐方法，它可以同时完成人脸检测和人脸对齐的任务，相比于传统的算法，它的
yolo 3d车辆目标检测（教程+代码）阿利同学 YOLO 3d 目标检测计算机视觉人工智能 3d目标检测
关于3D目标检测及其与YOLO3D相关性的概览：3D目标检测：开启视觉感知的新维度随着计算机视觉技术的发展，目标检测算法已经成为人工智能领域的重要组成部分。从自动驾驶汽车到无人机导航，再到增强现实（AR）应用，3D目标检测技术正在逐步改变我们与周围环境交互的方式。传统的2D目标检测虽然取得了显著的进步，但在处理三维空间中的物体识别与定位时却显得力不从心。因此，3D目标检测技术应运而生，它不仅能够识
yolo8 目标检测、鉴黄刷刷刷粉刷匠目标检测目标跟踪人工智能深度学习机器学习图像处理 python
引言目标检测是计算机视觉中的一个关键任务，其目标是在图像中识别出各种对象，并准确地定位其位置。近年来，目标检测技术取得了显著进展，尤其是在深度学习的推动下，性能得到了极大的提升。YOLO（YouOnlyLookOnce）系列模型作为目标检测领域的重要代表，其创新性的设计和优异的性能受到了广泛关注。YOLOv8是YOLO系列中的最新版本，它在继承前几代模型优势的基础上，通过引入更先进的网络架构和优化
自动驾驶之心规划控制理论&实战课程 vsdvsvfhf 自动驾驶人工智能机器学习
单目3D与单目BEV全栈教程(视频答疑)多传感器标定全栈系统学习教程多传感器融合:毫米波雷达和视觉融合感知全栈教程(深度学习传统方式)多传感器融合跟踪全栈教程(视频答疑)多模态融合3D目标检测教程(视频答疑)规划控制理论&实战课程国内首个BEV感知全栈系列学习教程首个基于Transformer的分割检测视觉大模型视频课程CUDA与TensorRT部署实战课程(视频答疑)Occupancy从入门到精
ztree异步加载 3213213333332132 JavaScript Ajax json Web ztree
相信新手用ztree的时候,对异步加载会有些困惑，我开始的时候也是看了API花了些时间才搞定了异步加载，在这里分享给大家。我后台代码生成的是json格式的数据，数据大家按各自的需求生成，这里只给出前端的代码。设置setting，这里只关注async属性的配置 var setting = { //异步加载配置
thirft rpc 具体调用流程 BlueSkator 中间件 rpc thrift
Thrift调用过程中，Thrift客户端和服务器之间主要用到传输层类、协议层类和处理类三个主要的核心类，这三个类的相互协作共同完成rpc的整个调用过程。在调用过程中将按照以下顺序进行协同工作：（1）将客户端程序调用的函数名和参数传递给协议层（TProtocol），协议
异或运算推导, 交换数据 dcj3sjt126com PHP 异或 ^
/* * 5 0101 * 9 1010 * * 5 ^ 5 * 0101 * 0101 * ----- * 0000 * 得出第一个规律: 相同的数进行异或, 结果是0 * * 9 ^ 5 ^ 6 * 1010 * 0101 * ---- * 1111 * * 1111 * 0110 * ---- * 1001
事件源对象周华华 JavaScript
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml&q
MySql配置及相关命令 g21121 mysql
MySQL安装完毕后我们需要对它进行一些设置及性能优化，主要包括字符集设置，启动设置，连接优化，表优化，分区优化等等。一修改MySQL密码及用户
[简单]poi删除excel 2007超链接 53873039oycg Excel
采用解析sheet.xml方式删除超链接，缺点是要打开文件2次,代码如下: public void removeExcel2007AllHyperLink(String filePath) throws Exception { OPCPackage ocPkg = OPCPac
Struts2添加 open flash chart 云端月影
准备以下开源项目： 1. Struts 2.1.6 2. Open Flash Chart 2 Version 2 Lug Wyrm Charmer (28th, July 2009) 3. jofc2，这东西不知道是没做好还是什么意思，好像和ofc2不怎么匹配，最好下源码，有什么问题直接改。 4. log4j 用eclipse新建动态网站，取名OFC2Demo，将Struts2 l
spring包详解 aijuans spring
下载的spring包中文件及各种包众多，在项目中往往只有部分是我们必须的，如果不清楚什么时候需要什么包的话，看看下面就知道了。 aspectj目录下是在Spring框架下使用aspectj的源代码和测试程序文件。Aspectj是java最早的提供AOP的应用框架。 dist 目录下是Spring 的发布包，关于发布包下面会详细进行说明。 docs&nb
网站推广之seo概念 antonyup_2006 算法 Web 应用服务器搜索引擎 Google
持续开发一年多的b2c网站终于在08年10月23日上线了。作为开发人员的我在修改bug的同时，准备了解下网站的推广分析策略。所谓网站推广，目的在于让尽可能多的潜在用户了解并访问网站，通过网站获得有关产品和服务等信息，为最终形成购买决策提供支持。网站推广策略有很多，seo，email，adv
单例模式,sql注入,序列百合不是茶单例模式序列 sql注入预编译
序列在前面写过有关的博客,也有过总结,但是今天在做一个JDBC操作数据库的相关内容时需要使用序列创建一个自增长的字段居然不会了,所以将序列写在本篇的前面 1,序列是一个保存数据连续的增长的一种方式; 序列的创建; CREATE SEQUENCE seq_pro 2 INCREMENT BY 1 -- 每次加几个 3
Mockito单元测试实例 bijian1013 单元测试 mockito
Mockito单元测试实例： public class SettingServiceTest { private List<PersonDTO> personList = new ArrayList<PersonDTO>(); @InjectMocks private SettingPojoService settin
精通Oracle10编程SQL(9)使用游标 bijian1013 oracle 数据库 plsql
/* *使用游标 */ --显示游标 --在显式游标中使用FETCH...INTO语句 DECLARE CURSOR emp_cursor is select ename,sal from emp where deptno=1; v_ename emp.ename%TYPE; v_sal emp.sal%TYPE; begin ope
【Java语言】动态代理 bit1129 java语言
JDK接口动态代理 JDK自带的动态代理通过动态的根据接口生成字节码(实现接口的一个具体类)的方式，为接口的实现类提供代理。被代理的对象和代理对象通过InvocationHandler建立关联 package com.tom; import com.tom.model.User; import com.tom.service.IUserService;
Java通信之URL通信基础白糖_ java jdk webservice 网络协议 ITeye
java对网络通信以及提供了比较全面的jdk支持，java.net包能让程序员直接在程序中实现网络通信。在技术日新月异的现在，我们能通过很多方式实现数据通信，比如webservice、url通信、socket通信等等，今天简单介绍下URL通信。学习准备：建议首先学习java的IO基础知识 URL是统一资源定位器的简写，URL可以访问Internet和www，可以通过url
博弈Java讲义 - Java线程同步 (1) boyitech java 多线程同步锁
在并发编程中经常会碰到多个执行线程共享资源的问题。例如多个线程同时读写文件，共用数据库连接，全局的计数器等。如果不处理好多线程之间的同步问题很容易引起状态不一致或者其他的错误。同步不仅可以阻止一个线程看到对象处于不一致的状态，它还可以保证进入同步方法或者块的每个线程，都看到由同一锁保护的之前所有的修改结果。处理同步的关键就是要正确的识别临界条件（cri
java-给定字符串，删除开始和结尾处的空格，并将中间的多个连续的空格合并成一个。 bylijinnan java
public class DeleteExtraSpace { /** * 题目：给定字符串，删除开始和结尾处的空格，并将中间的多个连续的空格合并成一个。 * 方法1.用已有的String类的trim和replaceAll方法 * 方法2.全部用正则表达式，这个我不熟 * 方法3.“重新发明轮子”，从头遍历一次 */ public static v
An error has occurred.See the log file错误解决！ Kai_Ge MyEclipse
今天早上打开MyEclipse时，自动关闭！弹出An error has occurred.See the log file错误提示！很郁闷昨天启动和关闭还好着！！！打开几次依然报此错误，确定不是眼花了！打开日志文件！找到当日错误文件内容： --------------------------------------------------------------------------
[矿业与工业]修建一个空间矿床开采站要多少钱? comsci
地球上的钛金属矿藏已经接近枯竭........... 我们在冥王星的一颗卫星上面发现一些具有开采价值的矿床..... 那么,现在要编制一个预算,提交给财政部门..
解析Google Map Routes dai_lm google api
为了获得从A点到B点的路劲，经常会使用Google提供的API，例如 [url] http://maps.googleapis.com/maps/api/directions/json?origin=40.7144,-74.0060&destination=47.6063,-122.3204&sensor=false [/url] 从返回的结果上，大致可以了解应该怎么走，但
SQL还有多少“理所应当”？ datamachine sql
转贴存档，原帖地址：http://blog.chinaunix.net/uid-29242841-id-3968998.html、http://blog.chinaunix.net/uid-29242841-id-3971046.html！ ------------------------------------华丽的分割线--------------------------------
Yii使用Ajax验证时，如何设置某些字段不需要验证 dcj3sjt126com Ajax yii
经常像你注册页面,你可能非常希望只需要Ajax去验证用户名和Email,而不需要使用Ajax再去验证密码,默认如果你使用Yii 内置的ajax验证Form,例如: $form=$this->beginWidget('CActiveForm', array( 'id'=>'usuario-form',&
使用git同步网站代码 dcj3sjt126com crontab git
转自:http://ued.ctrip.com/blog/?p=3646?tn=gongxinjun.com 管理一网站，最开始使用的虚拟空间，采用提供商支持的ftp上传网站文件，后换用vps，vps可以自己搭建ftp的，但是懒得搞，直接使用scp传输文件到服务器，现在需要更新文件到服务器，使用scp真的很烦。发现本人就职的公司，采用的git+rsync的方式来管理、同步代码，遂
sql基本操作蕃薯耀 sql sql基本操作 sql常用操作
sql基本操作 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 蕃薯耀 2015年6月1日 17:30:33 星期一 &
Spring4+Hibernate4+Atomikos3.3多数据源事务管理 hanqunfeng Hibernate4
Spring3+后不再对JTOM提供支持，所以可以改用Atomikos管理多数据源事务。Spring2.5+Hibernate3+JTOM参考：http://hanqunfeng.iteye.com/blog/1554251Atomikos官网网站：http://www.atomikos.com/ 一.pom.xml <dependency> <
jquery中两个值得注意的方法one()和trigger()方法 jackyrong trigger
在jquery中，有两个值得注意但容易忽视的方法，分别是one()方法和trigger()方法,这是从国内作者<<jquery权威指南》一书中看到不错的介绍 1） one方法 one方法的功能是让所选定的元素绑定一个仅触发一次的处理函数，格式为 one(type,${data},fn) &nb
拿工资不仅仅是让你写代码的 lampcy 工作面试咨询
这是我对团队每个新进员工说的第一件事情。这句话的意思是，我并不关心你是如何快速完成任务的，哪怕代码很差，只要它像救生艇通气门一样管用就行。这句话也是我最喜欢的座右铭之一。这个说法其实很合理：我们的工作是思考客户提出的问题，然后制定解决方案。思考第一，代码第二，公司请我们的最终目的不是写代码，而是想出解决方案。话粗理不粗。付你薪水不是让你来思考的，也不是让你来写代码的，你的目的是交付产品
架构师之对象操作----------对象的效率复制和判断是否全为空 nannan408 架构师
1.前言。如题。 2.代码。 (1)对象的复制，比spring的beanCopier在大并发下效率要高，利用net.sf.cglib.beans.BeanCopier Src src=new Src(); BeanCopier beanCopier = BeanCopier.create(Src.class, Des.class, false);
ajax 被缓存的解决方案 Rainbow702 JavaScript jquery Ajax cache 缓存
使用jquery的ajax来发送请求进行局部刷新画面，各位可能都做过。今天碰到一个奇怪的现象，就是，同一个ajax请求，在chrome中，不论发送多少次，都可以发送至服务器端，而不会被缓存。但是，换成在IE下的时候，发现，同一个ajax请求，会发生被缓存的情况，只有第一次才会被发送至服务器端，之后的不会再被发送。郁闷。解决方法如下： ① 直接使用 JQuery提供的 “cache”参数，
修改date.toLocaleString()的警告 tntxia String
我们在写程序的时候，经常要查看时间，所以我们经常会用到date.toLocaleString()，但是date.toLocaleString()是一个过时的API，代替的方法如下： package com.tntxia.htmlmaker.util; import java.text.SimpleDateFormat; import java.util.
项目完成后的小总结 xiaomiya js 总结项目
项目完成了，突然想做个总结但是有点无从下手了。做之前对于客户端给的接口很模式。然而定义好了格式要求就如此的愉快了。先说说项目主要实现的功能吧 1，按键精灵 2，获取行情数据 3，各种input输入条件判断 4，发送数据（有json格式和string格式） 5，获取预警条件列表和预警结果列表， 6，排序， 7，预警结果分页获取 8，导出文件（excel，text等） 9，修