When learning with long-tailed data, a common challenge is that instance-rich (or head) classes dominate the training procedure. The learned classification model tends to perform better on these classes, while performance is significantly worse for instance-scarce (or tail) classes (under-fitting).
The general scheme for long-tailed recognition is: classifiers are either learned jointly with the representations end-to-end, or via a two-stage approach where the classifier and the representation are jointly fine-tuned with variants of class-balanced sampling as a second stage.
In our work, we argue for decoupling representation and classification. We demonstrate that in a long-tailed scenario, this separation allows straightforward approaches to achieve high recognition performance, without the need for designing sampling strategies, balance-aware losses or adding memory modules.
Recent studies’ directions on solving long-tailed recognition problem:
For most sampling strategies presented below, the probability p j p_j pj of sampling a data point from class j j j is given by: p j = n j q ∑ i = 1 C n i q p_{j}=\frac{n_{j}^{q}}{\sum_{i=1}^{C} n_{i}^{q}} pj=∑i=1Cniqnjq where q ∈ [ 0 , 1 ] q \in[0,1] q∈[0,1], n j n_j nj denote the number of training sample for class j j j and C C C is the number of training classes. Different sampling strategies arise for different values of q q q and below we present strategies that correspond to q = 1 , q = 0 , q=1, q=0, q=1,q=0, and q = 1 / 2 q=1 / 2 q=1/2.
Re-train the classifier with class-balanced sampling. That is, keeping the representations fixed, we randomly re-initialize and optimize the classifier weights W W W and b b b for a small number of epochs using class-balanced sampling.
First compute the mean feature representation for each class on the training set and then perform nearest neighbor search either using cosine similarity or the Euclidean distance computed on L2 normalized mean features. (the cosine similarity alleviates the weight imbalance problem via its inherent normalization)
We investigate an efficient approach to re-balance the decision boundaries of classifiers, inspired by an empirical observation:
Empirical Observation: after joint training with instance-balanced sampling, the norms of the weights ∥ w j ∥ \left\|w_{j}\right\| ∥wj∥ are correlated with the cardinality of the classes n j n_j nj, while, after fine-tuning the classifiers using class-balanced sampling, the norms of the classifier weights tend to be more similar.
Inspired by the above observations, we consider rectifying imbalance of decision boundaries by adjusting the classifier weight norms directly through the following τ \tau τ-normalization procedure. Formally, let W = { w j } ∈ R d × C , \boldsymbol{W}=\left\{w_{j}\right\} \in \mathbb{R}^{d \times C}, W={wj}∈Rd×C, where w j ∈ R d w_{j} \in \mathbb{R}^{d} wj∈Rd are the classifier weights corresponding to class j . j . j. We scale the weights of W \boldsymbol{W} W to get W ~ = { w j ~ } \widetilde{\boldsymbol{W}}=\left\{\widetilde{w_{j}}\right\} W ={wj } by: w i ~ = w i ∥ w i ∥ τ \widetilde{w_{i}}=\frac{w_{i}}{\left\|w_{i}\right\|^{\tau}} wi =∥wi∥τwi where τ \tau τ is a hyper-parameter controlling the “temperature” of the normalization, and ∥ ⋅ ∥ \|\cdot\| ∥⋅∥ denotes the L 2 L_{2} L2 norm. When τ = 1 \tau=1 τ=1, it reduces to standard L 2 L_{2} L2 -normalization. When τ = 0 \tau=0 τ=0, no scaling is imposed. We empirically choose τ ∈ ( 0 , 1 ) \tau \in(0,1) τ∈(0,1) such that the weights can be rectified smoothly.
Learnable weight scaling (LWS). Another way of interpreting τ \tau τ-normalization would be to think of it as a re-scaling of the magnitude for each classifier w i w_{i} wi keeping the direction unchanged. This could be written as w i ~ = f i ∗ w i , where f i = 1 ∥ w i ∥ τ \widetilde{w_{i}}=f_{i} * w_{i}, \text { where } f_{i}=\frac{1}{\left\|w_{i}\right\|^{\tau}} wi =fi∗wi, where fi=∥wi∥τ1 Although for τ \tau τ-normalized in general τ \tau τ is chosen through cross-validation.
Note: NCM and τ \tau τ-normalized cases give competitive performance even though they are free of additional training and involve no additional sampling procedure.
We perform extensive experiments on three large-scale long-tailed datasets.
To better examine performance variations across classes with different number of examples seen during training, we report accuracy on three splits of the set of classes: Many-shot (more than 100 images), Medium-shot (20∼100 images) and Few-shot (less than 20 images).
This figure compares different sampling strategies for the conventional joint training scheme to a number of variations of the decoupled learning scheme on the ImageNet-LT dataset.
Among all decoupled methods, we see that Instance-balanced sampling gives the best results. This is particularly interesting, as it implies that data imbalance might not be an issue learning high-quality representations.
To further justify our claim that it is beneficial to decouple representation and classifier, we experiment with fine-tuning the backbone network (ResNeXt-50) jointly with the linear classifier. Here is the result:
Fine-tuning the whole network yields the worst performance, while keeping the representation frozen performs best. This result suggests that decoupling representation and classifier is desirable for long-tailed recognition.
This figure shows L2 norms of the weight vectors for all classifiers, as well as the training data distribution sorted in a descending manner with respect to the number of instances in the training set.
We compare the performance of the decoupled schemes to other recent works that report state-of-the-art results on on three common long-tailed benchmarks. This is the result for ImageNet-LT.
Reference: