Long-tailed Recognition (长尾问题)

目录

  • Long-Tailed Recognition
    • 长尾数据
    • 基本方法
      • 重采样 (data re-sampling)
      • 重加权 (loss re-weighting)
      • 迁移学习 (transfer learning from head- to tail-classes)
  • Long-tailed datasets
    • Long-tailed CIFAR
    • iNaturalist 2018
    • Long-Tailed ImageNet
    • Places-LT
  • References

Long-Tailed Recognition

长尾数据

  • 在传统的分类和识别任务中,训练数据的分布往往都受到了人工的均衡,即不同类别的样本数量无明显差异。一个均衡的数据集固然大大简化了对算法鲁棒性的要求,也一定程度上保障了所得模型的可靠性,但随着关注类别的逐渐增加,维持各个类别之间均衡就将带来指数增长的采集成本
    • 举个简单的例子,如果要做一个动物分类数据集,猫狗等常见数据可以轻轻松松的采集数以百万张的图片,但是考虑到数据集的均衡,我们必须也给雪豹等罕见动物采集等量的样本,而随着类别稀有度的增加,其采集成本往往成指数增长
  • 那么如果我们完全不考虑人工均衡,自然的采集所有相关数据呢?这样的数据就是本文所关注的长尾数据。如下图所示,训练集中的少数类别 (head class) 含有训练集中的多数标注数据,而大量其余类别 (tail class) 仅有少数标注数据 (a few classes occupy most of the data, while most classes have rarely few samples)。直接利用长尾数据来训练的分类和识别系统,往往会对头部数据 (instance-rich / head classes) 过拟合,对尾部类别 (instance-scarce / tail classes) 欠拟合
    Long-tailed Recognition (长尾问题)_第1张图片

注意,虽然长尾问题的训练集是 imbalance 的,但其测试集必须是 balance 的

基本方法

重采样 (data re-sampling)

  • 重采样: 对不同类别的图片采样频率根据样本数量进行反向加权
    p j = n j q ∑ i = 1 C n i q p_j=\frac{n_j^q}{\sum_{i=1}^Cn_i^q} pj=i=1Cniqnjq其中 C C C 为数据集的类别数量, n i n_i ni 为类别 i i i 的样本总数, p j p_j pj 为从类别 j j j 中采样一个图片的概率
    • 狭义的重采样可以看作 q ∈ ( 0 , 1 ] q\in(0,1] q(0,1] 的情况,也就是尾部类别的样本图片会比头部类别的图片有更高的概率被采样到,当然尾部类别的图片可能会被反复重复采样,所以一般也会做一些简单的数据增强,例如反转,随机剪裁等
    • 传统的样本均衡采样 (instance-balanced sampling) 在这个公式里就是 q = 0 q=0 q=0 的情况,也就是每个图片等概率被采样
    • 类别均衡采样 (class-balanced sampling) 则是 q = 1 q=1 q=1 的情况,即所有类别都采样相同数量的样本,这个过程可以看作由两个阶段组成,即首先随机选择一个类,然后在类内作样本均衡采样
    • Square-root sampling 则是采样方法的有一个变种,相当于将 q q q 设为 1 / 2 1/2 1/2
    • Progressively-balanced sampling 是一种混合的采样策略,训练开始时采样策略为样本均衡采样 (IB) ( p j = p j I B p_j=p_j^{IB} pj=pjIB),随着训练的进行,采样策略逐渐过渡为类别均衡采样 (CB) ( p j = p j C B p_j=p_j^{CB} pj=pjCB):
      p j P B ( t ) = ( 1 − t T ) p j I B + t T p j C B p_j^{PB}(t)=(1-\frac{t}{T})p_j^{IB}+\frac{t}{T}p_j^{CB} pjPB(t)=(1Tt)pjIB+TtpjCB其中 t t t 为 epoch 数

缺点

  • 总的来说,重采样就是在已有数据不均衡的情况下,人为地让模型学习时接触到的训练样本是类别均衡的,从而一定程度上减少对头部数据的过拟合
  • 不过由于尾部的少量数据往往被反复学习,缺少足够多的样本差异,不够鲁棒,而头部拥有足够差异的大量数据又往往得不到充分学习,所以重采样也并非是个真正完美的解决方案

重加权 (loss re-weighting)

  • 重加权则主要体现在分类的 loss 上。不同于采样,因为 loss 计算的灵活性和方便性,很多比较复杂的任务比如物体检测和实例分割等,都更倾向于使用重加权 loss 来解决长尾分布问题。毕竟当一张图片上包含多个需要检测或分割的物体,采样时,对他们分别按类别作筛选远比图像层面的采样麻烦的多。而重加权的实现不仅简单,也往往更加灵活。从基于类别分布的反向加权 (class-level) (Cui et al., 2019; Khan et al., 2017; Cao et al., 2019; Khan et al., 2019; Huang et al., 2019),到不需要知道类别,直接根据分类的可信度进行的困难样本挖掘 (Hard Example Mining) (sample level),如 focal loss, Meta-Weight-Net, re-weighted training。最近 Hayat et al. (2019) 提出使用 affinity measure 来使得各类中心均匀等距离分布
  • 这里我们先给个 Re-weighted Cross-Entropy Loss 的通用公式:
    L o s s = − β log ⁡ exp ⁡ ( z j ) ∑ i = 1 C exp ⁡ ( z i ) Loss=-\beta\log\frac{\exp(z_j)}{\sum_{i=1}^C\exp(z_i)} Loss=βlogi=1Cexp(zi)exp(zj)其中 z i z_i zi 是网络输出的 logit, β \beta β 就是我们重加权中的权重,需要注意的是,这里的 β \beta β 不是一个常数,而是一个取决于具体实现的经过计算的权重,但一般来说 β \beta β 的趋势是,给头部类别更低的权重,给尾部类别更高的权重,从而反向抵消长尾效应
    • 关于最简单的重加权实现,则可以直接利用公式 β = g ( ∑ i = 1 C f ( n i ) f ( n j ) ) \beta=g\left(\frac{\sum_{i=1}^Cf(n_i)}{f(n_j)}\right) β=g(f(nj)i=1Cf(ni)), f , g f,g f,g 可以是任意单调递增函数,比如 log ⁡ \log log 或者各种幂大于 0 的指数函数
    • 在 Focal loss 中,设 x i x_i xi 的类别为 y i y_i yi,模型预测概率为 h i h_i hi,则 β = ( 1 − h i ) γ \beta=(1-h_i)^\gamma β=(1hi)γ

迁移学习 (transfer learning from head- to tail-classes)

  • 迁移学习: 从头部常见类中学习通用知识,然后迁移到尾部少样本类别中,这通常需要大量精力来设计特征转移所需的特殊模块 (e.g. external memory)。近期的工作包括 transfer the intra-class variance (Yin et al., 2019) 和 transfer semantic deep features (Liu et al., 2019).

Long-tailed datasets

  • Generally, in long-tail recognition tasks, the classes are categorized into many-shot (with more than 100 training samples), medium-shot (with 20 ∼100 samples) and few-shot (with less than 20 samples) splits.
  • The imbalance factors (IFs) of the long-tailed datasets, defined as the frequency of the largest class divided by the smallest class, vary from 10 to over 500.

Long-tailed CIFAR

  • Both CIFAR-10 and CIFAR-100 contain 60,000 images, 50,000 for training and 10,000 for validation with category number of 10 and 100, respectively. As the original CIFAR datasets, CIFAR-10-LT (CIFAR-10) and CIFAR-100-LT (CIFAR-100) contain the same categories. However, they are created by reducing the number of training samples per class according to an exponential function n = n t × μ t n = n_t × \mu^t n=nt×μt, where t t t is the class index (0-indexed) and n t n_t nt is the original number of training images with μ ∈ ( 0 , 1 ) \mu ∈ (0, 1) μ(0,1). The test set remains unchanged.
  • The imbalance factor of a long-tailed CIFAR dataset is defined as the number of training samples in the largest class divided by that of the smallest, which ranges from 10 to 200. In the literature, the imbalance factor of 50 and 100 are widely used, with around 12,000 training images under each imbalance factor.

iNaturalist 2018

  • The iNaturalist species classification datasets are large-scale real-world datasets for species identification of animals and plants that suffer from extremely imbalanced label distributions.
  • The most challenging dataset of iNaturalist is the 2018 version, which contains 437,513 images from 8,142 categories. Besides the extreme imbalance (IF=512), the iNaturalist datasets also face the fine-grained problem.

Long-Tailed ImageNet

  • The long-tailed ImageNet (ImageNet-LT) is derived from the original ImageNet-2012 by sampling a subset following the Pareto distribution with the power value α = 6 α = 6 α=6 from 1,000 categories, consisting of 115.8K images from 1000 categories, with maximally 1280 images per class and minimally 5 images per class. The test set is balanced by following (Liu et al. 2019).

Places-LT

  • Places-LT has an imbalanced training set with 62,500 images for 365 classes from Places-2. The class frequencies follow a natural power law distribution with a maximum number of 4,980 images per class and a minimum number of 5 images per class. The validation and testing sets are balanced and contain 20 and 100 images per class respectively.

References

  • 长尾分布下分类问题简介与基本方法
  • 长尾分布下分类问题的最新研究 (持续更新)
  • 长尾分布下的物体检测和实例分割最新研究
  • 一种崭新的长尾分布下分类问题的通用算法
  • Decoupling Representation and Classifier for Long-Tailed Recognition, ICLR 2020
  • Places-LT、ImageNet-LT、iNaturalist
  • Zhang, Yongshun, et al. “Bag of tricks for long-tailed visual recognition with deep convolutional neural networks.” Proceedings of the AAAI conference on artificial intelligence. Vol. 35. No. 4. 2021.
  • Cai, Jiarui, Yizhou Wang, and Jenq-Neng Hwang. “Ace: Ally complementary experts for solving long-tailed recognition in one-shot.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

你可能感兴趣的:(long-tailed,深度学习)