Self-supervised learning 之DINO和PAWS

一、DINO(self-distillation with no labels)

1.1 整体框架:

  • DINO is inspired from BYOL.
  • In DINO, the model passes two kind of random transformations of an input image to the student network g θ s g_{\theta_{s}} gθs and the teacher network g θ t g_{\theta_{t}} gθt.
  • Both student and teacher networks have the same architecture but different parameters.
  • The output of the teacher network is centered with a mean computed over the batch. Each networks outputs a K dimensional feature denoted by P s P_s Ps and P t P_t Pt, i.e. output probability distributions, which are normalized with a temperature softmax τ s \tau_{s} τs over the feature dimension:
  • With a fixed teacher, their similarity is then measured with a cross-entropy loss.
    1.2 损失函数设计
      More precisely, from a given image, a set V of different views is generated. This set contains two global views, xg1 and xg2 and several local views of smaller resolution.
      All crops are passed through the student while only the global views are passed through the teacher, therefore encouraging “local-to-global” correspondences.
  • The loss is minimized:
  • In practice, the standard setting for multi-crop by using 2 global views at resolution 224² covering a large (for example greater than 50%) area of the original image, and several local views of resolution 96² covering only small areas (for example less than 50%) of the original image.

二、PAWS(semi-supervised learning using limited labeled data)

1.1 整体结构

  • 使用一个large unlabeled数据集和一个small labeled数据集。同时使用两个数据集进行预训练,使用small labeled数据集进行fine-tune.
  • anchor view是large unlabeled dataset中的数据,经过data augmentation生成positive view正样本无标签数据。support samples是对small labeled dataset通过先随机筛选class类别、后在这些类别中分别随机筛选部分样本得到的带标签小样本。
  • 经过相同的映射头 f θ : R 3 × m × n → R d f_\theta:R^{3\times{m}\times{n}}\rightarrow R^{d} fθ:R3×m×nRd ,support samples的映射 z s ( R m × d ) z_s(R^{m\times d}) zs(Rm×d)分别和映射 z 和 z + ( R n × d ) z 和z^{+}(R^{n \times d}) zz+(Rn×d)进行相似度计算,并和support samples中对应的标签 y s ( R m × k ) y_s(R^{m\times k}) ys(Rm×k)相乘得到对k类预测的概率分布 R n × k R^{n\times k} Rn×k
    其中 d ( a , b ) = e x p ( a T b / ∣ ∣ a ∣ ∣ ∣ ∣ b ∣ ∣ τ ) d(a,b)=exp(a^Tb/||a||||b||\tau) d(a,b)=exp(aTb/∣∣a∣∣∣∣b∣∣τ)

1.2 损失函数设计

  • 增强数据表示的差异


  • 使用交叉熵损失函数 H H H使 p r e d i c t i o n   p 和 t a r g e t   p + prediction\space p和target \space p^{+} prediction ptarget p+之间差异最小化,更新映射参数 f θ 中的 θ f_\theta中的\theta fθ中的θ H ( p ‾ ) H(\overline{p}) H(p)是正则化项





