Visual Prompt Tuning 笔记


  • Visual Prompt Tuning
    • Basic information
    • The content
    • 参考资料

Visual Prompt Tuning

Basic information

  • Title: Visual Prompt Tuning
  • Paper: arXiv
  • Code: official code , unofficial code
  • Slides: unofficial colalab , unofficial github
  • Video: (unofficial) Bilibili 1, (unofficial) Bilibili 2
  • Author: Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, Ser-Nam Lim
  • Time:ECCV 2022

The content

  • Background

    1. 目前调整预训练模型的方法是full fine-tuning,即完全微调。预训练好的模型利用full fine-tuning的方式迁移到下游任务上时,需要存储整个模型,而且在会对模型的所有参数都进行训练,造成计算量大的问题;
    2. 随着计算机视觉领域的发展,基于Transformer的模型相较于基于CNN的模型更大,导致模型参数急剧上升,也致使训练难度的增大;
    3. 近年来,NLP已经进入大模型阶段,对于如何迁移NLP预训练好的大模型到下游任务,相关人员提出了不同于Fine-tuning的方法,即Prompt-tuning,在保持预训练模型冻结的情况下,只需要训练少量额外的参数即可将该大模型迁移到下游任务,而且效果不错。
  • Motivation
    如何更加有效地 adapt 预训练的Transformer 用于下游任务?
    what is the best way to adapt large pre-trained Transformers to downstream tasks in terms of effectiveness and efficiency?

  • Contribution

    1. 这篇文章的提出了一个简单、有效的方法调整预训练好的Transformer模型用于下游任务,即Visual-Prompt Tuning (VPT)
    2. 对于本文所提出的VPT在多个下游任务上进行了实验,甚至在20个下游任务上都可以超过Fine-tuning的效果
      Visual Prompt Tuning 笔记_第1张图片
  • Method

    1. 方法示意图
      Visual Prompt Tuning 笔记_第2张图片

    2. 方法基础
      Visual Prompt Tuning 笔记_第3张图片在这里插入图片描述

    3. Visual-Prompt Tuning (VPT)
      Visual Prompt Tuning 笔记_第4张图片

    4. Storing Visual Prompts
      VPT is beneficial in presence of multiple downstream tasks. We only need to store the learned prompts and classification head for each task and re-use the original copy of the pre-trained Transformer model, significantly reducing the storage cost. For instance, given a ViT-Base with 86 million (M) parameters and d = 768 d = 768 d=768, 50 shallow prompts and deep prompts yield additional p × d = 50 × 768 = 0.038 p × d = 50 × 768 = 0.038 p×d=50×768=0.038M, and N × p × d = 0.46 N × p × d = 0.46 N×p×d=0.46M parameters, amounting to only 0.04% and 0.53% of all ViT-Base parameters, respectively.

  • Experiment result

    1. 实验设置特别关注部分:

      1. Pre-trained Backbones.: All backbones in this section are pre-trained on ImageNet-21k.
      2. Baselines:
        • Full: fully update all backbone and classification head parameters.
        • Linear: only use a linear layer as the classification head.
        • Partial- k k k: fine-tune the last k layers of backbone while freezing the others.
        • Mlp- k k k: utilize a multilayer perceptron (MLP) with k layers, instead of a linear layer, as classification head.
        • Sidetune : train a “side” network and linear interpolate between pretrained features and side-tuned features before being fed into the head.
        • Bias: : fine-tune only the bias terms of a pre-trained backbone.
        • Adapter: insert new MLP modules with residual connection inside Transformer layers.
    2. 实验结果
      Visual Prompt Tuning 笔记_第5张图片

      • Even if storage is not a concern, VPT is a promising approach for adapting larger Transformers in vision. VPT-Deep outperforms all the other parameter-efficient tuning protocolsacross all task groups, indicating that VPTdeep is the best fine-tuning strategy in storage-constrained environments. Although sub-optimal than VPT-deep, VPT-shallow still offers non-trivial performance gain than head-oriented tuning methods, indicating that VPT-shallow is a worthwhile choice in deploying multi-task fine-tuned models if the storage constraint is severe.
        Visual Prompt Tuning 笔记_第6张图片
        Visual Prompt Tuning 笔记_第7张图片

      • The experiments are conducted on the ImageNet-21k supervised pre-trained Swin-Base. VPT continues to outperform other parameter-efficient fine-tuning methods (b, c) for all three subgroups of VTAB, though in this case Full yields the highest accuracy scores overall (at a heavy cost in total parameters).
        Visual Prompt Tuning 笔记_第8张图片
        Visual Prompt Tuning 笔记_第9张图片

      • Yet the accuracy drops if we insert prompts from top to bottom, suggesting that prompts at earlier Transformer layers matter more than those at later layers.
        Visual Prompt Tuning 笔记_第10张图片
        Visual Prompt Tuning 笔记_第11张图片

      • In the case of MoCo v3, VPT no longer holds the best performance, though it is still competitive with the others. This suggests that these two self-supervised ViTs are
        fundamentally different from the supervised ones in previous sections. Exactly why and how these differences arise remain open questions.

        Visual Prompt Tuning 笔记_第12张图片

      We examine the idea of adding trainable parameters in the input space of ConvNets: padding both height and width by p p p learnable prompt pixels for the input image. Though this operation seems unconventional, we implement VPT this way given there is no obvious solution to add location-invariant prompts similar to the Transformer counterparts. In fact this approach has been explored before in the adversarial attack literature. VPT works well in a larger ConvNet backbone, ConvNeXt-B, offering accuracy gains over other sparse tuning protocols (b, c), and outperforming Full on 8 out of 19 cases. The advantages of VPT, however, diminish with smaller ConvNet (ResNet50), as there is no clear winner for all 19 VTAB-1k tasks.

  • Conclusion

    1. We present Visual Prompt Tuning, a new parameter-efficient approach to leverage large vision Transformer models for a wide range of downstream tasks. VPT introduces task-specific learnable prompts in the input space, keeping the pretrained backbone fixed.
    2. We show that VPT can surpass other fine-tuning protocols (often including full fine-tuning) while dramatically reducing the storage cost.
    3. Our experiments also raise intriguing questions on fine-tuning dynamics of vision Transformers with different pre-training objectives, and how to transfer to broader vision recognition tasks in an efficient manner.


  1. 《Visual Prompt Tuning》视觉prompt
  2. 替代微调!Meta AI提出VPT:视觉Prompt Tuning
  3. Visual Prompt Tuning (Github)
  4. 视觉Prompt新方法:超越所有微调方法,参数量大幅减少
  5. 训练CV模型新思路来了:用NLP大火的Prompt替代微调,性能全面提升
  6. prompt-tuning (Github)
