读论文---ViT是参数有效的视听学习者-Visio Transfermers are Parameter-Efficient Audio-Visual Learners

读论文---ViT是参数有效的视听学习者-Visio Transfermers are Parameter-Efficient Audio-Visual Learners_第1张图片

名词定义
LAVIS(Latent Audio-VISual Hybrid)适配器

Abstract

Vision transformers (ViTs) have achieved impressive results on various computer vision tasks in the last several years.
In this work, we study the capability of frozen ViTs, pretrained only on visual data, to generalize to audio-visual data without finetuning any of its original parameters.
To do so, we propose a latent

你可能感兴趣的:(人工智能,计算机视觉)