基于遮罩自适应CLIP的开放词汇语义分割

OPEN-VOCABULARY SEMANTIC SEGMENTATION WITH MASK-ADAPTED CLIP
基于MASK-ADAPTED剪辑的开放词汇语义分割

摘要

Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during training.
Recent two-stage methods first generate class-agnostic mask proposals and then leverage pre-trained vision-language models, e.g., CLIP, to classify masked regions.
We identify the performance bottleneck of this paradigm to be the pre-trained CLIP model, since it does not perform well on masked images.
To address this, we propose to finetune CLIP on a collection of masked image regions and their corresponding text descriptions.
We collect traini

你可能感兴趣的:(深度学习,计算机视觉,人工智能)