1, BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding
2, Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks
3, LongT5: Efficient Text-To-Text Transformer for Long Sequences
4, LUKE: Deep Contextualized Entity Representations with
Entity-aware Self-attention
5, Improving Language Understanding
by Generative Pre-Training
6, AltCLIP: Altering the Language Encoder in CLIP for Extended
Language Capabilities
7, VISUALBERT: A SIMPLE AND PERFORMANT
BASELINE FOR VISION AND LANGUAGE
8, Expanding Language-Image Pretrained Models
for General Video Recognition
9, FLAVA: A Foundational Language And Vision Alignment Model
10, GIT: A Generative Image-to-text Transformer
for Vision and Language
11, OCR-free Document Understanding Transformer
12, data2vec: A General Framework for Self-supervised Learning in Speech,
Vision and Language
13, Image Segmentation Using Text and Image Prompts
14,Learning Transferable Visual Models From Natural Language Supervision
15,Masked Siamese Networks
for Label-Efficient Learning
16, Masked Autoencoders Are Scalable Vision Learners
17, AN IMAGE IS WORTH 16X16 WORDS:
TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE
18, VideoMAE: Masked Autoencoders are Data-Efficient
Learners for Self-Supervised Video Pre-Training
19, Visual Attention Network
20, Unified Perceptual Parsing for Scene
Understanding
21, Is Space-Time Attention All You Need for Video Understanding?
22, PubTables-1M: Towards comprehensive table extraction from unstructured
23, Swin2SR: SwinV2 Transformer for Compressed
Image Super-Resolution and Restoration
24, Swin Transformer V2: Scaling Up Capacity and Resolution
25, Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
26, SegFormer: Simple and Efficient Design for Semantic
Segmentation with Transformers
27, MetaFormer Is Actually What You Need for Vision
28, Neighborhood Attention Transformer
29, MOBILEVIT: LIGHT-WEIGHT, GENERAL-PURPOSE,
AND MOBILE-FRIENDLY VISION TRANSFORMER
30, MobileNetV2: Inverted Residuals and Linear Bottlenecks
31, MobileNets: Efficient Convolutional Neural Networks for Mobile Vision
Applications
32, Per-Pixel Classification is Not All You Need
for Semantic Segmentation
33, Masked-attention Mask Transformer for Universal Image Segmentation
34, LeViT: a Vision Transformer in ConvNet’s Clothing
for Faster Inference
35, Global-Local Path Networks for Monocular Depth Estimation
with Vertical CutDepth