(四十六):VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

(四十六):VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

  • Abstract
  • 1. Introduction
  • 2. Related work
    • 2.1. Transformers in Vision
    • 2.2. Self-Supervised Learning
  • 3. Approach
    • 3.1. Tokenization and Positional Encoding
    • 3.1.1 DropToken
    • 3.2. The Transformer Architecture
    • 3.3. Common Space Projection
    • 3.4. Multimodal Contrastive Learning
    <

你可能感兴趣的:(nlp,深度学习,自然语言处理)