CVPR2022将于6月22日召开,本次会议共收录了2067篇论文。由于数量较多,本文将分四个子文章呈现,可直接点击论文标题获取文档。
第一部分, 第二部分, 第四部分。
A Stitch in Time Saves Nine: A Train-Time Regularizing Loss for Improved Neural Network Calibration [supp] |
The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy [supp] |
Deep Image-Based Illumination Harmonization [supp] |
ViM: Out-of-Distribution With Virtual-Logit Matching [supp] |
Active Learning by Feature Mixing [supp] |
Towards Accurate Facial Landmark Detection via Cascaded Transformers [supp] |
Class-Aware Contrastive Semi-Supervised Learning [supp] |
Long-Term Visual Map Sparsification With Heterogeneous GNN [supp] |
Debiased Learning From Naturally Imbalanced Pseudo-Labels |
RNNPose: Recurrent 6-DoF Object Pose Refinement With Robust Correspondence Field Estimation and Pose Optimization [supp] |
Ditto: Building Digital Twins of Articulated Objects From Interaction [supp] |
Dual-AI: Dual-Path Actor Interaction Learning for Group Activity Recognition [supp] |
Harmony: A Generic Unsupervised Approach for Disentangling Semantic Content From Parameterized Transformations [supp] |
Talking Face Generation With Multilingual TTS |
A Brand New Dance Partner: Music-Conditioned Pluralistic Dancing Controlled by Multiple Dance Genres [supp] |
Kernelized Few-Shot Object Detection With Efficient Integral Aggregation [supp] |
Transformer Based Line Segment Classifier With Image Context for Real-Time Vanishing Point Detection in Manhattan World |
Self-Sustaining Representation Expansion for Non-Exemplar Class-Incremental Learning [supp] |
Adaptive Early-Learning Correction for Segmentation From Noisy Annotations [supp] |
Cross-Domain Correlation Distillation for Unsupervised Domain Adaptation in Nighttime Semantic Segmentation [supp] |
Context-Aware Video Reconstruction for Rolling Shutter Cameras [supp] |
Towards Efficient Data Free Black-Box Adversarial Attack |
Robust Contrastive Learning Against Noisy Views [supp] |
More Than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech [supp] |
Cross-Modal Perceptionist: Can Face Geometry Be Gleaned From Voices? [supp] |
On Generalizing Beyond Domains in Cross-Domain Continual Learning [supp] |
RSTT: Real-Time Spatial Temporal Transformer for Space-Time Video Super-Resolution [supp] |
Learning Memory-Augmented Unidirectional Metrics for Cross-Modality Person Re-Identification [supp] |
A Closer Look at Few-Shot Image Generation [supp] |
Depth-Supervised NeRF: Fewer Views and Faster Training for Free |
Unsupervised Domain Generalization by Learning a Bridge Across Domains [supp] |
Partial Class Activation Attention for Semantic Segmentation |
Multi-Scale Memory-Based Video Deblurring [supp] |
SkinningNet: Two-Stream Graph Convolutional Neural Network for Skinning Prediction of Synthetic Characters [supp] |
A Scalable Combinatorial Solver for Elastic Geometrically Consistent 3D Shape Matching [supp] |
Learning Trajectory-Aware Transformer for Video Super-Resolution [supp] |
Differentiable Dynamics for Articulated 3D Human Motion Reconstruction [supp] |
Geometric Structure Preserving Warp for Natural Image Stitching [supp] |
GOAL: Generating 4D Whole-Body Motion for Hand-Object Grasping [supp] |
Multi-Robot Active Mapping via Neural Bipartite Graph Matching [supp] |
Adversarial Texture for Fooling Person Detectors in the Physical World [supp] |
Focal Length and Object Pose Estimation via Render and Compare [supp] |
TO-FLOW: Efficient Continuous Normalizing Flows With Temporal Optimization Adjoint With Moving Speed [supp] |
Arbitrary-Scale Image Synthesis [supp] |
Cross-Modal Representation Learning for Zero-Shot Action Recognition [supp] |
Conditional Prompt Learning for Vision-Language Models |
Graph Sampling Based Deep Metric Learning for Generalizable Person Re-Identification [supp] |
Retrieval-Based Spatially Adaptive Normalization for Semantic Image Synthesis [supp] |
Undoing the Damage of Label Shift for Cross-Domain Semantic Segmentation |
GPV-Pose: Category-Level Object Pose Estimation via Geometry-Guided Point-Wise Voting [supp] |
Dynamic 3D Gaze From Afar: Deep Gaze Estimation From Temporal Eye-Head-Body Coordination [supp] |
Expressive Talking Head Generation With Granular Audio-Visual Control [supp] |
Trustworthy Long-Tailed Classification [supp] |
Primitive3D: 3D Object Dataset Synthesis From Randomly Assembled Primitives [supp] |
Mix and Localize: Localizing Sound Sources in Mixtures |
FisherMatch: Semi-Supervised Rotation Regression via Entropy-Based Filtering [supp] |
NPBG++: Accelerating Neural Point-Based Graphics [supp] |
SphericGAN: Semi-Supervised Hyper-Spherical Generative Adversarial Networks for Fine-Grained Image Synthesis |
HairMapper: Removing Hair From Portraits Using GANs [supp] |
Affine Medical Image Registration With Coarse-To-Fine Vision Transformer [supp] |
SMPL-A: Modeling Person-Specific Deformable Anatomy [supp] |
Image Dehazing Transformer With Transmission-Aware 3D Position Embedding [supp] |
Out-of-Distribution Generalization With Causal Invariant Transformations [supp] |
Panoptic-PHNet: Towards Real-Time and High-Precision LiDAR Panoptic Segmentation via Clustering Pseudo Heatmap [supp] |
Dual-Key Multimodal Backdoors for Visual Question Answering [supp] |
A Differentiable Two-Stage Alignment Scheme for Burst Image Reconstruction With Large Shift [supp] |
Unifying Panoptic Segmentation for Autonomous Driving [supp] |
Learning Motion-Dependent Appearance for High-Fidelity Rendering of Dynamic Humans From a Single Camera [supp] |
On the Road to Online Adaptation for Semantic Image Segmentation [supp] |
Deformable ProtoPNet: An Interpretable Image Classifier Using Deformable Prototypes [supp] |
Context-Aware Sequence Alignment Using 4D Skeletal Augmentation [supp] |
Perturbed and Strict Mean Teachers for Semi-Supervised Semantic Segmentation [supp] |
Motion-Modulated Temporal Fragment Alignment Network for Few-Shot Action Recognition |
Focal Sparse Convolutional Networks for 3D Object Detection [supp] |
Masked Autoencoders Are Scalable Vision Learners [supp] |
Point-BERT: Pre-Training 3D Point Cloud Transformers With Masked Point Modeling [supp] |
Nested Collaborative Learning for Long-Tailed Visual Recognition [supp] |
Crowd Counting in the Frequency Domain [supp] |
Restormer: Efficient Transformer for High-Resolution Image Restoration [supp] |
STRPM: A Spatiotemporal Residual Predictive Model for High-Resolution Video Prediction |
Learning From Untrimmed Videos: Self-Supervised Video Representation Learning With Hierarchical Consistency [supp] |
Aladdin: Joint Atlas Building and Diffeomorphic Registration Learning With Pairwise Alignment [supp] |
IFRNet: Intermediate Feature Refine Network for Efficient Frame Interpolation [supp] |
Large Loss Matters in Weakly Supervised Multi-Label Classification [supp] |
Toward Practical Monocular Indoor Depth Estimation [supp] |
Attention Concatenation Volume for Accurate and Efficient Stereo Matching |
Learning Distinctive Margin Toward Active Domain Adaptation [supp] |
Zero-Query Transfer Attacks on Context-Aware Object Detectors [supp] |
Neural Inertial Localization [supp] |
Speed Up Object Detection on Gigapixel-Level Images With Patch Arrangement |
Finding Fallen Objects via Asynchronous Audio-Visual Integration |
Learning sRGB-to-Raw-RGB De-Rendering With Content-Aware Metadata [supp] |
GraftNet: Towards Domain Generalized Stereo Matching With a Broad-Spectrum and Task-Oriented Feature [supp] |
Towards Total Recall in Industrial Anomaly Detection [supp] |
DTA: Physical Camouflage Attacks Using Differentiable Transformation Network [supp] |
Neural Recognition of Dashed Curves With Gestalt Law of Continuity [supp] |
Semi-Supervised Object Detection via Multi-Instance Alignment With Global Class Prototypes [supp] |
HODOR: High-Level Object Descriptors for Object Re-Segmentation in Video Learned From Static Images [supp] |
Point Cloud Color Constancy [supp] |
VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning [supp] |
Catching Both Gray and Black Swans: Open-Set Supervised Anomaly Detection [supp] |
MLSLT: Towards Multilingual Sign Language Translation [supp] |
Towards an End-to-End Framework for Flow-Guided Video Inpainting [supp] |
Contrastive Test-Time Adaptation |
Multimodal Colored Point Cloud to Image Alignment [supp] |
MotionAug: Augmentation With Physical Correction for Human Motion Prediction [supp] |
Active Teacher for Semi-Supervised Object Detection |
CrossLoc: Scalable Aerial Localization Assisted by Multimodal Synthetic Data [supp] |
Audio-Adaptive Activity Recognition Across Video Domains [supp] |
Collaborative Learning for Hand and Object Reconstruction With Attention-Guided Graph Convolution [supp] |
On Learning Contrastive Representations for Learning With Noisy Labels [supp] |
Unsupervised Deraining: Where Contrastive Learning Meets Self-Similarity [supp] |
Modeling Indirect Illumination for Inverse Rendering |
BACON: Band-Limited Coordinate Networks for Multiscale Scene Representation [supp] |
Regional Semantic Contrast and Aggregation for Weakly Supervised Semantic Segmentation [supp] |
Class Re-Activation Maps for Weakly-Supervised Semantic Segmentation |
TransWeather: Transformer-Based Restoration of Images Degraded by Adverse Weather Conditions |
Merry Go Round: Rotate a Frame and Fool a DNN [supp] |
H2FA R-CNN: Holistic and Hierarchical Feature Alignment for Cross-Domain Weakly Supervised Object Detection [supp] |
Modeling sRGB Camera Noise With Normalizing Flows [supp] |
A ConvNet for the 2020s [supp] |
Reference-Based Video Super-Resolution Using Multi-Camera Video Triplets [supp] |
Self-Supervised Image Representation Learning With Geometric Set Consistency [supp] |
Deep Anomaly Discovery From Unlabeled Videos via Normality Advantage and Self-Paced Refinement [supp] |
P3Depth: Monocular Depth Estimation With a Piecewise Planarity Prior [supp] |
GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection |
Simple Multi-Dataset Detection [supp] |
MLP-3D: A MLP-Like 3D Architecture With Grouped Time Mixing |
Proactive Image Manipulation Detection [supp] |
Sketch3T: Test-Time Training for Zero-Shot SBIR [supp] |
BANMo: Building Animatable 3D Neural Models From Many Casual Videos [supp] |
StyTr2: Image Style Transfer With Transformers [supp] |
Towards Discriminative Representation: Multi-View Trajectory Contrastive Learning for Online Multi-Object Tracking |
Global Matching With Overlapping Attention for Optical Flow Estimation [supp] |
Language As Queries for Referring Video Object Segmentation [supp] |
Investigating the Impact of Multi-LiDAR Placement on Object Detection for Autonomous Driving [supp] |
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection [supp] |
Audio-Visual Generalised Zero-Shot Learning With Cross-Modal Attention and Language [supp] |
Rethinking Efficient Lane Detection via Curve Modeling [supp] |
GreedyNASv2: Greedier Search With a Greedy Path Filter [supp] |
Self-Supervised Arbitrary-Scale Point Clouds Upsampling via Implicit Neural Representation |
Co-Advise: Cross Inductive Bias Distillation |
AdaMixer: A Fast-Converging Query-Based Object Detector [supp] |
DTFD-MIL: Double-Tier Feature Distillation Multiple Instance Learning for Histopathology Whole Slide Image Classification [supp] |
BEVT: BERT Pretraining of Video Transformers [supp] |
Deep Generalized Unfolding Networks for Image Restoration |
Automatic Relation-Aware Graph Network Proliferation [supp] |
AIM: An Auto-Augmenter for Images and Meshes |
VISOLO: Grid-Based Space-Time Aggregation for Efficient Online Video Instance Segmentation [supp] |
Deep Unlearning via Randomized Conditionally Independent Hessians [supp] |
Patch-Level Representation Learning for Self-Supervised Vision Transformers [supp] |
Sylph: A Hypernetwork Framework for Incremental Few-Shot Object Detection |
Incremental Learning in Semantic Segmentation From Image Labels [supp] |
Playable Environments: Video Manipulation in Space and Time [supp] |
Robust Cross-Modal Representation Learning With Progressive Self-Distillation [supp] |
What To Look at and Where: Semantic and Spatial Refined Transformer for Detecting Human-Object Interactions [supp] |
Compressive Single-Photon 3D Cameras [supp] |
Stereo Magnification With Multi-Layer Images [supp] |
CO-SNE: Dimensionality Reduction and Visualization for Hyperbolic Data [supp] |
Revisiting Skeleton-Based Action Recognition [supp] |
Rethinking Controllable Variational Autoencoders [supp] |
Contextual Instance Decoupling for Robust Multi-Person Pose Estimation |
LMGP: Lifted Multicut Meets Geometry Projections for Multi-Camera Multi-Object Tracking [supp] |
Boosting Crowd Counting via Multifaceted Attention |
Stereo Depth From Events Cameras: Concentrate and Focus on the Future [supp] |
A Probabilistic Graphical Model Based on Neural-Symbolic Reasoning for Visual Relationship Detection |
A Simple Data Mixing Prior for Improving Self-Supervised Learning |
Knowledge Distillation As Efficient Pre-Training: Faster Convergence, Higher Data-Efficiency, and Better Transferability [supp] |
LOLNerf: Learn From One Look [supp] |
Geometry-Aware Guided Loss for Deep Crack Recognition |
Multi-Modal Alignment Using Representation Codebook |
Maintaining Reasoning Consistency in Compositional Visual Question Answering [supp] |
Structure-Aware Motion Transfer With Deformable Anchor Model [supp] |
BigDL 2.0: Seamless Scaling of AI Pipelines From Laptops to Distributed Cluster [supp] |
Integrative Few-Shot Learning for Classification and Segmentation [supp] |
Acquiring a Dynamic Light Field Through a Single-Shot Coded Image [supp] |
Attentive Fine-Grained Structured Sparsity for Image Restoration [supp] |
Pix2NeRF: Unsupervised Conditional p-GAN for Single Image to Neural Radiance Fields Translation [supp] |
HARA: A Hierarchical Approach for Robust Rotation Averaging [supp] |
Diffusion Autoencoders: Toward a Meaningful and Decodable Representation [supp] |
Learning Fair Classifiers With Partially Annotated Group Labels [supp] |
StylizedNeRF: Consistent 3D Scene Stylization As Stylized NeRF via 2D-3D Mutual Learning [supp] |
NightLab: A Dual-Level Architecture With Hardness Detection for Segmentation at Night [supp] |
Knowledge Distillation With the Reused Teacher Classifier [supp] |
Contrastive Learning for Unsupervised Video Highlight Detection |
InfoGCN: Representation Learning for Human Skeleton-Based Action Recognition [supp] |
Rethinking Image Cropping: Exploring Diverse Compositions From Global Views [supp] |
Constrained Few-Shot Class-Incremental Learning [supp] |
Self-Supervised Material and Texture Representation Learning for Remote Sensing Tasks [supp] |
Threshold Matters in WSSS: Manipulating the Activation for the Robust and Accurate Segmentation Model Against Thresholds [supp] |
Data-Free Network Compression via Parametric Non-Uniform Mixed Precision Quantization [supp] |
Sparse to Dense Dynamic 3D Facial Expression Generation [supp] |
Think Twice Before Detecting GAN-Generated Fake Images From Their Spectral Domain Imprints [supp] |
Crafting Better Contrastive Views for Siamese Representation Learning |
RSCFed: Random Sampling Consensus Federated Semi-Supervised Learning [supp] |
TransMVSNet: Global Context-Aware Multi-View Stereo Network With Transformers [supp] |
ROCA: Robust CAD Model Retrieval and Alignment From a Single Image [supp] |
Continual Learning for Visual Search With Backward Consistent Feature Embedding [supp] |
iFS-RCNN: An Incremental Few-Shot Instance Segmenter [supp] |
DPGEN: Differentially Private Generative Energy-Guided Network for Natural Image Synthesis [supp] |
MetaFSCIL: A Meta-Learning Approach for Few-Shot Class Incremental Learning [supp] |
The Majority Can Help the Minority: Context-Rich Minority Oversampling for Long-Tailed Classification [supp] |
Dense Depth Priors for Neural Radiance Fields From Sparse Input Views [supp] |
EyePAD++: A Distillation-Based Approach for Joint Eye Authentication and Presentation Attack Detection Using Periocular Images [supp] |
IntentVizor: Towards Generic Query Guided Interactive Video Summarization [supp] |
Wnet: Audio-Guided Video Object Segmentation via Wavelet-Based Cross-Modal Denoising Networks [supp] |
Camera Pose Estimation Using Implicit Distortion Models [supp] |
Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations [supp] |
Shape-Invariant 3D Adversarial Point Clouds [supp] |
LAS-AT: Adversarial Training With Learnable Attack Strategy [supp] |
Bootstrapping ViTs: Towards Liberating Vision Transformers From Pre-Training [supp] |
PubTables-1M: Towards Comprehensive Table Extraction From Unstructured Documents |
Styleformer: Transformer Based Generative Adversarial Networks With Style Vector [supp] |
Efficient Two-Stage Detection of Human-Object Interactions With a Novel Unary-Pairwise Transformer [supp] |
ELSR: Efficient Line Segment Reconstruction With Planes and Points Guidance [supp] |
Meta-Attention for ViT-Backed Continual Learning [supp] |
DST: Dynamic Substitute Training for Data-Free Black-Box Attack |
Photorealistic Monocular 3D Reconstruction of Humans Wearing Clothing [supp] |
A Low-Cost & Real-Time Motion Capture System |
Unified Contrastive Learning in Image-Text-Label Space [supp] |
Unifying Motion Deblurring and Frame Interpolation With Events [supp] |
Generalizing Interactive Backpropagating Refinement for Dense Prediction Networks [supp] |
Unsupervised Pre-Training for Temporal Action Localization Tasks [supp] |
Light Field Neural Rendering [supp] |
Fast Point Transformer [supp] |
Look Outside the Room: Synthesizing a Consistent Long-Term 3D Scene Video From a Single Image [supp] |
Unimodal-Concentrated Loss: Fully Adaptive Label Distribution Learning for Ordinal Regression [supp] |
Augmented Geometric Distillation for Data-Free Incremental Person ReID [supp] |
Deep Stereo Image Compression via Bi-Directional Coding |
Come-Closer-Diffuse-Faster: Accelerating Conditional Diffusion Models for Inverse Problems Through Stochastic Contraction [supp] |
Smooth-Swap: A Simple Enhancement for Face-Swapping With Smoothness [supp] |
Full-Range Virtual Try-On With Recurrent Tri-Level Transform [supp] |
Style Neophile: Constantly Seeking Novel Styles for Domain Generalization |
High-Fidelity Human Avatars From a Single RGB Camera [supp] |
ADAPT: Vision-Language Navigation With Modality-Aligned Action Prompts [supp] |
Multiview Transformers for Video Recognition [supp] |
RIO: Rotation-Equivariance Supervised Learning of Robust Inertial Odometry [supp] |
How Good Is Aesthetic Ability of a Fashion Model? [supp] |
Mining Multi-View Information: A Strong Self-Supervised Framework for Depth-Based 3D Hand Pose and Mesh Estimation [supp] |
Automated Progressive Learning for Efficient Training of Vision Transformers [supp] |
BTS: A Bi-Lingual Benchmark for Text Segmentation in the Wild [supp] |
Learning Structured Gaussians To Approximate Deep Ensembles [supp] |
Adaptive Trajectory Prediction via Transferable GNN [supp] |
Total Variation Optimization Layers for Computer Vision |
Defensive Patches for Robust Recognition in the Physical World [supp] |
Single-Stage Is Enough: Multi-Person Absolute 3D Pose Estimation [supp] |
Deformation and Correspondence Aware Unsupervised Synthetic-to-Real Scene Flow Estimation for Point Clouds [supp] |
Learn From Others and Be Yourself in Heterogeneous Federated Learning |
Sequential Voting With Relational Box Fields for Active Object Detection [supp] |
Semantic-Aware Auto-Encoders for Self-Supervised Representation Learning |
Learning Transferable Human-Object Interaction Detector With Natural Language Supervision |
Fourier Document Restoration for Robust Document Dewarping and Recognition |
Consistency Learning via Decoding Path Augmentation for Transformers in Human Object Interaction Detection [supp] |
Consistent Explanations by Contrastive Learning [supp] |
Text2Pos: Text-to-Point-Cloud Cross-Modal Localization [supp] |
MulT: An End-to-End Multitask Learning Transformer [supp] |
Hierarchical Modular Network for Video Captioning [supp] |
Learning With Neighbor Consistency for Noisy Labels [supp] |
Depth Estimation by Combining Binocular Stereo and Monocular Structured-Light [supp] |
Salient-to-Broad Transition for Video Person Re-Identification [supp] |
Object-Region Video Transformers [supp] |
DeeCap: Dynamic Early Exiting for Efficient Image Captioning |
AME: Attention and Memory Enhancement in Hyper-Parameter Optimization [supp] |
Alignment-Uniformity Aware Representation Learning for Zero-Shot Video Classification [supp] |
RepMLPNet: Hierarchical Vision MLP With Re-Parameterized Locality [supp] |
DR.VIC: Decomposition and Reasoning for Video Individual Counting [supp] |
LiDARCap: Long-Range Marker-Less 3D Human Motion Capture With LiDAR Point Clouds [supp] |
GeoEngine: A Platform for Production-Ready Geospatial Research |
Revisiting Document Image Dewarping by Grid Regularization [supp] |
Semi-Supervised Few-Shot Learning via Multi-Factor Clustering [supp] |
CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation [supp] |
Weakly-Supervised Generation and Grounding of Visual Descriptions With Conditional Generative Models [supp] |
Novel Class Discovery in Semantic Segmentation [supp] |
ARCS: Accurate Rotation and Correspondence Search [supp] |
Learning To Anticipate Future With Dynamic Context Removal [supp] |
GCFSR: A Generative and Controllable Face Super Resolution Method Without Facial and GAN Priors [supp] |
Perception Prioritized Training of Diffusion Models [supp] |
Using 3D Topological Connectivity for Ghost Particle Reduction in Flow Reconstruction [supp] |
On the Integration of Self-Attention and Convolution [supp] |
Progressively Generating Better Initial Guesses Towards Next Stages for High-Quality Human Motion Prediction [supp] |
CHEX: CHannel EXploration for CNN Model Compression [supp] |
M2I: From Factored Marginal Trajectory Prediction to Interactive Prediction |
Domain Adaptation on Point Clouds via Geometry-Aware Implicits [supp] |
Consistency Driven Sequential Transformers Attention Model for Partially Observable Scenes [supp] |
GroupViT: Semantic Segmentation Emerges From Text Supervision [supp] |
NeuralHOFusion: Neural Volumetric Rendering Under Human-Object Interactions [supp] |
Generalizable Human Pose Triangulation [supp] |
DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation [supp] |
Occlusion-Aware Cost Constructor for Light Field Depth Estimation [supp] |
SmartPortraits: Depth Powered Handheld Smartphone Dataset of Human Portraits for State Estimation, Reconstruction and Synthesis |
BppAttack: Stealthy and Efficient Trojan Attacks Against Deep Neural Networks via Image Quantization and Contrastive Adversarial Learning [supp] |
GlideNet: Global, Local and Intrinsic Based Dense Embedding NETwork for Multi-Category Attributes Prediction [supp] |
Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased Scene Graph Generation [supp] |
Ensembling Off-the-Shelf Models for GAN Training |
Towards Better Plasticity-Stability Trade-Off in Incremental Learning: A Simple Linear Connector |
Topology-Preserving Shape Reconstruction and Registration via Neural Diffeomorphic Flow [supp] |
Segment and Complete: Defending Object Detectors Against Adversarial Patch Attacks With Robust Patch Detection [supp] |
Cross-Domain Few-Shot Learning With Task-Specific Adapters [supp] |
MAXIM: Multi-Axis MLP for Image Processing [supp] |
Learning Part Segmentation Through Unsupervised Domain Adaptation From Synthetic Vehicles [supp] |
Delving Into the Estimation Shift of Batch Normalization in a Network [supp] |
Towards Better Understanding Attribution Methods [supp] |
Learning Object Context for Novel-View Scene Layout Generation |
PSTR: End-to-End One-Step Person Search With Transformers |
Neural Fields As Learnable Kernels for 3D Reconstruction [supp] |
A Deeper Dive Into What Deep Spatiotemporal Networks Encode: Quantifying Static vs. Dynamic Information [supp] |
Detector-Free Weakly Supervised Group Activity Recognition [supp] |
NFormer: Robust Person Re-Identification With Neighbor Transformer [supp] |
Joint Forecasting of Panoptic Segmentations With Difference Attention [supp] |
HairCLIP: Design Your Hair by Text and Reference Image [supp] |
Imposing Consistency for Optical Flow Estimation [supp] |
Style Transformer for Image Inversion and Editing [supp] |
OakInk: A Large-Scale Knowledge Repository for Understanding Hand-Object Interaction [supp] |
Pyramid Adversarial Training Improves ViT Performance [supp] |
Bridging Global Context Interactions for High-Fidelity Image Completion [supp] |
SwinBERT: End-to-End Transformers With Sparse Attention for Video Captioning [supp] |
Maximum Spatial Perturbation Consistency for Unpaired Image-to-Image Translation [supp] |
Unseen Classes at a Later Time? No Problem [supp] |
InfoNeRF: Ray Entropy Minimization for Few-Shot Neural Volume Rendering [supp] |
Learning the Degradation Distribution for Blind Image Super-Resolution |
Dist-PU: Positive-Unlabeled Learning From a Label Distribution Perspective [supp] |
SC2-PCR: A Second Order Spatial Compatibility for Efficient and Robust Point Cloud Registration [supp] |
Relative Pose From a Calibrated and an Uncalibrated Smartphone Image [supp] |
Towards Robust and Reproducible Active Learning Using Neural Networks [supp] |
Retrieval Augmented Classification for Long-Tail Visual Recognition [supp] |
Not All Tokens Are Equal: Human-Centric Visual Analysis via Token Clustering Transformer [supp] |
Temporally Efficient Vision Transformer for Video Instance Segmentation |
The Devil Is in the Margin: Margin-Based Label Smoothing for Network Calibration [supp] |
NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks [supp] |
Bringing Old Films Back to Life [supp] |
Sound and Visual Representation Learning With Multiple Pretraining Tasks |
WarpingGAN: Warping Multiple Uniform Priors for Adversarial 3D Point Cloud Generation [supp] |
RePaint: Inpainting Using Denoising Diffusion Probabilistic Models [supp] |
Revealing Occlusions With 4D Neural Fields [supp] |
Meta Agent Teaming Active Learning for Pose Estimation [supp] |
Forward Propagation, Backward Regression, and Pose Association for Hand Tracking in the Wild |
Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding [supp] |
E2(GO)MOTION: Motion Augmented Event Stream for Egocentric Action Recognition [supp] |
ES6D: A Computation Efficient and Symmetry-Aware 6D Pose Regression Framework [supp] |
Self-Supervised Deep Image Restoration via Adaptive Stochastic Gradient Langevin Dynamics [supp] |
Towards Discovering the Effectiveness of Moderately Confident Samples for Semi-Supervised Learning [supp] |
OoD-Bench: Quantifying and Understanding Two Dimensions of Out-of-Distribution Generalization [supp] |
An Empirical Study of Training End-to-End Vision-and-Language Transformers [supp] |
Multimodal Dynamics: Dynamical Fusion for Trustworthy Multimodal Classification [supp] |
The Neurally-Guided Shape Parser: Grammar-Based Labeling of 3D Shape Regions With Approximate Inference [supp] |
Unsupervised Homography Estimation With Coplanarity-Aware GAN [supp] |
LIFT: Learning 4D LiDAR Image Fusion Transformer for 3D Object Detection [supp] |
AutoLoss-Zero: Searching Loss Functions From Scratch for Generic Tasks [supp] |
PatchNet: A Simple Face Anti-Spoofing Framework via Fine-Grained Patch Recognition [supp] |
OnePose: One-Shot Object Pose Estimation Without CAD Models |
Weakly-Supervised Online Action Segmentation in Multi-View Instructional Videos [supp] |
Rethinking Minimal Sufficient Representation in Contrastive Learning [supp] |
Disentangling Visual Embeddings for Attributes and Objects [supp] |
Scalable Penalized Regression for Noise Detection in Learning With Noisy Labels |
Effective Conditioned and Composed Image Retrieval Combining CLIP-Based Features |
Registering Explicit to Implicit: Towards High-Fidelity Garment Mesh Reconstruction From Single Images [supp] |
Federated Class-Incremental Learning [supp] |
MiniViT: Compressing Vision Transformers With Weight Multiplexing [supp] |
Practical Stereo Matching via Cascaded Recurrent Network With Adaptive Correlation [supp] |
D-Grasp: Physically Plausible Dynamic Grasp Synthesis for Hand-Object Interactions [supp] |
Show, Deconfound and Tell: Image Captioning With Causal Inference [supp] |
Extracting Triangular 3D Models, Materials, and Lighting From Images [supp] |
Weakly Supervised Segmentation on Outdoor 4D Point Clouds With Temporal Matching and Spatial Graph Propagation [supp] |
ImFace: A Nonlinear 3D Morphable Face Model With Implicit Neural Representations [supp] |
MobRecon: Mobile-Friendly Hand Mesh Reconstruction From Monocular Image [supp] |
Layered Depth Refinement With Mask Guidance [supp] |
Parameter-Free Online Test-Time Adaptation [supp] |
SIGMA: Semantic-Complete Graph Matching for Domain Adaptive Object Detection [supp] |
Global Convergence of MAML and Theory-Inspired Neural Architecture Search for Few-Shot Learning [supp] |
LAKe-Net: Topology-Aware Point Cloud Completion by Localizing Aligned Keypoints [supp] |
Scribble-Supervised LiDAR Semantic Segmentation [supp] |
AlignMixup: Improving Representations by Interpolating Aligned Features [supp] |
No Pain, Big Gain: Classify Dynamic Point Cloud Sequences With Static Models by Fitting Feature-Level Space-Time Surfaces [supp] |
HiVT: Hierarchical Vector Transformer for Multi-Agent Motion Prediction [supp] |
HerosNet: Hyperspectral Explicable Reconstruction and Optimal Sampling Deep Network for Snapshot Compressive Imaging |
Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space |
Brain-Inspired Multilayer Perceptron With Spiking Neurons |
Learning To Estimate Robust 3D Human Mesh From In-the-Wild Crowded Scenes [supp] |
ObjectFormer for Image Manipulation Detection and Localization |
Detecting Deepfakes With Self-Blended Images [supp] |
Correlation-Aware Deep Tracking [supp] |
Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos [supp] |
NeurMiPs: Neural Mixture of Planar Experts for View Synthesis [supp] |
Implicit Sample Extension for Unsupervised Person Re-Identification |
Energy-Based Latent Aligner for Incremental Learning [supp] |
Towards Semi-Supervised Deep Facial Expression Recognition With an Adaptive Confidence Margin [supp] |
GanOrCon: Are Generative Models Useful for Few-Shot Segmentation? [supp] |
Bi-Level Doubly Variational Learning for Energy-Based Latent Variable Models [supp] |
SplitNets: Designing Neural Architectures for Efficient Distributed Computing on Head-Mounted Systems [supp] |
Masked-Attention Mask Transformer for Universal Image Segmentation [supp] |
Reading To Listen at the Cocktail Party: Multi-Modal Speech Separation |
AxIoU: An Axiomatically Justified Measure for Video Moment Retrieval [supp] |
NOC-REK: Novel Object Captioning With Retrieved Vocabulary From External Knowledge [supp] |
Boosting Robustness of Image Matting With Context Assembling and Strong Data Augmentation [supp] |
Group R-CNN for Weakly Semi-Supervised Object Detection With Points [supp] |
Weakly-Supervised Action Transition Learning for Stochastic Human Motion Prediction [supp] |
Speech Driven Tongue Animation [supp] |
Hybrid Relation Guided Set Matching for Few-Shot Action Recognition [supp] |
Self-Supervised Spatial Reasoning on Multi-View Line Drawings [supp] |
Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation |
Cross-Patch Dense Contrastive Learning for Semi-Supervised Segmentation of Cellular Nuclei in Histopathologic Images [supp] |
Frame-Wise Action Representations for Long Videos via Sequence Contrastive Learning [supp] |
Coarse-To-Fine Deep Video Coding With Hyperprior-Guided Mode Prediction |
Generalized Binary Search Network for Highly-Efficient Multi-View Stereo [supp] |
SHIFT: A Synthetic Driving Dataset for Continuous Multi-Task Domain Adaptation [supp] |
Adaptive Hierarchical Representation Learning for Long-Tailed Object Detection [supp] |
FlexIT: Towards Flexible Semantic Image Translation [supp] |
Face2Exp: Combating Data Biases for Facial Expression Recognition |
SAR-Net: Shape Alignment and Recovery Network for Category-Level 6D Object Pose and Size Estimation [supp] |
Whose Hands Are These? Hand Detection and Hand-Body Association in the Wild [supp] |
Mega-NERF: Scalable Construction of Large-Scale NeRFs for Virtual Fly-Throughs [supp] |
PINA: Learning a Personalized Implicit Neural Avatar From a Single RGB-D Video Sequence [supp] |
Forecasting From LiDAR via Future Object Detection [supp] |
CRAFT: Cross-Attentional Flow Transformer for Robust Optical Flow [supp] |
Adversarial Eigen Attack on Black-Box Models [supp] |
Training Quantised Neural Networks With STE Variants: The Additive Noise Annealing Algorithm [supp] |
Split Hierarchical Variational Compression [supp] |
Video Swin Transformer |
Privacy Preserving Partial Localization [supp] |
Cross-Modal Background Suppression for Audio-Visual Event Localization |
Mutual Quantization for Cross-Modal Search With Noisy Labels |
Lagrange Motion Analysis and View Embeddings for Improved Gait Recognition [supp] |
SphereSR: 360deg Image Super-Resolution With Arbitrary Projection via Continuous Spherical Image Representation [supp] |
Neural Mesh Simplification [supp] |
Cloth-Changing Person Re-Identification From a Single Image With Gait Prediction and Regularization [supp] |
BoxeR: Box-Attention for 2D and 3D Transformers [supp] |
Neural Architecture Search With Representation Mutual Information [supp] |
Deep Hyperspectral-Depth Reconstruction Using Single Color-Dot Projection [supp] |
M3T: Three-Dimensional Medical Image Classifier Using Multi-Plane and Multi-Slice Transformer [supp] |
3MASSIV: Multilingual, Multimodal and Multi-Aspect Dataset of Social Media Short Videos [supp] |
Can Neural Nets Learn the Same Model Twice? Investigating Reproducibility and Double Descent From the Decision Boundary Perspective [supp] |
Cross Domain Object Detection by Target-Perceived Dual Branch Distillation [supp] |
A Proposal-Based Paradigm for Self-Supervised Sound Source Localization in Videos |
Overcoming Catastrophic Forgetting in Incremental Object Detection via Elastic Response Distillation |
GroupNet: Multiscale Hypergraph Neural Networks for Trajectory Prediction With Relational Reasoning [supp] |
Unbiased Subclass Regularization for Semi-Supervised Semantic Segmentation |
P3IV: Probabilistic Procedure Planning From Instructional Videos With Weak Supervision [supp] |
Hierarchical Nearest Neighbor Graph Embedding for Efficient Dimensionality Reduction [supp] |
Coupled Iterative Refinement for 6D Multi-Object Pose Estimation [supp] |
Multi-View Transformer for 3D Visual Grounding |
Structured Sparse R-CNN for Direct Scene Graph Generation [supp] |
Multi-Grained Spatio-Temporal Features Perceived Network for Event-Based Lip-Reading [supp] |
Semi-Supervised Video Paragraph Grounding With Contrastive Encoder |
Continual Predictive Learning From Videos [supp] |
Weakly Paired Associative Learning for Sound and Image Representations via Bimodal Associative Memory [supp] |
BARC: Learning To Regress 3D Dog Shape From Images by Exploiting Breed Information [supp] |
Knowledge Distillation: A Good Teacher Is Patient and Consistent [supp] |
PCA-Based Knowledge Distillation Towards Lightweight and Content-Style Balanced Photorealistic Style Transfer Models [supp] |
Frame Averaging for Equivariant Shape Space Learning [supp] |
Transformer Tracking With Cyclic Shifting Window Attention [supp] |
ProposalCLIP: Unsupervised Open-Category Object Proposal Generation via Exploiting CLIP Cues |
Towards Understanding Adversarial Robustness of Optical Flow Networks [supp] |
Panoptic SegFormer: Delving Deeper Into Panoptic Segmentation With Transformers [supp] |
Training High-Performance Low-Latency Spiking Neural Networks by Differentiation on Spike Representation [supp] |
AnyFace: Free-Style Text-To-Face Synthesis and Manipulation |
HL-Net: Heterophily Learning Network for Scene Graph Generation [supp] |
Lifelong Graph Learning [supp] |
Hypergraph-Induced Semantic Tuplet Loss for Deep Metric Learning [supp] |
Computing Wasserstein-p Distance Between Images With Linear Cost [supp] |
DLFormer: Discrete Latent Transformer for Video Inpainting |
Unsupervised Representation Learning for Binary Networks by Joint Classifier Learning [supp] |
High Quality Segmentation for Ultra High-Resolution Images [supp] |
Investigating Tradeoffs in Real-World Video Super-Resolution [supp] |
MERLOT Reserve: Neural Script Knowledge Through Vision and Language and Sound [supp] |
Differentiable Stereopsis: Meshes From Multiple Views Using Differentiable Rendering [supp] |
Towards Practical Certifiable Patch Defense With Vision Transformer |
A Conservative Approach for Unbiased Learning on Unknown Biases [supp] |
Large-Scale Video Panoptic Segmentation in the Wild: A Benchmark [supp] |
Label, Verify, Correct: A Simple Few Shot Object Detection Method |
Aesthetic Text Logo Synthesis via Content-Aware Layout Inferring [supp] |
Global Tracking via Ensemble of Local Trackers [supp] |
Autoregressive Image Generation Using Residual Quantization [supp] |
MPC: Multi-View Probabilistic Clustering [supp] |
End-to-End Compressed Video Representation Learning for Generic Event Boundary Detection |
GrainSpace: A Large-Scale Dataset for Fine-Grained and Domain-Adaptive Recognition of Cereal Grains [supp] |
BokehMe: When Neural Rendering Meets Classical Rendering [supp] |
Learning Modal-Invariant and Temporal-Memory for Video-Based Visible-Infrared Person Re-Identification [supp] |
MSDN: Mutually Semantic Distillation Network for Zero-Shot Learning |
Oriented RepPoints for Aerial Object Detection |
OccAM's Laser: Occlusion-Based Attribution Maps for 3D Object Detectors on LiDAR Data [supp] |
BigDatasetGAN: Synthesizing ImageNet With Pixel-Wise Annotations [supp] |
Align Representations With Base: A New Approach to Self-Supervised Learning |
Exploring Denoised Cross-Video Contrast for Weakly-Supervised Temporal Action Localization |
Pre-Train, Self-Train, Distill: A Simple Recipe for Supersizing 3D Reconstruction [supp] |
Meta Distribution Alignment for Generalizable Person Re-Identification |
TeachAugment: Data Augmentation Optimization Using Teacher Knowledge [supp] |
SVIP: Sequence VerIfication for Procedures in Videos [supp] |
Weakly Supervised Temporal Sentence Grounding With Gaussian-Based Contrastive Proposal Learning |
Low-Resource Adaptation for Personalized Co-Speech Gesture Generation [supp] |
BoosterNet: Improving Domain Generalization of Deep Neural Nets Using Culpability-Ranked Features |
Task-Specific Inconsistency Alignment for Domain Adaptive Object Detection [supp] |
HDR-NeRF: High Dynamic Range Neural Radiance Fields [supp] |
MS2DG-Net: Progressive Correspondence Learning via Multiple Sparse Semantics Dynamic Graph |
Neural Emotion Director: Speech-Preserving Semantic Control of Facial Expressions in "In-the-Wild" Videos [supp] |
Learning To Listen: Modeling Non-Deterministic Dyadic Facial Motion |
3PSDF: Three-Pole Signed Distance Function for Learning Surfaces With Arbitrary Topologies [supp] |
Capturing Humans in Motion: Temporal-Attentive 3D Human Pose and Shape Estimation From Monocular Video [supp] |
MixFormer: End-to-End Tracking With Iterative Mixed Attention [supp] |