CVPR2022将于6月22日召开,本次会议共收录了2067篇论文。由于数量较多,本文将分四个子文章呈现,可直接点击论文标题获取文档。
第一部分, 第三部分, 第四部分。
Cannot See the Forest for the Trees: Aggregating Multiple Viewpoints To Better Classify Objects in Videos [supp] |
Learning Canonical F-Correlation Projection for Compact Multiview Representation [supp] |
DIFNet: Boosting Visual Information Flow for Image Captioning |
Weakly Supervised Object Localization As Domain Adaption [supp] |
Tencent-MVSE: A Large-Scale Benchmark Dataset for Multi-Modal Video Similarity Evaluation |
Dynamic Prototype Convolution Network for Few-Shot Semantic Segmentation [supp] |
Deep Orientation-Aware Functional Maps: Tackling Symmetry Issues in Shape Matching [supp] |
Tree Energy Loss: Towards Sparsely Annotated Semantic Segmentation [supp] |
Mr.BiQ: Post-Training Non-Uniform Quantization Based on Minimizing the Reconstruction Error [supp] |
MatteFormer: Transformer-Based Image Matting via Prior-Tokens [supp] |
Video Shadow Detection via Spatio-Temporal Interpolation Consistency Training [supp] |
Ranking Distance Calibration for Cross-Domain Few-Shot Learning [supp] |
Robust and Accurate Superquadric Recovery: A Probabilistic Approach [supp] |
Zero-Shot Text-Guided Object Generation With Dream Fields [supp] |
Learning Pixel Trajectories With Multiscale Contrastive Random Walks |
Self-Supervised Correlation Mining Network for Person Image Generation |
Grounding Answers for Visual Questions Asked by Visually Impaired People [supp] |
Task Adaptive Parameter Sharing for Multi-Task Learning [supp] |
Sparse Instance Activation for Real-Time Instance Segmentation |
Automatic Color Image Stitching Using Quaternion Rank-1 Alignment [supp] |
VisualGPT: Data-Efficient Adaptation of Pretrained Language Models for Image Captioning [supp] |
ESCNet: Gaze Target Detection With the Understanding of 3D Scenes [supp] |
Can You Spot the Chameleon? Adversarially Camouflaging Images From Co-Salient Object Detection |
Finding Badly Drawn Bunnies [supp] |
Point2Cyl: Reverse Engineering 3D Objects From Point Clouds to Extrusion Cylinders [supp] |
All-Photon Polarimetric Time-of-Flight Imaging [supp] |
MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation [supp] |
Surface-Aligned Neural Radiance Fields for Controllable 3D Human Synthesis [supp] |
Learning From Temporal Gradient for Semi-Supervised Action Recognition [supp] |
Towards Implicit Text-Guided 3D Shape Generation [supp] |
Audio-Driven Neural Gesture Reenactment With Video Motion Graphs [supp] |
SoftCollage: A Differentiable Probabilistic Tree Generator for Image Collage [supp] |
Transforming Model Prediction for Tracking [supp] |
A Unified Framework for Implicit Sinkhorn Differentiation [supp] |
DGECN: A Depth-Guided Edge Convolutional Network for End-to-End 6D Pose Estimation [supp] |
Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs With Language Structures via Dependency Relationships |
Open-Vocabulary Instance Segmentation via Robust Cross-Modal Pseudo-Labeling [supp] |
Locality-Aware Inter- and Intra-Video Reconstruction for Self-Supervised Correspondence Learning [supp] |
A Versatile Multi-View Framework for LiDAR-Based 3D Object Detection With Guidance From Panoptic Segmentation [supp] |
Query and Attention Augmentation for Knowledge-Based Explainable Reasoning [supp] |
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality [supp] |
RFNet: Unsupervised Network for Mutually Reinforcing Multi-Modal Image Registration and Fusion [supp] |
Progressive Attention on Multi-Level Dense Difference Maps for Generic Event Boundary Detection [supp] |
Interactron: Embodied Adaptive Object Detection [supp] |
3D Scene Painting via Semantic Image Synthesis [supp] |
MeMOT: Multi-Object Tracking With Memory |
Revisiting Weakly Supervised Pre-Training of Visual Perception Models [supp] |
Semi-Supervised Semantic Segmentation With Error Localization Network |
Meta Convolutional Neural Networks for Single Domain Generalization [supp] |
Generalizing Gaze Estimation With Rotation Consistency |
Anomaly Detection via Reverse Distillation From One-Class Embedding [supp] |
Fine-Grained Object Classification via Self-Supervised Pose Alignment [supp] |
Spatio-Temporal Gating-Adjacency GCN for Human Motion Prediction [supp] |
CellTypeGraph: A New Geometric Computer Vision Benchmark [supp] |
Clustering Plotted Data by Image Segmentation |
Accelerating Neural Network Optimization Through an Automated Control Theory Lens [supp] |
Animal Kingdom: A Large and Diverse Dataset for Animal Behavior Understanding [supp] |
Learning To Learn Across Diverse Data Biases in Deep Face Recognition [supp] |
Back to Reality: Weakly-Supervised 3D Object Detection With Shape-Guided Label Enhancement [supp] |
Long-Tail Recognition via Compositional Knowledge Transfer [supp] |
EI-CLIP: Entity-Aware Interventional Contrastive Learning for E-Commerce Cross-Modal Retrieval [supp] |
Multi-Dimensional, Nuanced and Subjective - Measuring the Perception of Facial Expressions [supp] |
PyMiceTracking: An Open-Source Toolbox for Real-Time Behavioral Neuroscience Experiments |
Self-Taught Metric Learning Without Labels [supp] |
MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition [supp] |
Fine-Grained Temporal Contrastive Learning for Weakly-Supervised Temporal Action Localization |
Embracing Single Stride 3D Object Detector With Sparse Transformer [supp] |
Multidimensional Belief Quantification for Label-Efficient Meta-Learning [supp] |
UTC: A Unified Transformer With Inter-Task Contrastive Learning for Visual Dialog |
Relieving Long-Tailed Instance Segmentation via Pairwise Class Balance [supp] |
Online Convolutional Re-Parameterization [supp] |
Mimicking the Oracle: An Initial Phase Decorrelation Approach for Class Incremental Learning [supp] |
RIDDLE: Lidar Data Compression With Range Image Deep Delta Encoding [supp] |
RelTransformer: A Transformer-Based Long-Tail Visual Relationship Recognition [supp] |
HODEC: Towards Efficient High-Order DEcomposed Convolutional Neural Networks |
RigidFlow: Self-Supervised Scene Flow Learning on Point Clouds by Local Rigidity Prior [supp] |
Smooth Maximum Unit: Smooth Activation Function for Deep Networks Using Smoothing Maximum Technique [supp] |
Learning Invisible Markers for Hidden Codes in Offline-to-Online Photography [supp] |
Personalized Image Aesthetics Assessment With Rich Attributes |
Task2Sim: Towards Effective Pre-Training and Transfer From Synthetic Data [supp] |
Part-Based Pseudo Label Refinement for Unsupervised Person Re-Identification [supp] |
Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation [supp] |
HDNet: High-Resolution Dual-Domain Learning for Spectral Compressive Imaging |
OW-DETR: Open-World Detection Transformer [supp] |
Learning Deep Implicit Functions for 3D Shapes With Dynamic Code Clouds [supp] |
Reversible Vision Transformers [supp] |
Amodal Panoptic Segmentation [supp] |
Gravitationally Lensed Black Hole Emission Tomography [supp] |
3D-Aware Image Synthesis via Learning Structural and Textural Representations [supp] |
Text-to-Image Synthesis Based on Object-Guided Joint-Decoding Transformer [supp] |
Correlation Verification for Image Retrieval [supp] |
Unsupervised Vision-and-Language Pre-Training via Retrieval-Based Multi-Granular Alignment [supp] |
Protecting Facial Privacy: Generating Adversarial Identity Masks via Style-Robust Makeup Transfer [supp] |
PONI: Potential Functions for ObjectGoal Navigation With Interaction-Free Learning [supp] |
Noise Is Also Useful: Negative Correlation-Steered Latent Contrastive Learning |
Temporal Feature Alignment and Mutual Information Maximization for Video-Based Human Pose Estimation |
Spatially-Adaptive Multilayer Selection for GAN Inversion and Editing |
Self-Supervised Transformers for Unsupervised Object Discovery Using Normalized Cut [supp] |
Exploring Structure-Aware Transformer Over Interaction Proposals for Human-Object Interaction Detection [supp] |
Towards Robust Adaptive Object Detection Under Noisy Annotations [supp] |
Decoupled Multi-Task Learning With Cyclical Self-Regulation for Face Parsing |
Pastiche Master: Exemplar-Based High-Resolution Portrait Style Transfer [supp] |
Learning To Memorize Feature Hallucination for One-Shot Image Generation |
AUV-Net: Learning Aligned UV Maps for Texture Transfer and Synthesis |
Open-Vocabulary One-Stage Detection With Hierarchical Visual-Language Knowledge Distillation [supp] |
Glass: Geometric Latent Augmentation for Shape Spaces |
COAP: Compositional Articulated Occupancy of People [supp] |
Counterfactual Cycle-Consistent Learning for Instruction Following and Generation in Vision-Language Navigation |
Evading the Simplicity Bias: Training a Diverse Set of Models Discovers Solutions With Superior OOD Generalization [supp] |
Assembly101: A Large-Scale Multi-View Video Dataset for Understanding Procedural Activities [supp] |
Deterministic Point Cloud Registration via Novel Transformation Decomposition [supp] |
Motion-Adjustable Neural Implicit Video Representation |
Neural Prior for Trajectory Estimation [supp] |
DPICT: Deep Progressive Image Compression Using Trit-Planes [supp] |
Rethinking Depth Estimation for Multi-View Stereo: A Unified Representation [supp] |
Long-Tailed Recognition via Weight Balancing [supp] |
Text to Image Generation With Semantic-Spatial Aware GAN |
The Norm Must Go On: Dynamic Unsupervised Domain Adaptation by Normalization [supp] |
ShapeFormer: Transformer-Based Shape Completion via Sparse Representation [supp] |
PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures [supp] |
Eigencontours: Novel Contour Descriptors Based on Low-Rank Approximation [supp] |
Generalizable Cross-Modality Medical Image Segmentation via Style Augmentation and Dual Normalization [supp] |
Learning Optical Flow With Kernel Patch Attention |
Learning To Prompt for Open-Vocabulary Object Detection With Vision-Language Model [supp] |
TimeReplayer: Unlocking the Potential of Event Cameras for Video Interpolation [supp] |
General Incremental Learning With Domain-Aware Categorical Representations [supp] |
Interactive Segmentation and Visualization for Tiny Objects in Multi-Megapixel Images |
ActiveZero: Mixed Domain Learning for Active Stereovision With Zero Annotation [supp] |
DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers [supp] |
Global-Aware Registration of Less-Overlap RGB-D Scans [supp] |
RayMVSNet: Learning Ray-Based 1D Implicit Fields for Accurate Multi-View Stereo [supp] |
ContrastMask: Contrastive Learning To Segment Every Thing [supp] |
Efficient Deep Embedded Subspace Clustering [supp] |
Neural MoCon: Neural Motion Control for Physically Plausible Human Motion Capture [supp] |
Revisiting Temporal Alignment for Video Restoration [supp] |
Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning [supp] |
Neural Reflectance for Shape Recovery With Shadow Handling [supp] |
Rep-Net: Efficient On-Device Learning via Feature Reprogramming [supp] |
Surface Representation for Point Clouds [supp] |
Implicit Motion Handling for Video Camouflaged Object Detection [supp] |
OVE6D: Object Viewpoint Encoding for Depth-Based 6D Object Pose Estimation [supp] |
DeepLIIF: An Online Platform for Quantification of Clinical Pathology Slides |
Joint Video Summarization and Moment Localization by Cross-Task Sample Transfer [supp] |
WALT: Watch and Learn 2D Amodal Representation From Time-Lapse Imagery [supp] |
Learning With Twin Noisy Labels for Visible-Infrared Person Re-Identification [supp] |
Optical Flow Estimation for Spiking Camera [supp] |
MetaFormer Is Actually What You Need for Vision [supp] |
GradViT: Gradient Inversion of Vision Transformers [supp] |
Spatial-Temporal Space Hand-in-Hand: Spatial-Temporal Video Super-Resolution via Cycle-Projected Mutual Learning |
InstaFormer: Instance-Aware Image-to-Image Translation With Transformer [supp] |
Revisiting Near/Remote Sensing With Geospatial Attention [supp] |
Joint Global and Local Hierarchical Priors for Learned Image Compression [supp] |
Knowledge Distillation via the Target-Aware Transformer [supp] |
Recurring the Transformer for Video Action Recognition [supp] |
Subspace Adversarial Training [supp] |
3D-VField: Adversarial Augmentation of Point Clouds for Domain Generalization in 3D Object Detection [supp] |
Image Segmentation Using Text and Image Prompts [supp] |
AutoMine: An Unmanned Mine Dataset [supp] |
Neural Data-Dependent Transform for Learned Image Compression [supp] |
Background Activation Suppression for Weakly Supervised Object Localization [supp] |
How Many Observations Are Enough? Knowledge Distillation for Trajectory Forecasting [supp] |
Evaluation-Oriented Knowledge Distillation for Deep Face Recognition |
Improving Subgraph Recognition With Variational Graph Information Bottleneck |
Slot-VPS: Object-Centric Representation Learning for Video Panoptic Segmentation [supp] |
Motion-From-Blur: 3D Shape and Motion Estimation of Motion-Blurred Objects in Videos [supp] |
Efficient Video Instance Segmentation via Tracklet Query and Proposal [supp] |
Synthetic Generation of Face Videos With Plethysmograph Physiology |
TransRAC: Encoding Multi-Scale Temporal Correlation With Transformers for Repetitive Action Counting [supp] |
Hallucinated Neural Radiance Fields in the Wild [supp] |
NeuralHDHair: Automatic High-Fidelity Hair Modeling From a Single Image Using Implicit Neural Representations [supp] |
The Two Dimensions of Worst-Case Training and Their Integrated Effect for Out-of-Domain Generalization [supp] |
Global Tracking Transformers |
Backdoor Attacks on Self-Supervised Learning [supp] |
Multimodal Token Fusion for Vision Transformers [supp] |
Exploring Frequency Adversarial Attacks for Face Forgery Detection |
GMFlow: Learning Optical Flow via Global Matching [supp] |
Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation [supp] |
FLAVA: A Foundational Language and Vision Alignment Model [supp] |
Signing at Scale: Learning to Co-Articulate Signs for Large-Scale Photo-Realistic Sign Language Production [supp] |
Explore Spatio-Temporal Aggregation for Insubstantial Object Detection: Benchmark Dataset and Baseline |
OCSampler: Compressing Videos to One Clip With Single-Step Sampling [supp] |
Learning Bayesian Sparse Networks With Full Experience Replay for Continual Learning |
Graph-Based Spatial Transformer With Memory Replay for Multi-Future Pedestrian Trajectory Prediction |
Scanline Homographies for Rolling-Shutter Plane Absolute Pose [supp] |
TableFormer: Table Structure Understanding With Transformers [supp] |
Exemplar-Based Pattern Synthesis With Implicit Periodic Field Network |
Grounded Language-Image Pre-Training [supp] |
Spectral Unsupervised Domain Adaptation for Visual Recognition [supp] |
AdaInt: Learning Adaptive Intervals for 3D Lookup Tables on Real-Time Image Enhancement [supp] |
PatchFormer: An Efficient Point Transformer With Patch Attention |
Recurrent Glimpse-Based Decoder for Detection With Transformer [supp] |
Generating 3D Bio-Printable Patches Using Wound Segmentation and Reconstruction To Treat Diabetic Foot Ulcers [supp] |
SimMIM: A Simple Framework for Masked Image Modeling [supp] |
OmniFusion: 360 Monocular Depth Estimation via Geometry-Aware Fusion [supp] |
Label Matching Semi-Supervised Object Detection [supp] |
RegionCLIP: Region-Based Language-Image Pretraining [supp] |
Video Frame Interpolation Transformer |
An MIL-Derived Transformer for Weakly Supervised Point Cloud Segmentation [supp] |
Fast Light-Weight Near-Field Photometric Stereo [supp] |
BCOT: A Markerless High-Precision 3D Object Tracking Benchmark [supp] |
Omni-DETR: Omni-Supervised Object Detection With Transformers [supp] |
Uniform Subdivision of Omnidirectional Camera Space for Efficient Spherical Stereo Matching [supp] |
High-Resolution Image Synthesis With Latent Diffusion Models [supp] |
Improving Adversarially Robust Few-Shot Image Classification With Generalizable Representations |
Transferable Sparse Adversarial Attack |
CREAM: Weakly Supervised Object Localization via Class RE-Activation Mapping |
Semi-Weakly-Supervised Learning of Complex Actions From Instructional Task Videos [supp] |
APRIL: Finding the Achilles' Heel on Privacy for Vision Transformers [supp] |
Text Spotting Transformers |
Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields [supp] |
VALHALLA: Visual Hallucination for Machine Translation [supp] |
StyleSDF: High-Resolution 3D-Consistent Image and Geometry Generation [supp] |
Incorporating Semi-Supervised and Positive-Unlabeled Learning for Boosting Full Reference Image Quality Assessment [supp] |
GLAMR: Global Occlusion-Aware Human Mesh Recovery With Dynamic Cameras [supp] |
HINT: Hierarchical Neuron Concept Explainer [supp] |
Capturing and Inferring Dense Full-Body Human-Scene Contact [supp] |
Advancing High-Resolution Video-Language Representation With Large-Scale Video Transcriptions [supp] |
Target-Aware Dual Adversarial Learning and a Multi-Scenario Multi-Modality Benchmark To Fuse Infrared and Visible for Object Detection [supp] |
En-Compactness: Self-Distillation Embedding & Contrastive Generation for Generalized Zero-Shot Learning [supp] |
Neural Face Identification in a 2D Wireframe Projection of a Manifold Object [supp] |
LC-FDNet: Learned Lossless Image Compression With Frequency Decomposition Network [supp] |
Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via Generalized Straight-Through Estimation [supp] |
Deep Rectangling for Image Stitching: A Learning Baseline [supp] |
PCL: Proxy-Based Contrastive Learning for Domain Generalization [supp] |
SurfEmb: Dense and Continuous Correspondence Distributions for Object Pose Estimation With Learnt Surface Embeddings |
Diverse Plausible 360-Degree Image Outpainting for Efficient 3DCG Background Creation [supp] |
Learning 3D Object Shape and Layout Without 3D Supervision [supp] |
An Empirical Study of End-to-End Temporal Action Detection [supp] |
SimVP: Simpler Yet Better Video Prediction [supp] |
Object Localization Under Single Coarse Point Supervision [supp] |
Unsupervised Learning of Accurate Siamese Tracking [supp] |
Bayesian Nonparametric Submodular Video Partition for Robust Anomaly Detection [supp] |
Brain-Supervised Image Editing [supp] |
3D Shape Variational Autoencoder Latent Disentanglement via Mini-Batch Feature Swapping for Bodies and Faces [supp] |
Unified Transformer Tracker for Object Tracking [supp] |
Non-Parametric Depth Distribution Modelling Based Depth Inference for Multi-View Stereo [supp] |
Equalized Focal Loss for Dense Long-Tailed Object Detection [supp] |
Generating High Fidelity Data From Low-Density Regions Using Diffusion Models [supp] |
DeepDPM: Deep Clustering With an Unknown Number of Clusters [supp] |
Spiking Transformers for Event-Based Single Object Tracking [supp] |
FocalClick: Towards Practical Interactive Image Segmentation |
ISDNet: Integrating Shallow and Deep Networks for Efficient Ultra-High Resolution Segmentation [supp] |
Unsupervised Domain Adaptation for Nighttime Aerial Tracking [supp] |
Balanced Multimodal Learning via On-the-Fly Gradient Modulation [supp] |
RestoreFormer: High-Quality Blind Face Restoration From Undegraded Key-Value Pairs [supp] |
Understanding Uncertainty Maps in Vision With Statistical Testing [supp] |
CAFE: Learning To Condense Dataset by Aligning Features |
Causality Inspired Representation Learning for Domain Generalization [supp] |
Mask-Guided Spectral-Wise Transformer for Efficient Hyperspectral Image Reconstruction |
A Variational Bayesian Method for Similarity Learning in Non-Rigid Image Registration |
Not Just Selection, but Exploration: Online Class-Incremental Continual Learning via Dual View Consistency |
PPDL: Predicate Probability Distribution Based Loss for Unbiased Scene Graph Generation [supp] |
Block-NeRF: Scalable Large Scene Neural View Synthesis [supp] |
Coupling Vision and Proprioception for Navigation of Legged Robots [supp] |
Fine-Grained Predicates Learning for Scene Graph Generation |
Generalized Few-Shot Semantic Segmentation [supp] |
Exploiting Rigidity Constraints for LiDAR Scene Flow Estimation [supp] |
Neural Head Avatars From Monocular RGB Videos [supp] |
B-Cos Networks: Alignment Is All We Need for Interpretability [supp] |
EMOCA: Emotion Driven Monocular Face Capture and Animation [supp] |
Burst Image Restoration and Enhancement [supp] |
What Makes Transfer Learning Work for Medical Images: Feature Reuse & Other Factors [supp] |
Towards Diverse and Natural Scene-Aware 3D Human Motion Synthesis [supp] |
Quarantine: Sparsity Can Uncover the Trojan Attack Trigger for Free [supp] |
Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis [supp] |
Localized Adversarial Domain Generalization [supp] |
X-Trans2Cap: Cross-Modal Knowledge Transfer Using Transformer for 3D Dense Captioning [supp] |
How Much Does Input Data Type Impact Final Face Model Accuracy? [supp] |
Image-to-Lidar Self-Supervised Distillation for Autonomous Driving Data [supp] |
HumanNeRF: Free-Viewpoint Rendering of Moving People From Monocular Video [supp] |
PoseKernelLifter: Metric Lifting of 3D Human Pose Using Sound |
Which Images To Label for Few-Shot Medical Landmark Detection? |
Why Discard if You Can Recycle?: A Recycling Max Pooling Module for 3D Point Cloud Analysis [supp] |
Explaining Deep Convolutional Neural Networks via Latent Visual-Semantic Filter Attention [supp] |
AlignQ: Alignment Quantization With ADMM-Based Correlation Preservation [supp] |
Self-Distillation From the Last Mini-Batch for Consistency Regularization |
Interactive Multi-Class Tiny-Object Detection [supp] |
Learning From Pixel-Level Noisy Label: A New Perspective for Light Field Saliency Detection [supp] |
UBoCo: Unsupervised Boundary Contrastive Learning for Generic Event Boundary Detection [supp] |
Multi-View Depth Estimation by Fusing Single-View Depth Probability With Multi-View Geometry [supp] |
Learning To Collaborate in Decentralized Learning of Personalized Models |
CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields [supp] |
ART-Point: Improving Rotation Robustness of Point Cloud Classifiers via Adversarial Rotation [supp] |
Ref-NeRF: Structured View-Dependent Appearance for Neural Radiance Fields [supp] |
360-Attack: Distortion-Aware Perturbations From Perspective-Views |
Targeted Supervised Contrastive Learning for Long-Tailed Recognition [supp] |
Both Style and Fog Matter: Cumulative Domain Adaptation for Semantic Foggy Scene Understanding |
Ev-TTA: Test-Time Adaptation for Event-Based Object Recognition [supp] |
Balanced Contrastive Learning for Long-Tailed Visual Recognition [supp] |
Slimmable Domain Adaptation [supp] |
Bandits for Structure Perturbation-Based Black-Box Attacks To Graph Neural Networks With Theoretical Guarantees |
NODEO: A Neural Ordinary Differential Equation Based Optimization Framework for Deformable Image Registration [supp] |
DIP: Deep Inverse Patchmatch for High-Resolution Optical Flow [supp] |
Few-Shot Object Detection With Fully Cross-Transformer [supp] |
Pyramid Architecture for Multi-Scale Processing in Point Cloud Segmentation |
Decoupling Makes Weakly Supervised Local Feature Better [supp] |
Cross-Architecture Self-Supervised Video Representation Learning |
High-Resolution Image Harmonization via Collaborative Dual Transformations [supp] |
Homography Loss for Monocular 3D Object Detection |
A Unified Model for Line Projections in Catadioptric Cameras With Rotationally Symmetric Mirrors [supp] |
Dynamic Sparse R-CNN |
MM-TTA: Multi-Modal Test-Time Adaptation for 3D Semantic Segmentation [supp] |
Stable Long-Term Recurrent Video Super-Resolution [supp] |
Dual-Generator Face Reenactment |
Towards Bidirectional Arbitrary Image Rescaling: Joint Optimization and Cycle Idempotence |
Self-Supervised Neural Articulated Shape and Appearance Models [supp] |
A Hybrid Quantum-Classical Algorithm for Robust Fitting [supp] |
Topology Preserving Local Road Network Estimation From Single Onboard Camera Image [supp] |
Eigenlanes: Data-Driven Lane Descriptors for Structurally Diverse Lanes [supp] |
Human Instance Matting via Mutual Guidance and Multi-Instance Refinement [supp] |
TCTrack: Temporal Contexts for Aerial Tracking [supp] |
SpaceEdit: Learning a Unified Editing Space for Open-Domain Image Color Editing [supp] |
GAN-Supervised Dense Visual Alignment [supp] |
SwinTextSpotter: Scene Text Spotting via Better Synergy Between Text Detection and Text Recognition [supp] |
Multi-Level Feature Learning for Contrastive Multi-View Clustering |
RendNet: Unified 2D/3D Recognizer With Latent Space Rendering |
iPLAN: Interactive and Procedural Layout Planning [supp] |
Video Frame Interpolation With Transformer [supp] |
GIFS: Neural Implicit Function for General Shape Representation [supp] |
Deblur-NeRF: Neural Radiance Fields From Blurry Images [supp] |
Egocentric Prediction of Action Target in 3D [supp] |
TemporalUV: Capturing Loose Clothing With Temporally Coherent UV Coordinates [supp] |
Whose Track Is It Anyway? Improving Robustness to Tracking Errors With Affinity-Based Trajectory Prediction |
DoubleField: Bridging the Neural Surface and Radiance Fields for High-Fidelity Human Reconstruction and Rendering [supp] |
Towards Real-World Navigation With Deep Differentiable Planners [supp] |
An Iterative Quantum Approach for Transformation Estimation From Point Sets [supp] |
Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation [supp] |
UnweaveNet: Unweaving Activity Stories [supp] |
Balanced MSE for Imbalanced Visual Regression [supp] |
Local Learning Matters: Rethinking Data Heterogeneity in Federated Learning [supp] |
PhysFormer: Facial Video-Based Physiological Measurement With Temporal Difference Transformer |
Dimension Embeddings for Monocular 3D Object Detection |
Look Closer To Supervise Better: One-Shot Font Generation via Component-Based Discriminator [supp] |
NeRFReN: Neural Radiance Fields With Reflections [supp] |
Blind Image Super-Resolution With Elaborate Degradation Modeling on Noise and Kernel [supp] |
Finding Good Configurations of Planar Primitives in Unorganized Point Clouds [supp] |
PhyIR: Physics-Based Inverse Rendering for Panoramic Indoor Images [supp] |
SCS-Co: Self-Consistent Style Contrastive Learning for Image Harmonization [supp] |
Beyond Fixation: Dynamic Window Visual Transformer |
Progressive End-to-End Object Detection in Crowded Scenes [supp] |
FMCNet: Feature-Level Modality Compensation for Visible-Infrared Person Re-Identification [supp] |
Improving GAN Equilibrium by Raising Spatial Awareness [supp] |
Neural Convolutional Surfaces [supp] |
HyperSegNAS: Bridging One-Shot Neural Architecture Search With 3D Medical Image Segmentation Using HyperNet [supp] |
A Comprehensive Study of Image Classification Model Sensitivity to Foregrounds, Backgrounds, and Visual Attributes [supp] |
ConDor: Self-Supervised Canonicalization of 3D Pose for Partial Shapes [supp] |
Source-Free Domain Adaptation via Distribution Estimation [supp] |
Robust Combination of Distributed Gradients Under Adversarial Perturbations [supp] |
Exploring Endogenous Shift for Cross-Domain Detection: A Large-Scale Benchmark and Perturbation Suppression Network |
VisCUIT: Visual Auditor for Bias in CNN Image Classifier |
Automatic Synthesis of Diverse Weak Supervision Sources for Behavior Analysis [supp] |
Transferability Estimation Using Bhattacharyya Class Separability [supp] |
DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition |
Hierarchical Self-Supervised Representation Learning for Movie Understanding |
Robust Egocentric Photo-Realistic Facial Expression Transfer for Virtual Reality |
Does Robustness on ImageNet Transfer to Downstream Tasks? [supp] |
Propagation Regularizer for Semi-Supervised Learning With Extremely Scarce Labeled Samples [supp] |
Bailando: 3D Dance Generation by Actor-Critic GPT With Choreographic Memory [supp] |
Faithful Extreme Rescaling via Generative Prior Reciprocated Invertible Representations [supp] |
Distillation Using Oracle Queries for Transformer-Based Human-Object Interaction Detection [supp] |
Proto2Proto: Can You Recognize the Car, the Way I Do? [supp] |
Learning Local-Global Contextual Adaptation for Multi-Person Pose Estimation [supp] |
Learning Video Representations of Human Motion From Synthetic Data [supp] |
TVConv: Efficient Translation Variant Convolution for Layout-Aware Visual Processing |
Dual Adversarial Adaptation for Cross-Device Real-World Image Super-Resolution |
FS6D: Few-Shot 6D Pose Estimation of Novel Objects [supp] |
Habitat-Web: Learning Embodied Object-Search Strategies From Human Demonstrations at Scale [supp] |
The Probabilistic Normal Epipolar Constraint for Frame-to-Frame Rotation Optimization Under Uncertain Feature Positions [supp] |
Vision-Language Pre-Training for Boosting Scene Text Detectors |
Reflection and Rotation Symmetry Detection via Equivariant Learning [supp] |
BoostMIS: Boosting Medical Image Semi-Supervised Learning With Adaptive Pseudo Labeling and Informative Active Annotation |
Simple but Effective: CLIP Embeddings for Embodied AI [supp] |
NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition [supp] |
HOI4D: A 4D Egocentric Dataset for Category-Level Human-Object Interaction |
Collaborative Transformers for Grounded Situation Recognition [supp] |
DyRep: Bootstrapping Training With Dynamic Re-Parameterization [supp] |
Not All Labels Are Equal: Rationalizing the Labeling Costs for Training Object Detection [supp] |
CPPF: Towards Robust Category-Level 9D Pose Estimation in the Wild [supp] |
Interact Before Align: Leveraging Cross-Modal Knowledge for Domain Adaptive Action Recognition [supp] |
Interactive Disentanglement: Learning Concepts by Interacting With Their Prototype Representations [supp] |
CDGNet: Class Distribution Guided Network for Human Parsing [supp] |
Recall@k Surrogate Loss With Large Batches and Similarity Mixup [supp] |
Direct Voxel Grid Optimization: Super-Fast Convergence for Radiance Fields Reconstruction [supp] |
Continual Test-Time Domain Adaptation [supp] |
URetinex-Net: Retinex-Based Deep Unfolding Network for Low-Light Image Enhancement [supp] |
Towards Multi-Domain Single Image Dehazing via Test-Time Training |
Vox2Cortex: Fast Explicit Reconstruction of Cortical Surfaces From 3D MRI Scans With Geometric Deep Neural Networks [supp] |
Deep Safe Multi-View Clustering: Reducing the Risk of Clustering Performance Degradation Caused by View Increase [supp] |
Dynamic MLP for Fine-Grained Image Classification by Leveraging Geographical and Temporal Information [supp] |
HP-Capsule: Unsupervised Face Part Discovery by Hierarchical Parsing Capsule Network [supp] |
ScanQA: 3D Question Answering for Spatial Scene Understanding [supp] |
MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-Based Visual Question Answering [supp] |
Class-Incremental Learning by Knowledge Distillation With Adaptive Feature Consolidation [supp] |
Learning Program Representations for Food Images and Cooking Recipes |
Bending Graphs: Hierarchical Shape Matching Using Gated Optimal Transport [supp] |
Transform-Retrieve-Generate: Natural Language-Centric Outside-Knowledge Visual Question Answering [supp] |
Federated Learning With Position-Aware Neurons [supp] |
Fair Contrastive Learning for Facial Attribute Classification [supp] |
MDAN: Multi-Level Dependent Attention Network for Visual Emotion Analysis |
Nested Hyperbolic Spaces for Dimensionality Reduction and Hyperbolic NN Design [supp] |
BNUDC: A Two-Branched Deep Neural Network for Restoring Images From Under-Display Cameras [supp] |
RGB-Depth Fusion GAN for Indoor Depth Completion [supp] |
Training Object Detectors From Scratch: An Empirical Study in the Era of Vision Transformer |
RCL: Recurrent Continuous Localization for Temporal Action Detection [supp] |
C2SLR: Consistency-Enhanced Continuous Sign Language Recognition [supp] |
Human Trajectory Prediction With Momentary Observation |
FoggyStereo: Stereo Matching With Fog Volume Representation [supp] |
Trajectory Optimization for Physics-Based Reconstruction of 3D Human Pose From Monocular Video [supp] |
Directional Self-Supervised Learning for Heavy Image Augmentations [supp] |
Lifelong Unsupervised Domain Adaptive Person Re-Identification With Coordinated Anti-Forgetting and Adaptation [supp] |
No-Reference Point Cloud Quality Assessment via Domain Adaptation |
Generating Representative Samples for Few-Shot Classification [supp] |
Comprehending and Ordering Semantics for Image Captioning |
Dynamic Scene Graph Generation via Anticipatory Pre-Training |
A Large-Scale Comprehensive Dataset and Copy-Overlap Aware Evaluation Protocol for Segment-Level Video Copy Detection [supp] |
GaTector: A Unified Framework for Gaze Object Prediction [supp] |
ELIC: Efficient Learned Image Compression With Unevenly Grouped Space-Channel Contextual Adaptive Coding [supp] |
CSWin Transformer: A General Vision Transformer Backbone With Cross-Shaped Windows [supp] |
LaTr: Layout-Aware Transformer for Scene-Text VQA [supp] |
Label Relation Graphs Enhanced Hierarchical Residual Network for Hierarchical Multi-Granularity Classification [supp] |
ITSA: An Information-Theoretic Approach to Automatic Shortcut Avoidance and Domain Generalization in Stereo Matching Networks [supp] |
Enhancing Face Recognition With Self-Supervised 3D Reconstruction |
HeadNeRF: A Real-Time NeRF-Based Parametric Head Model |
FvOR: Robust Joint Shape and Pose Optimization for Few-View Object Reconstruction [supp] |
Reduce Information Loss in Transformers for Pluralistic Image Inpainting [supp] |
Replacing Labeled Real-Image Datasets With Auto-Generated Contours |
Cross-Modal Transferable Adversarial Attacks From Images to Videos [supp] |
Few Could Be Better Than All: Feature Sampling and Grouping for Scene Text Detection [supp] |
Do Explanations Explain? Model Knows Best [supp] |
WebQA: Multihop and Multimodal QA [supp] |
Occlusion-Robust Face Alignment Using a Viewpoint-Invariant Hierarchical Network Architecture [supp] |
BasicVSR++: Improving Video Super-Resolution With Enhanced Propagation and Alignment [supp] |
IDR: Self-Supervised Image Denoising via Iterative Data Refinement [supp] |
MogFace: Towards a Deeper Appreciation on Face Detection [supp] |
GuideFormer: Transformers for Image Guided Depth Completion [supp] |
Multi-Label Iterated Learning for Image Classification With Label Ambiguity [supp] |
Region-Aware Face Swapping |
Towards Language-Free Training for Text-to-Image Generation [supp] |
Learning Affinity From Attention: End-to-End Weakly-Supervised Semantic Segmentation With Transformers [supp] |
Pushing the Envelope of Gradient Boosting Forests via Globally-Optimized Oblique Trees [supp] |
Physical Simulation Layer for Accurate 3D Modeling [supp] |
Deformable Sprites for Unsupervised Video Decomposition [supp] |
CamLiFlow: Bidirectional Camera-LiDAR Fusion for Joint Optical Flow and Scene Flow Estimation [supp] |
FERV39k: A Large-Scale Multi-Scene Dataset for Facial Expression Recognition in Videos [supp] |
Learning To Detect Mobile Objects From LiDAR Scans Without Labels [supp] |
BNV-Fusion: Dense 3D Reconstruction Using Bi-Level Neural Volume Fusion [supp] |
Probabilistic Representations for Video Contrastive Learning [supp] |
EnvEdit: Environment Editing for Vision-and-Language Navigation [supp] |
Omnivore: A Single Model for Many Visual Modalities [supp] |
Neural Shape Mating: Self-Supervised Object Assembly With Adversarial Shape Priors |
Reflash Dropout in Image Super-Resolution [supp] |
WildNet: Learning Domain Generalized Semantic Segmentation From the Wild [supp] |
Auditing Privacy Defenses in Federated Learning via Generative Gradient Leakage [supp] |
DAIR-V2X: A Large-Scale Dataset for Vehicle-Infrastructure Cooperative 3D Object Detection |
DECORE: Deep Compression With Reinforcement Learning |
Time3D: End-to-End Joint Monocular 3D Object Detection and Tracking for Autonomous Driving [supp] |
MonoJSG: Joint Semantic and Geometric Cost Volume for Monocular 3D Object Detection [supp] |
Task Discrepancy Maximization for Fine-Grained Few-Shot Classification [supp] |
FedDC: Federated Learning With Non-IID Data via Local Drift Decoupling and Correction [supp] |
Efficient Classification of Very Large Images With Tiny Objects [supp] |
SWEM: Towards Real-Time Video Object Segmentation With Sequential Weighted Expectation-Maximization [supp] |
Point-to-Voxel Knowledge Distillation for LiDAR Semantic Segmentation [supp] |
Leveling Down in Computer Vision: Pareto Inefficiencies in Fair Deep Classifiers [supp] |
Generating Diverse 3D Reconstructions From a Single Occluded Face Image [supp] |
RBGNet: Ray-Based Grouping for 3D Object Detection [supp] |
Stand-Alone Inter-Frame Attention in Video Models |
Uncertainty-Aware Adaptation for Self-Supervised 3D Human Pose Estimation [supp] |
Open-Domain, Content-Based, Multi-Modal Fact-Checking of Out-of-Context Images via Online Resources [supp] |
Memory-Augmented Deep Conditional Unfolding Network for Pan-Sharpening |
Semi-Supervised Wide-Angle Portraits Correction by Multi-Scale Transformer [supp] |
Large-Scale Pre-Training for Person Re-Identification With Noisy Labels [supp] |
Adiabatic Quantum Computing for Multi Object Tracking [supp] |
Feature Erasing and Diffusion Network for Occluded Person Re-Identification |
Is Mapping Necessary for Realistic PointGoal Navigation? [supp] |
Node-Aligned Graph Convolutional Network for Whole-Slide Image Representation and Classification |
Represent, Compare, and Learn: A Similarity-Aware Framework for Class-Agnostic Counting [supp] |
Masked Feature Prediction for Self-Supervised Visual Pre-Training [supp] |
Critical Regularizations for Neural Surface Reconstruction in the Wild [supp] |
EASE: Unsupervised Discriminant Subspace Learning for Transductive Few-Shot Learning [supp] |
Object-Relation Reasoning Graph for Action Recognition |
Semantic Segmentation by Early Region Proxy [supp] |
GIQE: Generic Image Quality Enhancement via Nth Order Iterative Degradation [supp] |
Instance Segmentation With Mask-Supervised Polygonal Boundary Transformers |
FaceVerse: A Fine-Grained and Detail-Controllable 3D Face Morphable Model From a Hybrid Dataset [supp] |
Bring Evanescent Representations to Life in Lifelong Class Incremental Learning [supp] |
Single-Stage 3D Geometry-Preserving Depth Estimation Model Training on Dataset Mixtures With Uncalibrated Stereo Data [supp] |
LD-ConGR: A Large RGB-D Video Dataset for Long-Distance Continuous Gesture Recognition [supp] |
SimVQA: Exploring Simulated Environments for Visual Question Answering [supp] |
Thin-Plate Spline Motion Model for Image Animation [supp] |
Learning Local Displacements for Point Cloud Completion [supp] |
Human Hands As Probes for Interactive Object Understanding [supp] |
Understanding and Increasing Efficiency of Frank-Wolfe Adversarial Training [supp] |
Certified Patch Robustness via Smoothed Vision Transformers [supp] |
Look Back and Forth: Video Super-Resolution With Explicit Temporal Difference Modeling |
UCC: Uncertainty Guided Cross-Head Co-Training for Semi-Supervised Semantic Segmentation |
HVH: Learning a Hybrid Neural Volumetric Representation for Dynamic Hair Performance Capture [supp] |
RADU: Ray-Aligned Depth Update Convolutions for ToF Data Denoising [supp] |
Rethinking Visual Geo-Localization for Large-Scale Applications [supp] |
Learning Based Multi-Modality Image and Video Compression [supp] |