CVPR2022将于6月22日召开,本次会议共收录了2067篇论文。由于数量较多,本文将分四个子文章呈现,可直接点击论文标题获取文档。
第二部分, 第三部分, 第四部分。
Dual Cross-Attention Learning for Fine-Grained Visual Categorization and Object Re-Identification [supp] |
SimAN: Exploring Self-Supervised Representation Learning of Scene Text via Similarity-Aware Normalization [supp] |
GASP, a Generalized Framework for Agglomerative Clustering of Signed Graphs and Its Application to Instance Segmentation [supp] |
Estimating Example Difficulty Using Variance of Gradients [supp] |
One Loss for Quantization: Deep Hashing With Discrete Wasserstein Distributional Matching [supp] |
Pixel Screening Based Intermediate Correction for Blind Deblurring [supp] |
Weakly Supervised Semantic Segmentation by Pixel-to-Prototype Contrast |
Controllable Animation of Fluid Elements in Still Images |
Holocurtains: Programming Light Curtains via Binary Holography [supp] |
Recurrent Dynamic Embedding for Video Object Segmentation [supp] |
Deep Hierarchical Semantic Segmentation [supp] |
f-SfT: Shape-From-Template With a Physics-Based Deformation Model [supp] |
Continual Object Detection via Prototypical Task Correlation Guided Gating Mechanism [supp] |
DATA: Domain-Aware and Task-Aware Self-Supervised Learning [supp] |
TWIST: Two-Way Inter-Label Self-Training for Semi-Supervised 3D Instance Segmentation [supp] |
Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection From Point Clouds |
Learning Adaptive Warping for Real-World Rolling Shutter Correction [supp] |
Siamese Contrastive Embedding Network for Compositional Zero-Shot Learning |
Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions [supp] |
RIM-Net: Recursive Implicit Fields for Unsupervised Learning of Hierarchical Shape Structures |
Do Learned Representations Respect Causal Relationships? [supp] |
ZebraPose: Coarse To Fine Surface Encoding for 6DoF Object Pose Estimation [supp] |
ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [supp] |
Learning To Affiliate: Mutual Centralized Learning for Few-Shot Classification [supp] |
CAPRI-Net: Learning Compact CAD Shapes With Adaptive Primitive Assembly [supp] |
ATPFL: Automatic Trajectory Prediction Model Design Under Federated Learning Framework |
Revisiting Learnable Affines for Batch Norm in Few-Shot Transfer Learning |
Bridging the Gap Between Classification and Localization for Weakly Supervised Object Localization [supp] |
Multi-Class Token Transformer for Weakly Supervised Semantic Segmentation [supp] |
3D Moments From Near-Duplicate Photos |
Exact Feature Distribution Matching for Arbitrary Style Transfer and Domain Generalization [supp] |
Blind2Unblind: Self-Supervised Image Denoising With Visible Blind Spots [supp] |
Balanced and Hierarchical Relation Learning for One-Shot Object Detection |
End-to-End Generative Pretraining for Multimodal Video Captioning [supp] |
Delving Deep Into the Generalization of Vision Transformers Under Distribution Shifts |
NICE-SLAM: Neural Implicit Scalable Encoding for SLAM [supp] |
HyperDet3D: Learning a Scene-Conditioned 3D Object Detector [supp] |
Stochastic Trajectory Prediction via Motion Indeterminacy Diffusion [supp] |
CLRNet: Cross Layer Refinement Network for Lane Detection [supp] |
Cross-Modal Map Learning for Vision and Language Navigation [supp] |
Motion-Aware Contrastive Video Representation Learning via Foreground-Background Merging [supp] |
Incremental Transformer Structure Enhanced Image Inpainting With Masking Positional Encoding [supp] |
Pointly-Supervised Instance Segmentation [supp] |
Cross-Modal Clinical Graph Transformer for Ophthalmic Report Generation |
Human-Object Interaction Detection via Disentangled Transformer [supp] |
DINE: Domain Adaptation From Single and Multiple Black-Box Predictors |
LGT-Net: Indoor Panoramic Room Layout Estimation With Geometry-Aware Transformer Network [supp] |
CRIS: CLIP-Driven Referring Image Segmentation |
Multi-View Mesh Reconstruction With Neural Deferred Shading [supp] |
CVF-SID: Cyclic Multi-Variate Function for Self-Supervised Image Denoising by Disentangling Noise From Image [supp] |
Infrared Invisible Clothing: Hiding From Infrared Detectors at Multiple Angles in Real World [supp] |
Distribution-Aware Single-Stage Models for Multi-Person 3D Pose Estimation |
FaceFormer: Speech-Driven 3D Facial Animation With Transformers [supp] |
Exploring Patch-Wise Semantic Relation for Contrastive Learning in Image-to-Image Translation Tasks [supp] |
High-Resolution Face Swapping via Latent Semantics Disentanglement [supp] |
Searching the Deployable Convolution Neural Networks for GPUs [supp] |
Sparse Local Patch Transformer for Robust Face Alignment and Landmarks Inherent Relation Learning [supp] |
DeepFake Disrupter: The Detector of DeepFake Is My Friend [supp] |
Rotationally Equivariant 3D Object Detection [supp] |
Accelerating DETR Convergence via Semantic-Aligned Matching [supp] |
Long-Short Temporal Contrastive Learning of Video Transformers |
Vision Transformer With Deformable Attention [supp] |
Towards General Purpose Vision Systems: An End-to-End Task-Agnostic Vision-Language Architecture [supp] |
Deep Vanishing Point Detection: Geometric Priors Make Dataset Variations Vanish [supp] |
RM-Depth: Unsupervised Learning of Recurrent Monocular Depth in Dynamic Scenes |
LiT: Zero-Shot Transfer With Locked-Image Text Tuning [supp] |
Cloning Outfits From Real-World Images to 3D Characters for Generalizable Person Re-Identification [supp] |
GeoNeRF: Generalizing NeRF With Geometry Priors [supp] |
ABPN: Adaptive Blend Pyramid Network for Real-Time Local Retouching of Ultra High-Resolution Photo [supp] |
PhoCaL: A Multi-Modal Dataset for Category-Level Object Pose Estimation With Photometrically Challenging Objects [supp] |
Neural Compression-Based Feature Learning for Video Restoration [supp] |
Expanding Low-Density Latent Regions for Open-Set Object Detection [supp] |
Drop the GAN: In Defense of Patches Nearest Neighbors As Single Image Generative Models |
Uformer: A General U-Shaped Transformer for Image Restoration [supp] |
Exploring Dual-Task Correlation for Pose Guided Person Image Generation |
Portrait Eyeglasses and Shadow Removal by Leveraging 3D Synthetic Data [supp] |
Neural Rays for Occlusion-Aware Image-Based Rendering [supp] |
Modeling 3D Layout for Group Re-Identification |
Open-World Instance Segmentation: Exploiting Pseudo Ground Truth From Learned Pairwise Affinity [supp] |
SIOD: Single Instance Annotated per Category per Image for Object Detection [supp] |
Toward Fast, Flexible, and Robust Low-Light Image Enhancement [supp] |
Online Learning of Reusable Abstract Models for Object Goal Navigation |
Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos |
SimMatch: Semi-Supervised Learning With Similarity Matching |
OrphicX: A Causality-Inspired Latent Variable Model for Interpreting Graph Neural Networks [supp] |
HandOccNet: Occlusion-Robust 3D Hand Mesh Estimation Network [supp] |
EfficientNeRF Efficient Neural Radiance Fields [supp] |
Quantifying Societal Bias Amplification in Image Captioning [supp] |
Modular Action Concept Grounding in Semantic Video Prediction [supp] |
StyleSwin: Transformer-Based GAN for High-Resolution Image Generation [supp] |
Reinforced Structured State-Evolution for Vision-Language Navigation |
Sub-Word Level Lip Reading With Visual Attention |
Weakly Supervised High-Fidelity Clothing Model Generation [supp] |
Highly-Efficient Incomplete Large-Scale Multi-View Clustering With Consensus Bipartite Graph [supp] |
Towards Principled Disentanglement for Domain Generalization [supp] |
Discrete Cosine Transform Network for Guided Depth Map Super-Resolution [supp] |
Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing [supp] |
E2V-SDE: From Asynchronous Events to Fast and Continuous Video Reconstruction via Neural Stochastic Differential Equations [supp] |
CoSSL: Co-Learning of Representation and Classifier for Imbalanced Semi-Supervised Learning [supp] |
Discovering Objects That Can Move [supp] |
Knowledge Mining With Scene Text for Fine-Grained Recognition |
Self-Supervised Learning of Object Parts for Semantic Segmentation [supp] |
Iterative Corresponding Geometry: Fusing Region and Depth for Highly Efficient 3D Tracking of Textureless Objects [supp] |
Single-Photon Structured Light [supp] |
Deblurring via Stochastic Refinement [supp] |
3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds |
TransGeo: Transformer Is All You Need for Cross-View Image Geo-Localization [supp] |
R(Det)2: Randomized Decision Routing for Object Detection [supp] |
Abandoning the Bayer-Filter To See in the Dark [supp] |
SASIC: Stereo Image Compression With Latent Shifts and Stereo Attention [supp] |
Exploiting Temporal Relations on Radar Perception for Autonomous Driving [supp] |
Multi-Instance Point Cloud Registration by Efficient Correspondence Clustering [supp] |
Contrastive Boundary Learning for Point Cloud Segmentation [supp] |
Details or Artifacts: A Locally Discriminative Learning Approach to Realistic Image Super-Resolution [supp] |
CVNet: Contour Vibration Network for Building Extraction |
Hyperbolic Image Segmentation [supp] |
Forward Compatible Training for Large-Scale Embedding Retrieval Systems [supp] |
Everything at Once - Multi-Modal Fusion Transformer for Video Retrieval [supp] |
Swin Transformer V2: Scaling Up Capacity and Resolution [supp] |
Neural Template: Topology-Aware Reconstruction and Disentangled Generation of 3D Meshes [supp] |
DEFEAT: Deep Hidden Feature Backdoor Attacks by Imperceptible Perturbation and Latent Representation Constraints |
Projective Manifold Gradient Layer for Deep Rotation Regression [supp] |
CLIMS: Cross Language Image Matching for Weakly Supervised Semantic Segmentation [supp] |
Learning To Refactor Action and Co-Occurrence Features for Temporal Action Localization [supp] |
It's Time for Artistic Correspondence in Music and Video [supp] |
Mixed Differential Privacy in Computer Vision [supp] |
AdaFace: Quality Adaptive Margin for Face Recognition [supp] |
Learning Soft Estimator of Keypoint Scale and Orientation With Probabilistic Covariant Loss [supp] |
DN-DETR: Accelerate DETR Training by Introducing Query DeNoising [supp] |
HCSC: Hierarchical Contrastive Selective Coding [supp] |
TransRank: Self-Supervised Video Representation Learning via Ranking-Based Transformation Recognition [supp] |
KeyTr: Keypoint Transporter for 3D Reconstruction of Deformable Objects in Videos |
Invariant Grounding for Video Question Answering [supp] |
Prompt Distribution Learning [supp] |
RAGO: Recurrent Graph Optimizer for Multiple Rotation Averaging [supp] |
Arch-Graph: Acyclic Architecture Relation Predictor for Task-Transferable Neural Architecture Search [supp] |
On Aliased Resizing and Surprising Subtleties in GAN Evaluation |
Lepard: Learning Partial Point Cloud Matching in Rigid and Deformable Scenes [supp] |
Virtual Elastic Objects [supp] |
DiSparse: Disentangled Sparsification for Multitask Model Compression [supp] |
Pushing the Limits of Simple Pipelines for Few-Shot Learning: External Data and Fine-Tuning Make a Difference [supp] |
Opening Up Open World Tracking [supp] |
Towards Efficient and Scalable Sharpness-Aware Minimization [supp] |
VISTA: Boosting 3D Object Detection via Dual Cross-VIew SpaTial Attention [supp] |
Rethinking Deep Face Restoration [supp] |
OSSO: Obtaining Skeletal Shape From Outside [supp] |
Temporal Alignment Networks for Long-Term Video [supp] |
Few-Shot Head Swapping in the Wild [supp] |
A Study on the Distribution of Social Biases in Self-Supervised Learning Visual Models [supp] |
LAR-SR: A Local Autoregressive Model for Image Super-Resolution [supp] |
Bayesian Invariant Risk Minimization [supp] |
Democracy Does Matter: Comprehensive Feature Mining for Co-Salient Object Detection [supp] |
Alleviating Semantics Distortion in Unsupervised Low-Level Image-to-Image Translation via Structure Consistency Constraint [supp] |
Doodle It Yourself: Class Incremental Learning by Drawing a Few Sketches [supp] |
Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes [supp] |
ICON: Implicit Clothed Humans Obtained From Normals [supp] |
Comparing Correspondences: Video Prediction With Correspondence-Wise Losses |
Uni-Perceiver: Pre-Training Unified Architecture for Generic Perception for Zero-Shot and Few-Shot Tasks [supp] |
The Auto Arborist Dataset: A Large-Scale Benchmark for Multiview Urban Forest Monitoring Under Domain Shift [supp] |
On the Instability of Relative Pose Estimation and RANSAC's Role [supp] |
Shape From Polarization for Complex Scenes in the Wild |
Real-Time, Accurate, and Consistent Video Semantic Segmentation via Unsupervised Adaptation and Cross-Unit Deployment on Mobile Device [supp] |
SNUG: Self-Supervised Neural Dynamic Garments [supp] |
Towards Fewer Annotations: Active Learning via Region Impurity and Prediction Uncertainty for Domain Adaptive Semantic Segmentation [supp] |
Glass Segmentation Using Intensity and Spectral Polarization Cues [supp] |
CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D Point Cloud Understanding |
Few Shot Generative Model Adaption via Relaxed Spatial Structural Alignment |
Target-Relevant Knowledge Preservation for Multi-Source Domain Adaptive Object Detection [supp] |
Pyramid Grafting Network for One-Stage High Resolution Saliency Detection [supp] |
A Style-Aware Discriminator for Controllable Image Translation [supp] |
Non-Iterative Recovery From Nonlinear Observations Using Generative Models [supp] |
Incremental Cross-View Mutual Distillation for Self-Supervised Medical CT Synthesis |
Enhancing Adversarial Training With Second-Order Statistics of Weights [supp] |
Partially Does It: Towards Scene-Level FG-SBIR With Partial Input [supp] |
Dual Temperature Helps Contrastive Learning Without Many Negative Samples: Towards Understanding and Simplifying MoCo [supp] |
Moving Window Regression: A Novel Approach to Ordinal Regression [supp] |
UniCoRN: A Unified Conditional Image Repainting Network |
Forecasting Characteristic 3D Poses of Human Actions [supp] |
ACPL: Anti-Curriculum Pseudo-Labelling for Semi-Supervised Medical Image Classification [supp] |
Learning to Deblur Using Light Field Generated and Real Defocus Images [supp] |
Self-Supervised Predictive Convolutional Attentive Block for Anomaly Detection [supp] |
Safe Self-Refinement for Transformer-Based Domain Adaptation |
Density-Preserving Deep Point Cloud Compression [supp] |
StyleMesh: Style Transfer for Indoor 3D Scene Reconstructions [supp] |
Which Model To Transfer? Finding the Needle in the Growing Haystack [supp] |
Fast and Unsupervised Action Boundary Detection for Action Segmentation |
Class-Incremental Learning With Strong Pre-Trained Models [supp] |
Robust Optimization As Data Augmentation for Large-Scale Graphs [supp] |
Robust Structured Declarative Classifiers for 3D Point Clouds: Defending Adversarial Attacks With Implicit Gradients |
PhotoScene: Photorealistic Material and Lighting Transfer for Indoor Scenes [supp] |
Improving the Transferability of Targeted Adversarial Examples Through Object-Based Diverse Input [supp] |
IRON: Inverse Rendering by Optimizing Neural SDFs and Materials From Photometric Images [supp] |
ObjectFolder 2.0: A Multisensory Object Dataset for Sim2Real Transfer |
Versatile Multi-Modal Pre-Training for Human-Centric Perception |
360MonoDepth: High-Resolution 360deg Monocular Depth Estimation [supp] |
Splicing ViT Features for Semantic Appearance Transfer |
Contrastive Regression for Domain Adaptation on Gaze Estimation [supp] |
MUSE-VAE: Multi-Scale VAE for Environment-Aware Long Term Trajectory Prediction [supp] |
Multi-View Consistent Generative Adversarial Networks for 3D-Aware Image Synthesis [supp] |
Putting People in Their Place: Monocular Regression of 3D People in Depth [supp] |
POCO: Point Convolution for Surface Reconstruction [supp] |
Memory-Augmented Non-Local Attention for Video Super-Resolution [supp] |
Neural Texture Extraction and Distribution for Controllable Person Image Synthesis |
Classification-Then-Grounding: Reformulating Video Scene Graphs As Temporal Bipartite Graphs [supp] |
Transformer-Empowered Multi-Scale Contextual Matching and Aggregation for Multi-Contrast MRI Super-Resolution [supp] |
GazeOnce: Real-Time Multi-Person Gaze Estimation [supp] |
GateHUB: Gated History Unit With Background Suppression for Online Action Detection [supp] |
Few-Shot Font Generation by Learning Fine-Grained Local Styles [supp] |
Bridging Video-Text Retrieval With Multiple Choice Questions [supp] |
Depth-Aware Generative Adversarial Network for Talking Head Video Generation [supp] |
Dual-Path Image Inpainting With Auxiliary GAN Inversion [supp] |
DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis |
Generative Flows With Invertible Attentions [supp] |
Clipped Hyperbolic Classifiers Are Super-Hyperbolic Classifiers [supp] |
Estimating Fine-Grained Noise Model via Contrastive Learning |
DiffPoseNet: Direct Differentiable Camera Pose Estimation |
The Flag Median and FlagIRLS [supp] |
Implicit Feature Decoupling With Depthwise Quantization [supp] |
Graph-Context Attention Networks for Size-Varied Deep Graph Matching [supp] |
FENeRF: Face Editing in Neural Radiance Fields |
CoNeRF: Controllable Neural Radiance Fields [supp] |
Noise2NoiseFlow: Realistic Camera Noise Modeling Without Clean Images [supp] |
ZeroWaste Dataset: Towards Deformable Object Segmentation in Cluttered Scenes |
Remember Intentions: Retrospective-Memory-Based Trajectory Prediction [supp] |
Measuring Compositional Consistency for Video Question Answering [supp] |
Category Contrast for Unsupervised Domain Adaptation in Visual Tasks [supp] |
SwapMix: Diagnosing and Regularizing the Over-Reliance on Visual Context in Visual Question Answering |
UNIST: Unpaired Neural Implicit Shape Translation Network [supp] |
Local-Adaptive Face Recognition via Graph-Based Meta-Clustering and Regularized Adaptation [supp] |
The DEVIL Is in the Details: A Diagnostic Evaluation Benchmark for Video Inpainting [supp] |
Mutual Information-Driven Pan-Sharpening |
Shifting More Attention to Visual Backbone: Query-Modulated Refinement Networks for End-to-End Visual Grounding [supp] |
A Framework for Learning Ante-Hoc Explainable Models via Concepts [supp] |
Generating Useful Accident-Prone Driving Scenarios via a Learned Traffic Prior [supp] |
FLOAT: Factorized Learning of Object Attributes for Improved Multi-Object Multi-Part Scene Parsing |
Efficient Geometry-Aware 3D Generative Adversarial Networks [supp] |
DO-GAN: A Double Oracle Framework for Generative Adversarial Networks [supp] |
Dancing Under the Stars: Video Denoising in Starlight [supp] |
FocusCut: Diving Into a Focus View in Interactive Segmentation |
Medial Spectral Coordinates for 3D Shape Analysis |
Contextualized Spatio-Temporal Contrastive Learning With Self-Supervision [supp] |
Rethinking Architecture Design for Tackling Data Heterogeneity in Federated Learning [supp] |
APES: Articulated Part Extraction From Sprite Sheets [supp] |
Dressing in the Wild by Watching Dance Videos [supp] |
SPAct: Self-Supervised Privacy Preservation for Action Recognition [supp] |
Uni6D: A Unified CNN Framework Without Projection Breakdown for 6D Pose Estimation [supp] |
De-Rendering 3D Objects in the Wild [supp] |
SPAMs: Structured Implicit Parametric Models [supp] |
Global Sensing and Measurements Reuse for Image Compressed Sensing |
SeeThroughNet: Resurrection of Auxiliary Loss by Preserving Class Probability Information [supp] |
Representing 3D Shapes With Probabilistic Directed Distance Fields [supp] |
Learning ABCs: Approximate Bijective Correspondence for Isolating Factors of Variation With Weak Supervision [supp] |
ABO: Dataset and Benchmarks for Real-World 3D Object Understanding [supp] |
DETReg: Unsupervised Pretraining With Region Priors for Object Detection [supp] |
Learning To Restore 3D Face From In-the-Wild Degraded Images [supp] |
Practical Evaluation of Adversarial Robustness via Adaptive Auto Attack [supp] |
Convolutions for Spatial Interaction Modeling [supp] |
MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection [supp] |
Salvage of Supervision in Weakly Supervised Object Detection [supp] |
Cross-View Transformers for Real-Time Map-View Semantic Segmentation |
Distinguishing Unseen From Seen for Generalized Zero-Shot Learning |
Online Continual Learning on a Contaminated Data Stream With Blurry Task Boundaries [supp] |
Controllable Dynamic Multi-Task Architectures [supp] |
Learning To Imagine: Diversify Memory for Incremental Learning Using Unlabeled Data [supp] |
SmartAdapt: Multi-Branch Object Detection Framework for Videos on Mobiles [supp] |
VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks [supp] |
Deep Hybrid Models for Out-of-Distribution Detection [supp] |
Accelerating Video Object Segmentation With Compressed Video [supp] |
Exploring Domain-Invariant Parameters for Source Free Domain Adaptation |
FastDOG: Fast Discrete Optimization on GPU [supp] |
Fire Together Wire Together: A Dynamic Pruning Approach With Self-Supervised Mask Prediction [supp] |
Multi-Source Uncertainty Mining for Deep Unsupervised Saliency Detection |
Self-Supervised Equivariant Learning for Oriented Keypoint Detection [supp] |
Wavelet Knowledge Distillation: Towards Efficient Image-to-Image Translation |
Focal and Global Knowledge Distillation for Detectors |
Learning To Prompt for Continual Learning [supp] |
Human Mesh Recovery From Multiple Shots [supp] |
Improving Adversarial Transferability via Neuron Attribution-Based Attacks [supp] |
Better Trigger Inversion Optimization in Backdoor Scanning [supp] |
GANSeg: Learning To Segment by Unsupervised Hierarchical Image Generation [supp] |
Dense Learning Based Semi-Supervised Object Detection |
Fixing Malfunctional Objects With Learned Physical Simulation and Functional Prediction [supp] |
Convolution of Convolution: Let Kernels Spatially Collaborate |
Make It Move: Controllable Image-to-Video Generation With Text Descriptions [supp] |
C2AM Loss: Chasing a Better Decision Boundary for Long-Tail Object Detection |
Neural Points: Point Cloud Representation With Neural Fields for Arbitrary Upsampling |
Distribution Consistent Neural Architecture Search |
Video-Text Representation Learning via Differentiable Weak Temporal Alignment [supp] |
Bi-Directional Object-Context Prioritization Learning for Saliency Ranking [supp] |
FreeSOLO: Learning To Segment Objects Without Annotations [supp] |
What Do Navigation Agents Learn About Their Environment? [supp] |
Progressive Minimal Path Method With Embedded CNN |
FIFO: Learning Fog-Invariant Features for Foggy Scene Segmentation [supp] |
3D Human Tongue Reconstruction From Single "In-the-Wild" Images [supp] |
Enhancing Adversarial Robustness for Deep Metric Learning [supp] |
Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation [supp] |
Lite-MDETR: A Lightweight Multi-Modal Detector |
CoordGAN: Self-Supervised Dense Correspondences Emerge From GANs [supp] |
A Simple Multi-Modality Transfer Learning Baseline for Sign Language Translation [supp] |
Unsupervised Visual Representation Learning by Online Constrained K-Means [supp] |
Neural Point Light Fields [supp] |
Vehicle Trajectory Prediction Works, but Not Everywhere [supp] |
PSMNet: Position-Aware Stereo Merging Network for Room Layout Estimation [supp] |
MonoDTR: Monocular 3D Object Detection With Depth-Aware Transformer [supp] |
Learning Graph Regularisation for Guided Super-Resolution [supp] |
Instance-Wise Occlusion and Depth Orders in Natural Scenes [supp] |
Look for the Change: Learning Object States and State-Modifying Actions From Untrimmed Web Videos [supp] |
Attribute Surrogates Learning and Spectral Tokens Pooling in Transformers for Few-Shot Learning [supp] |
Generalized Category Discovery [supp] |
Maximum Consensus by Weighted Influences of Monotone Boolean Functions [supp] |
TransforMatcher: Match-to-Match Attention for Semantic Correspondence [supp] |
Robust Outlier Detection by De-Biasing VAE Likelihoods [supp] |
Contour-Hugging Heatmaps for Landmark Detection [supp] |
Voxel Field Fusion for 3D Object Detection |
Divide and Conquer: Compositional Experts for Generalized Novel Class Discovery [supp] |
Programmatic Concept Learning for Human Motion Description and Synthesis [supp] |
Interpretable Part-Whole Hierarchies and Conceptual-Semantic Relationships in Neural Networks [supp] |
Fast Algorithm for Low-Rank Tensor Completion in Delay-Embedded Space |
Panoptic, Instance and Semantic Relations: A Relational Context Encoder To Enhance Panoptic Segmentation [supp] |
Point2Seq: Detecting 3D Objects As Sequences [supp] |
Less Is More: Generating Grounded Navigation Instructions From Landmarks [supp] |
Task-Adaptive Negative Envision for Few-Shot Open-Set Recognition [supp] |
DisARM: Displacement Aware Relation Module for 3D Detection [supp] |
ETHSeg: An Amodel Instance Segmentation Network and a Real-World Dataset for X-Ray Waste Inspection [supp] |
MixFormer: Mixing Features Across Windows and Dimensions [supp] |
Killing Two Birds With One Stone: Efficient and Robust Training of Face Recognition CNNs by Partial FC |
NeRF-Editing: Geometry Editing of Neural Radiance Fields [supp] |
Optimal Correction Cost for Object Detection Evaluation [supp] |
Contextual Similarity Distillation for Asymmetric Image Retrieval |
FineDiving: A Fine-Grained Dataset for Procedure-Aware Action Quality Assessment [supp] |
Artistic Style Discovery With Independent Components |
HEAT: Holistic Edge Attention Transformer for Structured Reconstruction [supp] |
HyperStyle: StyleGAN Inversion With HyperNetworks for Real Image Editing [supp] |
DASO: Distribution-Aware Semantics-Oriented Pseudo-Label for Imbalanced Semi-Supervised Learning [supp] |
Mobile-Former: Bridging MobileNet and Transformer [supp] |
Exploiting Pseudo Labels in a Self-Supervised Learning Framework for Improved Monocular Depth Estimation [supp] |
DESTR: Object Detection With Split Transformer [supp] |
LTP: Lane-Based Trajectory Prediction for Autonomous Driving [supp] |
CycleMix: A Holistic Strategy for Medical Image Segmentation From Scribble Supervision [supp] |
VideoINR: Learning Video Implicit Neural Representation for Continuous Space-Time Super-Resolution [supp] |
Towards End-to-End Unified Scene Text Detection and Layout Analysis [supp] |
Image Based Reconstruction of Liquids From 2D Surface Detections [supp] |
Contextual Outpainting With Object-Level Contrastive Learning [supp] |
AP-BSN: Self-Supervised Denoising for Real-World Images via Asymmetric PD and Blind-Spot Network [supp] |
AutoSDF: Shape Priors for 3D Completion, Reconstruction and Generation |
ISNAS-DIP: Image-Specific Neural Architecture Search for Deep Image Prior [supp] |
Depth-Guided Sparse Structure-From-Motion for Movies and TV Shows [supp] |
End-to-End Referring Video Object Segmentation With Multimodal Transformers [supp] |
Unpaired Cartoon Image Synthesis via Gated Cycle Mapping [supp] |
IterMVS: Iterative Probability Estimation for Efficient Multi-View Stereo [supp] |
Not All Points Are Equal: Learning Highly Efficient Point-Based Detectors for 3D LiDAR Point Clouds [supp] |
FedCorr: Multi-Stage Federated Learning for Label Noise Correction [supp] |
Detecting Camouflaged Object in Frequency Domain [supp] |
RigNeRF: Fully Controllable Neural 3D Portraits |
CLIP-Forge: Towards Zero-Shot Text-To-Shape Generation [supp] |
Style-Based Global Appearance Flow for Virtual Try-On |
Source-Free Object Detection by Learning To Overlook Domain Style |
Active Learning for Open-Set Annotation |
SceneSqueezer: Learning To Compress Scene for Camera Relocalization [supp] |
SelfRecon: Self Reconstruction Your Digital Avatar From Monocular Video |
Instance-Dependent Label-Noise Learning With Manifold-Regularized Transition Matrix Estimation |
Rethinking the Augmentation Module in Contrastive Learning: Learning Hierarchical Augmentation Invariance With Expanded Views |
Self-Supervised Models Are Continual Learners [supp] |
Dreaming To Prune Image Deraining Networks [supp] |
Equivariant Point Cloud Analysis via Learning Orientations for Message Passing [supp] |
When Does Contrastive Visual Representation Learning Work? |
One Step at a Time: Long-Horizon Vision-and-Language Navigation With Milestones |
Node Representation Learning in Graph via Node-to-Neighbourhood Mutual Information Maximization [supp] |
Point Cloud Pre-Training With Natural 3D Structures [supp] |
Scene Consistency Representation Learning for Video Scene Segmentation [supp] |
Two Coupled Rejection Metrics Can Tell Adversarial Examples Apart [supp] |
Exploiting Explainable Metrics for Augmented SGD [supp] |
Semi-Supervised Video Semantic Segmentation With Inter-Frame Feature Reconstruction |
GenDR: A Generalized Differentiable Renderer [supp] |
Improving Neural Implicit Surfaces Geometry With Patch Warping [supp] |
XYLayoutLM: Towards Layout-Aware Multimodal Networks for Visually-Rich Document Understanding [supp] |
Amodal Segmentation Through Out-of-Task and Out-of-Distribution Generalization With a Bayesian Model [supp] |
How Well Do Sparse ImageNet Models Transfer? [supp] |
REX: Reasoning-Aware and Grounded Explanation [supp] |
Dynamic Dual-Output Diffusion Models [supp] |
StyleT2I: Toward Compositional and High-Fidelity Text-to-Image Synthesis [supp] |
JoinABLe: Learning Bottom-Up Assembly of Parametric CAD Joints [supp] |
CaDeX: Learning Canonical Deformation Coordinate Space for Dynamic Surface Representation via Neural Homeomorphism [supp] |
Canonical Voting: Towards Robust Oriented Bounding Box Detection in 3D Scenes [supp] |
V-Doc: Visual Questions Answers With Documents |
AEGNN: Asynchronous Event-Based Graph Neural Networks [supp] |
Layer-Wised Model Aggregation for Personalized Federated Learning [supp] |
Polarity Sampling: Quality and Diversity Control of Pre-Trained Generative Networks via Singular Values [supp] |
Style-Structure Disentangled Features and Normalizing Flows for Diverse Icon Colorization [supp] |
Object-Aware Video-Language Pre-Training for Retrieval |
OSKDet: Orientation-Sensitive Keypoint Localization for Rotated Object Detection |
MAT: Mask-Aware Transformer for Large Hole Image Inpainting [supp] |
Exploring Geometric Consistency for Monocular 3D Object Detection [supp] |
Neural Window Fully-Connected CRFs for Monocular Depth Estimation |
CodedVTR: Codebook-Based Sparse Voxel Transformer With Geometric Guidance [supp] |
Uncertainty-Aware Deep Multi-View Photometric Stereo [supp] |
Coherent Point Drift Revisited for Non-Rigid Shape Matching and Registration |
Unleashing Potential of Unsupervised Pre-Training With Intra-Identity Regularization for Person Re-Identification [supp] |
Align and Prompt: Video-and-Language Pre-Training With Entity Prompts |
A Unified Query-Based Paradigm for Point Cloud Understanding [supp] |
It's About Time: Analog Clock Reading in the Wild [supp] |
MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens [supp] |
Cross Modal Retrieval With Querybank Normalisation [supp] |
Contrastive Dual Gating: Learning Sparse Features With Contrastive Learning |
Universal Photometric Stereo Network Using Global Lighting Contexts [supp] |
Hire-MLP: Vision MLP via Hierarchical Rearrangement [supp] |
Ray3D: Ray-Based 3D Human Pose Estimation for Monocular Absolute 3D Localization [supp] |
Occluded Human Mesh Recovery [supp] |
Multi-Object Tracking Meets Moving UAV |
ASM-Loc: Action-Aware Segment Modeling for Weakly-Supervised Temporal Action Localization [supp] |
Uncertainty-Guided Probabilistic Transformer for Complex Action Recognition |
Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs [supp] |
End-to-End Multi-Person Pose Estimation With Transformers |
REGTR: End-to-End Point Cloud Correspondences With Transformers [supp] |
Neural 3D Scene Reconstruction With the Manhattan-World Assumption [supp] |
V2C: Visual Voice Cloning [supp] |
Revisiting AP Loss for Dense Object Detection: Adaptive Ranking Pair Selection [supp] |
3DeformRS: Certifying Spatial Deformations on Point Clouds [supp] |
ElePose: Unsupervised 3D Human Pose Estimation by Predicting Camera Elevation and Learning Normalizing Flows on 2D Poses [supp] |
MAD: A Scalable Dataset for Language Grounding in Videos From Movie Audio Descriptions [supp] |
EvUnroll: Neuromorphic Events Based Rolling Shutter Image Correction [supp] |
Gait Recognition in the Wild With Dense 3D Representations and a Benchmark [supp] |
ArtiBoost: Boosting Articulated 3D Hand-Object Pose Estimation via Online Exploration and Synthesis [supp] |
Temporal Context Matters: Enhancing Single Image Prediction With Disease Progression Representations [supp] |
QueryDet: Cascaded Sparse Query for Accelerating High-Resolution Small Object Detection |
IDEA-Net: Dynamic 3D Point Cloud Interpolation via Deep Embedding Alignment [supp] |
UniCon: Combating Label Noise Through Uniform Selection and Contrastive Learning [supp] |
Learning From All Vehicles [supp] |
BEHAVE: Dataset and Method for Tracking Human Object Interactions [supp] |
Disentangled3D: Learning a 3D Generative Model With Disentangled Geometry and Appearance From Monocular Images [supp] |
Revisiting Random Channel Pruning for Neural Network Compression [supp] |
One-Bit Active Query With Contrastive Pairs [supp] |
Estimating Egocentric 3D Human Pose in the Wild With External Weak Supervision [supp] |
Performance-Aware Mutual Knowledge Distillation for Improving Neural Architecture Search |
Does Text Attract Attention on E-Commerce Images: A Novel Saliency Prediction Dataset and Method |
Topologically-Aware Deformation Fields for Single-View 3D Reconstruction [supp] |
HyperInverter: Improving StyleGAN Inversion via Hypernetwork [supp] |
Sparse Non-Local CRF [supp] |
Dataset Distillation by Matching Training Trajectories |
Towards Driving-Oriented Metric for Lane Detection Models [supp] |
EPro-PnP: Generalized End-to-End Probabilistic Perspective-N-Points for Monocular Object Pose Estimation [supp] |
Rethinking Reconstruction Autoencoder-Based Out-of-Distribution Detection [supp] |
XYDeblur: Divide and Conquer for Single Image Deblurring [supp] |
Generating Diverse and Natural 3D Human Motions From Text [supp] |
E-CIR: Event-Enhanced Continuous Intensity Recovery |
Towards Robust Rain Removal Against Adversarial Attacks: A Comprehensive Benchmark Analysis and Beyond [supp] |
STCrowd: A Multimodal Dataset for Pedestrian Perception in Crowded Scenes [supp] |
Deep Decomposition for Stochastic Normal-Abnormal Transport [supp] |
Global Context With Discrete Diffusion in Vector Quantised Modelling for Image Generation [supp] |
Symmetry and Uncertainty-Aware Object SLAM for 6DoF Object Pose Estimation [supp] |
AziNorm: Exploiting the Radial Symmetry of Point Cloud for Azimuth-Normalized 3D Perception |
Towards Multimodal Depth Estimation From Light Fields [supp] |
Learning To Recognize Procedural Activities With Distant Supervision [supp] |
Multimodal Material Segmentation [supp] |
Multi-Frame Self-Supervised Depth With Transformers [supp] |
Weakly Supervised Rotation-Invariant Aerial Object Detection Network |
Modeling Motion With Multi-Modal Features for Text-Based Video Segmentation [supp] |
Surface Reconstruction From Point Clouds by Learning Predictive Context Priors [supp] |
Deformable Video Transformer |
Self-Supervised Keypoint Discovery in Behavioral Videos [supp] |
IRISformer: Dense Vision Transformers for Single-Image Inverse Rendering in Indoor Scenes [supp] |
DynamicEarthNet: Daily Multi-Spectral Satellite Dataset for Semantic Change Segmentation [supp] |
Connecting the Complementary-View Videos: Joint Camera Identification and Subject Association [supp] |
End-to-End Trajectory Distribution Prediction Based on Occupancy Grid Maps [supp] |
Fast, Accurate and Memory-Efficient Partial Permutation Synchronization [supp] |
Quantization-Aware Deep Optics for Diffractive Snapshot Hyperspectral Imaging [supp] |
Weakly Supervised Temporal Action Localization via Representative Snippet Knowledge Propagation [supp] |
Parametric Scattering Networks [supp] |
SketchEdit: Mask-Free Local Image Manipulation With Partial Sketches [supp] |
ScaleNet: A Shallow Architecture for Scale Estimation [supp] |
E2EC: An End-to-End Contour-Based Method for High-Quality High-Speed Instance Segmentation |
Bounded Adversarial Attack on Deep Content Features [supp] |
BatchFormer: Learning To Explore Sample Relationships for Robust Representation Learning [supp] |
Self-Supervised Image-Specific Prototype Exploration for Weakly Supervised Semantic Segmentation |
CAD: Co-Adapting Discriminative Features for Improved Few-Shot Classification [supp] |
Fingerprinting Deep Neural Networks Globally via Universal Adversarial Perturbations [supp] |
Learning Multi-View Aggregation in the Wild for Large-Scale 3D Semantic Segmentation [supp] |
ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-Wise Semantic Alignment and Generation [supp] |
Improving Video Model Transfer With Dynamic Representation Learning [supp] |
PIE-Net: Photometric Invariant Edge Guided Network for Intrinsic Image Decomposition [supp] |
Clothes-Changing Person Re-Identification With RGB Modality Only [supp] |
Chitransformer: Towards Reliable Stereo From Cues [supp] |
Robust Image Forgery Detection Over Online Social Network Shared Images [supp] |
QS-Attn: Query-Selected Attention for Contrastive Learning in I2I Translation [supp] |
Physically Disentangled Intra- and Inter-Domain Adaptation for Varicolored Haze Removal [supp] |
Modality-Agnostic Learning for Radar-Lidar Fusion in Vehicle Detection |
A Re-Balancing Strategy for Class-Imbalanced Classification Based on Instance Difficulty [supp] |
Representation Compensation Networks for Continual Semantic Segmentation [supp] |
Adaptive Gating for Single-Photon 3D Imaging [supp] |
Tracking People by Predicting 3D Appearance, Location and Pose [supp] |
Text2Mesh: Text-Driven Neural Stylization for Meshes [supp] |
Learning To Solve Hard Minimal Problems [supp] |
H4D: Human 4D Modeling by Learning Neural Compositional Representation [supp] |
FWD: Real-Time Novel View Synthesis With Forward Warping and Depth [supp] |
Non-Generative Generalized Zero-Shot Learning via Task-Correlated Disentanglement and Controllable Samples Synthesis |
C-CAM: Causal CAM for Weakly Supervised Semantic Segmentation on Medical Image |
Leveraging Real Talking Faces via Self-Supervision for Robust Forgery Detection [supp] |
Forward Compatible Few-Shot Class-Incremental Learning [supp] |
BaLeNAS: Differentiable Architecture Search via the Bayesian Learning Rule [supp] |