今日arXiv精选 | 11篇ICCV 2021最新论文

今日arXiv精选 | 11篇ICCV 2021最新论文_第1张图片

 关于 #今日arXiv精选 

这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者。

Explain Me the Painting: Multi-Topic Knowledgeable Art Description Generation

Comment: ICCV 2021

Link: http://arxiv.org/abs/2109.05743

Abstract

Have you ever looked at a painting and wondered what is the story behind it?This work presents a framework to bring art closer to people by generatingcomprehensive descriptions of fine-art paintings. Generating informativedescriptions for artworks, however, is extremely challenging, as it requires to1) describe multiple aspects of the image such as its style, content, orcomposition, and 2) provide background and contextual knowledge about theartist, their influences, or the historical period. To address thesechallenges, we introduce a multi-topic and knowledgeable art descriptionframework, which modules the generated sentences according to three artistictopics and, additionally, enhances each description with external knowledge.The framework is validated through an exhaustive analysis, both quantitativeand qualitative, as well as a comparative human evaluation, demonstratingoutstanding results in terms of both topic diversity and information veracity.

Image Shape Manipulation from a Single Augmented Training Sample

Comment: ICCV 2021 (Oral). arXiv admin note: text overlap with  arXiv:2007.01289

Link: http://arxiv.org/abs/2109.06151

Abstract

In this paper, we present DeepSIM, a generative model for conditional imagemanipulation based on a single image. We find that extensive augmentation iskey for enabling single image training, and incorporate the use ofthin-plate-spline (TPS) as an effective augmentation. Our network learns to mapbetween a primitive representation of the image to the image itself. The choiceof a primitive representation has an impact on the ease and expressiveness ofthe manipulations and can be automatic (e.g. edges), manual (e.g. segmentation)or hybrid such as edges on top of segmentations. At manipulation time, ourgenerator allows for making complex image changes by modifying the primitiveinput representation and mapping it through the network. Our method is shown toachieve remarkable performance on image manipulation tasks.

Weakly Supervised Person Search with Region Siamese Networks

Comment: Accepted by ICCV 2021

Link: http://arxiv.org/abs/2109.06109

Abstract

Supervised learning is dominant in person search, but it requires elaboratelabeling of bounding boxes and identities. Large-scale labeled training data isoften difficult to collect, especially for person identities. A naturalquestion is whether a good person search model can be trained without the needof identity supervision. In this paper, we present a weakly supervised settingwhere only bounding box annotations are available. Based on this new setting,we provide an effective baseline model termed Region Siamese Networks(R-SiamNets). Towards learning useful representations for recognition in theabsence of identity labels, we supervise the R-SiamNet with instance-levelconsistency loss and cluster-level contrastive loss. For instance-levelconsistency learning, the R-SiamNet is constrained to extract consistentfeatures from each person region with or without out-of-region context. Forcluster-level contrastive learning, we enforce the aggregation of closestinstances and the separation of dissimilar ones in feature space. Extensiveexperiments validate the utility of our weakly supervised method. Our modelachieves the rank-1 of 87.1% and mAP of 86.0% on CUHK-SYSU benchmark, whichsurpasses several fully supervised methods, such as OIM and MGTS, by a clearmargin. More promising performance can be reached by incorporating extratraining data. We hope this work could encourage the future research in thisfield.

Learning Indoor Inverse Rendering with 3D Spatially-Varying Lighting

Comment: ICCV 2021 (Oral Presentation)

Link: http://arxiv.org/abs/2109.06061

Abstract

In this work, we address the problem of jointly estimating albedo, normals,depth and 3D spatially-varying lighting from a single image. Most existingmethods formulate the task as image-to-image translation, ignoring the 3Dproperties of the scene. However, indoor scenes contain complex 3D lighttransport where a 2D representation is insufficient. In this paper, we proposea unified, learning-based inverse rendering framework that formulates 3Dspatially-varying lighting. Inspired by classic volume rendering techniques, wepropose a novel Volumetric Spherical Gaussian representation for lighting,which parameterizes the exitant radiance of the 3D scene surfaces on a voxelgrid. We design a physics based differentiable renderer that utilizes our 3Dlighting representation, and formulates the energy-conserving image formationprocess that enables joint training of all intrinsic properties with there-rendering constraint. Our model ensures physically correct predictions andavoids the need for ground-truth HDR lighting which is not easily accessible.Experiments show that our method outperforms prior works both quantitativelyand qualitatively, and is capable of producing photorealistic results for ARapplications such as virtual object insertion even for highly specular objects.

Mutual Supervision for Dense Object Detection

Comment: ICCV 2021 camera ready version

Link: http://arxiv.org/abs/2109.05986

Abstract

The classification and regression head are both indispensable components tobuild up a dense object detector, which are usually supervised by the sametraining samples and thus expected to have consistency with each other fordetecting objects accurately in the detection pipeline. In this paper, we breakthe convention of the same training samples for these two heads in densedetectors and explore a novel supervisory paradigm, termed as MutualSupervision (MuSu), to respectively and mutually assign training samples forthe classification and regression head to ensure this consistency. MuSu definestraining samples for the regression head mainly based on classificationpredicting scores and in turn, defines samples for the classification headbased on localization scores from the regression head. Experimental resultsshow that the convergence of detectors trained by this mutual supervision isguaranteed and the effectiveness of the proposed method is verified on thechallenging MS COCO benchmark. We also find that tiling more anchors at thesame location benefits detectors and leads to further improvements under thistraining scheme. We hope this work can inspire further researches on theinteraction of the classification and regression task in detection and thesupervision paradigm for detectors, especially separately for these two heads.

Graph-Based 3D Multi-Person Pose Estimation Using Multi-View Images

Comment: Accepted by ICCV'2021

Link: http://arxiv.org/abs/2109.05885

Abstract

This paper studies the task of estimating the 3D human poses of multiplepersons from multiple calibrated camera views. Following the top-down paradigm,we decompose the task into two stages, i.e. person localization and poseestimation. Both stages are processed in coarse-to-fine manners. And we proposethree task-specific graph neural networks for effective message passing. For 3Dperson localization, we first use Multi-view Matching Graph Module (MMG) tolearn the cross-view association and recover coarse human proposals. The CenterRefinement Graph Module (CRG) further refines the results via flexiblepoint-based prediction. For 3D pose estimation, the Pose Regression GraphModule (PRG) learns both the multi-view geometry and structural relationsbetween human joints. Our approach achieves state-of-the-art performance on CMUPanoptic and Shelf datasets with significantly lower computation complexity.

Meta Navigator: Search for a Good Adaptation Policy for Few-shot Learning

Comment: Accepted by ICCV2021

Link: http://arxiv.org/abs/2109.05749

Abstract

Few-shot learning aims to adapt knowledge learned from previous tasks tonovel tasks with only a limited amount of labeled data. Research literature onfew-shot learning exhibits great diversity, while different algorithms oftenexcel at different few-shot learning scenarios. It is therefore tricky todecide which learning strategies to use under different task conditions.Inspired by the recent success in Automated Machine Learning literature(AutoML), in this paper, we present Meta Navigator, a framework that attemptsto solve the aforementioned limitation in few-shot learning by seeking ahigher-level strategy and proffer to automate the selection from variousfew-shot learning designs. The goal of our work is to search for good parameteradaptation policies that are applied to different stages in the network forfew-shot classification. We present a search space that covers many popularfew-shot learning algorithms in the literature and develop a differentiablesearching and decoding algorithm based on meta-learning that supportsgradient-based optimization. We demonstrate the effectiveness of oursearching-based method on multiple benchmark datasets. Extensive experimentsshow that our approach significantly outperforms baselines and demonstratesperformance advantages over many state-of-the-art methods. Code and models willbe made publicly available.

ADNet: Leveraging Error-Bias Towards Normal Direction in Face Alignment

Comment: ICCV 2021

Link: http://arxiv.org/abs/2109.05721

Abstract

The recent progress of CNN has dramatically improved face alignmentperformance. However, few works have paid attention to the error-bias withrespect to error distribution of facial landmarks. In this paper, weinvestigate the error-bias issue in face alignment, where the distributions oflandmark errors tend to spread along the tangent line to landmark curves. Thiserror-bias is not trivial since it is closely connected to the ambiguouslandmark labeling task. Inspired by this observation, we seek a way to leveragethe error-bias property for better convergence of CNN model. To this end, wepropose anisotropic direction loss (ADL) and anisotropic attention module (AAM)for coordinate and heatmap regression, respectively. ADL imposes strong bindingforce in normal direction for each landmark point on facial boundaries. On theother hand, AAM is an attention module which can get anisotropic attention maskfocusing on the region of point and its local edge connected by adjacentpoints, it has a stronger response in tangent than in normal, which meansrelaxed constraints in the tangent. These two methods work in a complementarymanner to learn both facial structures and texture details. Finally, weintegrate them into an optimized end-to-end training pipeline named ADNet. OurADNet achieves state-of-the-art results on 300W, WFLW and COFW datasets, whichdemonstrates the effectiveness and robustness.

Low-Shot Validation: Active Importance Sampling for Estimating Classifier Performance on Rare Categories

Comment: Accepted to ICCV 2021; 12 pages, 12 figures

Link: http://arxiv.org/abs/2109.05720

Abstract

For machine learning models trained with limited labeled training data,validation stands to become the main bottleneck to reducing overall annotationcosts. We propose a statistical validation algorithm that accurately estimatesthe F-score of binary classifiers for rare categories, where finding relevantexamples to evaluate on is particularly challenging. Our key insight is thatsimultaneous calibration and importance sampling enables accurate estimateseven in the low-sample regime (< 300 samples). Critically, we also derive anaccurate single-trial estimator of the variance of our method and demonstratethat this estimator is empirically accurate at low sample counts, enabling apractitioner to know how well they can trust a given low-sample estimate. Whenvalidating state-of-the-art semi-supervised models on ImageNet andiNaturalist2017, our method achieves the same estimates of model performancewith up to 10x fewer labels than competing approaches. In particular, we canestimate model F1 scores with a variance of 0.005 using as few as 100 labels.

Spatial and Semantic Consistency Regularizations for Pedestrian Attribute Recognition

Comment: Accepted to ICCV 2021

Link: http://arxiv.org/abs/2109.05686

Abstract

While recent studies on pedestrian attribute recognition have shownremarkable progress in leveraging complicated networks and attentionmechanisms, most of them neglect the inter-image relations and an importantprior: spatial consistency and semantic consistency of attributes undersurveillance scenarios. The spatial locations of the same attribute should beconsistent between different pedestrian images, \eg, the ``hat" attribute andthe ``boots" attribute are always located at the top and bottom of the picturerespectively. In addition, the inherent semantic feature of the ``hat"attribute should be consistent, whether it is a baseball cap, beret, or helmet.To fully exploit inter-image relations and aggregate human prior in the modellearning process, we construct a Spatial and Semantic Consistency (SSC)framework that consists of two complementary regularizations to achieve spatialand semantic consistency for each attribute. Specifically, we first propose aspatial consistency regularization to focus on reliable and stableattribute-related regions. Based on the precise attribute locations, we furtherpropose a semantic consistency regularization to extract intrinsic anddiscriminative semantic features. We conduct extensive experiments on popularbenchmarks including PA100K, RAP, and PETA. Results show that the proposedmethod performs favorably against state-of-the-art methods without increasingparameters.

Shape-Biased Domain Generalization via Shock Graph Embeddings

Comment: Accepted to ICCV 2021

Link: http://arxiv.org/abs/2109.05671

Abstract

There is an emerging sense that the vulnerability of Image ConvolutionalNeural Networks (CNN), i.e., sensitivity to image corruptions, perturbations,and adversarial attacks, is connected with Texture Bias. This relative lack ofShape Bias is also responsible for poor performance in Domain Generalization(DG). The inclusion of a role of shape alleviates these vulnerabilities andsome approaches have achieved this by training on negative images, imagesendowed with edge maps, or images with conflicting shape and textureinformation. This paper advocates an explicit and complete representation ofshape using a classical computer vision approach, namely, representing theshape content of an image with the shock graph of its contour map. Theresulting graph and its descriptor is a complete representation of contourcontent and is classified using recent Graph Neural Network (GNN) methods. Theexperimental results on three domain shift datasets, Colored MNIST, PACS, andVLCS demonstrate that even without using appearance the shape-based approachexceeds classical Image CNN based methods in domain generalization.

·

你可能感兴趣的:(人工智能,firebug,sms,animation,nagios)