今日arXiv精选 | TNNLS/ICCV/TIP/ACM MM/CIKM/WWW/ICME

今日arXiv精选 | TNNLS/ICCV/TIP/ACM MM/CIKM/WWW/ICME_第1张图片

 关于 #今日arXiv精选 

这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者。

Medical-VLBERT: Medical Visual Language BERT for COVID-19 CT Report Generation With Alternate Learning

Comment: Accepted by IEEE Transactions on Neural Networks and Learning Systems

Link: http://arxiv.org/abs/2108.05067

Abstract

Medical imaging technologies, including computed tomography (CT) or chestX-Ray (CXR), are largely employed to facilitate the diagnosis of the COVID-19.Since manual report writing is usually too time-consuming, a more intelligentauxiliary medical system that could generate medical reports automatically andimmediately is urgently needed. In this article, we propose to use the medicalvisual language BERT (Medical-VLBERT) model to identify the abnormality on theCOVID-19 scans and generate the medical report automatically based on thedetected lesion regions. To produce more accurate medical reports and minimizethe visual-and-linguistic differences, this model adopts an alternate learningstrategy with two procedures that are knowledge pretraining and transferring.To be more precise, the knowledge pretraining procedure is to memorize theknowledge from medical texts, while the transferring procedure is to utilizethe acquired knowledge for professional medical sentences generations throughobservations of medical images. In practice, for automatic medical reportgeneration on the COVID-19 cases, we constructed a dataset of 368 medicalfindings in Chinese and 1104 chest CT scans from The First Affiliated Hospitalof Jinan University, Guangzhou, China, and The Fifth Affiliated Hospital of SunYat-sen University, Zhuhai, China. Besides, to alleviate the insufficiency ofthe COVID-19 training samples, our model was first trained on the large-scaleChinese CX-CHR dataset and then transferred to the COVID-19 CT dataset forfurther fine-tuning. The experimental results showed that Medical-VLBERTachieved state-of-the-art performances on terminology prediction and reportgeneration with the Chinese COVID-19 CT dataset and the CX-CHR dataset. TheChinese COVID-19 CT dataset is available at https://covid19ct.github.io/.

Person Re-identification via Attention Pyramid

Comment: Accepted by IEEE Transcations on Image Processing. 

Code: https://github.com/CHENGY12/APNet

Link: http://arxiv.org/abs/2108.05340

Abstract

In this paper, we propose an attention pyramid method for personre-identification. Unlike conventional attention-based methods which only learna global attention map, our attention pyramid exploits the attention regions ina multi-scale manner because human attention varies with different scales. Ourattention pyramid imitates the process of human visual perception which tendsto notice the foreground person over the cluttered background, and furtherfocus on the specific color of the shirt with close observation. Specifically,we describe our attention pyramid by a "split-attend-merge-stack" principle. Wefirst split the features into multiple local parts and learn the correspondingattentions. Then, we merge local attentions and stack these merged attentionswith the residual connection as an attention pyramid. The proposed attentionpyramid is a lightweight plug-and-play module that can be applied tooff-the-shelf models. We implement our attention pyramid method in twodifferent attention mechanisms including channel-wise attention and spatialattention. We evaluate our method on four largescale person re-identificationbenchmarks including Market-1501, DukeMTMC, CUHK03, and MSMT17. Experimentalresults demonstrate the superiority of our method, which outperforms thestate-of-the-art methods by a large margin with limited computational cost.

Towards Interpretable Deep Networks for Monocular Depth Estimation

Comment: Accepted by ICCV2021

Link: http://arxiv.org/abs/2108.05312

Abstract

Deep networks for Monocular Depth Estimation (MDE) have achieved promisingperformance recently and it is of great importance to further understand theinterpretability of these networks. Existing methods attempt to provide posthocexplanations by investigating visual cues, which may not explore the internalrepresentations learned by deep networks. In this paper, we find that somehidden units of the network are selective to certain ranges of depth, and thussuch behavior can be served as a way to interpret the internal representations.Based on our observations, we quantify the interpretability of a deep MDEnetwork by the depth selectivity of its hidden units. Moreover, we then proposea method to train interpretable MDE deep networks without changing theiroriginal architectures, by assigning a depth range for each unit to select.Experimental results demonstrate that our method is able to enhance theinterpretability of deep MDE networks by largely improving the depthselectivity of their units, while not harming or even improving the depthestimation accuracy. We further provide a comprehensive analysis to show thereliability of selective units, the applicability of our method on differentlayers, models, and datasets, and a demonstration on analysis of model error.Source code and models are available athttps://github.com/youzunzhi/InterpretableMDE .

Video Transformer for Deepfake Detection with Incremental Learning

Comment: Accepted at ACM International Conference on Multimedia, October 20 to  24, 2021, Virtual Event, China

Link: http://arxiv.org/abs/2108.05307

Abstract

Face forgery by deepfake is widely spread over the internet and this raisessevere societal concerns. In this paper, we propose a novel video transformerwith incremental learning for detecting deepfake videos. To better align theinput face images, we use a 3D face reconstruction method to generate UVtexture from a single input face image. The aligned face image can also providepose, eyes blink and mouth movement information that cannot be perceived in theUV texture image, so we use both face images and their UV texture maps toextract the image features. We present an incremental learning strategy tofine-tune the proposed model on a smaller amount of data and achieve betterdeepfake detection performance. The comprehensive experiments on various publicdeepfake datasets demonstrate that the proposed video transformer model withincremental learning achieves state-of-the-art performance in the deepfakevideo detection task with enhanced feature learning from the sequenced data.

ConvNets vs. Transformers: Whose Visual Representations are More Transferable?

Comment: Accepted by ICCV 2021 Workshop on Multi-Task Learning in Computer  Vision (DeepMTL)

Link: http://arxiv.org/abs/2108.05305

Abstract

Vision transformers have attracted much attention from computer visionresearchers as they are not restricted to the spatial inductive bias ofConvNets. However, although Transformer-based backbones have achieved muchprogress on ImageNet classification, it is still unclear whether the learnedrepresentations are as transferable as or even more transferable than ConvNets'features. To address this point, we systematically investigate the transferlearning ability of ConvNets and vision transformers in 15 single-task andmulti-task performance evaluations. Given the strong correlation between theperformance of pre-trained models and transfer learning, we include 2 residualConvNets (i.e., R-101x3 and R-152x4) and 3 Transformer-based visual backbones(i.e., ViT-B, ViT-L and Swin-B), which have close error rates on ImageNet, thatindicate similar transfer learning performance on downstream datasets.  We observe consistent advantages of Transformer-based backbones on 13downstream tasks (out of 15), including but not limited to fine-grainedclassification, scene recognition (classification, segmentation and depthestimation), open-domain classification, face recognition, etc. Morespecifically, we find that two ViT models heavily rely on whole networkfine-tuning to achieve performance gains while Swin Transformer does not havesuch a requirement. Moreover, vision transformers behave more robustly inmulti-task learning, i.e., bringing more improvements when managing mutuallybeneficial tasks and reducing performance losses when tackling irrelevanttasks. We hope our discoveries can facilitate the exploration and exploitationof vision transformers in the future.

Mutual Affine Network for Spatially Variant Kernel Estimation in Blind Image Super-Resolution

Comment: Accepted by ICCV2021. 

Code: https://github.com/JingyunLiang/MANet

Link: http://arxiv.org/abs/2108.05302

Abstract

Existing blind image super-resolution (SR) methods mostly assume blur kernelsare spatially invariant across the whole image. However, such an assumption israrely applicable for real images whose blur kernels are usually spatiallyvariant due to factors such as object motion and out-of-focus. Hence, existingblind SR methods would inevitably give rise to poor performance in realapplications. To address this issue, this paper proposes a mutual affinenetwork (MANet) for spatially variant kernel estimation. Specifically, MANethas two distinctive features. First, it has a moderate receptive field so as tokeep the locality of degradation. Second, it involves a new mutual affineconvolution (MAConv) layer that enhances feature expressiveness withoutincreasing receptive field, model size and computation burden. This is madepossible through exploiting channel interdependence, which applies each channelsplit with an affine transformation module whose input are the rest channelsplits. Extensive experiments on synthetic and real images show that theproposed MANet not only performs favorably for both spatially variant andinvariant kernel estimation, but also leads to state-of-the-art blind SRperformance when combined with non-blind SR methods.

Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling

Comment: Accepted by ICCV2021. 

Code: https://github.com/JingyunLiang/HCFlow

Link: http://arxiv.org/abs/2108.05301

Abstract

Normalizing flows have recently demonstrated promising results for low-levelvision tasks. For image super-resolution (SR), it learns to predict diversephoto-realistic high-resolution (HR) images from the low-resolution (LR) imagerather than learning a deterministic mapping. For image rescaling, it achieveshigh accuracy by jointly modelling the downscaling and upscaling processes.While existing approaches employ specialized techniques for these two tasks, weset out to unify them in a single formulation. In this paper, we propose thehierarchical conditional flow (HCFlow) as a unified framework for image SR andimage rescaling. More specifically, HCFlow learns a bijective mapping betweenHR and LR image pairs by modelling the distribution of the LR image and therest high-frequency component simultaneously. In particular, the high-frequencycomponent is conditional on the LR image in a hierarchical manner. To furtherenhance the performance, other losses such as perceptual loss and GAN loss arecombined with the commonly used negative log-likelihood loss in training.Extensive experiments on general image SR, face image SR and image rescalinghave demonstrated that the proposed HCFlow achieves state-of-the-artperformance in terms of both quantitative metrics and visual quality.

Fog Simulation on Real LiDAR Point Clouds for 3D Object Detection in Adverse Weather

Comment: Camera-Ready Version for ICCV 2021

Link: http://arxiv.org/abs/2108.05249

Abstract

This work addresses the challenging task of LiDAR-based 3D object detectionin foggy weather. Collecting and annotating data in such a scenario is verytime, labor and cost intensive. In this paper, we tackle this problem bysimulating physically accurate fog into clear-weather scenes, so that theabundant existing real datasets captured in clear weather can be repurposed forour task. Our contributions are twofold: 1) We develop a physically valid fogsimulation method that is applicable to any LiDAR dataset. This unleashes theacquisition of large-scale foggy training data at no extra cost. Thesepartially synthetic data can be used to improve the robustness of severalperception methods, such as 3D object detection and tracking or simultaneouslocalization and mapping, on real foggy data. 2) Through extensive experimentswith several state-of-the-art detection approaches, we show that our fogsimulation can be leveraged to significantly improve the performance for 3Dobject detection in the presence of fog. Thus, we are the first to providestrong 3D object detection baselines on the Seeing Through Fog dataset. Ourcode is available at www.trace.ethz.ch/lidar_fog_simulation.

ProAI: An Efficient Embedded AI Hardware for Automotive Applications - a Benchmark Study

Comment: Accepted by IEEE International Conference on Computer Vision (ICCV)  2021

Link: http://arxiv.org/abs/2108.05170

Abstract

Development in the field of Single Board Computers (SBC) have been increasingfor several years. They provide a good balance between computing performanceand power consumption which is usually required for mobile platforms, likeapplication in vehicles for Advanced Driver Assistance Systems (ADAS) andAutonomous Driving (AD). However, there is an ever-increasing need of morepowerful and efficient SBCs which can run power intensive Deep Neural Networks(DNNs) in real-time and can also satisfy necessary functional safetyrequirements such as Automotive Safety Integrity Level (ASIL). ProAI is beingdeveloped by ZF mainly to run powerful and efficient applications such asmultitask DNNs and on top of that it also has the required safety certificationfor AD. In this work, we compare and discuss state of the art SBC on the basisof power intensive multitask DNN architecture called Multitask-CenterNet withrespect to performance measures such as, FPS and power efficiency. As anautomotive supercomputer, ProAI delivers an excellent combination ofperformance and efficiency, managing nearly twice the number of FPS per wattthan a modern workstation laptop and almost four times compared to the JetsonNano. Furthermore, it was also shown that there is still power in reserve forfurther and more complex tasks on the ProAI, based on the CPU and GPUutilization during the benchmark.

Efficient Surfel Fusion Using Normalised Information Distance

Comment: 4 pages, 4 figures, presented at CVPR 2019 Workshop on 3D Scene  Understanding for Vision, Graphics, and Robotics

Link: http://arxiv.org/abs/2108.05163

Abstract

We present a new technique that achieves a significant reduction in thequantity of measurements required for a fusion based dense 3D mapping system toconverge to an accurate, de-noised surface reconstruction. This is achievedthrough the use of a Normalised Information Distance metric, that computes thenovelty of the information contained in each incoming frame with respect to thereconstruction, and avoids fusing those frames that exceed a redundancythreshold. This provides a principled approach for opitmising the trade-offbetween surface reconstruction accuracy and the computational cost ofprocessing frames. The technique builds upon the ElasticFusion (EF) algorithmwhere we report results of the technique's scalability and the accuracy of theresultant maps by applying it to both the ICL-NUIM and TUM RGB-D datasets.These results demonstrate the capabilities of the approach in performingaccurate surface reconstructions whilst utilising a fraction of the frames whencompared to the original EF algorithm.

Zero-Shot Domain Adaptation with a Physics Prior

Comment: ICCV 2021 Oral presentation. 

Code: https://github.com/Attila94/CIConv

Link: http://arxiv.org/abs/2108.05137

Abstract

We explore the zero-shot setting for day-night domain adaptation. Thetraditional domain adaptation setting is to train on one domain and adapt tothe target domain by exploiting unlabeled data samples from the test set. Asgathering relevant test data is expensive and sometimes even impossible, weremove any reliance on test data imagery and instead exploit a visual inductiveprior derived from physics-based reflection models for domain adaptation. Wecast a number of color invariant edge detectors as trainable layers in aconvolutional neural network and evaluate their robustness to illuminationchanges. We show that the color invariant layer reduces the day-nightdistribution shift in feature map activations throughout the network. Wedemonstrate improved performance for zero-shot day to night domain adaptationon both synthetic as well as natural datasets in various tasks, includingclassification, segmentation and place recognition.

M3D-VTON: A Monocular-to-3D Virtual Try-On Network

Comment: Accepted at ICCV 2021

Link: http://arxiv.org/abs/2108.05126

Abstract

Virtual 3D try-on can provide an intuitive and realistic view for onlineshopping and has a huge potential commercial value. However, existing 3Dvirtual try-on methods mainly rely on annotated 3D human shapes and garmenttemplates, which hinders their applications in practical scenarios. 2D virtualtry-on approaches provide a faster alternative to manipulate clothed humans,but lack the rich and realistic 3D representation. In this paper, we propose anovel Monocular-to-3D Virtual Try-On Network (M3D-VTON) that builds on themerits of both 2D and 3D approaches. By integrating 2D information efficientlyand learning a mapping that lifts the 2D representation to 3D, we make thefirst attempt to reconstruct a 3D try-on mesh only taking the target clothingand a person image as inputs. The proposed M3D-VTON includes three modules: 1)The Monocular Prediction Module (MPM) that estimates an initial full-body depthmap and accomplishes 2D clothes-person alignment through a novel two-stagewarping procedure; 2) The Depth Refinement Module (DRM) that refines theinitial body depth to produce more detailed pleat and face characteristics; 3)The Texture Fusion Module (TFM) that fuses the warped clothing with thenon-target body part to refine the results. We also construct a high-qualitysynthesized Monocular-to-3D virtual try-on dataset, in which each person imageis associated with a front and a back depth map. Extensive experimentsdemonstrate that the proposed M3D-VTON can manipulate and reconstruct the 3Dhuman body wearing the given clothing with compelling details and is moreefficient than other 3D approaches.

Representation Learning for Remote Sensing: An Unsupervised Sensor Fusion Approach

Comment: Work completed in 2019 and submitted to ICLR in  2020.

Code:  https://github.com/descarteslabs/contrastive_sensor_fusion. 

Data:  https://storage.cloud.google.com/public-published-datasets/osm_example_dataset.zip?folder=true&organizationId=272688069953

Link: http://arxiv.org/abs/2108.05094

Abstract

In the application of machine learning to remote sensing, labeled data isoften scarce or expensive, which impedes the training of powerful models likedeep convolutional neural networks. Although unlabeled data is abundant, recentself-supervised learning approaches are ill-suited to the remote sensingdomain. In addition, most remote sensing applications currently use only asmall subset of the multi-sensor, multi-channel information available,motivating the need for fused multi-sensor representations. We propose a newself-supervised training objective, Contrastive Sensor Fusion, which exploitscoterminous data from multiple sources to learn useful representations of everypossible combination of those sources. This method uses information commonacross multiple sensors and bands by training a single model to produce arepresentation that remains similar when any subset of its input channels isused. Using a dataset of 47 million unlabeled coterminous image triplets, wetrain an encoder to produce semantically meaningful representations from anypossible combination of channels from the input sensors. These representationsoutperform fully supervised ImageNet weights on a remote sensing classificationtask and improve as more sensors are fused. Our code is available athttps://storage.cloud.google.com/public-published-datasets/csf_code.zip.

Multi-Source Fusion and Automatic Predictor Selection for Zero-Shot Video Object Segmentation

Comment: This work was accepted as ACM MM 2021 oral

Link: http://arxiv.org/abs/2108.05076

Abstract

Location and appearance are the key cues for video object segmentation. Manysources such as RGB, depth, optical flow and static saliency can provide usefulinformation about the objects. However, existing approaches only utilize theRGB or RGB and optical flow. In this paper, we propose a novel multi-sourcefusion network for zero-shot video object segmentation. With the help ofinteroceptive spatial attention module (ISAM), spatial importance of eachsource is highlighted. Furthermore, we design a feature purification module(FPM) to filter the inter-source incompatible features. By the ISAM and FPM,the multi-source features are effectively fused. In addition, we put forward anautomatic predictor selection network (APS) to select the better prediction ofeither the static saliency predictor or the moving object predictor in order toprevent over-reliance on the failed results caused by low-quality optical flowmaps. Extensive experiments on three challenging public benchmarks (i.e.DAVIS$_{16}$, Youtube-Objects and FBMS) show that the proposed model achievescompelling performance against the state-of-the-arts. The source code will bepublicly available at\textcolor{red}{\url{https://github.com/Xiaoqi-Zhao-DLUT/Multi-Source-APS-ZVOS}}.

MultiTask-CenterNet (MCN): Efficient and Diverse Multitask Learning using an Anchor Free Approach

Comment: Accepted by IEEE International Conference on Computer Vision (ICCV)  2021

Link: http://arxiv.org/abs/2108.05060

Abstract

Multitask learning is a common approach in machine learning, which allows totrain multiple objectives with a shared architecture. It has been shown that bytraining multiple tasks together inference time and compute resources can besaved, while the objectives performance remains on a similar or even higherlevel. However, in perception related multitask networks only closely relatedtasks can be found, such as object detection, instance and semanticsegmentation or depth estimation. Multitask networks with diverse tasks andtheir effects with respect to efficiency on one another are not well studied.In this paper we augment the CenterNet anchor-free approach for trainingmultiple diverse perception related tasks together, including the task ofobject detection and semantic segmentation as well as human pose estimation. Werefer to this DNN as Multitask-CenterNet (MCN). Additionally, we studydifferent MCN settings for efficiency. The MCN can perform several tasks atonce while maintaining, and in some cases even exceeding, the performancevalues of its corresponding single task networks. More importantly, the MCNarchitecture decreases inference time and reduces network size when compared toa composition of single task networks.

Rethinking Coarse-to-Fine Approach in Single Image Deblurring

Comment: Accepted by IEEE International Conference on Computer Vision (ICCV)  2021

Link: http://arxiv.org/abs/2108.05054

Abstract

Coarse-to-fine strategies have been extensively used for the architecturedesign of single image deblurring networks. Conventional methods typicallystack sub-networks with multi-scale input images and gradually improvesharpness of images from the bottom sub-network to the top sub-network,yielding inevitably high computational costs. Toward a fast and accuratedeblurring network design, we revisit the coarse-to-fine strategy and present amulti-input multi-output U-net (MIMO-UNet). The MIMO-UNet has three distinctfeatures. First, the single encoder of the MIMO-UNet takes multi-scale inputimages to ease the difficulty of training. Second, the single decoder of theMIMO-UNet outputs multiple deblurred images with different scales to mimicmulti-cascaded U-nets using a single U-shaped network. Last, asymmetric featurefusion is introduced to merge multi-scale features in an efficient manner.Extensive experiments on the GoPro and RealBlur datasets demonstrate that theproposed network outperforms the state-of-the-art methods in terms of bothaccuracy and computational complexity. Source code is available for researchpurposes at https://github.com/chosj95/MIMO-UNet.

Learning Action Completeness from Points for Weakly-supervised Temporal Action Localization

Comment: Accepted by ICCV 2021 (Oral). 

Code: https://github.com/Pilhyeon

Link: http://arxiv.org/abs/2108.05029

Abstract

We tackle the problem of localizing temporal intervals of actions with only asingle frame label for each action instance for training. Owing to labelsparsity, existing work fails to learn action completeness, resulting infragmentary action predictions. In this paper, we propose a novel framework,where dense pseudo-labels are generated to provide completeness guidance forthe model. Concretely, we first select pseudo background points to supplementpoint-level action labels. Then, by taking the points as seeds, we search forthe optimal sequence that is likely to contain complete action instances whileagreeing with the seeds. To learn completeness from the obtained sequence, weintroduce two novel losses that contrast action instances with background onesin terms of action score and feature similarity, respectively. Experimentalresults demonstrate that our completeness guidance indeed helps the model tolocate complete action instances, leading to large performance gains especiallyunder high IoU thresholds. Moreover, we demonstrate the superiority of ourmethod over existing state-of-the-art methods on four benchmarks: THUMOS'14,GTEA, BEOID, and ActivityNet. Notably, our method even performs comparably torecent fully-supervised methods, at the 6 times cheaper annotation cost. Ourcode is available at https://github.com/Pilhyeon.

Prototype Completion for Few-Shot Learning

Comment: Extended version of 'Prototype Completion with Primitive Knowledge  for Few-Shot Learning' in CVPR2021

Link: http://arxiv.org/abs/2108.05010

Abstract

Few-shot learning aims to recognize novel classes with few examples.Pre-training based methods effectively tackle the problem by pre-training afeature extractor and then fine-tuning it through the nearest centroid basedmeta-learning. However, results show that the fine-tuning step makes marginalimprovements. In this paper, 1) we figure out the reason, i.e., in thepre-trained feature space, the base classes already form compact clusters whilenovel classes spread as groups with large variances, which implies thatfine-tuning feature extractor is less meaningful; 2) instead of fine-tuningfeature extractor, we focus on estimating more representative prototypes.Consequently, we propose a novel prototype completion based meta-learningframework. This framework first introduces primitive knowledge (i.e.,class-level part or attribute annotations) and extracts representative featuresfor seen attributes as priors. Second, a part/attribute transfer network isdesigned to learn to infer the representative features for unseen attributes assupplementary priors. Finally, a prototype completion network is devised tolearn to complete prototypes with these priors. Moreover, to avoid theprototype completion error, we further develop a Gaussian based prototypefusion strategy that fuses the mean-based and completed prototypes byexploiting the unlabeled samples. Extensive experiments show that our method:(i) obtains more accurate prototypes; (ii) achieves superior performance onboth inductive and transductive FSL settings.

Large-Scale Modeling of Mobile User Click Behaviors Using Deep Learning

Comment: Accepted to RecSys'21

Link: http://arxiv.org/abs/2108.05342

Abstract

Modeling tap or click sequences of users on a mobile device can improve ourunderstandings of interaction behavior and offers opportunities for UIoptimization by recommending next element the user might want to click on. Weanalyzed a large-scale dataset of over 20 million clicks from more than 4,000mobile users who opted in. We then designed a deep learning model that predictsthe next element that the user clicks given the user's click history, thestructural information of the UI screen, and the current context such as thetime of the day. We thoroughly investigated the deep model by comparing it witha set of baseline methods based on the dataset. The experiments show that ourmodel achieves 48% and 71% accuracy (top-1 and top-3) for predicting nextclicks based on a held-out dataset of test users, which significantlyoutperformed all the baseline methods with a large margin. We discussed a fewscenarios for integrating the model in mobile interaction and how users canpotentially benefit from the model.

Estimation of Fair Ranking Metrics with Incomplete Judgments

Comment: Published in Proceedings of the Web Conference 2021 (WWW '21)

Link: http://arxiv.org/abs/2108.05152

Abstract

There is increasing attention to evaluating the fairness of search systemranking decisions. These metrics often consider the membership of items toparticular groups, often identified using protected attributes such as genderor ethnicity. To date, these metrics typically assume the availability andcompleteness of protected attribute labels of items. However, the protectedattributes of individuals are rarely present, limiting the application of fairranking metrics in large scale systems. In order to address this problem, wepropose a sampling strategy and estimation technique for four fair rankingmetrics. We formulate a robust and unbiased estimator which can operate evenwith very limited number of labeled items. We evaluate our approach using bothsimulated and real world data. Our experimental results demonstrate that ourmethod can estimate this family of fair ranking metrics and provides a robust,reliable alternative to exhaustive or random data annotation.

Cooperative Learning for Noisy Supervision

Comment: ICME 2021 Oral

Link: http://arxiv.org/abs/2108.05092

Abstract

Learning with noisy labels has gained the enormous interest in the robustdeep learning area. Recent studies have empirically disclosed that utilizingdual networks can enhance the performance of single network but withouttheoretic proof. In this paper, we propose Cooperative Learning (CooL)framework for noisy supervision that analytically explains the effects ofleveraging dual or multiple networks. Specifically, the simple but efficientcombination in CooL yields a more reliable risk minimization for unseen cleandata. A range of experiments have been conducted on several benchmarks withboth synthetic and real-world settings. Extensive results indicate that CooLoutperforms several state-of-the-art methods.

ULTRA: An Unbiased Learning To Rank Algorithm Toolbox

Comment: 10 pages, 6 figures, CIKM conference

Link: http://arxiv.org/abs/2108.05073

Abstract

Learning to rank systems has become an important aspect of our daily life.However, the implicit user feedback that is used to train many learning to rankmodels is usually noisy and suffered from user bias (i.e., position bias).Thus, obtaining an unbiased model using biased feedback has become an importantresearch field for IR. Existing studies on unbiased learning to rank (ULTR) canbe generalized into two families-algorithms that attain unbiasedness withlogged data, offline learning, and algorithms that achieve unbiasedness byestimating unbiased parameters with real-time user interactions, namely onlinelearning. While there exist many algorithms from both families, there lacks aunified way to compare and benchmark them. As a result, it can be challengingfor researchers to choose the right technique for their problems or for peoplewho are new to the field to learn and understand existing algorithms. To solvethis problem, we introduced ULTRA, which is a flexible, extensible, and easilyconfigure ULTR toolbox. Its key features include support for multiple ULTRalgorithms with configurable hyperparameters, a variety of built-in clickmodels that can be used separately to simulate clicks, different ranking modelarchitecture and evaluation metrics, and simple learning to rank pipelinecreation. In this paper, we discuss the general framework of ULTR, brieflydescribe the algorithms in ULTRA, detailed the structure, and pipeline of thetoolbox. We experimented on all the algorithms supported by ultra and showedthat the toolbox performance is reasonable. Our toolbox is an importantresource for researchers to conduct experiments on ULTR algorithms withdifferent configurations as well as testing their own algorithms with thesupported features.

Boosting the Generalization Capability in Cross-Domain Few-shot Learning via Noise-enhanced Supervised Autoencoder

Comment: Accepted at ICCV2021

Link: http://arxiv.org/abs/2108.05028

Abstract

State of the art (SOTA) few-shot learning (FSL) methods suffer significantperformance drop in the presence of domain differences between source andtarget datasets. The strong discrimination ability on the source dataset doesnot necessarily translate to high classification accuracy on the targetdataset. In this work, we address this cross-domain few-shot learning (CDFSL)problem by boosting the generalization capability of the model. Specifically,we teach the model to capture broader variations of the feature distributionswith a novel noise-enhanced supervised autoencoder (NSAE). NSAE trains themodel by jointly reconstructing inputs and predicting the labels of inputs aswell as their reconstructed pairs. Theoretical analysis based on intra-classcorrelation (ICC) shows that the feature embeddings learned from NSAE havestronger discrimination and generalization abilities in the target domain. Wealso take advantage of NSAE structure and propose a two-step fine-tuningprocedure that achieves better adaption and improves classification performancein the target domain. Extensive experiments and ablation studies are conductedto demonstrate the effectiveness of the proposed method. Experimental resultsshow that our proposed method consistently outperforms SOTA methods undervarious conditions.

LightMove: A Lightweight Next-POI Recommendation for Taxicab Rooftop Advertising

Comment: Accepted in CIKM 2021

Link: http://arxiv.org/abs/2108.04993

Abstract

Mobile digital billboards are an effective way to augment brand-awareness.Among various such mobile billboards, taxicab rooftop devices are emerging inthe market as a brand new media. Motov is a leading company in South Korea inthe taxicab rooftop advertising market. In this work, we present a lightweightyet accurate deep learning-based method to predict taxicabs' next locations tobetter prepare for targeted advertising based on demographic information oflocations. Considering the fact that next POI recommendation datasets arefrequently sparse, we design our presented model based on neural ordinary differential equations (NODEs), which are known to be robust tosparse/incorrect input, with several enhancements. Our model, which we callLightMove, has a larger prediction accuracy, a smaller number of parameters,and/or a smaller training/inference time, when evaluating with variousdatasets, in comparison with state-of-the-art models.

·

你可能感兴趣的:(nagios,3d,信息熵,sms,firebug)